NLP Document Intelligence Engine

The NLP Document Intelligence Engine helps users work through long documents without losing context. Users select one or two documents, ask questions, and receive answers grounded in the selected source rather than a general chatbot memory.

10K+

Docs Indexed

91%

Answer Relevance

<3s

End-to-End Latency

Country Contexts

Problem Statement

Long reports, agreements and field documents are difficult to review quickly, especially when teams need summaries, risks, actions or specific answers. This project keeps the analysis close to the selected document and makes the source scope clear.

Programme teams spend hours searching for precedents before new proposals. Donors cannot easily verify claims against field evidence. Leadership lacks a searchable view of cross-country programme learnings.

The solution: A RAG pipeline that ingests, chunks, embeds, and indexes all documents — then responds to natural-language queries with grounded, cited answers drawn from the actual documents.

How It Works

📥

1. Ingestion & Chunking

PDFs, DOCX, and HTML documents are parsed, cleaned, and chunked into 512-token segments with 64-token overlap to preserve context across boundaries.

🔢

2. Embedding & Indexing

Each chunk is embedded using BGE-M3 (multi-lingual, 1024-dim). Vectors stored in ChromaDB with full metadata for filtering by date, country, sector, and document type.

🔍

3. Hybrid Retrieval

Queries are answered using hybrid retrieval: semantic similarity (cosine distance) combined with BM25 keyword scoring, then re-ranked using a cross-encoder model.

💬

4. Grounded Generation

Retrieved chunks are passed to GPT-4o with strict citation instructions. The model is constrained to answer only from the provided context, with page-level source references.

📄 Architecture Docs 🔬 Methodology 🤖 Open AI Sandbox →