The NLP Document Intelligence Engine helps users work through long documents without losing context. Users select one or two documents, ask questions, and receive answers grounded in the selected source rather than a general chatbot memory.
Problem Statement
Long reports, agreements and field documents are difficult to review quickly, especially when teams need summaries, risks, actions or specific answers. This project keeps the analysis close to the selected document and makes the source scope clear.
Programme teams spend hours searching for precedents before new proposals. Donors cannot
easily verify claims against field evidence. Leadership lacks a searchable view of
cross-country programme learnings.
The solution: A RAG pipeline that ingests, chunks, embeds, and indexes all
documents โ then responds to natural-language queries with grounded, cited answers drawn
from the actual documents.
๐ฅ
1. Ingestion & Chunking
PDFs, DOCX, and HTML documents are parsed, cleaned, and chunked into 512-token segments with 64-token overlap to preserve context across boundaries.
๐ข
2. Embedding & Indexing
Each chunk is embedded using BGE-M3 (multi-lingual, 1024-dim). Vectors stored in ChromaDB with full metadata for filtering by date, country, sector, and document type.
๐
3. Hybrid Retrieval
Queries are answered using hybrid retrieval: semantic similarity (cosine distance) combined with BM25 keyword scoring, then re-ranked using a cross-encoder model.
๐ฌ
4. Grounded Generation
Retrieved chunks are passed to GPT-4o with strict citation instructions. The model is constrained to answer only from the provided context, with page-level source references.