NLP Document Intelligence Engine — Documentation

Browser-first RAG PDF / DOCX / TXT Document-scoped Q&A No server upload in demo

Purpose

The NLP Document Intelligence Engine is an interactive document assistant that helps users understand bulky reports, agreements, evaluations, policy notes, operational files and evidence packs. It is designed to reduce the time spent manually reading long documents by allowing users to select one or two documents, ask natural-language questions, and receive structured answers grounded in the selected document scope.

The sandbox demonstrates a practical Retrieval-Augmented Generation workflow using browser-based extraction, chunking, lightweight vector-style scoring and structured answer generation. The goal is not only to answer questions, but to help non-technical users understand documents in a clear, decision-ready way.

Core capabilities

  • Document selection: users choose one or two documents so the assistant knows exactly which source to answer from.
  • Sample documents: the tool includes donor, field monitoring and evaluation examples for demonstration.
  • Upload support: the sandbox accepts PDF, DOCX, TXT, MD, CSV and JSON files.
  • Processing feedback: the interface shows reading, extraction, chunking, vectorising and ready-for-querying stages.
  • Document-grounded answers: responses show the selected source scope and retrieved evidence chips.
  • Structured outputs: summaries, key notes, risks, recommendations, comparisons and ROI analysis are formatted with headings, paragraphs, bullets and numbered steps.

Security and data protection design

Privacy-first sandbox design: in this portfolio demo, document processing happens inside the user’s browser. The portfolio code does not upload the selected document to a backend server or external AI API for analysis.

This design minimises document movement. Uploaded files are read by the browser, text is extracted locally, chunks are created in memory, and the retrieval logic runs inside the current browser session. This approach is suitable for demonstrating data protection thinking because the default operating principle is to keep sensitive document content as close to the user as possible.

For production deployment, the same privacy principle can be extended through secure architecture: local or private-cloud document processing, encrypted storage, role-based access control, audit logs, tenant isolation, data retention rules, and clear user consent before any document is sent to external AI services. If CDN libraries are replaced with locally bundled assets, the project can further reduce external dependencies.

User workflow

  1. Select source: the user ticks one or two processed documents.
  2. Upload if needed: the user can attach a PDF, DOCX or text/data file.
  3. Process document: the tool extracts text, chunks content and creates searchable vector-style entries.
  4. Ask a question: the user asks for a summary, risks, action points, recommendations, ROI or custom answers.
  5. Read response: the assistant streams the answer from the top and clearly states the selected document scope.
  6. Review evidence: retrieved source chips show which document chunks informed the answer.

Architecture

Browser UI
  ├── Sample documents loaded into memory
  ├── User file upload input
  ├── Local text extraction
  │     ├── PDF parser for PDF files
  │     ├── DOCX parser for Word files
  │     └── FileReader for TXT / MD / CSV / JSON
  ├── Chunking layer
  │     └── Sentence and paragraph based segmentation
  ├── Lightweight vector-style retrieval
  │     ├── Token scoring
  │     ├── Intent-aware boosting
  │     └── Top evidence chunk selection
  └── Structured response generator
        ├── Source scope banner
        ├── Summary / risk / recommendation / ROI templates
        ├── Evidence chips
        └── Streaming ChatGPT-style display

Important limitations

  • This is a portfolio sandbox, not a full enterprise backend RAG platform.
  • PDF and DOCX extraction quality depends on how readable the uploaded file is.
  • Scanned image-only PDFs require OCR, which is not included in the lightweight browser demo.
  • Uploaded documents are processed for the current browser session and are not persisted as a secure knowledge base.
  • For production, the retrieval layer should be connected to a full vector database, access controls, audit logs and model governance.

Business and portfolio impact

The project demonstrates how Pharaoh can turn unstructured documents into practical decision intelligence. This is valuable for donor reporting, compliance review, programme management, field monitoring, evaluations, grant management, HR document review, policy interpretation and operational reporting.

  • ROI driver: reduced manual reading time and fewer rework cycles.
  • Quality driver: answers are scoped to selected documents and supported by retrieved evidence.
  • Compliance driver: risks, obligations and missing evidence are surfaced earlier.
  • Leadership value: decision-makers can move from long-form reading to action faster.
🔬 Methodology 🤖 Open Sandbox →