Methodology

How the pipeline classifies documents, extracts structured fields, and decides routing with quantified confidence.

Representative methodology — actual production LLM prompts, fine-tuned weights, and client-specific rule configurations are proprietary.
1
Document ingestion & pre-processing

Documents arrive via three channels: email (IMAP watch with attachment extraction), SharePoint (Graph API file events), and REST API (multipart upload). A validation layer checks MIME type, file size (<25MB), and runs a hash-based deduplication check to prevent re-processing.

PDFs are rendered at 300dpi using pdf2image. Images undergo deskewing (Hough transform), binarisation, and contrast normalisation before OCR submission.

2
OCR with Azure Form Recognizer

Azure Form Recognizer's prebuilt-document model extracts text blocks, tables, and key-value pairs with bounding-box coordinates. Layout analysis identifies headers, footers, signature blocks, and stamped fields.

Tables are parsed into structured JSON arrays. Hand-written fields are passed through an additional handwriting recognition step (prebuilt-read), which typically achieves 88–94% accuracy on field names and 82–91% on values.

3
Document classification

A DistilBERT classifier fine-tuned on 12 document classes uses the first 512 tokens of the OCR output. Training used ~4,000 labelled samples per class with data augmentation (synonym replacement, back-translation).

Classification accuracy by class ranges from 91% (medical letters) to 99.1% (invoices). Class probability is used as a confidence signal — probabilities below 0.72 route to the human review queue without extraction.

4
LLM field extraction (LangChain + GPT-4o)

Type-specific extraction prompts instruct GPT-4o to return JSON conforming to a Pydantic schema. Structured output mode (response_format: json_schema) enforces the schema client-side, eliminating hallucination of unknown fields.

A self-consistency check re-runs extraction on a 20% random sample with temperature=0.4 and compares key numeric fields. Divergence >5% on critical fields (amounts, dates) triggers a confidence penalty and human review flag.

5
Confidence scoring & routing

A composite confidence score combines: classifier probability (40%), required-field completeness (30%), and self-consistency agreement (30%). Scores ≥ 0.90 auto-route; 0.72–0.89 escalate; <0.72 queue for review.

Routing actions POST extracted JSON to the configured target endpoint (ERP, CRM, or SharePoint list) via authenticated API calls. The audit log records the document hash, confidence score, routing decision, and timestamp — retained for 7 years for compliance.

Technical stack

OCR
Azure Form Recognizer
Classification
DistilBERT fine-tuned
LLM extraction
LangChain + GPT-4o
Orchestration
Apache Airflow
Validation
Pydantic v2
Storage
Azure Blob + CosmosDB
Deployment
Azure Container Apps
Monitoring
Application Insights