Overview Documentation Methodology Use Cases 🧪 Sandbox

AI Document Processing Workflow

This pipeline automates the intake, classification, field extraction, and routing of 500+ documents per day — replacing a 4-person manual review team with a 92% straight-through processing rate and same-hour routing for routine document types.

Note: This documentation covers the architecture and configuration of a representative deployment pattern. Specific client configurations and proprietary schema are omitted.

Pipeline stages

The workflow operates as a 5-stage event-driven pipeline:

Ingestion — Documents arrive via email attachment, SharePoint watch folder, or REST API upload. A file validator checks format, size, and virus scan before passing to the queue.
OCR & Pre-processing — Azure Form Recognizer extracts text layout, tables, and key-value pairs. PDFs are rendered at 300dpi; images are deskewed and contrast-normalised.
Classification — A fine-tuned text classifier (DistilBERT) identifies document type from 12 classes. Confidence < 0.72 routes to the human review queue.
Field Extraction — LangChain + GPT-4o extracts structured fields per document type using type-specific extraction prompts with JSON schema enforcement.
Routing & Action — Extracted records are validated against business rules, then routed: auto-approved records POST to the target system API; exceptions queue for human review with an annotated preview.

Supported document types

Type	Sectors	Avg extraction accuracy
Insurance claim forms	Insurance	97.2%
Grant progress reports	NGO / Development	94.1%
Purchase invoices	Finance, Procurement	96.8%
ID documents (passport/national ID)	KYC, HR	98.4%
Medical referral letters	Healthcare	92.3%
Supplier contracts	Legal, Procurement	89.7%

Extraction schema (example — invoice)

{
  "document_type": "invoice",
  "vendor_name": "string",
  "invoice_number": "string",
  "invoice_date": "ISO 8601 date",
  "due_date": "ISO 8601 date",
  "line_items": [
    {"description": "string", "quantity": "number", "unit_price": "number", "total": "number"}
  ],
  "subtotal": "number",
  "tax": "number",
  "total_amount": "number",
  "currency": "ISO 4217",
  "payment_terms": "string",
  "confidence": "0.0–1.0"
}

Routing rules

After extraction, each document passes through a configurable routing engine:

Auto-approve: confidence ≥ 0.90 AND all required fields present AND amount within policy limits.
Escalate to senior reviewer: confidence ≥ 0.72 but amount above threshold or policy exceptions detected.
Human review queue: confidence < 0.72 OR required fields missing OR new document type.
Reject: malformed document, virus detected, or failed validation checksum.

Confidence thresholds

Score range	Action	SLA
0.90 – 1.00	Straight-through processing	<1 hour
0.72 – 0.89	Expedited human review	Same day
0.00 – 0.71	Standard review queue	24–48 hours

Deployment

The pipeline is containerised (Docker) and can deploy to:

Azure Container Apps — recommended for existing Azure Form Recognizer users.
AWS Lambda + S3 — event-triggered serverless variant for low-volume deployments.
On-premise VM — for air-gapped environments with sensitive document types.

The sandbox below simulates the classification and extraction stages client-side for demonstration purposes. Production deployments use Azure Form Recognizer for OCR and GPT-4o for extraction.