How the pipeline classifies documents, extracts structured fields, and decides routing with quantified confidence.
Documents arrive via three channels: email (IMAP watch with attachment extraction), SharePoint (Graph API file events), and REST API (multipart upload). A validation layer checks MIME type, file size (<25MB), and runs a hash-based deduplication check to prevent re-processing.
PDFs are rendered at 300dpi using pdf2image. Images undergo deskewing (Hough transform), binarisation, and contrast normalisation before OCR submission.
Azure Form Recognizer's prebuilt-document model extracts text blocks, tables, and key-value pairs with bounding-box coordinates. Layout analysis identifies headers, footers, signature blocks, and stamped fields.
Tables are parsed into structured JSON arrays. Hand-written fields are passed through an additional handwriting recognition step (prebuilt-read), which typically achieves 88–94% accuracy on field names and 82–91% on values.
A DistilBERT classifier fine-tuned on 12 document classes uses the first 512 tokens of the OCR output. Training used ~4,000 labelled samples per class with data augmentation (synonym replacement, back-translation).
Classification accuracy by class ranges from 91% (medical letters) to 99.1% (invoices). Class probability is used as a confidence signal — probabilities below 0.72 route to the human review queue without extraction.
Type-specific extraction prompts instruct GPT-4o to return JSON conforming to a Pydantic schema. Structured output mode (response_format: json_schema) enforces the schema client-side, eliminating hallucination of unknown fields.
A self-consistency check re-runs extraction on a 20% random sample with temperature=0.4 and compares key numeric fields. Divergence >5% on critical fields (amounts, dates) triggers a confidence penalty and human review flag.
A composite confidence score combines: classifier probability (40%), required-field completeness (30%), and self-consistency agreement (30%). Scores ≥ 0.90 auto-route; 0.72–0.89 escalate; <0.72 queue for review.
Routing actions POST extracted JSON to the configured target endpoint (ERP, CRM, or SharePoint list) via authenticated API calls. The audit log records the document hash, confidence score, routing decision, and timestamp — retained for 7 years for compliance.