AGENT · DOCUMENT EXTRACTOR

Information extracted from documents in structured form, ready for the system.

Document Extractor reads complex documents that arrive at the company — signed contracts, invoices, certificates, regulatory documents, surveyor reports — and extracts the relevant information in structured form. The ready data lands in the management system, the CRM, the document archive without manual data entry.

02 · AGENT IN ACTION

Document Extractor at work.

Context

Why it exists.

A mid-large company receives complex documents every day: signed contracts, incoming invoices, conformity certificates, technical surveys, regulatory documents. Each one contains information needed in downstream systems. Manual data entry is still the standard: it takes time, generates errors, scales poorly.

What it does

How it works each day.

Document Extractor activates on document arrival (webhook upload, front-desk scan, configured channel). It recognises the type, extracts the relevant information in structured form, validates the data against configured rules (valid VAT number, positive numeric, correct date format), passes the structured data to the downstream system via API.

Supervision

The decision stays with the team.

For standard validated documents the agent proceeds on its own. For documents with anomalies — missing fields, out-of-range values, unrecognised formats — it flags to the operator with the specific point to review. Better one extra flag than a wrong datum in the system.

03 WHO IT SERVES

Who it serves and where it applies.

Head of operations

Reclaims the time of manual data entry for the team. Capacity is sized on real document volume, not on the hours available from the team. Month-end and quarter-end peaks no longer turn into backlogs.

SAD · #SAD-2024-877 Reg. 952/2013
Goods code 8703.23
Origin DE · EU
ICS2 · pre-arrival transmitted
AEO record updated

Validation clerk

Concentrates on the cases that require judgement — document with missing fields, out-of-range values, new supplier not in master data — instead of transcribing data already present in the document. The work becomes qualified again.

Proposal no. 2024-081 In review
Missing disclosure
MiFID II art. · regulated financial instrument
Alt. 1 …in compliance with MiFID II and applicable supervisory provisions.
Alt. 2 …with full disclosure attached to the offer document.
Audit trace recorded · 14:31

Head of administration or compliance

Sees systemic transcription errors shrink: wrong amounts, mis-classifications, missing regulatory references. The trace of processed documents is inspectable at period end for internal audit.

fnol.receive 09:14:22 ALLOW
triage.classify 09:14:25 ALLOW
idd.check 09:14:31 WARN
liquidation.propose 09:15:02 ALLOW
SELECT * FROM audit_log WHERE claim_id = '2024-0847'
04 EXAMPLE OF A PROCESS

A concrete example.

Document arrival

An invoice from supplier X arrives in the system.

For a mid-market B2B company with 800 incoming invoices per month, the intake channel and invoice management system integration is delivered during the project. An invoice from supplier X arrives in the configured system. The agent activates, recognises the document type (incoming invoice).

Extraction and validation

Structured data passes to the management system via API.

The agent extracts the data: supplier (VAT number validated against master data), issue date, payment due in 60 days, billing lines, net amount €12,450, VAT 22%, total €15,189, IBAN (validated as a formally correct European IBAN). The structured data is passed to the invoice management system via API. The invoice is marked as "received, validated, awaiting liquidation".

The exception

For 95% of the standard cases the flow closes in seconds.

For 95% of standard invoices from the known supplier, the flow closes in seconds without manual intervention. The 5% with anomalies — amount significantly above the supplier's average, new billing line, changed master data — reaches the head for validation with the specific point highlighted. The event stays in the audit registry, readable with a standard SQL client.

05 CONFIGURATION

Configuration and technical resources.

The Document Extractor rules are declarative. The customer's operations or compliance team defines in a readable format the document taxonomy, the extraction schemas for each type, the validation rules (numeric ranges, expected formats, reference master data), and the routing rules to the downstream system. The rules live in the customer's repository, versioned, validated at agent startup. For scanned documents or complex PDFs, integration with an external OCR service (Microsoft Document Intelligence, Google Document AI, or equivalent) is delivered during the project by the Exelab team.

SPEC SHEET
Language
TypeScript (Node.js)
LLM model
customer's choice: Anthropic, OpenAI, Mistral, open source models hosted internally, AWS Bedrock for a private model
Built-in controls used
pii-detector, credential-detector, tool-param-validator, message-length-limit
Native channels
OpenAI-compatible HTTP (upload webhook), Telegram, Slack, WhatsApp
Corporate email as intake channel
integration delivered during the project
OCR and complex PDF parsing
external service (Microsoft Document Intelligence, Google Document AI, or equivalent) configured during the project — not a built-in runtime capability
Downstream management system integration
dedicated adapter delivered during the project
Memory
persistent per instance, pgvector + PostgreSQL FTS on the customer's document patterns
Registry
immutable, queryable with a standard SQL client
06 FREQUENTLY ASKED QUESTIONS

Frequently asked questions about the agent.

For scanned documents or complex PDFs, the typical flow uses an external OCR service — Microsoft Document Intelligence, Google Document AI, or proprietary services — configured during the project. OCR is not a built-in runtime capability. The pattern is external OCR + the agent's structured extraction on the resulting text.

Every extraction is validated against rules configured by the team. When a value does not pass validation (unrecognised VAT number, amount out of range, wrong date format), the agent does not proceed on its own: it flags the anomaly to the operator with the specific point highlighted. The applied rule is conservative: better one extra flag than a wrong datum in the system.

Supported types depend on the declarative rules configured by the customer's team during the project. The extraction logic is generic — the LLM model reads structured text and applies parsing rules. Patterns already proven include European invoices, standard contracts, ISO certificates, insurance surveys.

The typical pattern for Document Extractor is 8-14 weeks. Discovery 2 weeks, extraction rule configuration for the main document types 3-5 weeks, OCR integration (if required) and downstream management system 3-5 weeks, hand-off 1-2 weeks. Duration depends on the number of document types covered.

From a 30-minute conversation to the squad in production.

A 30-45 minute conversation to understand how Document Extractor would configure to the customer's case. Which document types, which volume, which downstream system.