AGENT · DOCUMENT EXTRACTOR

Information extracted from documents in structured form, ready for the system.

Document Extractor reads complex documents that arrive at the company — signed contracts, invoices, certificates, regulatory documents, surveyor reports — and extracts the relevant information in structured form. The ready data lands in the management system, the CRM, the document archive without manual data entry.

Talk to the team → See the other agents for operations

Agent Builder · document-extractor.yaml ● live

role: document-extractor

tools: [doc_intake, vat_validate, erp_write]

channels: webhook · slack · email

rules: 5 doc types · live

09:14:02 · ev_d2a1b8 · invoice rossi-spa · validated → ERP

09:14:11 · ev_e7c3f0 · warn new line "extraordinary consulting"

09:14:17 · ev_f1k9m4 · block contract-x · VAT number not recognised

02 · AGENT IN ACTION

Document Extractor at work.

Ops Team #incoming-invoices

Anna · admin Head of Admin h 09:12

The invoice from supplier Smith Components has arrived. Can you process it?

Document Extractor APP h 09:14

Incoming invoice recognised. Supplier Smith Components Ltd (VAT in master data). Net €12,450 · VAT 22% · total €15,189. IBAN formally correct. All validations passed, data forwarded to the management system.

Document Extractor APP h 09:14

One billing line is new compared to this supplier's baseline ("Extraordinary technical consulting"). I have flagged the line for review before liquidation.

Context

Why it exists.

A mid-large company receives complex documents every day: signed contracts, incoming invoices, conformity certificates, technical surveys, regulatory documents. Each one contains information needed in downstream systems. Manual data entry is still the standard: it takes time, generates errors, scales poorly.

What it does

How it works each day.

Document Extractor activates on document arrival (webhook upload, front-desk scan, configured channel). It recognises the type, extracts the relevant information in structured form, validates the data against configured rules (valid VAT number, positive numeric, correct date format), passes the structured data to the downstream system via API.

Supervision

The decision stays with the team.

For standard validated documents the agent proceeds on its own. For documents with anomalies — missing fields, out-of-range values, unrecognised formats — it flags to the operator with the specific point to review. Better one extra flag than a wrong datum in the system.

03 WHO IT SERVES

Who it serves and where it applies.

Head of operations

Reclaims the time of manual data entry for the team. Capacity is sized on real document volume, not on the hours available from the team. Month-end and quarter-end peaks no longer turn into backlogs.

SAD · #SAD-2024-877 Reg. 952/2013

Goods code 8703.23

Origin DE · EU

ICS2 · pre-arrival transmitted

AEO record updated

Validation clerk

Concentrates on the cases that require judgement — document with missing fields, out-of-range values, new supplier not in master data — instead of transcribing data already present in the document. The work becomes qualified again.

Proposal no. 2024-081 In review

⚠

Missing disclosure

MiFID II art. · regulated financial instrument

Alt. 1 …in compliance with MiFID II and applicable supervisory provisions.

Alt. 2 …with full disclosure attached to the offer document.

Audit trace recorded · 14:31

Head of administration or compliance

Sees systemic transcription errors shrink: wrong amounts, mis-classifications, missing regulatory references. The trace of processed documents is inspectable at period end for internal audit.

fnol.receive 09:14:22 ALLOW

triage.classify 09:14:25 ALLOW

idd.check 09:14:31 WARN

liquidation.propose 09:15:02 ALLOW

SELECT * FROM audit_log WHERE claim_id = '2024-0847'

04 EXAMPLE OF A PROCESS

A concrete example.

Document arrival

An invoice from supplier X arrives in the system.

For a mid-market B2B company with 800 incoming invoices per month, the intake channel and invoice management system integration is delivered during the project. An invoice from supplier X arrives in the configured system. The agent activates, recognises the document type (incoming invoice).

Extraction and validation

Structured data passes to the management system via API.

The agent extracts the data: supplier (VAT number validated against master data), issue date, payment due in 60 days, billing lines, net amount €12,450, VAT 22%, total €15,189, IBAN (validated as a formally correct European IBAN). The structured data is passed to the invoice management system via API. The invoice is marked as "received, validated, awaiting liquidation".

The exception

For 95% of the standard cases the flow closes in seconds.

For 95% of standard invoices from the known supplier, the flow closes in seconds without manual intervention. The 5% with anomalies — amount significantly above the supplier's average, new billing line, changed master data — reaches the head for validation with the specific point highlighted. The event stays in the audit registry, readable with a standard SQL client.

05 CONFIGURATION

Configuration and technical resources.

The Document Extractor rules are declarative. The customer's operations or compliance team defines in a readable format the document taxonomy, the extraction schemas for each type, the validation rules (numeric ranges, expected formats, reference master data), and the routing rules to the downstream system. The rules live in the customer's repository, versioned, validated at agent startup. For scanned documents or complex PDFs, integration with an external OCR service (Microsoft Document Intelligence, Google Document AI, or equivalent) is delivered during the project by the Exelab team.

SPEC SHEET

Language: TypeScript (Node.js)
LLM model: customer's choice: Anthropic, OpenAI, Mistral, open source models hosted internally, AWS Bedrock for a private model
Built-in controls used: pii-detector, credential-detector, tool-param-validator, message-length-limit
Native channels: OpenAI-compatible HTTP (upload webhook), Telegram, Slack, WhatsApp
Corporate email as intake channel: integration delivered during the project
OCR and complex PDF parsing: external service (Microsoft Document Intelligence, Google Document AI, or equivalent) configured during the project — not a built-in runtime capability
Downstream management system integration: dedicated adapter delivered during the project
Memory: persistent per instance, pgvector + PostgreSQL FTS on the customer's document patterns
Registry: immutable, queryable with a standard SQL client

06 FREQUENTLY ASKED QUESTIONS

Frequently asked questions about the agent.

01 Does the agent also read scanned documents?

For scanned documents or complex PDFs, the typical flow uses an external OCR service — Microsoft Document Intelligence, Google Document AI, or proprietary services — configured during the project. OCR is not a built-in runtime capability. The pattern is external OCR + the agent's structured extraction on the resulting text.

02 How are false positives (incorrect extractions) handled?

Every extraction is validated against rules configured by the team. When a value does not pass validation (unrecognised VAT number, amount out of range, wrong date format), the agent does not proceed on its own: it flags the anomaly to the operator with the specific point highlighted. The applied rule is conservative: better one extra flag than a wrong datum in the system.

03 Which document types are supported?

Supported types depend on the declarative rules configured by the customer's team during the project. The extraction logic is generic — the LLM model reads structured text and applies parsing rules. Patterns already proven include European invoices, standard contracts, ISO certificates, insurance surveys.

04 How long does the first production deploy take?

The typical pattern for Document Extractor is 8-14 weeks. Discovery 2 weeks, extraction rule configuration for the main document types 3-5 weeks, OCR integration (if required) and downstream management system 3-5 weeks, hand-off 1-2 weeks. Duration depends on the number of document types covered.

From a 30-minute conversation to the squad in production.

A 30-45 minute conversation to understand how Document Extractor would configure to the customer's case. Which document types, which volume, which downstream system.

Talk to the team → See the agents for operations