The invoice from supplier Smith Components has arrived. Can you process it?
Information extracted from documents in structured form, ready for the system.
Document Extractor reads complex documents that arrive at the company — signed contracts, invoices, certificates, regulatory documents, surveyor reports — and extracts the relevant information in structured form. The ready data lands in the management system, the CRM, the document archive without manual data entry.
Document Extractor at work.
Incoming invoice recognised. Supplier Smith Components Ltd (VAT in master data). Net €12,450 · VAT 22% · total €15,189. IBAN formally correct. All validations passed, data forwarded to the management system.
One billing line is new compared to this supplier's baseline ("Extraordinary technical consulting"). I have flagged the line for review before liquidation.
Why it exists.
A mid-large company receives complex documents every day: signed contracts, incoming invoices, conformity certificates, technical surveys, regulatory documents. Each one contains information needed in downstream systems. Manual data entry is still the standard: it takes time, generates errors, scales poorly.
How it works each day.
Document Extractor activates on document arrival (webhook upload, front-desk scan, configured channel). It recognises the type, extracts the relevant information in structured form, validates the data against configured rules (valid VAT number, positive numeric, correct date format), passes the structured data to the downstream system via API.
The decision stays with the team.
For standard validated documents the agent proceeds on its own. For documents with anomalies — missing fields, out-of-range values, unrecognised formats — it flags to the operator with the specific point to review. Better one extra flag than a wrong datum in the system.
Who it serves and where it applies.
Head of operations
Reclaims the time of manual data entry for the team. Capacity is sized on real document volume, not on the hours available from the team. Month-end and quarter-end peaks no longer turn into backlogs.
Validation clerk
Concentrates on the cases that require judgement — document with missing fields, out-of-range values, new supplier not in master data — instead of transcribing data already present in the document. The work becomes qualified again.
Head of administration or compliance
Sees systemic transcription errors shrink: wrong amounts, mis-classifications, missing regulatory references. The trace of processed documents is inspectable at period end for internal audit.
A concrete example.
An invoice from supplier X arrives in the system.
For a mid-market B2B company with 800 incoming invoices per month, the intake channel and invoice management system integration is delivered during the project. An invoice from supplier X arrives in the configured system. The agent activates, recognises the document type (incoming invoice).
Structured data passes to the management system via API.
The agent extracts the data: supplier (VAT number validated against master data), issue date, payment due in 60 days, billing lines, net amount €12,450, VAT 22%, total €15,189, IBAN (validated as a formally correct European IBAN). The structured data is passed to the invoice management system via API. The invoice is marked as "received, validated, awaiting liquidation".
For 95% of the standard cases the flow closes in seconds.
For 95% of standard invoices from the known supplier, the flow closes in seconds without manual intervention. The 5% with anomalies — amount significantly above the supplier's average, new billing line, changed master data — reaches the head for validation with the specific point highlighted. The event stays in the audit registry, readable with a standard SQL client.
Configuration and technical resources.
The Document Extractor rules are declarative. The customer's operations or compliance team defines in a readable format the document taxonomy, the extraction schemas for each type, the validation rules (numeric ranges, expected formats, reference master data), and the routing rules to the downstream system. The rules live in the customer's repository, versioned, validated at agent startup. For scanned documents or complex PDFs, integration with an external OCR service (Microsoft Document Intelligence, Google Document AI, or equivalent) is delivered during the project by the Exelab team.
- Language
- TypeScript (Node.js)
- LLM model
- customer's choice: Anthropic, OpenAI, Mistral, open source models hosted internally, AWS Bedrock for a private model
- Built-in controls used
- pii-detector, credential-detector, tool-param-validator, message-length-limit
- Native channels
- OpenAI-compatible HTTP (upload webhook), Telegram, Slack, WhatsApp
- Corporate email as intake channel
- integration delivered during the project
- OCR and complex PDF parsing
- external service (Microsoft Document Intelligence, Google Document AI, or equivalent) configured during the project — not a built-in runtime capability
- Downstream management system integration
- dedicated adapter delivered during the project
- Memory
- persistent per instance, pgvector + PostgreSQL FTS on the customer's document patterns
- Registry
- immutable, queryable with a standard SQL client
Frequently asked questions about the agent.
For scanned documents or complex PDFs, the typical flow uses an external OCR service — Microsoft Document Intelligence, Google Document AI, or proprietary services — configured during the project. OCR is not a built-in runtime capability. The pattern is external OCR + the agent's structured extraction on the resulting text.
Every extraction is validated against rules configured by the team. When a value does not pass validation (unrecognised VAT number, amount out of range, wrong date format), the agent does not proceed on its own: it flags the anomaly to the operator with the specific point highlighted. The applied rule is conservative: better one extra flag than a wrong datum in the system.
Supported types depend on the declarative rules configured by the customer's team during the project. The extraction logic is generic — the LLM model reads structured text and applies parsing rules. Patterns already proven include European invoices, standard contracts, ISO certificates, insurance surveys.
The typical pattern for Document Extractor is 8-14 weeks. Discovery 2 weeks, extraction rule configuration for the main document types 3-5 weeks, OCR integration (if required) and downstream management system 3-5 weeks, hand-off 1-2 weeks. Duration depends on the number of document types covered.
From a 30-minute conversation to the squad in production.
A 30-45 minute conversation to understand how Document Extractor would configure to the customer's case. Which document types, which volume, which downstream system.