SYSTEM Cited by 1 source

Amazon Textract¶

What it is¶

Amazon Textract is AWS's managed document text and metadata extraction service — OCR plus form / table / key-value / signature detection over scanned and digital documents. Output is structured JSON with bounding-box positions and per-element confidence scores.

Stub page — expand as additional Textract-internals sources are ingested.

How it appears in architectures¶

Textract is the canonical OCR + structure-extraction layer in AWS document-intelligence pipelines. It typically sits between a document-upload tier (S3) and a downstream LLM / processing tier (Bedrock + custom logic), invoked by an orchestrator (Lambda):

S3 (upload) → Lambda (orchestrator) → Textract (OCR + metadata)
                                         │
                                         ▼
                              Custom processing / LLM

Capabilities¶

Plain OCR for scanned and digital documents.
Forms extraction — key-value pair detection.
Tables extraction — row / column / cell structure.
Queries — ask natural-language questions directly against the extracted document content.
Signature detection, layout analysis, hand-printed text recognition.

Asynchronous-vs-synchronous APIs¶

Textract supports both synchronous calls (single-page or short documents) and asynchronous job-based APIs (multi-page documents, poll for completion or receive SNS notification).

When to use Textract vs do it yourself¶

The structural property is managed OCR + structure — Textract handles model maintenance, multi-language support, accuracy improvements over time, and capacity provisioning. The trade-off is pay-per-page pricing versus fixed-cost open-source OCR (e.g. Tesseract) or commercial OCR libraries on self-managed compute.

Composition with downstream LLMs¶

Textract output is text + structure but is not semantic understanding — converting Textract output into business-meaningful fields typically requires a downstream layer:

Rules-based extraction — historical baseline (~55% accuracy on contract-class documents per AArete's pre-2024 system).
LLM-based extraction — current state of the art (~99% accuracy on the same contract-class documents per Doczy.ai's AWS-blog-disclosed numbers).
Hybrid pipelines with custom preprocessing (smart chunking) and clustering (dual clustering) between Textract and the LLM, which produce "grounded" representations the LLM can reason over more accurately.

Seen in¶

sources/2026-06-02-aws-automating-contract-intelligence-with-doczyai-on-aws — Doczy.ai uses Textract as the OCR + metadata extraction layer before applying smart chunking and dual clustering. "An AWS Lambda function triggers Amazon Textract to extract text and metadata from documents in various formats." Production scale: 50 M pages / 22 months on this pipeline (Textract is invoked once per document; with 2.5 M documents that's ~2.5 M Textract jobs, with per-page billing on the 50 M pages).

systems/aws-lambda — typical orchestrator for Textract jobs
systems/aws-s3 — typical input storage tier
systems/amazon-bedrock — typical downstream LLM tier
systems/doczy-ai — production wiki-canonicalised consumer of Textract
concepts/smart-chunking — chunking pattern applied to Textract output
concepts/multimodal-document-understanding — sibling concept at the LLM-direct-on-pixels altitude (alternative to Textract + text-LLM pipelines)
patterns/managed-ai-document-intelligence-pipeline-on-aws
patterns/visual-first-document-extraction — sibling extraction pattern using multimodal LLMs directly on document images instead of OCR + text-LLM