Document Extraction
Every B2B application eventually needs to parse documents: invoices from suppliers, CVs from candidates, contracts from partners, identity documents for compliance. Traditional parsers (regex, iText, rule-based) are fragile — they break on every layout variation.
Granit.AI.Extraction takes a different approach: send the document text to an LLM using
ChatResponseFormat.ForJsonSchema<T>() (MEAI native structured output), and get back a
strongly-typed C# object. The schema is derived automatically from the C# type — no manual
prompt engineering. The LLM handles layout variations, multilingual content, and messy
formatting — you get clean data.
The problem
Section titled “The problem”| Approach | Handles layout variations | Multilingual | Maintenance |
|---|---|---|---|
| Regex/rule-based | No | No | One parser per format, constant updates |
| Template-based OCR | Partially | No | One template per provider/layout |
| LLM extraction | Yes | Yes | Zero maintenance — the model adapts |
The trade-off: LLM extraction is slower (5-15s) and costs tokens. That’s why it’s designed to run asynchronously via Wolverine handlers, not in HTTP request paths.
How it works
Section titled “How it works”sequenceDiagram
participant C as Client
participant API as Minimal API
participant W as Wolverine Handler
participant LLM as IChatClient
participant V as FluentValidation
C->>API: POST /invoices/upload (PDF)
API->>API: Extract text from PDF
API-->>C: 202 Accepted + { jobId }
API->>W: InvoiceUploadedEvent
W->>LLM: "Extract InvoiceData from this text"
LLM-->>W: JSON { supplier, amount, currency, date }
W->>V: Validate InvoiceData
alt Validation passes
V-->>W: Valid
W->>W: Save to database
W->>C: Push notification (SignalR/SSE)
else Validation fails
V-->>W: Errors
W->>W: Status = NeedsReview
W->>C: Push notification (needs review)
end
Key point: the HTTP endpoint returns 202 Accepted immediately. The extraction happens in background via Wolverine with automatic retry if the LLM is temporarily down.
builder.AddGranitAI();builder.AddGranitAIOllama(); // or Azure OpenAI for productionbuilder.AddGranitAIExtraction();
// Register extractors for your domain typesbuilder.Services.AddDocumentExtractor<InvoiceData>();builder.Services.AddDocumentExtractor<CandidateProfile>();Defining a result type
Section titled “Defining a result type”The result type is a plain C# record. The LLM infers the mapping from property names:
public sealed record InvoiceData{ public string? Supplier { get; init; } public string? InvoiceNumber { get; init; } public decimal Amount { get; init; } public decimal? VatAmount { get; init; } public string? Currency { get; init; } public DateOnly? InvoiceDate { get; init; } public DateOnly? DueDate { get; init; }}Using the extractor
Section titled “Using the extractor”In a Wolverine handler (recommended)
Section titled “In a Wolverine handler (recommended)”public class InvoiceUploadedHandler( IDocumentExtractor<InvoiceData> extractor, InvoiceDbContext dbContext){ public async Task HandleAsync( InvoiceUploaded message, CancellationToken cancellationToken) { ExtractionResult<InvoiceData> result = await extractor .ExtractAsync(message.DocumentText, cancellationToken) .ConfigureAwait(false);
switch (result.Status) { case ExtractionStatus.Succeeded: // result.Data is fully typed — save it dbContext.Invoices.Add(Invoice.FromExtraction(result.Data!)); await dbContext.SaveChangesAsync(cancellationToken) .ConfigureAwait(false); break;
case ExtractionStatus.NeedsReview: // Low confidence — save as draft, notify user dbContext.InvoiceDrafts.Add(InvoiceDraft.FromExtraction( result.Data!, result.ConfidenceScore, result.Warnings)); await dbContext.SaveChangesAsync(cancellationToken) .ConfigureAwait(false); break;
case ExtractionStatus.Failed: // Could not extract — log and notify // result.ErrorMessage has details break; } }}The ExtractionResult
Section titled “The ExtractionResult”| Property | Type | Description |
|---|---|---|
Status | ExtractionStatus | Succeeded, NeedsReview, or Failed |
Data | TResult? | The extracted object (null when Failed) |
ConfidenceScore | double? | 0.0 to 1.0, based on LLM response metadata |
ErrorMessage | string? | Error details when Failed |
Warnings | IReadOnlyList<string> | Non-blocking issues (e.g., “Currency not found, defaulted to EUR”) |
The ReviewThreshold option (default 0.7) controls the boundary between Succeeded
and NeedsReview. Below 0.7 confidence, the extraction is flagged for human review
even if the data looks valid.
Advantages and risks
Section titled “Advantages and risks”Advantages
Section titled “Advantages”- Zero parser maintenance — the LLM adapts to layout variations automatically
- Multilingual — works with French invoices, German contracts, Japanese receipts
- Strongly typed — the result is a C# record, not a dictionary of strings
- Validation integration — results go through FluentValidation before persistence
Risks and mitigations
Section titled “Risks and mitigations”| Risk | Impact | Mitigation |
|---|---|---|
| Hallucination | LLM invents data not in the document | ChatResponseFormat.ForJsonSchema<T>() constrains the response to the schema; FluentValidation catches semantic inconsistencies |
| PII in documents | Documents contain personal data | Use Azure OpenAI with DPA or Ollama (local). Never send medical/legal docs to public APIs without a DPA |
| Cost | Each extraction costs tokens | Batch processing, cache similar documents, use smaller models (GPT-4o-mini) for simple docs |
| Latency | 5-15s per extraction | Always async (Wolverine handler), never in HTTP request path |
| Wrong extraction | LLM misreads a field | NeedsReview status for low confidence, human review workflow |
Real-world use cases
Section titled “Real-world use cases”| Domain | Document | Result type | Typical confidence |
|---|---|---|---|
| Accounting | Supplier invoices (PDF) | InvoiceData | 0.85-0.95 |
| HR | CVs/Resumes (PDF) | CandidateProfile | 0.70-0.85 |
| Legal | Contracts (PDF) | ContractSummary | 0.65-0.80 |
| Compliance | Identity documents | IdentityDocument | 0.80-0.90 |
| Healthcare | Lab results | LabResult | Use local model only |
Configuration reference
Section titled “Configuration reference”| Property | Type | Default | Description |
|---|---|---|---|
WorkspaceName | string | "default" | AI workspace for extraction |
ReviewThreshold | double | 0.7 | Below this → NeedsReview |
TimeoutSeconds | int | 30 | Extraction timeout |
See also
Section titled “See also”- Granit.AI overview — core module, providers, workspaces
- AI: Import Mapping — AI for CSV/Excel column mapping
- AI: Semantic Search — index extracted content for RAG