Skip to content

Document Extraction

Every B2B application eventually needs to parse documents: invoices from suppliers, CVs from candidates, contracts from partners, identity documents for compliance. Traditional parsers (regex, iText, rule-based) are fragile — they break on every layout variation.

Granit.AI.Extraction takes a different approach: send the document text to an LLM using ChatResponseFormat.ForJsonSchema<T>() (MEAI native structured output), and get back a strongly-typed C# object. The schema is derived automatically from the C# type — no manual prompt engineering. The LLM handles layout variations, multilingual content, and messy formatting — you get clean data.

ApproachHandles layout variationsMultilingualMaintenance
Regex/rule-basedNoNoOne parser per format, constant updates
Template-based OCRPartiallyNoOne template per provider/layout
LLM extractionYesYesZero maintenance — the model adapts

The trade-off: LLM extraction is slower (5-15s) and costs tokens. That’s why it’s designed to run asynchronously via Wolverine handlers, not in HTTP request paths.

sequenceDiagram
    participant C as Client
    participant API as Minimal API
    participant W as Wolverine Handler
    participant LLM as IChatClient
    participant V as FluentValidation

    C->>API: POST /invoices/upload (PDF)
    API->>API: Extract text from PDF
    API-->>C: 202 Accepted + { jobId }
    API->>W: InvoiceUploadedEvent

    W->>LLM: "Extract InvoiceData from this text"
    LLM-->>W: JSON { supplier, amount, currency, date }
    W->>V: Validate InvoiceData
    alt Validation passes
        V-->>W: Valid
        W->>W: Save to database
        W->>C: Push notification (SignalR/SSE)
    else Validation fails
        V-->>W: Errors
        W->>W: Status = NeedsReview
        W->>C: Push notification (needs review)
    end

Key point: the HTTP endpoint returns 202 Accepted immediately. The extraction happens in background via Wolverine with automatic retry if the LLM is temporarily down.

builder.AddGranitAI();
builder.AddGranitAIOllama(); // or Azure OpenAI for production
builder.AddGranitAIExtraction();
// Register extractors for your domain types
builder.Services.AddDocumentExtractor<InvoiceData>();
builder.Services.AddDocumentExtractor<CandidateProfile>();

The result type is a plain C# record. The LLM infers the mapping from property names:

public sealed record InvoiceData
{
public string? Supplier { get; init; }
public string? InvoiceNumber { get; init; }
public decimal Amount { get; init; }
public decimal? VatAmount { get; init; }
public string? Currency { get; init; }
public DateOnly? InvoiceDate { get; init; }
public DateOnly? DueDate { get; init; }
}
public class InvoiceUploadedHandler(
IDocumentExtractor<InvoiceData> extractor,
InvoiceDbContext dbContext)
{
public async Task HandleAsync(
InvoiceUploaded message,
CancellationToken cancellationToken)
{
ExtractionResult<InvoiceData> result = await extractor
.ExtractAsync(message.DocumentText, cancellationToken)
.ConfigureAwait(false);
switch (result.Status)
{
case ExtractionStatus.Succeeded:
// result.Data is fully typed — save it
dbContext.Invoices.Add(Invoice.FromExtraction(result.Data!));
await dbContext.SaveChangesAsync(cancellationToken)
.ConfigureAwait(false);
break;
case ExtractionStatus.NeedsReview:
// Low confidence — save as draft, notify user
dbContext.InvoiceDrafts.Add(InvoiceDraft.FromExtraction(
result.Data!, result.ConfidenceScore, result.Warnings));
await dbContext.SaveChangesAsync(cancellationToken)
.ConfigureAwait(false);
break;
case ExtractionStatus.Failed:
// Could not extract — log and notify
// result.ErrorMessage has details
break;
}
}
}
PropertyTypeDescription
StatusExtractionStatusSucceeded, NeedsReview, or Failed
DataTResult?The extracted object (null when Failed)
ConfidenceScoredouble?0.0 to 1.0, based on LLM response metadata
ErrorMessagestring?Error details when Failed
WarningsIReadOnlyList<string>Non-blocking issues (e.g., “Currency not found, defaulted to EUR”)

The ReviewThreshold option (default 0.7) controls the boundary between Succeeded and NeedsReview. Below 0.7 confidence, the extraction is flagged for human review even if the data looks valid.

  • Zero parser maintenance — the LLM adapts to layout variations automatically
  • Multilingual — works with French invoices, German contracts, Japanese receipts
  • Strongly typed — the result is a C# record, not a dictionary of strings
  • Validation integration — results go through FluentValidation before persistence
RiskImpactMitigation
HallucinationLLM invents data not in the documentChatResponseFormat.ForJsonSchema<T>() constrains the response to the schema; FluentValidation catches semantic inconsistencies
PII in documentsDocuments contain personal dataUse Azure OpenAI with DPA or Ollama (local). Never send medical/legal docs to public APIs without a DPA
CostEach extraction costs tokensBatch processing, cache similar documents, use smaller models (GPT-4o-mini) for simple docs
Latency5-15s per extractionAlways async (Wolverine handler), never in HTTP request path
Wrong extractionLLM misreads a fieldNeedsReview status for low confidence, human review workflow
DomainDocumentResult typeTypical confidence
AccountingSupplier invoices (PDF)InvoiceData0.85-0.95
HRCVs/Resumes (PDF)CandidateProfile0.70-0.85
LegalContracts (PDF)ContractSummary0.65-0.80
ComplianceIdentity documentsIdentityDocument0.80-0.90
HealthcareLab resultsLabResultUse local model only
PropertyTypeDefaultDescription
WorkspaceNamestring"default"AI workspace for extraction
ReviewThresholddouble0.7Below this → NeedsReview
TimeoutSecondsint30Extraction timeout