Indexing — Pluggable full-text + semantic search backends
Every product team builds the same search feature twice. First a quick LIKE '%term%' on a single table. Then — when the corpus grows — a hasty rewrite onto
tsvector or Elasticsearch, with a re-indexing batch glued together over a long
weekend. Both attempts share the same blind spots: no per-resource ACL filter
past tenant scope, no language-aware analysis, no GDPR-grade erasure when a data
subject revokes consent, no graceful path from lexical-only to hybrid semantic
retrieval, and every endpoint silently re-implements query parsing — usually
forwarding raw operator syntax straight into to_tsquery.
Granit.Indexing is the horizontal full-text + semantic search framework.
One write port (IIndexer<TKey>), one read port
(ISearchService<TKey, TResult>), pluggable backends behind both. The default
backend is Postgres tsvector (Granit.Indexing.EntityFrameworkCore) — zero
new infrastructure for a Granit host. Elasticsearch 8.x is opt-in
(Granit.Indexing.Elasticsearch) for low-millions-of-rows corpora or per-language
analyzer needs. Embeddings + hybrid Reciprocal Rank Fusion ship as a separate
opt-in (Granit.Indexing.Embeddings), as do AI summaries
(Granit.Indexing.AI), background reindex with checkpoint resume
(Granit.Indexing.BackgroundJobs), and the GDPR Art. 17 bridge
(Granit.Indexing.Privacy).
| Pain | This package’s answer |
|------|----------------------|
| Hand-rolled LIKE '%…%' per endpoint with no ACL filter past tenant scope | One ISearchService<TKey, TResult> with consumer-supplied ISearchResultAuthorizer<TKey> on every hit |
| tsquery / Lucene injection from anonymous traffic | plainto_tsquery / simple_query_string (restricted flags) on the default path; advanced syntax gated on a Search.Advanced.Execute permission |
| Existence-oracle probing via empty-result pings | Per-principal sliding window — MaxEmptyResultQueriesPerPrincipalPerMinute = 10 by default; 11th hits 429 |
| Long-tail “no results” page reveals row counts to restricted principals | Exponential over-fetch loop with a MaxAuthorizationDepth = 5 000 ceiling; aggregated HitAuthorizationLimit hint, not a per-query oracle |
| Index outlives the source row after a GDPR Art. 17 request | IIndexedDataEraser + Granit.Indexing.Privacy bridge: one bulk statement per backend, atomic with the source delete |
| Embedding sidecar table de-syncs from Content on erasure | Embeddings live on the same row as Content — pinned by an architecture test (Embeddings_must_live_on_the_same_row_as_Content_no_sidecar_entity_types) |
| Reindex after a tokenizer change requires a custom batch | RebuildIndexJob<TKey> + checkpoint store; survives worker restarts |
| Switching from Postgres to ES means rewriting consumers | Same IIndexer<TKey> / ISearchService<TKey, TResult> contracts — backend swap is a DI line |
Package structure
Section titled “Package structure”DirectoryGranit.Indexing/ Contracts:
IIndexer<TKey>,ISearchService<TKey, TResult>,ISearchBackend<TKey, TResult>,ISearchResultAuthorizer<TKey>,IIndexedEntrySource<TKey>,IIndexedDataEraser,IndexedEntry<TKey>,SearchPage<TResult>DirectoryGranit.Indexing.EntityFrameworkCore/ Default backend — Postgres
tsvector(GENERATED ALWAYS … STORED) + GIN index,PersonalDataDeletionHandler- …
DirectoryGranit.Indexing.Elasticsearch/ ES 8.x backend,
Shared/PerTenantstrategies, restrictedsimple_query_string- …
DirectoryGranit.Indexing.AI/
ISummarizer— one-shot LLM call, JSON-schema-pinned, AIQuota-rate-limited- …
DirectoryGranit.Indexing.Embeddings/ Decorator:
IEmbeddingGeneratorwrite + RRF hybrid retriever (k=60 default, dense ranking)- …
DirectoryGranit.Indexing.BackgroundJobs/
RebuildIndexJob<TKey>+ checkpoint store (in-memory or EF)- …
DirectoryGranit.Indexing.Privacy/ Wolverine handler bridging
PersonalDataDeletionRequestedEto→ everyIIndexedDataEraser- …
| Package | Role | Depends on |
|---------|------|------------|
| Granit.Indexing | Contract root — IIndexer<TKey>, ISearchService<TKey, TResult>, ISearchBackend<TKey, TResult>, ISearchResultAuthorizer<TKey>, IIndexedEntrySource<TKey>, IIndexedDataEraser, IndexedEntry<TKey>, SearchPage<TResult>, DefaultSearchService, empty-result rate limiter | Granit, Granit.LanguageDetection |
| Granit.Indexing.EntityFrameworkCore | Default Postgres tsvector backend — IndexingDbContext, IndexedEntryRow<TKey>, HasGeneratedTsVectorColumn, EfIndexer<TKey>, EfSearchBackend<TKey, TResult>, EfIndexedDataEraser | Granit.Indexing, Granit.Persistence.EntityFrameworkCore |
| Granit.Indexing.Elasticsearch | ES 8.x backend — BM25 multi-field per-language analyzers, Shared / PerTenant strategy, delete_by_query eraser | Granit.Indexing, Elastic.Clients.Elasticsearch |
| Granit.Indexing.AI | ISummarizer LLM provider — JSON-schema response, AIQuota counter, prompt-injection isolation | Granit.Indexing, Granit.AI |
| Granit.Indexing.Embeddings | Decorator pair: IIndexer<TKey> writer (embeds Content) + ISearchBackend<TKey, TResult> hybrid retriever (BM25 ∪ kNN → RRF) | Granit.Indexing, Microsoft.Extensions.AI.Abstractions |
| Granit.Indexing.BackgroundJobs | RebuildIndexJob<TKey> (on-demand, host-dispatched), InMemoryRebuildCheckpointStore<TKey> default, EF persistent checkpoint opt-in | Granit.Indexing, Granit.BackgroundJobs |
| Granit.Indexing.Privacy | PersonalDataDeletionHandler Wolverine handler — fans PersonalDataDeletionRequestedEto to every registered IIndexedDataEraser | Granit.Indexing, Granit.Privacy |
Contracts
Section titled “Contracts”Write port — IIndexer<TKey>
Section titled “Write port — IIndexer<TKey>”public interface IIndexer<TKey>{ Task IndexAsync(IndexedEntry<TKey> entry, CancellationToken cancellationToken = default); Task RemoveAsync(TKey key, Guid? tenantId, CancellationToken cancellationToken = default);}
public sealed record IndexedEntry<TKey>{ public required TKey Key { get; init; } public required Guid? TenantId { get; init; } public required string Content { get; init; } public string? Language { get; init; } // ISO 639-1 — from ILanguageDetector public string? Summary { get; init; } // optional — ISummarizer public IReadOnlyList<string>? Tags { get; init; } public ReadOnlyMemory<float>? Embedding { get; init; } public IReadOnlyDictionary<string, string>? Facets { get; init; } public bool IsTruncated { get; init; } // from TextExtractionResult public int CharCount { get; init; } public Guid? DataSubjectId { get; init; } // drives GDPR Art. 17 erasure}IndexAsync is idempotent: re-indexing (TenantId, Key) overwrites the existing
row in place. RemoveAsync takes tenantId explicitly because background
workers — which may not have ICurrentTenant in scope — must still be able to
remove rows from any tenant without leaking through ambient state.
Read port — ISearchService<TKey, TResult>
Section titled “Read port — ISearchService<TKey, TResult>”public interface ISearchService<TKey, TResult>{ Task<SearchPage<TResult>> SearchAsync(SearchRequest request, CancellationToken ct = default);}
public sealed record SearchRequest( string Query, int Page = 1, int PageSize = 20, string? Language = null, string? PrincipalIdentifier = null, bool UseAdvancedSyntax = false);SearchRequest.Query is treated as a phrase by default. UseAdvancedSyntax
opts the backend into operator-aware parsing (to_tsquery on Postgres,
Lucene query_string on ES) — endpoints MUST gate this on a dedicated
Search.Advanced.Execute permission before forwarding the flag.
PrincipalIdentifier (typically User.GetSubjectId()) is hashed before any
log emission and used as the bucket key for the empty-result rate limiter.
Authorization boundary
Section titled “Authorization boundary”The single most important contract in Granit.Indexing:
The framework enforces tenant isolation only. Per-resource ACL is the consumer module’s responsibility and is enforced at read time via
ISearchResultAuthorizer<TKey>— never serialised into the index.
sequenceDiagram
autonumber
participant E as Endpoint
participant S as ISearchService<TKey, TResult><br/>(DefaultSearchService)
participant B as ISearchBackend<TKey, TResult><br/>(EfSearchBackend / EsSearchBackend)
participant A as ISearchResultAuthorizer<TKey><br/>(consumer-supplied)
E->>S: SearchAsync(request)
loop Over-fetch loop
S->>B: SearchAsync(request, offset, limit)<br/>limit = pageSize × multiplier × 2^i
B-->>S: hits + HasMore (tenant-filtered)
S->>A: FilterAsync(keys)
A-->>S: AuthorizedResult(keys)
Note over S: Stop when page full,<br/>backend exhausted,<br/>or MaxAuthorizationDepth hit
end
S-->>E: SearchPage<TResult>
| Layer | Concern | Default |
|-------|---------|---------|
| ISearchBackend<TKey, TResult> | Tenant isolation — every query scoped to ICurrentTenant, applied by the backend (never by the orchestrator) | EF: GranitDbContext parameterised tenant filter rewritten into every SQL statement. ES: mandatory term tenant_id on every read/write |
| ISearchResultAuthorizer<TKey> | Per-resource ACL — workspace, role-based row-level, public-link grants | NullSearchResultAuthorizer<TKey> (authorises every hit) — appropriate only when tenant isolation is the complete authorization story |
| DefaultSearchService<TKey, TResult> | Exponential over-fetch loop fills the requested page with authorised hits without leaking row counts | Iteration 1: pageSize × RecommendedInitialMultiplier. Doubles per iteration. Stops at MaxAuthorizationDepth = 5 000 |
The over-fetch loop avoids a class of leaks:
- No per-page existence oracle. A restricted principal who never sees more
than
nrows cannot bisect-search a private term — the response always carries the same authorised-page shape; the only signal is the aggregatedHitAuthorizationLimitflag, which endpoints MUST throttle to at most one display per principal per 60 s (the framework cannot enforce this — it has no UI state). - No empty-result probing. A principal that exceeds
MaxEmptyResultQueriesPerPrincipalPerMinute(default 10) inside a 60 s window getsEmptyResultRateLimitedException, which the endpoint adapter converts toProblem(429). - No backend hit count leak.
SearchPage<TResult>.BackendHitCountis marked[JsonIgnore]+[EditorBrowsable(Never)]so it never round-trips through HTTP; an architecture test (BackendHitCount_must_not_be_referenced_from_any_Endpoints_package) forbids cross-package access from.Endpointsprojects.
Writing an authorizer
Section titled “Writing an authorizer”public sealed class WorkspaceAclAuthorizer(IWorkspaceAccess access) : ISearchResultAuthorizer<Guid>{ // Restricted principal sees ~10 % of hits — over-fetch 10× on iteration 1. public int RecommendedInitialMultiplier => 10;
public async Task<AuthorizedResult<Guid>> FilterAsync( IReadOnlyList<Guid> candidates, CancellationToken ct) { IReadOnlyList<Guid> allowed = await access .FilterReadableAsync(candidates, ct).ConfigureAwait(false); return new AuthorizedResult<Guid>(allowed); }}
// Composition rootservices.AddGranitIndexing();services.AddSingleton<ISearchResultAuthorizer<Guid>, WorkspaceAclAuthorizer>();Use the rule of thumb ceil(1 / expected_authorized_ratio) for the multiplier.
Too low costs an extra round-trip on the common path; too high wastes backend
rows on the rare path. Unknown principals default to 3 — covers admin and
restricted alike within two iterations.
Backends
Section titled “Backends”Postgres tsvector (default)
Section titled “Postgres tsvector (default)”Granit.Indexing.EntityFrameworkCore is the default backend — zero new
infrastructure for any Granit host that already runs Postgres.
| Aspect | Behaviour |
|--------|-----------|
| Storage | IndexedEntryRow<TKey> with Content, SearchVector tsvector (GENERATED ALWAYS … STORED), Language, Summary, Tags string[], IsTruncated, CharCount, DataSubjectId. One physical table per registered TKey. |
| Query syntax | plainto_tsquery by default; websearch_to_tsquery via IndexingEntityFrameworkCoreOptions.UseWebSearchSyntax. to_tsquery is not reachable from the default path — operator characters (&, \|, !, parentheses) are treated as literals. |
| Index | GIN over SearchVector, emitted by the HasGeneratedTsVectorColumn(...) ModelBuilder extension. |
| Tenant isolation | Inherited from GranitDbContext — parameterised tenant filter rewritten into every SQL statement at execution time (no closure-leak risk). |
| GDPR Art. 17 | EfIndexedDataEraser fans out a single ExecuteDelete() per registered TKey filtered by (TenantId, DataSubjectId). |
| Architecture pins | Granit.Indexing.EntityFrameworkCore is the only package allowed to reference EF Core NuGets (EntityFrameworkCore_NuGets_only_in_the_EntityFrameworkCore_backend). IgnoreQueryFilters usage is on an audit allowlist. |
builder.Services.AddGranitIndexing();builder.Services.AddGranitIndexingEntityFrameworkCore( opts => opts.UseNpgsql(connectionString), typeof(Guid));
builder.Services.AddGranitIndexingBackend<Guid, MyHitResponse>( row => new MyHitResponse(row.Key, row.Summary ?? string.Empty, row.Tags));The package ships no EF migrations — the consumer host owns them:
dotnet ef migrations add InitIndexing \ --context IndexingDbContext \ --project YourHost/YourHost.csprojElasticsearch 8.x (opt-in)
Section titled “Elasticsearch 8.x (opt-in)”Granit.Indexing.Elasticsearch swaps the backend wholesale: registering it
strips any previously-registered IIndexer<TKey> and IIndexedDataEraser to
guarantee the host runs a single backend.
Reach for it when:
- The corpus exceeds what a single Postgres
tsvectorindex can comfortably serve (low-millions of rows or multi-GB content). - Per-language analyzers, synonym maps, or phrase scoring are core to UX.
- An Elasticsearch cluster is already operated and consolidating full-text workloads makes sense.
builder.Services.AddGranitIndexing();builder.Services.AddGranitIndexingElasticsearch( configureClient: null, typeof(Guid));builder.Services.AddGranitIndexingElasticsearchBackend<Guid, MyResponse>( keyProjection: doc => Guid.Parse(doc.Key), resultProjection: doc => new MyResponse(doc.Key, doc.Summary, doc.Tags));{ "Indexing": { "Elasticsearch": { "Uri": "https://es.internal:9200", "ApiKey": "your-api-key", "Strategy": "Shared", "IndexPrefix": "granit-indexing", "BulkBatchSize": 500, "StoreFullContentInIndex": true, "UseSimpleQueryString": true, "DefaultAnalyzer": "standard" } }}| Setting | Choice |
|---------|--------|
| Strategy: Shared (default) | One index per TKey; tenants isolated by mandatory term tenant_id filter on every read / write. |
| Strategy: PerTenant | One index per (TKey, tenant) pair. Stricter physical isolation, one extra index per tenant. The framework still applies the tenant_id filter as defence-in-depth for misrouted bulk imports. |
| UseSimpleQueryString: true (default) | simple_query_string with the restricted flag set AND \| OR \| PHRASE \| PREFIX. Lucene’s full query_string (regex, fuzzy, field-targeted operators) is reachable only when the request carries UseAdvancedSyntax = true and the endpoint has gated on Search.Advanced.Execute. |
| StoreFullContentInIndex | Trade-off — see below. |
Granit.Indexing.Elasticsearch ships an IIndexedDataEraser that fans out a
single delete_by_query across every registered TKey. delete_by_query is a
logical delete; physical disposal happens at the next segment merge or via an
explicit forcemerge schedule — Article 17 is satisfied because the data is no
longer addressable, but bit-level disposal depends on the host’s storage policy.
Backend comparison
Section titled “Backend comparison”| Concern | Postgres tsvector | Elasticsearch 8.x |
|---------|---------------------|--------------------|
| Infrastructure cost | None beyond Postgres | Dedicated cluster |
| Default query parser | plainto_tsquery (operator characters → literals) | simple_query_string with restricted flags |
| Advanced syntax | to_tsquery — gated on Search.Advanced.Execute | Full Lucene query_string — gated on Search.Advanced.Execute |
| Tenant isolation | GranitDbContext parameterised filter | Mandatory term tenant_id filter, plus optional PerTenant physical isolation |
| Per-language analyzers | One tsvector config per row (chosen from Language) | One sub-field per analyzer; synonym maps and phrase scoring built-in |
| GDPR Art. 17 | ExecuteDelete() per TKey (synchronous, atomic) | delete_by_query (logical delete; physical disposal on next merge / forcemerge) |
| Embeddings | vector(N) pgvector column on the same row as Content | dense_vector(dims: N) field on the same document as Content |
Language-aware analysis
Section titled “Language-aware analysis”IndexedEntry.Language is consumed by every backend at index time to pick the
right analyser (Postgres tsvector configuration, ES <lang>_<analyzer>). The
value comes from
Granit.LanguageDetection — a
cross-cutting ILanguageDetector with a deterministic trigram default and
optional priority-chain overrides:
public sealed class MyDocumentSource( IDocumentRepository repo, ITextExtractionPipeline extraction, ILanguageDetector languageDetector) : IIndexedEntrySource<Guid>{ public string Name => "document";
public async IAsyncEnumerable<Guid> EnumerateKeysAsync( Guid? tenantId, Guid? resumeAfter, [EnumeratorCancellation] CancellationToken ct) { await foreach (Guid id in repo.EnumerateIdsAsync(tenantId, resumeAfter, ct)) yield return id; }
public async Task<IndexedEntry<Guid>?> BuildEntryAsync(Guid key, CancellationToken ct) { Document? doc = await repo.GetAsync(key, ct).ConfigureAwait(false); if (doc is null) return null;
TextExtractionResult body = await extraction.ExtractAsync( doc.OpenRead(), doc.ContentType, ct).ConfigureAwait(false);
string? language = body.DetectedLanguage ?? await languageDetector.DetectAsync(body.Content, ct).ConfigureAwait(false);
return new IndexedEntry<Guid> { Key = key, TenantId = doc.TenantId, Content = body.Content, Language = language, IsTruncated = body.IsTruncated, CharCount = body.CharCount, DataSubjectId = doc.OwnerPartyId, // GDPR Art. 17 hook }; }
public Task<Guid?> GetDataSubjectIdAsync(Guid key, CancellationToken ct) => repo.GetOwnerPartyIdAsync(key, ct);}AI providers (opt-in)
Section titled “AI providers (opt-in)”Every AI add-on is opt-in. The base Granit.Indexing pipeline runs fully
lexical with the deterministic trigram detector — no network calls, no embedded
LLM. Bring in AI providers package by package when the cost/quality trade-off
makes sense.
flowchart LR
subgraph base["Always-on baseline"]
ext["Granit.TextExtraction<br/>bytes → text"] --> lang["Granit.LanguageDetection.Trigram<br/>ISO 639-1"]
lang --> entry["IndexedEntry<TKey>"]
entry --> ix["IIndexer<TKey>"]
end
subgraph ai["Opt-in AI providers"]
lang -. higher priority .-> aiLang["Granit.LanguageDetection.AI<br/>(short / mixed corpora)"]
entry -. before IndexAsync .-> sum["Granit.Indexing.AI<br/>ISummarizer (LLM snippet)"]
ix -. decorator .-> emb["Granit.Indexing.Embeddings<br/>IEmbeddingGenerator + RRF retriever"]
end
| Package | Adds | Cost ceiling |
|---------|------|--------------|
| Granit.LanguageDetection.AI | LLM-backed ILanguageDetectorProvider at priority 200 — disambiguates short or mixed-language inputs the trigram detector cannot reliably classify | Inherits Granit.AI AIQuotaOptions.MaxRequestsPerTenantPerHour |
| Granit.Indexing.AI | ISummarizer — one-shot LLM call producing a SERP-style snippet for IndexedEntry.Summary. JSON-schema-pinned, content wrapped in <untrusted_document>...</untrusted_document> (OWASP LLM01) | MaxAICallsPerHourPerTenant (default 1 000); on cap, returns null — the entry persists without a summary |
| Granit.Indexing.Embeddings | Decorator pair — embeds Content via IEmbeddingGenerator at write time, fuses BM25/tsvector + cosine kNN with Reciprocal Rank Fusion at read time | Wraps the host’s IEmbeddingGenerator; native cost ceiling tracked under follow-up (the cost-accounting contract is fleshed out in I-F3.2) |
Hybrid retrieval — Reciprocal Rank Fusion
Section titled “Hybrid retrieval — Reciprocal Rank Fusion”Granit.Indexing.Embeddings decorates both ports — write-time embedding of
Content on the same row as the body (atomic GDPR erasure), read-time
fusion of the lexical and dense channels via Reciprocal Rank Fusion
(Cormack et al. 2009, k = 60, dense ranking on score ties). The full
registration order, RrfFetchPoolSize / pagination-after-fusion contract,
graceful lexical-only degradation, and the HNSW ghost-vector operations
concern live on the dedicated page:
GDPR Art. 17 — atomic erasure path
Section titled “GDPR Art. 17 — atomic erasure path”Indexed copies must not outlive the source row. Three pieces co-operate:
sequenceDiagram
autonumber
participant P as Granit.Privacy
participant Bus as Wolverine bus
participant H as PersonalDataDeletionHandler<br/>(Granit.Indexing.Privacy)
participant E1 as EfIndexedDataEraser
participant E2 as EsIndexedDataEraser
P->>Bus: PersonalDataDeletionRequestedEto(tenantId, dataSubjectId)
Bus->>H: handler picks it up
par per backend, in parallel
H->>E1: EraseAsync(tenantId, dataSubjectId)
E1->>E1: ExecuteDelete()<br/>per registered TKey
E1-->>H: rows deleted
and
H->>E2: EraseAsync(tenantId, dataSubjectId)
E2->>E2: delete_by_query<br/>per registered TKey
E2-->>H: rows deleted
end
| Layer | Role |
|-------|------|
| Producer — IIndexedEntrySource<TKey>.GetDataSubjectIdAsync | Returns the natural-person id the indexed body refers to; null for non-personal data (system documents, public reference data). Populated into IndexedEntry.DataSubjectId at build time. |
| Backend — IIndexedDataEraser | One implementation per backend (EfIndexedDataEraser, EsIndexedDataEraser). Bulk-deletes rows filtered by (TenantId, DataSubjectId) in a single statement. Idempotent — Wolverine retries and manual replays converge safely. |
| Bridge — Granit.Indexing.Privacy | PersonalDataDeletionHandler subscribed to PersonalDataDeletionRequestedEto. Resolves every IIndexedDataEraser from DI and fans out the request. Skip the package on hosts that do not need the cascade. |
Why a separate hook instead of IIndexer<TKey>.RemoveAsync?
IIndexer<TKey>.RemoveAsync removes a single known (TenantId, Key) tuple. The
GDPR cascade does not know the keys — only the DataSubjectId the entries
reference. Backends store that value at index time and expose EraseAsync to
erase by subject in one bulk statement; calling RemoveAsync in a loop would
require enumerating every key first — slower and racier.
When Granit.Indexing.Embeddings is wired, embeddings are persisted on the
same row as Content in every storage backend — pinned by
Embeddings_must_live_on_the_same_row_as_Content_no_sidecar_entity_types. When
the eraser fires, both atoms vanish atomically in a single DELETE / delete_by_query.
No sidecar table, no orphan vectors.
Background reindex with checkpoint resume
Section titled “Background reindex with checkpoint resume”Granit.Indexing.BackgroundJobs ships RebuildIndexJob<TKey> — an on-demand
job that iterates every key emitted by IIndexedEntrySource<TKey> and calls
IIndexer<TKey>.IndexAsync per entry. Use it after a tokenizer / analyzer
change, after a new tenant backfill, after wiring a new embedding model, or
after a data-quality incident.
The job is on-demand only — the framework ships no [RecurringJob]. The
permission check MUST live at the dispatch site (the Wolverine handler has no
HTTP context), and a MaxEntriesPerRun / MaxRunDuration budget should be
set in production to bound the blast radius of a runaway or hostile dispatch.
Resource budget, lifecycle events, checkpoint store options, and the full
dispatch-controller example live on the dedicated page:
→ Indexing — Background reindex
Configuration cookbook
Section titled “Configuration cookbook”[DependsOn( typeof(GranitIndexingEntityFrameworkCoreModule), typeof(GranitLanguageDetectionTrigramModule))]public sealed class SearchModule : GranitModule{ public override void ConfigureServices(ServiceConfigurationContext context) { context.Services.AddGranitIndexing(); context.Services.AddGranitIndexingEntityFrameworkCore( opts => opts.UseNpgsql(context.Configuration.GetConnectionString("Indexing")!), typeof(Guid));
context.Services.AddGranitIndexingBackend<Guid, MyHitResponse>( row => new MyHitResponse(row.Key, row.Summary ?? string.Empty, row.Tags));
context.Services.AddSingleton<ISearchResultAuthorizer<Guid>, WorkspaceAclAuthorizer>(); context.Services.AddScoped<IIndexedEntrySource<Guid>, MyDocumentSource>(); }}{ "Indexing": { "DefaultPageSize": 20, "MaxPageSize": 100, "MaxAuthorizationDepth": 5000, "MaxEmptyResultQueriesPerPrincipalPerMinute": 10, "AuthorizationOverfetchMultiplier": 3 }}[DependsOn(typeof(GranitIndexingElasticsearchModule))]public sealed class SearchModule : GranitModule{ public override void ConfigureServices(ServiceConfigurationContext context) { context.Services.AddGranitIndexing(); context.Services.AddGranitIndexingElasticsearch( configureClient: null, typeof(Guid));
context.Services.AddGranitIndexingElasticsearchBackend<Guid, MyHitResponse>( keyProjection: doc => Guid.Parse(doc.Key), resultProjection: doc => new MyHitResponse(doc.Key, doc.Summary, doc.Tags)); }}{ "Indexing": { "Elasticsearch": { "Uri": "https://es.internal:9200", "ApiKey": "your-api-key", "Strategy": "Shared", "IndexPrefix": "granit-indexing", "StoreFullContentInIndex": false, "UseSimpleQueryString": true } }}[DependsOn( typeof(GranitIndexingAIModule), typeof(GranitIndexingPrivacyModule))]public sealed class SearchAIModule : GranitModule{ public override void ConfigureServices(ServiceConfigurationContext context) { context.Services.AddGranitIndexingAISummarizer(); }}{ "Indexing": { "AI": { "WorkspaceName": "default", "MaxAICallsPerHourPerTenant": 1000, "RedactPIIBeforeLLMCall": true, "MaxContentLength": 8192, "MaxSummaryLength": 500, "TimeoutSeconds": 20 } }}Consumers retrieve ISummarizer from DI and call SummarizeAsync(content)
before building the IndexedEntry, then assign the result to
entry.Summary. The framework does NOT auto-wire a decorator on IIndexer<TKey> —
the call cost lives where the consumer can control it.
Architecture invariants
Section titled “Architecture invariants”The following invariants are pinned by tests in
IndexingArchitectureTests
and run on every CI build:
| Invariant | Rationale |
|-----------|-----------|
| BackendHitCount_must_not_be_referenced_from_any_Endpoints_package | The raw backend hit count is a tenant-wide row-count oracle. Endpoint adapters must read only SearchPage.Items + HitAuthorizationLimit. |
| Embeddings_must_live_on_the_same_row_as_Content_no_sidecar_entity_types | Atomic GDPR Art. 17 erasure — embeddings and content vanish in the same DELETE. |
| IgnoreQueryFilters_calls_in_Granit_Indexing_EntityFrameworkCore_stay_within_the_audit_allowlist | Bypassing tenant filters is allowed only at named, reviewed call sites. |
| Granit_Indexing_packages_must_not_reference_Microsoft_AspNetCore | Indexing is a horizontal framework — CLI tools and background workers consume it without dragging in the web stack. |
| EntityFrameworkCore_NuGets_only_in_the_EntityFrameworkCore_backend | EF Core stays confined to the EF backend; the base contract and other backends remain provider-pure. |
| IIndexer_implementations_must_not_call_Console_or_Trace | All telemetry flows through IndexingMetrics + IndexingActivitySource — never stdout. |
See also
Section titled “See also”Sub-pages
Section titled “Sub-pages”- Indexing — Embeddings (RRF) — opt-in hybrid retrieval:
IEmbeddingGeneratorwriter + RRF fusion of BM25 /tsvectorwith cosine kNN; HNSW reindex cadence. - Indexing — Background reindex —
RebuildIndexJob<TKey>with checkpoint resume, dispatch-site permission, resource budgets, lifecycle events.
Related
Section titled “Related”- TextExtraction — upstream producer of
IndexedEntry.Content(andDetectedLanguage) for unstructured uploads. - LanguageDetection —
ILanguageDetectorpopulatesIndexedEntry.Language; backends pick the analyser from it. - Query Engine — peer building block for structured filter / sort / paging over EF Core. Indexing handles full-text + semantic; QueryEngine handles structured admin grids.
- Data Exchange — peer building block for CSV / Excel import / export. Often paired: imported rows are indexed at the same time.
- Documents — downstream consumer that uploads bytes, runs them through TextExtraction, then indexes the result.
- AI — Semantic Search & RAG — overview of the hybrid-retrieval story across the AI feature family.
- Compliance — Privacy —
Granit.PrivacypublishesPersonalDataDeletionRequestedEto;Granit.Indexing.Privacyis its indexing bridge.