Spike — SQL ad-hoc metric definitions for Granit.Metering
Date: 2026-04-24 Author: Jean-Francois Meyers Issue: #1171 (Phase 2 of EPIC #1155 ORB alignment) Status: Concluded — recommendation DEFER
ORB lets users define a metric as an arbitrary SQL expression evaluated over the
events table at billing time — the most expressive billing primitive on the
market today. This spike investigates whether Granit.Metering should match
that surface, and concludes:
- DEFER. The five built-in aggregation types
(
Sum/Max/Count/Last/CountDistinct) cover ≥ 95 % of the scenarios that the framework’s B2B SaaS target audience has shipped to date. - The three safe ways to expose raw SQL (parameterised + AST-validated SQL, constrained DSL, embedded SQLite re-execution) each have non-trivial implementation cost and ongoing maintenance burden, with a security failure mode that would directly leak per-tenant billing data.
- Re-evaluate when at least three independent users request the capability for a scenario the existing primitives demonstrably cannot serve. At that point, the constrained DSL path is the recommended GO route — see Path forward.
Context
Section titled “Context”What ORB does
Section titled “What ORB does”ORB metric definitions accept arbitrary SQL evaluated against an internal events table. A typical metric reads roughly:
SELECT SUM(properties->>'bytes_uploaded'::numeric)FROM eventsWHERE properties->>'region' = 'us-east-1' AND properties->>'cold_storage' = 'false'This is enormously expressive — it covers conditional aggregation, joins to reference tables, expression-based grouping, percentile computations, etc. ORB runs these on a columnar warehouse (probably ClickHouse or Snowflake) where predicate pushdown and per-tenant partitioning make multi-million-event scans cheap. They also accept the operational cost of maintaining a SQL parser/ validator for every release.
What Granit.Metering does today
Section titled “What Granit.Metering does today”After Phase 2 (#1158), MeterDefinition.AggregationType admits five values:
| Type | Semantics |
|---|---|
Sum | SUM(quantity) over the period |
Max | MAX(quantity) over the period |
Count | COUNT(*) over the period |
Last | quantity of the latest event in the period |
CountDistinct | COUNT(DISTINCT metadata.<DistinctProperty>) over the period (#1166) |
Plus the on-demand recompute (#1167), event soft-deprecation (#1169), and 365-day backfill (#1168) covered by neighbouring stories. The combined surface handles flat-rate, per-seat, per-call, and MAU-style metering — the four pricing models that show up in 95 % of B2B SaaS billing implementations the maintainer has audited (Granit.Showcase included).
Why we even consider raw SQL
Section titled “Why we even consider raw SQL”There are real workloads the existing primitives cannot serve:
- Conditional aggregation: “Sum bandwidth, but only for non-cached requests”
— would require a
WHERE metadata->>'cached' = 'false'clause. - Expression-based grouping: “Bill differently per region in the same
meter” — needs
GROUP BY metadata->>'region'. - Reference-table joins: “Bill by customer tier looked up in another table” — needs a join.
For these, the user’s current options are:
- Pre-compute the predicate at ingestion time (spread the metric across N
MeterDefinitionrows) — works for low-cardinality grouping, falls over when the grouping key is open-ended. - Materialise an aggregate themselves and call
POST /metering/events/backfillwith the result — works for batch reporting, not for real-time quotas. - Drop down to raw SQL on the application’s own database (Granit isolates the
MeteringDbContextbut the application owns the schema and can read it) — bypasses the framework, no quota integration, no audit trail.
None of these are great. But they exist — the lack of native ad-hoc SQL is not currently a blocker for any concrete user the maintainer has talked to.
The eight questions
Section titled “The eight questions”1. Injection model
Section titled “1. Injection model”| Approach | Defense | Cost | Verdict |
|---|---|---|---|
| Parameterised values + raw SQL body | None on the SQL itself; predicate clauses can still drop the tenant filter or join across tables | None — direct passthrough | ❌ unsafe |
| AST-validated SQL (parse the user’s SQL with e.g. PgSqlParser, walk the tree, reject anything beyond a whitelist of node types) | Rejects DML, DDL, sub-queries, joins outside whitelist, function calls outside whitelist | High — must track every PostgreSQL syntax extension across versions; SQL Server divergence requires a second parser | 🟡 viable but heavy |
Whitelisted columns + WHERE-fragment only (user supplies only the WHERE body and the aggregate function from a closed list, framework wraps it) | Same blast radius as AST-validated, simpler implementation | Medium — closed grammar; need to publish the whitelist as part of the public contract | 🟢 safest within “raw SQL” approaches |
Constrained DSL (custom predicate expression + closed list of aggregates, transpiled to IQueryable<MeterEvent>) | EF Core LINQ provider does the SQL generation → no string concatenation possible | Medium-high — design + parser + transpiler + docs | 🟢 safest, most composable |
| Embedded SQLite re-execution (export the events to an in-memory SQLite DB, run user SQL there) | Total — the user’s SQL touches a copy of the data, no path to the production DB | Medium — copy the events to memory per query (10M events ≈ 1.5 GB raw) | 🟡 safe but expensive |
Decision driver: the framework’s threat model treats per-tenant data isolation as the strongest invariant (GDPR, ISO 27001 A.9.4). Any approach that passes user-supplied syntax to the production query planner must enumerate what is allowed (whitelist), never subtract what is forbidden (blacklist). That alone disqualifies the “parameterised raw SQL” approach.
2. Tenant isolation
Section titled “2. Tenant isolation”The MeteringDbContext already applies a tenant filter via
ApplyGranitConventions(currentTenant, dataFilter). However, that filter
operates on the IQueryable<T> LINQ tree — it has no effect on raw SQL passed
through db.Database.SqlQuery<T>(sql) or any user-controlled string.
Required guarantees, ranked:
-
Hard guarantee: the user cannot bypass
WHERE TenantId = @currentTenantregardless of what they write. Approaches that satisfy this:- Constrained DSL →
IQueryable(the global query filter applies) - AST validator that injects the tenant predicate at the appropriate position in the tree (mechanically possible but error-prone — every join, every CTE, every sub-query needs the predicate)
- PostgreSQL row-level security with
SET app.current_tenant_id = …per session;CREATE POLICY tenant_isolation ON meter_events USING (tenant_id = current_setting('app.current_tenant_id')::uuid). Works at the storage engine level, immune to query rewriting. Not portable to SQL Server (which requires the security-policy feature, separate setup).
- Constrained DSL →
-
Soft guarantee: a buggy user query is rejected with a clear error rather than silently returning wrong data. Approaches: every option above plus defensive
EXPLAIN (FORMAT JSON)post-parse to verify the planner sees the tenant predicate.
Decision driver: PostgreSQL RLS is the lowest-risk option if Granit can mandate the storage engine. Granit.Metering currently supports both PG and SQL Server (#1167); RLS would either fork the implementation or restrict ad-hoc SQL to the PG provider — acceptable, since this is admin tooling and most production deployments are PG.
3. Resource limits
Section titled “3. Resource limits”The default Granit.Metering workload runs hourly on commit-sized event
windows (≤ 100 K events for a busy tenant). An ad-hoc query against the same
table without an obvious index access plan can scan the entire history of
the meter_events table — 100 M rows for a year-old tenant.
| Mechanism | Behavior | Drawback |
|---|---|---|
statement_timeout per session (PG) / SET LOCK_TIMEOUT (SQL Server) | DB-side hard cancel | Per-connection setting; the connection pool resets it on return; need a BeforeOpen hook to reapply |
CommandTimeout on DbCommand | App-side cancel via CancellationToken | Cooperative — only checked at row materialisation, not during planner execution |
| Application-level cancellation token wired from HTTP timeout | Cooperative | Same as above |
Row-count cap via LIMIT N+1 → reject if N+1 returned | Truncates results, doesn’t bound cost | Doesn’t prevent the scan |
EXPLAIN before EXECUTE, reject if total_cost > X | Bound BEFORE work starts | PostgreSQL planner cost estimates are coarse; false positives common |
Read replica with stricter statement_timeout and lower memory | Isolates production load | Requires a replica — not in scope for Granit core |
Decision driver: combine statement_timeout (DB-side hard cancel — this is
the floor) with EXPLAIN-cost gating (rejects obvious foot-guns at meter creation
time). Run all ad-hoc queries through a dedicated DI-registered
DbConnection that always sets statement_timeout = 30000 so the limit
cannot accidentally leak into the standard ingestion connection.
4. Schema visibility
Section titled “4. Schema visibility”MeterEvent has 9 columns (Id, MeterDefinitionId, IdempotencyKey,
Quantity, Timestamp, Metadata, TenantId, DeprecatedAt,
DeprecationReason). Of these:
- Always exposed:
Quantity,Timestamp,Metadata— the three columns the user actually wants to compute against. - Always implicit:
TenantId(forced predicate, never selectable),MeterDefinitionId(joined automatically — the SQL is scoped to one meter). - Filtered out:
DeprecatedAt(ad-hoc queries see only active events; matches the aggregator’s behavior — see #1169). - Hidden:
Id,IdempotencyKey— internal plumbing, no billing relevance, surfaces a monotonic id that could leak insertion order to other tenants via timing.
Metadata is JSONB — exposing the JSON path operator (->>) is the whole
point of allowing ad-hoc SQL. Whether to expose -> (returns JSON) and #>>
(deep path) is an additional question; the principled answer is “all the
PostgreSQL JSONB path operators, because limiting to ->>'x' only covers the
flat-key case the existing CountDistinct.DistinctProperty already handles”.
5. Storage
Section titled “5. Storage”If shipped, the user-defined expression lives on MeterDefinition as a new
nullable column:
public string? SqlExpression { get; private set; } // domainpublic sealed class MeterDefinitionConfiguration { … }{ builder.Property(e => e.SqlExpression).HasMaxLength(4096);}Auditability:
- Versioning: every change goes through
MeterDefinition.Update(...)which is currently restricted toDraftstatus (#1165) — already auditable via the audit interceptor. - Provenance:
(CreatedBy, ModifiedBy)fromAuditedAggregateRootcapture who authored each version. - Rollback: if a SQL expression turns out to compute wrong values, the recompute machinery (#1167) handles re-running with the previous expression on the affected window — provided the previous expression is recoverable from the audit trail.
Versioning the expression itself (à la PlanPrice versioning in
Granit.Subscriptions) is appealing but doubles the storage model. A simpler
alternative: write each change as an audit log row, never overwrite history.
6. Validation pipeline
Section titled “6. Validation pipeline”When the user creates a meter with SqlExpression, run synchronously:
- Static validation (microseconds): tokenise + parse, reject on syntax errors before the meter is persisted.
- AST whitelist (microseconds): walk the tree, reject any node type not in the closed list (DML, DDL, SET, sub-query other than scalar aggregate, foreign-table reference, etc.).
- EXPLAIN against an empty
meter_events-shaped temp table (single-digit ms): catches column-name typos, unsupported function calls, broken JSONB paths. Cost estimate also bounds the worst-case fan-out. - Smoke execution against the last 5 minutes of events (50–500 ms):
verifies the query produces a
numeric-compatible scalar. The result is discarded.
If any step fails, the create returns 422 with a problem-detail mapping the
validation failure to a localised error code (mirrors the CountDistinct
validator pattern from #1166).
7. Performance
Section titled “7. Performance”Representative numbers for a busy SaaS tenant on Granit.Metering:
| Scenario | Events / period | SQL aggregate latency (PG, indexed (TenantId, Timestamp)) |
|---|---|---|
| Hourly aggregator on rolling window | ≤ 100 K | 50–150 ms per meter |
| On-demand recompute over a day (#1167) | 1–5 M | 300 ms – 2 s per meter |
Ad-hoc SQL with metadata->>'x' filter | 1–10 M | 200 ms – 8 s per meter (varies wildly with selectivity) |
Ad-hoc SQL with metadata->>'x' filter, no index | 10 M | 30 s – timeout |
The pathological case (no functional index on the JSONB path) is the one the
EXPLAIN gate must catch. Functional indexes on JSONB paths are user-defined
and cannot be inferred from the SqlExpression — Granit could ship a hint
that any metadata->>'foo' predicate triggers a recommendation to
CREATE INDEX ON meter_events ((metadata->>'foo'));, but the actual index
creation belongs to the application layer (Granit ships no migrations; #1167).
Bottom line: ad-hoc SQL on Granit’s PG-on-commodity-hardware target is viable up to ≈ 10 M events with the right indexes, an order of magnitude above what the existing primitives need. Beyond that, users should run on a columnar engine — at which point they likely shouldn’t be using Granit anyway.
8. Alternatives — constrained DSL
Section titled “8. Alternatives — constrained DSL”The most attractive alternative to raw SQL is a constrained predicate +
aggregate DSL that compiles to IQueryable<MeterEvent>. Sketch:
public sealed class MeterDefinition{ // additive in a future PR if we GO public string? PredicateExpression { get; private set; } // e.g. "metadata.region == 'us-east-1' && metadata.cached == false" public string? AggregateExpression { get; private set; } // e.g. "sum(quantity)" — closed list of fns}The DSL parser (custom, ~600 LOC for the closed grammar) translates to
IQueryable<MeterEvent> predicates and an Expression<Func<IQueryable<MeterEvent>, decimal>>
aggregator. EF Core then generates the SQL — and the existing tenant /
soft-delete / DeprecatedAt == null filters apply automatically because they
live on the IQueryable.
| Property | Raw SQL (whitelisted) | Constrained DSL → IQueryable |
|---|---|---|
| Tenant isolation | Manual (forced WHERE injection) | Automatic (global query filter) |
| Soft-delete + lifecycle filters | Manual | Automatic |
| Cross-DB portability | Per-provider parser | Native (EF Core does it) |
| Expressive power | High (joins, sub-queries, window fns) | Medium (predicate + aggregate; no joins yet) |
| Implementation cost | High (parser + validator + EXPLAIN gate per provider) | Medium (custom DSL parser + Expression-tree builder) |
| Maintenance cost over 5 years | High (track PG / SQL Server syntax churn) | Low (the DSL is a fixed grammar) |
| Failure mode of a parser bug | Tenant data leak | Compile error or wrong number — never a leak |
This is the recommended GO path, when GO becomes warranted.
Recommendation: DEFER
Section titled “Recommendation: DEFER”The framework should not ship raw SQL ad-hoc metric definitions today.
| Reason | Weight |
|---|---|
| The five built-in aggregation types cover ≥ 95 % of demonstrated B2B SaaS workloads (Granit.Showcase audit + maintainer’s own consulting history) | High |
| No concrete user has requested the feature — the demand signal is conjectural (“ORB has it, so we should”) rather than empirical (“we lost a deal because of it”) | High |
| The safe implementation paths each cost 2–4 weeks of focused work + ongoing maintenance, far above the marginal benefit of the missing 5 % of workloads | High |
| Failure modes are catastrophic — a parser bug or a missed tenant-isolation case directly leaks billing data across tenants (GDPR, ISO 27001 incident-grade severity) | Medium-high |
| Two interim escape hatches exist for users who genuinely need SQL today: pre-compute at ingestion time, or run the query against the application’s own DbContext | Medium |
Path forward (when DEFER expires)
Section titled “Path forward (when DEFER expires)”Re-evaluate when at least three independent users request the capability for a scenario the existing five primitives demonstrably cannot serve. At that point:
- Implement the constrained DSL →
IQueryable<MeterEvent>path (Question 8 alternative). 2–3 week effort, zero new attack surface beyond what EF Core already protects. - Open a tracking issue with three concrete use-cases (each linked to a real user / tenant). Without three, the conjectural demand is not enough to justify the maintenance burden.
- Treat raw SQL as a non-goal — even if the DSL turns out to be insufficient, the next escalation is to extend the DSL grammar, not to expose raw SQL. Raw SQL on a multi-tenant table is a category of feature the framework should not own.
Spike artifacts
Section titled “Spike artifacts”- This document.
- No PoC implementation in this iteration. The spike concluded DEFER before reaching the PoC step (Question 1’s analysis was decisive enough to halt early). A PoC for the constrained DSL path will be built the day the GO criteria are met.
- Issue #1171 updated with the result.