IoT Operations
Granit.IoT.BackgroundJobs ships three recurring jobs that keep a
multi-tenant IoT system predictable from day 1 to year 5. Retention
enforcement (GDPR), offline-device detection (with anti-flapping), and
partition maintenance — all per-tenant configurable through
Granit.Settings.
What problem does this package solve?
Section titled “What problem does this package solve?”Telemetry tables grow without supervision. Three failure modes recur in every IoT deployment:
- Unbounded storage. At 100,000 devices × 6 publishes/min, you add ~260 million rows per month. A year in, your cheapest query is slow and your backups are unworkable.
- Silent device failures. A sensor stops publishing. Nobody notices until the dashboard goes flat. Without automated heartbeat detection, “is my fleet online?” is a human question.
- Partition bookkeeping. Monthly partitioning works — until the day a write arrives for next month and PostgreSQL errors out because the partition was never created.
This package closes all three.
The three jobs
Section titled “The three jobs”flowchart LR
subgraph "Granit.IoT.BackgroundJobs"
P["StaleTelemetryPurgeJob<br/>03:00 UTC daily"]
H["DeviceHeartbeatTimeoutJob<br/>every 5 min"]
M["TelemetryPartitionMaintenanceJob<br/>Sundays 01:00 UTC"]
end
P -->|ExecuteDeleteAsync per retention bucket| DB[("iot_telemetry_points")]
H -->|FindStaleAsync + publish| ETO["DeviceOfflineDetectedEto"]
M -->|CREATE TABLE IF NOT EXISTS| DB
ETO -.->|Granit.IoT.Notifications| NOT["Email / Push / SMS"]
All three jobs are IBackgroundJob records with [RecurringJob] attributes.
Wolverine’s single-leader scheduling makes sure they run on exactly one
node — safe in a horizontally scaled deployment.
StaleTelemetryPurgeJob — GDPR retention enforcement
Section titled “StaleTelemetryPurgeJob — GDPR retention enforcement”Schedule: 0 3 * * * (daily, 03:00 UTC). Purpose: delete telemetry
older than the per-tenant retention window.
How it scales — bucketed deletes, not N+1
Section titled “How it scales — bucketed deletes, not N+1”Naively looping DELETE WHERE tenant_id = @t AND recorded_at < @cutoff
per tenant scales linearly — painful past 1,000 tenants. The purge job
does something smarter:
- Scan distinct
TenantIdvalues fromiot_devices(cheap — tens of thousands of devices, not hundreds of millions of telemetry rows). - Resolve each tenant’s effective
IoT:TelemetryRetentionDaysviaGranit.Settings(FusionCache-backed, microsecond lookups after warmup). - Group tenants by their effective retention value — in production, ~80% of tenants use the default, so 10,000 tenants typically collapse into 2-4 buckets.
- Issue one
ExecuteDeleteAsyncper bucket withWHERE tenant_id = ANY(@array) AND recorded_at < @cutoff.
| Scenario | Tenant count | SQL DELETEs issued |
|---|---|---|
| All tenants on default 365 days | 10,000 | 1 |
| 90% default, 10% custom (30 or 730 days) | 10,000 | 3 |
| Every tenant different | 10,000 | 10,000 (worst case) |
A 30-minute hard deadline wraps the job on a 24 h cycle — a runaway delete never overlaps the next run.
Configuration
Section titled “Configuration”| Key | Default | Purpose |
|---|---|---|
IoT:TelemetryRetentionDays | 365 | Days of telemetry kept per tenant. Set 0 to disable purge for a tenant |
DeviceHeartbeatTimeoutJob — offline detection
Section titled “DeviceHeartbeatTimeoutJob — offline detection”Schedule: */5 * * * * (every 5 minutes). Purpose: flag devices
whose last heartbeat is older than the tenant’s timeout, publish
DeviceOfflineDetectedEto, and suppress re-alerts on flaky links via
an in-memory tracker cache.
sequenceDiagram
participant J as DeviceHeartbeatTimeoutJob
participant R as IDeviceReader
participant C as DeviceOfflineTrackerCache
participant B as IDistributedEventBus
participant N as Granit.IoT.Notifications
J->>R: FindStaleAsync(tenantBucket, cutoff, 5000)
R-->>J: Device[] (LastHeartbeatAt null or < cutoff)
loop per stale device
J->>C: TryAdd(deviceId, ttl)
alt newly added
C-->>J: true
J->>B: Publish DeviceOfflineDetectedEto
B->>N: DeviceOfflineNotificationHandler
else already tracked
C-->>J: false (suppressed)
end
end
Why the tracker cache matters
Section titled “Why the tracker cache matters”A device on a flaky cellular link can drop and reconnect multiple times
per hour. Publishing an alert on every 5-minute cycle is spam.
DeviceOfflineTrackerCache uses IMemoryCache keyed by device ID with
TTL = IoT:HeartbeatOfflineNotificationCacheMinutes (default 60 min).
The first offline detection publishes; subsequent detections within
the TTL silently skip. When telemetry resumes, the
TelemetryIngestedHandler calls Forget(deviceId) so future
disappearances are re-eligible.
Bucketed per-tenant timeout
Section titled “Bucketed per-tenant timeout”Same bucketing pattern as the purge job: tenants are grouped by their
effective IoT:HeartbeatTimeoutMinutes, and FindStaleAsync is called
once per bucket with tenant_id = ANY(@array).
A 4-minute hard deadline on a 5-minute cron prevents overlapping runs even under unusual load.
Configuration
Section titled “Configuration”| Key | Default | Purpose |
|---|---|---|
IoT:HeartbeatTimeoutMinutes | 15 | Minutes without heartbeat before flagging offline. Set 0 per tenant to disable |
IoT:HeartbeatOfflineNotificationCacheMinutes | 60 | Tracker TTL preventing alert spam on flaky links |
TelemetryPartitionMaintenanceJob — ahead-of-time partition creation
Section titled “TelemetryPartitionMaintenanceJob — ahead-of-time partition creation”Schedule: 0 1 * * 0 (Sundays, 01:00 UTC). Purpose: provision
next-month and next-next-month partitions so writes never fail at a
boundary transition.
The job is a graceful no-op when the parent table is not partitioned —
it checks pg_partitioned_table and logs a single warning. This lets
non-partitioned deployments keep the job registered without side effects.
Enabling partitioning
Section titled “Enabling partitioning”protected override void Up(MigrationBuilder migrationBuilder){ migrationBuilder.EnableTelemetryPartitioning(); migrationBuilder.CreateTelemetryPartition(2026, 4); migrationBuilder.CreateTelemetryPartition(2026, 5); // Subsequent months provisioned automatically by the job}Every partition carries its own BRIN(recorded_at) and GIN(metrics)
indexes — dropping the partition drops the indexes with it, which is why
GDPR erasure at a month boundary is O(1).
Configuration
Section titled “Configuration”None. The job infers the schema and table name from IoTDbContext.
Registration
Section titled “Registration”The jobs are bundled in Granit.Bundle.IoT:
builder.Services.AddGranit(builder.Configuration).AddIoT();Or register individually (e.g. when not using the bundle):
builder.Services.AddGranitIoTBackgroundJobs();GranitIoTBackgroundJobsModule depends on GranitIoTModule,
GranitBackgroundJobsModule, and GranitSettingsModule — DI order is
driven by [DependsOn], not by the order of these calls.
Observability
Section titled “Observability”| Metric | Tags | Fires when |
|---|---|---|
granit.iot.background.telemetry_purged | tenant_id | Purge job deleted rows for a tenant |
granit.iot.device.offline_detected | tenant_id | Heartbeat job flagged a device (first detection only) |
granit.iot.background.partition_created | partition_name | Future partition created |
granit.iot.alerts.throttled | tenant_id, metric_name | Notification bridge suppressed an alert |
Wolverine also emits job-level telemetry (wolverine.job.execution_duration,
wolverine.job.failures) — wire both to Grafana for a single pane of glass.
Anti-patterns to avoid
Section titled “Anti-patterns to avoid”See also
Section titled “See also”- Device management —
RecordHeartbeatandLastHeartbeatAt - Telemetry ingestion — the pipeline that feeds the heartbeat
- Notifications bridge — where
DeviceOfflineDetectedEtobecomes an alert - Timeline bridge — device state changes as audit chatter
- Time-series storage — when to enable partitioning