Skip to content

IoT Operations

Granit.IoT.BackgroundJobs ships three recurring jobs that keep a multi-tenant IoT system predictable from day 1 to year 5. Retention enforcement (GDPR), offline-device detection (with anti-flapping), and partition maintenance — all per-tenant configurable through Granit.Settings.

Telemetry tables grow without supervision. Three failure modes recur in every IoT deployment:

  • Unbounded storage. At 100,000 devices × 6 publishes/min, you add ~260 million rows per month. A year in, your cheapest query is slow and your backups are unworkable.
  • Silent device failures. A sensor stops publishing. Nobody notices until the dashboard goes flat. Without automated heartbeat detection, “is my fleet online?” is a human question.
  • Partition bookkeeping. Monthly partitioning works — until the day a write arrives for next month and PostgreSQL errors out because the partition was never created.

This package closes all three.

flowchart LR
  subgraph "Granit.IoT.BackgroundJobs"
    P["StaleTelemetryPurgeJob<br/>03:00 UTC daily"]
    H["DeviceHeartbeatTimeoutJob<br/>every 5 min"]
    M["TelemetryPartitionMaintenanceJob<br/>Sundays 01:00 UTC"]
  end
  P -->|ExecuteDeleteAsync per retention bucket| DB[("iot_telemetry_points")]
  H -->|FindStaleAsync + publish| ETO["DeviceOfflineDetectedEto"]
  M -->|CREATE TABLE IF NOT EXISTS| DB
  ETO -.->|Granit.IoT.Notifications| NOT["Email / Push / SMS"]

All three jobs are IBackgroundJob records with [RecurringJob] attributes. Wolverine’s single-leader scheduling makes sure they run on exactly one node — safe in a horizontally scaled deployment.

StaleTelemetryPurgeJob — GDPR retention enforcement

Section titled “StaleTelemetryPurgeJob — GDPR retention enforcement”

Schedule: 0 3 * * * (daily, 03:00 UTC). Purpose: delete telemetry older than the per-tenant retention window.

How it scales — bucketed deletes, not N+1

Section titled “How it scales — bucketed deletes, not N+1”

Naively looping DELETE WHERE tenant_id = @t AND recorded_at < @cutoff per tenant scales linearly — painful past 1,000 tenants. The purge job does something smarter:

  1. Scan distinct TenantId values from iot_devices (cheap — tens of thousands of devices, not hundreds of millions of telemetry rows).
  2. Resolve each tenant’s effective IoT:TelemetryRetentionDays via Granit.Settings (FusionCache-backed, microsecond lookups after warmup).
  3. Group tenants by their effective retention value — in production, ~80% of tenants use the default, so 10,000 tenants typically collapse into 2-4 buckets.
  4. Issue one ExecuteDeleteAsync per bucket with WHERE tenant_id = ANY(@array) AND recorded_at < @cutoff.
ScenarioTenant countSQL DELETEs issued
All tenants on default 365 days10,0001
90% default, 10% custom (30 or 730 days)10,0003
Every tenant different10,00010,000 (worst case)

A 30-minute hard deadline wraps the job on a 24 h cycle — a runaway delete never overlaps the next run.

KeyDefaultPurpose
IoT:TelemetryRetentionDays365Days of telemetry kept per tenant. Set 0 to disable purge for a tenant

DeviceHeartbeatTimeoutJob — offline detection

Section titled “DeviceHeartbeatTimeoutJob — offline detection”

Schedule: */5 * * * * (every 5 minutes). Purpose: flag devices whose last heartbeat is older than the tenant’s timeout, publish DeviceOfflineDetectedEto, and suppress re-alerts on flaky links via an in-memory tracker cache.

sequenceDiagram
  participant J as DeviceHeartbeatTimeoutJob
  participant R as IDeviceReader
  participant C as DeviceOfflineTrackerCache
  participant B as IDistributedEventBus
  participant N as Granit.IoT.Notifications
  J->>R: FindStaleAsync(tenantBucket, cutoff, 5000)
  R-->>J: Device[] (LastHeartbeatAt null or &lt; cutoff)
  loop per stale device
    J->>C: TryAdd(deviceId, ttl)
    alt newly added
      C-->>J: true
      J->>B: Publish DeviceOfflineDetectedEto
      B->>N: DeviceOfflineNotificationHandler
    else already tracked
      C-->>J: false (suppressed)
    end
  end

A device on a flaky cellular link can drop and reconnect multiple times per hour. Publishing an alert on every 5-minute cycle is spam. DeviceOfflineTrackerCache uses IMemoryCache keyed by device ID with TTL = IoT:HeartbeatOfflineNotificationCacheMinutes (default 60 min). The first offline detection publishes; subsequent detections within the TTL silently skip. When telemetry resumes, the TelemetryIngestedHandler calls Forget(deviceId) so future disappearances are re-eligible.

Same bucketing pattern as the purge job: tenants are grouped by their effective IoT:HeartbeatTimeoutMinutes, and FindStaleAsync is called once per bucket with tenant_id = ANY(@array).

A 4-minute hard deadline on a 5-minute cron prevents overlapping runs even under unusual load.

KeyDefaultPurpose
IoT:HeartbeatTimeoutMinutes15Minutes without heartbeat before flagging offline. Set 0 per tenant to disable
IoT:HeartbeatOfflineNotificationCacheMinutes60Tracker TTL preventing alert spam on flaky links

TelemetryPartitionMaintenanceJob — ahead-of-time partition creation

Section titled “TelemetryPartitionMaintenanceJob — ahead-of-time partition creation”

Schedule: 0 1 * * 0 (Sundays, 01:00 UTC). Purpose: provision next-month and next-next-month partitions so writes never fail at a boundary transition.

The job is a graceful no-op when the parent table is not partitioned — it checks pg_partitioned_table and logs a single warning. This lets non-partitioned deployments keep the job registered without side effects.

protected override void Up(MigrationBuilder migrationBuilder)
{
migrationBuilder.EnableTelemetryPartitioning();
migrationBuilder.CreateTelemetryPartition(2026, 4);
migrationBuilder.CreateTelemetryPartition(2026, 5);
// Subsequent months provisioned automatically by the job
}

Every partition carries its own BRIN(recorded_at) and GIN(metrics) indexes — dropping the partition drops the indexes with it, which is why GDPR erasure at a month boundary is O(1).

None. The job infers the schema and table name from IoTDbContext.

The jobs are bundled in Granit.Bundle.IoT:

builder.Services.AddGranit(builder.Configuration).AddIoT();

Or register individually (e.g. when not using the bundle):

builder.Services.AddGranitIoTBackgroundJobs();

GranitIoTBackgroundJobsModule depends on GranitIoTModule, GranitBackgroundJobsModule, and GranitSettingsModule — DI order is driven by [DependsOn], not by the order of these calls.

MetricTagsFires when
granit.iot.background.telemetry_purgedtenant_idPurge job deleted rows for a tenant
granit.iot.device.offline_detectedtenant_idHeartbeat job flagged a device (first detection only)
granit.iot.background.partition_createdpartition_nameFuture partition created
granit.iot.alerts.throttledtenant_id, metric_nameNotification bridge suppressed an alert

Wolverine also emits job-level telemetry (wolverine.job.execution_duration, wolverine.job.failures) — wire both to Grafana for a single pane of glass.