IoT Operations

Granit.IoT.BackgroundJobs ships three recurring jobs that keep a multi-tenant IoT system predictable from day 1 to year 5. Retention enforcement (GDPR), offline-device detection (with anti-flapping), and partition maintenance — all per-tenant configurable through Granit.Settings.

What problem does this package solve?

Telemetry tables grow without supervision. Three failure modes recur in every IoT deployment:

Unbounded storage. At 100,000 devices × 6 publishes/min, you add ~260 million rows per month. A year in, your cheapest query is slow and your backups are unworkable.
Silent device failures. A sensor stops publishing. Nobody notices until the dashboard goes flat. Without automated heartbeat detection, “is my fleet online?” is a human question.
Partition bookkeeping. Monthly partitioning works — until the day a write arrives for next month and PostgreSQL errors out because the partition was never created.

This package closes all three.

The three jobs

flowchart LR
  subgraph "Granit.IoT.BackgroundJobs"
    P["StaleTelemetryPurgeJob<br/>03:00 UTC daily"]
    H["DeviceHeartbeatTimeoutJob<br/>every 5 min"]
    M["TelemetryPartitionMaintenanceJob<br/>Sundays 01:00 UTC"]
  end
  P -->|ExecuteDeleteAsync per retention bucket| DB[("iot_telemetry_points")]
  H -->|FindStaleAsync + publish| ETO["DeviceOfflineDetectedEto"]
  M -->|CREATE TABLE IF NOT EXISTS| DB
  ETO -.->|Granit.IoT.Notifications| NOT["Email / Push / SMS"]

All three jobs are IBackgroundJob records with [RecurringJob] attributes. Wolverine’s single-leader scheduling makes sure they run on exactly one node — safe in a horizontally scaled deployment.

Schedule: 0 3 * * * (daily, 03:00 UTC). Purpose: delete telemetry older than the per-tenant retention window.

How it scales — bucketed deletes, not N+1

Naively looping DELETE WHERE tenant_id = @t AND recorded_at < @cutoff per tenant scales linearly — painful past 1,000 tenants. The purge job does something smarter:

Scan distinct TenantId values from iot_devices (cheap — tens of thousands of devices, not hundreds of millions of telemetry rows).
Resolve each tenant’s effective IoT:TelemetryRetentionDays via Granit.Settings (FusionCache-backed, microsecond lookups after warmup).
Group tenants by their effective retention value — in production, ~80% of tenants use the default, so 10,000 tenants typically collapse into 2-4 buckets.
Issue one ExecuteDeleteAsync per bucket with WHERE tenant_id = ANY(@array) AND recorded_at < @cutoff.

Scenario	Tenant count	SQL DELETEs issued
All tenants on default 365 days	10,000	1
90% default, 10% custom (30 or 730 days)	10,000	3
Every tenant different	10,000	10,000 (worst case)

A 30-minute hard deadline wraps the job on a 24 h cycle — a runaway delete never overlaps the next run.

Configuration

Key	Default	Purpose
`IoT:TelemetryRetentionDays`	`365`	Days of telemetry kept per tenant. Set `0` to disable purge for a tenant

DeviceHeartbeatTimeoutJob — offline detection

Schedule: */5 * * * * (every 5 minutes). Purpose: flag devices whose last heartbeat is older than the tenant’s timeout, publish DeviceOfflineDetectedEto, and suppress re-alerts on flaky links via a distributed tracker cache.

Flow

sequenceDiagram
  participant J as DeviceHeartbeatTimeoutJob
  participant R as IDeviceReader
  participant C as DeviceOfflineTrackerCache
  participant B as IDistributedEventBus
  participant N as Granit.IoT.Notifications
  J->>R: FindStaleAsync(tenantBucket, cutoff, 5000)
  R-->>J: Device[] (LastHeartbeatAt null or &lt; cutoff)
  loop per stale device
    J->>C: TryAddAsync(deviceId, ttl)
    alt newly added
      C-->>J: true
      J->>B: Publish DeviceOfflineDetectedEto
      B->>N: DeviceOfflineNotificationHandler
    else already tracked
      C-->>J: false (suppressed)
    end
  end

Why the tracker cache matters

A device on a flaky cellular link can drop and reconnect multiple times per hour. Publishing an alert on every 5-minute cycle is spam. DeviceOfflineTrackerCache uses IDistributedCache — Redis in production, with an in-process distributed-memory fallback for standalone hosts — keyed by device ID with TTL = IoT:HeartbeatOfflineNotificationCacheMinutes (default 60 min). A distributed store is deliberate: the single-leader heartbeat job sets the flag on one node while any replica’s recovery handler clears it, so both paths must observe the same cluster-wide store. The first offline detection publishes; subsequent detections within the TTL silently skip. When telemetry resumes, a separate TelemetryRecoveredHandler calls ForgetAsync(deviceId) so future disappearances are re-eligible.

Bucketed per-tenant timeout

Same bucketing pattern as the purge job: tenants are grouped by their effective IoT:HeartbeatTimeoutMinutes, and FindStaleAsync is called once per bucket with tenant_id = ANY(@array).

A 4-minute hard deadline on a 5-minute cron prevents overlapping runs even under unusual load.

Configuration

Key	Default	Purpose
`IoT:HeartbeatTimeoutMinutes`	`15`	Minutes without heartbeat before flagging offline. Set `0` per tenant to disable
`IoT:HeartbeatOfflineNotificationCacheMinutes`	`60`	Tracker TTL preventing alert spam on flaky links

TelemetryPartitionMaintenanceJob — ahead-of-time partition creation

Schedule: 0 1 * * 0 (Sundays, 01:00 UTC). Purpose: provision next-month and next-next-month partitions so writes never fail at a boundary transition.

The job is a graceful no-op when the parent table is not partitioned — it checks pg_partitioned_table and logs a single warning. This lets non-partitioned deployments keep the job registered without side effects.

Enabling partitioning

protected override void Up(MigrationBuilder migrationBuilder)
{
    migrationBuilder.EnableTelemetryPartitioning();
    migrationBuilder.CreateTelemetryPartition(2026, 4);
    migrationBuilder.CreateTelemetryPartition(2026, 5);
    // Subsequent months provisioned automatically by the job
}

Every partition carries its own BRIN(recorded_at) and GIN(metrics) indexes — dropping the partition drops the indexes with it, which is why GDPR erasure at a month boundary is O(1).

Configuration

None. The job infers the schema and table name from IoTDbContext.

Registration

The jobs are bundled in Granit.Bundle.IoT:

builder.Services.AddGranit(builder.Configuration).AddIoT();

Or register individually (e.g. when not using the bundle):

builder.Services
    .AddGranit(builder.Configuration)
    .AddModule<GranitIoTBackgroundJobsModule>();

GranitIoTBackgroundJobsModule depends on GranitIoTModule, GranitBackgroundJobsModule, and GranitSettingsModule — DI order is driven by [DependsOn], not by the order of these calls.

Observability

Metric	Tags	Fires when
`granit.iot.background.telemetry_purged`	`tenant_id`	Purge job deleted rows for a tenant
`granit.iot.device.offline_detected`	`tenant_id`	Heartbeat job flagged a device (first detection only)
`granit.iot.background.partition_created`	`partition_name`	Future partition created
`granit.iot.alerts.throttled`	`tenant_id`, `metric_name`	Notification bridge suppressed an alert

Wolverine also emits job-level telemetry (wolverine.job.execution_duration, wolverine.job.failures) — wire both to Grafana for a single pane of glass.

IoT Operations

What problem does this package solve?

The three jobs

How it scales — bucketed deletes, not N+1

Configuration

DeviceHeartbeatTimeoutJob — offline detection

Flow

Why the tracker cache matters

Bucketed per-tenant timeout

Configuration

TelemetryPartitionMaintenanceJob — ahead-of-time partition creation

Enabling partitioning

Configuration

Registration

Observability

Anti-patterns to avoid

See also

IoT Operations

What problem does this package solve?

The three jobs

StaleTelemetryPurgeJob — GDPR retention enforcement

How it scales — bucketed deletes, not N+1

Configuration

DeviceHeartbeatTimeoutJob — offline detection

Flow

Why the tracker cache matters

Bucketed per-tenant timeout

Configuration

TelemetryPartitionMaintenanceJob — ahead-of-time partition creation

Enabling partitioning

Configuration

Registration

Observability

Anti-patterns to avoid

See also