AI/ML model OTA

Deploying a model

Deploying a model uses the same loop as every other artifact: verify, swap atomically, probe, and roll back on failure. A small agent on each device does this work - the platform is not agentless. What is zero-integration is your model and application code, which ride on the signed base without changes.

The on-device sequence

When a device is assigned a new model version, its agent:

Fetches the artifact through the device-facing facade.
Verifies it against the pinned trust anchor and cross-checks the digest. Unverified content is never applied - there is no insecure path.
Swaps the model file atomically, keeping the previous version in place.
Probes health after the swap.
Rolls back to the previous version on its own if the probe fails.

The previous model is retained precisely so rollback is instant and local - the device does not need to reach the platform to recover.

Health probes

A rollout is only as safe as its health check. A probe is one of:

exec - run an allowlisted health-check binary on the device; exit zero means healthy. The path to the active artifact is provided to the binary so a check can load and exercise the model. No shell is involved.
HTTP - a GET that must return a success status, for an inference service that exposes one.
file - a readiness file your application writes once the new model is loaded and serving.

You also set a timeout, the number of attempts, and an initial delay so a model that takes time to warm up is not failed prematurely.

Targeting and strategy

A rollout targets a group of devices and runs through waves expressed as cumulative percentages - for example 10%, then 50%, then 100%. Each wave can advance automatically once the prior waves are healthy, or wait for a manual approval gate. A halt rule sets how many device failures are tolerated before the rollout pauses itself fleet-wide; the default halts on the first failure.

rollout - yolo-v8n@2.1.0 → plant-7
  wave 1/3  10%   canary, manual gate
  wave 2/3  50%   auto when wave 1 healthy
  wave 3/3 100%   auto
  halt rule: pause after 1 failed device

When something is wrong

If a model regresses on a device, that device rolls back by itself and reports the outcome. If failures cross the halt threshold, the rollout pauses before the bad version spreads. Operators can acknowledge failures and resume, or abort - choosing whether to keep what is already installed or revert affected devices to the previous version. A device that is unreachable never blocks the rest of the fleet.

Every step - published, approved, updated, rolled back - is written to the append-only audit trail your compliance evidence reads from.

Deploying a model

The on-device sequence

Health probes

Targeting and strategy

When something is wrong

Related