MeshanicsDocs
Health probes & rollback

Rollback safety

Rollback is a feature, not error handling. Every update the agent applies is gated by a health check, and the previous good version is kept recoverable the whole time. If the new version does not come up healthy, the agent restores the old one on the device itself — no operator action, no return trip to the platform required.

This is one reason there is a small agent on each device rather than nothing. The agent verifies the artifact, swaps it in atomically, runs the health probe, and reverts if the probe fails. Your application and model code ride on top of this base unchanged; the safety machinery is ours, not something you integrate.

The update lifecycle

Each artifact assignment moves through a fixed set of states, which the agent reports back as it goes:

StateMeaning
downloadingFetching the artifact through the update-metadata client.
verifyingChecking signatures and cross-checking the expected digest.
applyingSwapping the new version into place.
healthyThe health probe passed; the new version is live.
rolled_backThe probe failed; the previous version was restored.
failedThe update could not be applied (and, where relevant, could not be reverted).

Only healthy, rolled_back, and failed are terminal. Download and verification problems are treated as transient — the agent retries on a later heartbeat rather than burning the rollout.

How "the previous version" is kept

The mechanism depends on the artifact kind, but the principle is the same: the old version stays intact until the new one has proven itself.

  • Files (models, raw artifacts). The new version is staged alongside the current one, then activated by an atomic swap. The previously active version is retained, so reverting is just swapping back.
  • Containers. The image is content-addressed by digest. The agent records the previously running digest before starting the new one; a failed probe restarts the previous digest.
  • Config. A delivered config file is placed at its destination with the prior contents retained, and the previous file is restored — and its reload re-run — if the probe fails.

In every case, if there is no previous version (a first install) and the probe fails, the agent does not leave a half-applied artifact in place: it deactivates or removes what it just installed and reports failed.

Atomic and crash-safe

The swap into place is atomic, so a device is never running a torn or half-written artifact. Terminal outcomes are also persisted on the device before they are reported, so a reboot or a redelivered assignment mid-update does not cause the agent to repeat work or lose track of where it was — re-applying a version it already settled is a no-op.