Halt rules & abort policy
Rollback safety is a feature, not error handling. A rollout watches its own health as it runs and stops the bleeding the moment failures cross a threshold — without waiting for an operator. An operator can also stop a rollout deliberately, and choose whether updated devices stay where they are or return to their previous version.
The halt rule
A strategy carries a tolerated failure rate, max_failure_rate, expressed as a
fraction in the range [0, 1). As devices report, the rollout tracks the share of
targeted devices that ended in failed or rolled_back. If that share exceeds the
tolerance, the rollout halts fleet-wide — even mid-wave — and waits for an
operator. The default tolerance is 0, which means the first failure halts the
rollout.
{
"waves": [{ "percent": 10 }, { "percent": 100 }],
"max_failure_rate": 0.05
}
A halt freezes progression: no further waves start and no new assignments are served while the rollout is paused. The halt is recorded as an audit event with the wave, the failure count and the tolerance that was crossed, so the reason is always on the record.
Resuming a halted rollout
An operator with approval rights can resume a halted rollout. Resuming acknowledges the failures that tripped the halt: only failures beyond the count already seen can trip the rule again. This lets you investigate, accept a known set of failures, and continue — without the same failures immediately re-halting the rollout. Resuming re-evaluates progress straight away, so a wave that is already complete advances at once.
Aborting a rollout
An operator can abort a running or halted rollout to stop it for good. Abort takes a policy that decides what happens to devices that already took the update:
| Policy | Effect |
|---|---|
keep (default) | Every device stays on whatever artifact it reached. The rollout simply stops. |
revert | Devices that already applied the update are returned to their previous artifact. |
Under revert, each updated device is pinned to its prior version and served a
reversal assignment on its next heartbeat — the device performs the swap itself,
exactly as it would for any other update. A device with no known prior version is
marked failed with a clear reason rather than left in limbo. Unreachable devices
simply stay in the reverting state until they come back; abort never blocks on a
device that is offline. As reverting devices return, a healthy re-apply of the
previous version is recorded as reverted; a probe failure during the revert is
recorded as failed.
The abort and its policy are written to the audit log, including how many devices were set to revert.