Operations & runbooks

Runbooks

Short, repeatable procedures for the situations operators meet most often. Each one assumes you are logged into the console and have access to the host. For the design behind these flows, see How it works.

Halt or abort a rollout

A rollout advances through waves, watching device health as it goes. If your own monitoring tells you something is wrong before the platform's halt rules fire, intervene directly.

Halt stops a rollout from progressing to further devices while leaving already-updated devices in place. Use it to freeze the situation while you investigate.
Abort ends the rollout. When you abort, you choose what happens to the devices that already took the update:
- keep leaves them on the new artifact;
- revert instructs each updated device to re-apply its previous artifact. Reversal is per-device and never blocks on devices that are unreachable — an offline device reverts when it next checks in.

A rollout can also halt itself automatically when the failure rate across a wave crosses its configured threshold. Once you have a fix, resume a halted rollout to let the remaining waves proceed.

Rotate console credentials

Administrative credentials and platform secrets are rotated by updating the node's configuration and recreating the services so they pick up the change. Re-enter any outbound-integration secrets afterward. Every credential change is the kind of administrative action the audit trail records.

Recover or re-enroll a device

A device that is wiped, replaced, or has lost its identity is recovered by enrolling it again — there is no special repair path, because re-enrollment is already safe.

In the console, open Devices → Add device and mint a fresh enrollment token.
Run the printed installer command on the device.
The device generates a new private key locally, obtains a freshly signed certificate, pulls the current trust anchor, and resumes heartbeating.

The previous certificate becomes irrelevant; the private key never leaves the device at any point. See Enroll a device and Device identity.

Force a device back to a known-good state

If a device is on a bad artifact and you do not want to wait for a rollout, abort the rollout with revert, or target the device with a rollout of the last-known-good bundle. Either way the agent verifies the signature before it applies anything and runs the bundle's health probe afterward — the recovery path is itself a signed, checked update, not an out-of-band override.

Re-key the update repository for production