Operations & runbooks

Troubleshooting

Most problems fall into one of four buckets: a device cannot enroll, a device cannot stay connected, an update will not verify, or a rollout is not making progress. Work from the symptom down. The control-plane services and the device agent both emit logs, and the agent runs as a managed service unit, so its logs are available with the usual journal tooling on the device.

A device will not enroll

Symptom	Likely cause	What to check
Installer fails before the device appears	Token spent or expired	Mint a fresh enrollment token. A token authorizes a single enrollment and is consumed even if signing fails.
TLS handshake fails	Address mismatch	The host the device connects to must be present in the server certificate's subject alternative names. If you changed the node's address, regenerate certificates for the new name.
Installer cannot download the agent	Web entrypoint or network	Confirm the console host is reachable from the device and serving the installer.

A device enrolls but goes offline

A device shows online while it is heartbeating and goes stale when heartbeats stop. Check, in order:

Connectivity to the device endpoints. The device gRPC endpoint and the update repository are mutual-TLS only. If a firewall or NAT sits between the device and the node, both device-facing ports must be reachable.
The device's certificate and trust anchor. A device authenticates with its own certificate and verifies the platform's. If either has been lost or the trust anchor no longer matches, re-enroll the device.
Clock skew. Mutual TLS and signed, time-bounded update metadata both depend on a roughly correct clock. A badly wrong device clock will reject otherwise valid certificates and metadata.

An update will not verify or apply

This is verification doing its job, not a bug to route around — there is no bypass flag, in development or production.

Signature or metadata rejected. The agent checks every artifact against the pinned trust anchor before applying it. If the repository was re-keyed but the device still pins the old root, the device correctly refuses the update; re-enroll it so it picks up the current anchor. See TUF.
Health probe fails after the swap. Each update carries a health check — an exec, HTTP, or file probe. When the probe fails, the agent automatically rolls back to the previous artifact. A device that keeps reporting a failed apply usually has a failing probe; inspect the probe definition and the device's logs.
Exec probe never runs. Exec probes run only operator-allowlisted binaries with no shell. If the allowlist on the device is empty, exec probes are refused by design — add the intended binary to the device's allowlist.

A rollout is stuck

Symptom	Likely cause
Rollout halted on its own	The wave's failure rate crossed its threshold. Investigate the failing devices, then resume once fixed.
No further devices updating	Manual gate awaiting approval, or a maintenance window not yet open.
A reverted device still shows reverting	The device is offline and will complete its reversal when it next checks in — reversal never blocks on unreachable devices.

When you need to roll back the platform itself