Shadow testing
The safest way to learn that a model is bad is before it touches production. Two ideas serve that goal: pre-flight a candidate against representative devices, and run a candidate alongside production on real inputs without affecting the live path.
The principle
A candidate model — and the bundle it ships in — should pass on hardware that looks like your fleet before any production device receives it. Because a model artifact carries its target hardware profile and framework, the platform can match a candidate to the right device profiles and exercise it there first, rather than discovering a mismatch as a failed swap in the field.
A shadow evaluation goes one step further: the candidate runs next to the production model, sees the same inputs, and its outputs are compared — but the production path keeps serving. Nothing the candidate does reaches the workload until you promote it.
What protects you today
Even before a dedicated shadow runner exists, the deployment loop is built so a candidate cannot quietly harm a fleet:
- A canary wave exposes the candidate to a small cohort first.
- A health probe runs after the swap and triggers automatic rollback on failure.
- A halt rule pauses the rollout fleet-wide if failures cross your threshold.
- The previous model is retained on-device for instant local rollback.
In practice you get much of the value of shadow testing — catch a regression on a few devices, not all of them — through canary plus probes plus halt.
On the roadmap
A first-class pre-flight and shadow-testing capability is a planned direction:
- Emulated pre-flight. Spin up a set of emulated device profiles, apply the candidate bundle, and run smoke checks plus your own defined checks as a gate on the rollout — so a bundle proves itself before any real device sees it.
- Lab cohort. Register physical lab devices as a cohort that pre-flight targets first, so candidates meet real hardware before the fleet does.
- On-device shadow evaluation. Run the candidate model side by side with production against recorded or live inputs, compare outputs, and never touch the production path.
Throughout, the platform's hard line holds: it works with metadata and your declared checks, never the contents of the data your application processes.