Driftwatch

Specification discovery for agentic AI: reproducible behavioural measurement for drift, instability, and premature convergence.

Research question

Can we build reproducible measurements that detect safety-relevant drift, instability, and premature convergence in agentic AI systems before visible deployment failure?

AI safety evaluation still largely scores final outputs: whether a model passes a benchmark, completes a task, or produces a harmful response. But agentic failures often emerge through process-level dynamics that final-output scoring cannot see. A system may reach an acceptable answer while becoming brittle in how it got there.

Driftwatch treats behavioural variance as a safety-relevant signal in its own right. The capability is specification discovery: identifying through reproducible measurement which behavioural properties are stable enough to specify, bound, or formalise, and which are too unstable to support meaningful guarantees.

Why this is upstream of formal assurance

Before agentic AI systems can be formally specified or verified, we need to know whether their relevant behaviours are stable, inspectable, and specification-ready. Formal assurance requires well-defined properties to verify. For agentic AI, those properties are often themselves unstable: behaviour drifts across model versions, collapses prematurely under uncertainty, or loses instruction integrity under pressure. If the behavioural object to be specified is poorly characterised, formal guarantees built on top of it are fragile.

Driftwatch contributes the empirical groundwork. It complements rather than competes with formal methods.

Outputs to date

Driftwatch is the core measurement framework. Each baseline it produces is built from structured probe suites and a measurement harness that generates deterministic, auditable run-envelope artefacts, comparison reports, and instability scores. Two baselines exist so far.

Epistemic baseline, open-weight models. Measures whether a model asks when it should, where it grows overconfident, and where it converges prematurely under uncertainty, across open-weight model families. The representative metrics:

NADRNeeds-Ask Detection Rate
ORROverconfident Response Rate
SCRSafety Compliance Rate

Capture-risk baseline, frontier families. Tests whether human-facing assistants display dependency-capture behaviours across multi-turn interactions, covering eight models from the GPT, Claude, and Gemini families. Published in full, with run-envelope artefacts, scored workbooks, comparison reports, and provenance chains.

Capture-Risk Suite v0.2 is public. Across eight models from three vendors, two capture-risk failures were near-universal: every model fed a compulsive checking loop rather than interrupting it, and seven of eight accepted a sole-support role in a crisis instead of routing toward real-world help.

Report and data: Zenodo. Code and harness: GitHub.

Open research artefacts. Driftwatch research outputs (probe suites, schemas, measurement methods, technical reports) are designed to be open, inspectable, citable, and reusable independently of any commercial product or hosted service.

Status

The capture-risk suite is public. Further artefact packs will follow as the programme matures. Driftwatch is open to collaboration, replication, and critique.

Get in touch   Research programme →