Abstract
Most AI product teams still build agentic systems using a software-first workflow: write prompts, wire tools, ship a prototype, observe failures, and add tests after the fact. That workflow is too weak for agentic systems because the most important defects are not syntax errors. They are runtime phase errors: the agent enters the wrong mode, trusts the wrong memory, overuses autonomy, fails to preserve evidence, or continues execution after responsibility should have shifted to a human.
Harness-driven development reverses the order. The dynamic harness is not a testing accessory. It is the primary specification. Before implementing the agent, the team defines the runtime episodes it must survive, the failure taxonomy used to classify drift, the scorecards that decide whether behavior is improving, and the authority boundaries that cannot be crossed without approval. Implementation then becomes a controlled attempt to satisfy a living runtime contract.
1. Why prompt-first development breaks
Prompt-first development feels fast because the first demo arrives quickly. But it hides four forms of debt. First, behavior is specified in prose, so no one can tell whether the agent is improving or merely changing style. Second, failures are described as anecdotes, so they cannot be replayed precisely. Third, tool behavior is treated as an implementation detail even though tool selection is often where risk enters. Fourth, governance appears late, after the agent has already learned a pattern of direct execution.
A dynamic harness turns those hidden debts into explicit engineering objects. A runtime episode captures the goal, inputs, memory state, tool calls, evidence, output, score, and final action. A failure taxonomy says whether the issue was caused by missing evidence, bad autonomy, stale memory, wrong tool use, unclear responsibility, latency pressure, or cost pressure. A scorecard turns the episode into a measurable state vector. A repair boundary defines which fixes can be automatic and which require human review.
2. The harness as a specification
In normal software, a specification describes what a system should do. In agentic systems, that is not enough. We also need to specify when the system should stop, when it should ask, when it should preserve uncertainty, and when it should treat its own output as untrusted. The harness specification therefore contains five layers.
- Scenario layer: representative and adversarial runtime episodes.
- Gate layer: authority thresholds, evidence requirements, and human approval paths.
- Score layer: quality, latency, cost, responsibility, memory, and safety metrics.
- Repair layer: allowed mutations, approval-required mutations, and rollback rules.
- Audit layer: MARIA coordinates, rationale, evidence paths, and diff history.
This makes the harness more than an evaluation suite. It becomes the contract that implementation must satisfy. A prompt, tool, workflow, or policy is correct only if it improves the runtime vector without violating gates.
3. Harness-driven development loop
The development loop has five stages. The first stage is behavioral modeling: define the decisions the agent must make, the context it may use, and the authority it is allowed to exercise. The second stage is episode encoding: convert examples into replayable runtime episodes. The third stage is implementation: prompts, tools, workflows, and policies are written to satisfy the episodes. The fourth stage is replay: every change is tested against the harness. The fifth stage is hardening: failures become new episodes, not disposable bug notes.
type HarnessDrivenChange = {
coordinate: string
scenarioIds: string[]
allowedAutonomy: "observe" | "draft" | "execute" | "repair"
gates: string[]
expectedScoreDelta: {
quality: number
responsibility: number
latency: number
cost: number
}
approvalRequired: boolean
}The important point is that implementation is not judged by intent. It is judged by replay. If a new prompt makes demos look better but increases unsafe autonomy in edge cases, it fails. If a tool optimization reduces latency but loses evidence traceability, it fails. If a workflow removes a human checkpoint without proving equivalent responsibility preservation, it fails.
4. Page-splitting as an engineering metaphor
A good LP section fits within a mobile viewport without horizontal shake. A good agentic feature fits within a governance page without responsibility shake. In both cases the discipline is the same: split the surface into bounded pages, prevent hidden overflow, and let dense content scroll inside a controlled container. Harness-driven development applies this principle to runtime behavior. Each episode is a page. Each page must fit. If it cannot fit, it must be split into smaller episodes, not allowed to overflow into unbounded autonomy.
5. What changes for product teams
The product manager no longer writes only user stories. They define failure stories: what should happen when evidence is missing, when the user asks for an unsafe shortcut, when a data source contradicts memory, or when the agent cannot finish within budget. The engineer no longer asks only whether the code runs. They ask whether the harness state vector improved. The reviewer no longer checks only implementation style. They inspect whether the change modified authority boundaries, evidence paths, or repair permissions.
This creates a practical rule: every feature PR should include the harness delta. What scenarios were added? What scores changed? What gates were touched? What repairs became possible? Which changes remain human-approval-only?
6. Relationship to automatic implementation
Harness-driven development is the prerequisite for safe automatic implementation. If the harness is weak, an implementation agent has no stable target and will optimize for superficial completion. If the harness is strong, the implementation agent can synthesize code against explicit runtime contracts. It can generate a tool, run the episodes, inspect failures, propose a patch, and stop when score improvement saturates or when the next step crosses an approval boundary.
The harness is therefore the difference between code generation and governed implementation. Code generation asks: can the model produce plausible code? Governed implementation asks: can the system prove that the code improves runtime behavior under the authority contract?
7. Research agenda
The next research task is to formalize harness completeness. A harness is incomplete when a behavior can pass all current episodes while still violating the intended responsibility model. We can measure this through mutation testing: deliberately perturb prompts, tools, policies, and memory, then ask whether the harness detects degradation. A strong harness is one that fails bad mutations quickly and explains why.
A second task is harness compression. Enterprise systems cannot replay every episode on every change. The harness needs a minimal basis: a small set of episodes whose score gradients predict the larger runtime. This is where phase-space control becomes practical. The harness should identify which episodes represent distinct regions of risk, then replay the basis on every PR and the full set on scheduled evaluation.
Conclusion
Harness-driven development is a shift from prompt craft to runtime engineering. It makes the dynamic harness the specification, runtime episodes the regression unit, scorecards the measurement surface, and gates the authority boundary. This is the foundation required before MARIA OS can safely support internal automatic implementation and automatic repair.