Abstract
The central question for agentic systems is shifting from model intelligence to runtime phase control. A long-running agent is not a single response generator. It is a dynamic system with goals, memory, identity, authority, quality, latency, cost pressure, and responsibility boundaries. Once those variables start moving together, a conventional evaluation harness can tell us whether one output passed, but it cannot tell us whether the system is drifting into retry loops, memory decay, identity fragmentation, or governance leakage.
This article defines the Dynamic Harness as a Runtime Governance Layer that observes, evaluates, and controls the phase space of an agent runtime. It connects MARIA OS research with implementation lessons from bonginkan/virtual-talent, where Producer AI already normalizes jobs into runtime episodes, classifies failures, builds dynamic scorecards, proposes repair scopes, and routes safe self-healing actions through explicit approval boundaries.
The result is a practical research frame: a harness is no longer only a test wrapper. It becomes the operating surface that converts runtime drift into reruns, quarantine, draft repair PRs, human approvals, policy changes, and measurable improvement loops.
1. From Test Harness to Control Harness
Traditional software harnesses isolate a unit, run it under fixed conditions, and compare the result against an expected contract. That remains necessary. Agentic systems still need type checks, schema checks, UI contracts, tenant boundaries, regression tests, and quality gates.
But agent runtime behavior is not a point. It is a trajectory. The system reads memory, chooses tools, coordinates agents, retries, hides failures, takes shortcuts under latency, and learns from prior outcomes. A single passing output may coexist with a worsening runtime phase: correction rates increase, retry loops thicken, identity signals degrade, or an advisory that once improved quality begins to poison future runs.
The Dynamic Harness therefore asks a different question: not only did this output pass, but what phase is the runtime entering?
2. The virtual-talent Reference Pattern
The virtual-talent Producer AI work provides a concrete implementation pattern. Producer jobs are normalized into runtime episodes. Each episode can include intent, stages, participating agents, quality gates, advisories, generated assets, retries, holds, failures, event counts, and duration.
That structure turns operational noise into a governable object. Once episodes exist, failures can be classified, owners can be assigned, scorecards can be produced, repair proposals can be scoped, and self-healing can be bounded.
| Dynamic Harness layer | virtual-talent pattern | MARIA OS expansion |
|---|---|---|
| Runtime episode | Producer job events become one analyzable unit | Decisions, audits, sales flows, meetings, code changes |
| Failure taxonomy | intent mismatch, identity drift, retry loop, provider failure | memory drift, authority leak, responsibility mismatch |
| Owner mapping | planning, UX, quality, provider, platform | Planet, Zone, Agent, Human Gate, Executive Gate |
| Scorecard | completion, pass rate, retry, advisory usage | business, trust, responsibility, and governance KPIs |
| Repair proposal | scoped fix plus verification commands | PRs, policy updates, gate changes, memory pruning |
| Controlled healing | rerun, quarantine, draft PR, human approval | fail-closed autonomy management |
The important move is that the harness does not stop at diagnosis. It produces the next operational action.
3. Agent Runtime as Phase Space
MARIA OS can represent an agent runtime as a state vector.
G_t: goal coherence M_t: memory integrity I_t: identity continuity Q_t: quality state L_t: latency pressure C_t: cost pressure R_t: responsibility demand A_t: authority boundary $$
The harness does not directly observe x_t. It observes logs, outputs, user corrections, gate decisions, tool calls, memory references, latency, cost events, and approval traces. The harness is therefore both an observation layer and a controller.
u_t = H(y_{0:t}) $$
The control input u_t may be a rerun, quarantine, draft repair PR, policy update, memory pruning, gate escalation, or human approval request. This makes the harness a runtime controller rather than a static checklist.
4. Phase-Level Failure Modes
A phase-level harness detects regions of self-reinforcing behavior. These states are not single failures. They are runtime attractors.
| Phase | Symptom | Control action |
|---|---|---|
| Stable production | Quality, latency, and correction rates are steady | Lightweight monitoring |
| Retry loop | The same class of failure repeats | Suppress loop, hold, route to owner |
| Identity drift | Persona, face, role, or voice continuity weakens | Identity gate, reference lock, memory pruning |
| Goal mutation | The agent optimizes away from the original goal | Goal consistency check, human gate |
| Governance leak | Authority or responsibility boundaries blur | Fail closed, escalate approval |
| Latency freeze | Slow paths collapse quality | Budgeted fallback, degradation policy |
| Advisory poisoning | Learned guidance makes future runs worse | ON/OFF evaluation, quarantine |
This is where Dynamic Harness becomes more than evaluation. It sees the slope, not just the point.
5. The Five-Layer Harness Stack
The minimal MARIA OS Dynamic Harness has five layers.
- Runtime Episode Layer. Normalize every meaningful agent action into a durable episode with coordinates, intent, memory, tools, gates, evidence, corrections, and final state.
- Failure Taxonomy Layer. Convert raw failure signals into typed failures with severity, confidence, owner, user visibility, suggested action, and verification.
- Dynamic Scorecard Layer. Track completion, quality pass rate, retry rate, human correction rate, advisory lift, owner failure density, duration, and release blockers over time.
- Repair Proposal Layer. Convert repeated failures and scorecard drift into scoped changes with tests or harnesses that can verify the improvement.
- Controlled Self-Healing Layer. Allow low-risk reruns or quarantine while requiring human approval for schema, deployment, global policy, core prompt, tenant boundary, or authority changes.
6. Why This Matters for MARIA OS
MARIA OS is not only an agent management surface. It is an operating system for human-agent organizations. That means it must govern the runtime, not merely orchestrate tasks.
The Dynamic Harness becomes the kernel boundary for autonomy. It determines when an agent can continue, when it must degrade gracefully, when a policy must be rewritten, when a memory should be pruned, when a draft PR is appropriate, and when the system must stop and return authority to a human.
This is also why the harness is a values layer. Values are not executed because they are written in a document. They are executed when the runtime knows when to stop, when to ask, when to quarantine, and when to preserve responsibility even if automation would be faster.
7. Research Agenda
Dynamic Harness research sits at the intersection of control theory, runtime assurance, anomaly detection, process mining, causal inference, and self-healing systems.
The open problems are clear:
- Observability: infer hidden runtime state from partial logs, outputs, corrections, gates, and memory traces.
- Causality: distinguish whether a quality lift came from a prompt, advisory, provider, memory, or random variation.
- Stability: prevent self-healing from becoming control oscillation.
- Topology: detect phase changes in high-dimensional agent state spaces.
- Legitimacy: define who sets thresholds, who approves autonomy, and who audits the harness itself.
The last point matters most. A harness that controls autonomy is itself a governance object. It must be visible, testable, accountable, and bounded.
8. Conclusion
The next AI infrastructure race is not only about larger models. It is about the ability to operate intelligence without breaking it.
Static harnesses preserve contracts. Dynamic Harnesses control phases. Static harnesses say whether the build passed. Dynamic Harnesses say whether the runtime is drifting into a dangerous attractor and what action should happen next.
The implementation pattern emerging from virtual-talent gives MARIA OS a concrete path: runtime episodes, failure taxonomy, dynamic scorecards, repair proposals, and controlled self-healing. Extending that pattern from Producer AI to companies, governance systems, and agentic society is the next step.
To run intelligence safely, we need more than smart agents. We need harnesses that can observe the phase space, detect unstable attractors, and apply responsible control inputs before the system breaks.