ArchitectureMay 30, 2026|19 min readpublished

Governed Auto-Implementation: How a Dynamic Harness Turns Research Intent into Code

From design note to implementation plan, patch, replay, and approval-gated merge

Engineering Case StudyReading label

Applies established engineering and mathematical methods to MARIA OS implementation and industry operations. The value is reproducible design, not novelty theater.

Provenance:ARIA-RD-01G1.U1.P9.Z3.A1
Reviewed by:ARIA-TECH-01ARIA-QA-01ARIA-WRITE-01

Abstract

Automatic implementation is often framed as a code-generation problem. That framing is too narrow. The hard part is not producing code. The hard part is preserving intent, responsibility, evidence, and reversibility while code changes. A dynamic harness can turn automatic implementation from a speculative assistant into a governed runtime actor.

The governed auto-implementation loop begins with research intent: a note, issue, design sketch, failure episode, or product request. The harness parses that intent into scope, identifies affected coordinates, generates an implementation plan, applies a bounded patch, replays relevant episodes, classifies risk, and routes the result to automatic merge, agent review, or human approval.

1. The difference between code generation and auto-implementation

Code generation produces files. Auto-implementation changes a system. The distinction matters because a generated file can be impressive while the system becomes worse. A real implementation must preserve interfaces, tests, product intent, runtime evidence, accessibility, security, governance, and operational cost. This requires a control loop, not just a model call.

Governed auto-implementation therefore has three invariants. First, every implementation must be linked to an intent object. Second, every implementation must be evaluated by replaying relevant harness episodes. Third, every implementation must be classified by authority risk before it can be merged.

2. Intent objects

An intent object is a structured representation of why a change should exist. It may originate from a human request, a failed runtime episode, a regression detector, a product roadmap item, or a research note. The harness converts the raw request into a machine-checkable object.

type ImplementationIntent = {
  id: string
  source: "human" | "episode" | "regression" | "roadmap" | "research"
  summary: string
  coordinates: string[]
  expectedBehavior: string[]
  forbiddenChanges: string[]
  evidenceRequired: string[]
  approvalPolicy: "auto" | "agent-review" | "human"
}

The key field is forbiddenChanges. Automatic implementation must know not only what to do but what not to touch. Without negative scope, implementation agents tend to solve adjacent problems, refactor unrelated areas, or modify authority boundaries because those changes make the immediate task easier.

3. Seven-stage loop

The governed loop has seven stages. Intent parse converts prose into a structured target. Scope resolution maps that target to files, APIs, data contracts, UI surfaces, and MARIA coordinates. Plan generation proposes the smallest viable implementation. Patch synthesis edits the code. Replay runs the harness basis. Risk classification determines whether the patch touched authority, data, security, schema, prompt, or workflow boundaries. Approval routing decides whether the change can merge automatically, needs agent review, or must wait for a human.

This loop deliberately separates patch synthesis from approval. An implementation agent may write the patch, but the harness decides whether that patch is allowed to move forward.

4. Risk classes

We classify auto-implementation changes into four risk classes. Class 1 is cosmetic: copy, layout, styles, and documentation that do not alter behavior. Class 2 is local behavior: a component, utility, or route changes behavior within a bounded surface. Class 3 is workflow behavior: the change modifies how steps are ordered, retried, escalated, or evaluated. Class 4 is authority mutation: the change affects who can decide, when gates fire, what evidence is required, or what can be modified automatically.

Risk classExamplesDefault route
Class 1Text, responsive CSS, docsAuto after build
Class 2Local UI logic, helper behaviorAgent review after tests
Class 3Workflow DAG, retry policy, scoringHuman approval if production-bound
Class 4Authority, schema, policy, prompt coreHuman approval required

The dynamic harness must be conservative here. A change that appears small in code can be large in authority. For example, changing a threshold from 0.82 to 0.75 may be one line, but if that threshold controls human escalation, it is an authority mutation.

5. Implementation plans as reviewable artifacts

The implementation plan should be reviewed before patching when risk is high. It contains affected files, intended diffs, expected tests, expected score changes, and rollback strategy. This makes the implementation agent accountable before it writes code.

A good plan is small. It prefers local patches over broad refactors, existing patterns over new abstractions, and testable behavior over architectural ambition. The harness should penalize plans that expand scope without evidence.

6. Replay as the merge predicate

The central merge predicate is not whether the patch compiles. Compilation is necessary but insufficient. The patch must improve or preserve the relevant runtime state vector. That vector includes quality, responsibility, evidence completeness, latency, cost, and reversibility.

merge(patch) = pass(build) ∧ pass(types) ∧ pass(harness_basis) ∧ ¬authority_violation(patch)

For Class 1 changes, the harness basis may be small. For Class 3 or 4 changes, the basis must include adversarial episodes, rollback tests, and approval-path verification.

7. Internal auto-implementation in MARIA OS

Inside MARIA OS, auto-implementation should be treated as an internal agent with limited authority. It can propose patches, run tests, inspect failures, and open draft pull requests. It cannot silently modify production schemas, deploy global policy changes, rewrite core prompts, or expand its own authority. Those operations require explicit gates.

This design preserves the benefit of autonomous implementation while preventing recursive authority creep. The implementation agent can improve the system, but it cannot decide the constitution under which it improves the system.

Conclusion

Governed auto-implementation is not code generation with better prompts. It is a runtime control loop around code change. The dynamic harness supplies the missing structure: intent, scope, replay, risk classification, approval, and rollback. With that structure, internal automatic implementation becomes a measurable engineering capability rather than an uncontrolled productivity demo.

R&D BENCHMARKS

Loop stages

7

Intent parse, scope, plan, patch, replay, risk classify, approval route.

Merge rule

evidence-first

No automatic merge without scenario evidence and diff rationale.

Risk boundary

4 classes

Cosmetic, local behavior, workflow behavior, and authority mutation.

Output

repairable PR

Every auto-implementation artifact is designed for review, replay, rollback, and future repair.

Published by Bonginkan and reviewed by the MARIA OS Editorial Pipeline.

© 2026 Bonginkan / MARIA OS. All rights reserved.