Abstract
The proliferation of AI agents in enterprise operations demands a rigorous, standardized approach to quality measurement. Unlike traditional software where a function either returns the correct value or does not, AI agents make contextual decisions across multi-step workflows, invoke governance gates with varying appropriateness, produce evidence bundles of varying completeness, and do all of this under latency constraints. The absence of standardized evaluation infrastructure means that agent quality is typically assessed through anecdotal observation, spot-checking, or post-incident analysis — none of which scale, none of which are reproducible, and none of which provide the continuous feedback loops necessary for systematic improvement.
This paper introduces the MARIA OS Evaluation Harness, a testing infrastructure designed from first principles for agent quality measurement. We define four test categories that collectively cover the agent quality space: correctness (does the agent make the right decisions?), safety (does the agent avoid harmful actions?), performance (does the agent meet latency and throughput requirements?), and governance compliance (does the agent respect the organizational decision architecture?). For each category, we specify concrete metrics, threshold criteria, and evaluation protocols.
The harness architecture consists of four components: a test runner that orchestrates scenario execution with deterministic seeding, a scenario generator that produces parameterized test cases from domain templates, an oracle comparator that evaluates agent outputs against validated reference decisions, and a regression detector that identifies statistically significant quality degradation across evaluation runs. All components are scoped through the MARIA coordinate system, enabling test targeting at any level of the organizational hierarchy — from individual agents to entire universes.
1. Introduction: Why Agent Quality Needs Standardized Measurement
Software engineering spent decades developing testing methodologies — unit tests, integration tests, property-based tests, mutation tests, performance benchmarks — because engineers learned through painful experience that untested code is unreliable code. The AI agent ecosystem is at an earlier stage: we deploy agents into decision-critical workflows but lack equivalent testing infrastructure.
The gap is not merely tooling. It is conceptual. Traditional tests verify deterministic contracts: given input X, the function must return Y. Agent evaluation must verify stochastic policies: given context C, the agent should make decisions within an acceptable distribution D, while respecting governance constraints G, producing evidence of quality Q, within latency bound L. This multi-dimensional quality space requires a fundamentally different evaluation framework.
Three properties distinguish agent testing from traditional software testing:
| Dimension | Traditional Software Testing | Agent Quality Evaluation |
| --- | --- | --- |
| Correctness | Deterministic: exact output match | Distributional: acceptable decision range |
| Safety | Boundary validation, null checks | Behavioral constraint verification over trajectories |
| Performance | Throughput, latency (single operation) | Decision quality under resource constraints |
| Compliance | API contract adherence | Governance gate invocation, evidence production |The MARIA OS Evaluation Harness addresses all four dimensions through a unified architecture that produces a single composite quality score while preserving dimensional transparency.
2. Test Categories
2.1 Correctness
Correctness evaluation measures whether an agent makes decisions that align with validated reference outcomes. Unlike unit testing where correctness is binary, agent correctness exists on a spectrum. An agent that approves a low-risk expense report without human review may be correct in one organizational context and incorrect in another where all expenses require manager sign-off.
We define correctness relative to a decision oracle — a validated reference policy that specifies the correct action for each scenario in the test suite. The oracle is constructed by domain experts and encoded as a decision table mapping (context, action) pairs to correctness labels.
interface DecisionOracle {
scenarioId: string;
context: ScenarioContext;
expectedAction: AgentAction;
acceptableVariants: AgentAction[];
correctnessWeight: number; // importance of this scenario [0, 1]
}
function evaluateCorrectness(
agent: AgentUnderTest,
oracle: DecisionOracle[],
seed: number
): CorrectnessReport {
const prng = createPRNG(seed);
const results = oracle.map((scenario) => {
const agentAction = agent.decide(scenario.context, prng);
const isExactMatch = deepEqual(agentAction, scenario.expectedAction);
const isAcceptable = scenario.acceptableVariants.some(
(v) => deepEqual(agentAction, v)
);
return {
scenarioId: scenario.scenarioId,
correct: isExactMatch || isAcceptable,
exact: isExactMatch,
weight: scenario.correctnessWeight,
};
});
const weightedAccuracy =
results.reduce((sum, r) => sum + (r.correct ? r.weight : 0), 0) /
results.reduce((sum, r) => sum + r.weight, 0);
return { results, weightedAccuracy };
}2.2 Safety
Safety evaluation verifies that agents never take actions classified as harmful, even when prompted by adversarial or ambiguous inputs. Safety test scenarios are designed to probe boundary conditions: edge cases where the correct action is refusal, escalation, or gate invocation rather than direct execution.
We categorize safety violations into three severity levels: S1 (Critical) — actions that cause irreversible harm or violate legal constraints; S2 (Serious) — actions that bypass governance gates or produce incomplete evidence; S3 (Minor) — actions that are suboptimal but do not violate safety invariants. The harness enforces a zero-tolerance policy for S1 violations: any agent producing even one S1 violation fails the safety evaluation regardless of all other scores.
2.3 Performance
Performance evaluation measures agent decision quality under resource constraints — latency budgets, concurrent request loads, and degraded infrastructure conditions. The key insight is that performance testing for agents is not merely about speed; it is about graceful degradation. An agent that returns fast but wrong answers under load is worse than an agent that correctly escalates to a human when it cannot meet quality thresholds within the latency budget.
interface PerformanceScenario {
concurrentRequests: number;
latencyBudgetMs: number;
infrastructureDegradation: "none" | "partial" | "severe";
expectedBehavior: "normal" | "graceful-degradation" | "escalation";
}
const PERFORMANCE_SCENARIOS: PerformanceScenario[] = [
{ concurrentRequests: 1, latencyBudgetMs: 500, infrastructureDegradation: "none", expectedBehavior: "normal" },
{ concurrentRequests: 50, latencyBudgetMs: 2000, infrastructureDegradation: "none", expectedBehavior: "normal" },
{ concurrentRequests: 200, latencyBudgetMs: 5000, infrastructureDegradation: "partial", expectedBehavior: "graceful-degradation" },
{ concurrentRequests: 500, latencyBudgetMs: 10000, infrastructureDegradation: "severe", expectedBehavior: "escalation" },
];2.4 Governance Compliance
Governance compliance evaluation verifies that agents correctly participate in the organizational decision architecture. This includes: invoking required responsibility gates at appropriate decision points, producing evidence bundles that meet the schema specification, respecting approval workflows for decisions above their autonomy threshold, and logging all actions with MARIA coordinates for auditability.
Governance compliance is the most distinctive test category in the MARIA OS Evaluation Harness — it has no equivalent in traditional software testing. While security testing verifies that software cannot be exploited, governance compliance testing verifies that agents actively participate in organizational accountability structures.
3. Evaluation Metrics
3.1 Decision Accuracy (DA)
Decision Accuracy measures the weighted proportion of correct decisions across the evaluation scenario set.
DA = (\sum_{i=1}^{N} w_i \cdot \mathbb{1}[a_i \in A_i^{\text{correct}}]) / (\sum_{i=1}^{N} w_i)Where w_i is the importance weight of scenario i, a_i is the agent's action, and A_i^{correct} is the set of acceptable actions as defined by the oracle. The indicator function returns 1 when the agent's action falls within the acceptable set.
3.2 Gate Compliance Rate (GCR)
Gate Compliance Rate measures the proportion of governance-sensitive decision points where the agent correctly invokes the required responsibility gate.
GCR = |\{ d \in D_{\text{gated}} : \text{gate}(d) = \text{gate}_{\text{required}}(d) \}| / |D_{\text{gated}}|Where D_gated is the set of all decision points requiring gate invocation, and the numerator counts decisions where the agent invoked the correct gate type (human-in-the-loop, approval workflow, evidence collection, etc.).
3.3 Evidence Quality Score (EQS)
Evidence Quality Score evaluates the completeness, traceability, and correctness of evidence bundles produced by the agent. Each evidence bundle is scored against the schema specification across three sub-dimensions:
EQS = \alpha \cdot \text{completeness}(E) + \beta \cdot \text{traceability}(E) + \gamma \cdot \text{correctness}(E)Where alpha + beta + gamma = 1. Completeness measures whether all required fields are present. Traceability measures whether the evidence chain links back to source data through MARIA coordinates. Correctness measures whether the evidence content accurately represents the underlying decision context.
3.4 Latency Under Load (LUL)
Latency Under Load measures the agent's p95 response time normalized against the latency budget for each performance scenario.
LUL = 1 - \text{min}(1, p_{95}(\text{latency}) / L_{\text{budget}})A LUL score of 1.0 means the agent's p95 latency is zero (theoretical maximum). A score of 0.0 means the p95 latency exactly equals or exceeds the budget. Negative values are clamped to zero.
4. Composite Agent Score
The four metrics are combined into a single Composite Agent Score (CAS) using a weighted geometric mean, which ensures that catastrophic failure in any single dimension cannot be masked by excellence in others.
CAS = DA^{w_1} \cdot GCR^{w_2} \cdot EQS^{w_3} \cdot LUL^{w_4}Where w_1 + w_2 + w_3 + w_4 = 1. The default weights in MARIA OS are w_1 = 0.35 (correctness), w_2 = 0.30 (governance), w_3 = 0.20 (evidence), w_4 = 0.15 (performance). The geometric mean has the critical property that if any factor is zero, the composite score is zero — an agent that completely fails governance compliance receives a CAS of 0 regardless of decision accuracy.
The CAS weights are not engineering parameters — they are organizational policy. An enterprise that prioritizes governance compliance over raw accuracy should increase w_2 relative to w_1. MARIA OS requires that weight changes go through the decision pipeline with evidence justification, ensuring that the evaluation criteria themselves are governed.Theorem (Monotonic Responsiveness). The Composite Agent Score is monotonically non-decreasing with respect to genuine improvements in any individual metric, holding all other metrics constant.
Proof. Let CAS = DA^{w_1} GCR^{w_2} EQS^{w_3} * LUL^{w_4}. Consider an improvement in DA from DA_0 to DA_1 where DA_1 > DA_0 >= 0, with all other metrics held constant. Then CAS_1 / CAS_0 = (DA_1 / DA_0)^{w_1} > 1 since DA_1 / DA_0 > 1 and w_1 > 0. Therefore CAS_1 > CAS_0. The argument is symmetric for all metrics. QED.
5. Harness Architecture
The evaluation harness consists of four components arranged in a pipeline architecture.
5.1 Test Runner
The Test Runner orchestrates scenario execution with deterministic seeding to ensure reproducibility. Each evaluation run is identified by a (suite_id, seed, timestamp) triple that uniquely determines the entire execution trace. The runner supports parallel execution across multiple agent instances while maintaining deterministic ordering of scenario presentation.
interface EvaluationRun {
suiteId: string;
seed: number;
timestamp: Date;
coordinate: MARIACoordinate; // scope of evaluation
agent: AgentUnderTest;
scenarios: EvaluationScenario[];
config: RunConfig;
}
interface RunConfig {
parallelism: number;
timeoutMs: number;
failFast: boolean; // stop on first S1 safety violation
recordTraces: boolean;
}
async function executeRun(run: EvaluationRun): Promise<RunResult> {
const prng = createPRNG(run.seed);
const scenarios = shuffleWithSeed(run.scenarios, prng);
const results: ScenarioResult[] = [];
for (const batch of chunk(scenarios, run.config.parallelism)) {
const batchResults = await Promise.all(
batch.map((s) => executeScenario(run.agent, s, prng))
);
results.push(...batchResults);
if (run.config.failFast && batchResults.some((r) => r.safetyLevel === "S1")) {
return { status: "failed-safety", results, abortedAt: results.length };
}
}
return { status: "completed", results };
}5.2 Scenario Generator
The Scenario Generator produces parameterized test cases from domain templates. Each template defines a decision context schema, a set of variable parameters, and the oracle-validated correct action for each parameter combination. Templates are organized by MARIA coordinate, enabling targeted test generation for specific domains.
The generator supports three modes: exhaustive (all parameter combinations), random (Monte Carlo sampling with specified seed), and adversarial (parameter selection biased toward known edge cases and historical failure modes).
5.3 Oracle Comparator
The Oracle Comparator evaluates agent outputs against validated reference decisions. Unlike simple equality checking, the comparator supports partial correctness — recognizing that agent decisions may be partially correct (e.g., correct action but incomplete evidence, or correct escalation but to the wrong gate level).
The comparator produces a multi-dimensional correctness vector for each scenario, which feeds into the metric calculations described in Section 3.
5.4 Regression Detector
The Regression Detector identifies statistically significant quality degradation across evaluation runs. It maintains a rolling baseline of CAS scores and applies hypothesis testing to detect regressions.
H_0: \mu_{\text{current}} \geq \mu_{\text{baseline}} \quad \text{vs} \quad H_1: \mu_{\text{current}} < \mu_{\text{baseline}}A one-sided t-test at significance level alpha = 0.01 is applied. If H_0 is rejected, the regression detector raises an alert that blocks deployment through the MARIA OS decision pipeline. The detector also tracks per-metric regressions, enabling targeted investigation when the composite score degrades due to a single dimension.
6. MARIA Coordinate-Based Test Scoping
One of the most powerful features of the evaluation harness is its integration with the MARIA coordinate system. Tests can be scoped at any level of the organizational hierarchy:
// Evaluate a single agent
const agentScope: TestScope = {
coordinate: "G1.U2.P3.Z1.A5",
level: "agent",
inheritParentScenarios: true,
};
// Evaluate all agents in a zone
const zoneScope: TestScope = {
coordinate: "G1.U2.P3.Z1.*",
level: "zone",
includeInterAgentScenarios: true,
};
// Evaluate an entire universe
const universeScope: TestScope = {
coordinate: "G1.U2.*.*.*",
level: "universe",
includeCrossZoneScenarios: true,
includeCrossPlanetScenarios: true,
};Hierarchical scoping means that test suites compose naturally. A zone-level evaluation includes all agent-level scenarios for agents within that zone, plus inter-agent coordination scenarios. A universe-level evaluation includes all zone-level evaluations plus cross-zone interaction scenarios. This composability ensures that quality is verified at every level of the organizational hierarchy without requiring separate test suite maintenance.
7. Formal Framework for Agent Scoring
We formalize the evaluation framework as a tuple (S, A, O, M, C) where:
- S = scenario space, the set of all possible evaluation contexts
- A = action space, the set of all possible agent responses
- O = oracle function, O: S -> P(A), mapping scenarios to sets of acceptable actions
- M = metric functions, M = {DA, GCR, EQS, LUL}
- C = composition function, C: R^4 -> R, mapping metric scores to the composite score
An agent pi is a stochastic policy pi: S -> Delta(A) mapping scenarios to probability distributions over actions. The expected quality of agent pi under metric m is:
Q_m(\pi) = \mathbb{E}_{s \sim \mathcal{S}, a \sim \pi(s)}[m(a, O(s))]The composite quality is then:
Q(\pi) = C(Q_{DA}(\pi), Q_{GCR}(\pi), Q_{EQS}(\pi), Q_{LUL}(\pi))Theorem (Separability). The composite quality Q(pi) can be decomposed into independent per-metric evaluations if and only if the metric functions are conditionally independent given the scenario context.
This separability property is critical for practical evaluation: it means we can run correctness, safety, performance, and governance tests independently and compose the results, rather than requiring every test scenario to evaluate all four dimensions simultaneously.
8. Continuous Evaluation Pipeline
The evaluation harness is not a one-time tool — it operates as a continuous pipeline integrated into the MARIA OS deployment workflow.
interface ContinuousEvaluationPipeline {
// Trigger conditions
triggers: {
onAgentUpdate: boolean; // re-evaluate when agent code changes
onOracleUpdate: boolean; // re-evaluate when oracle is updated
scheduled: CronExpression; // periodic re-evaluation (e.g., daily)
onIncident: boolean; // re-evaluate after production incidents
};
// Pipeline stages
stages: [
"scenario-generation", // generate or refresh test scenarios
"agent-execution", // run agent against scenarios
"oracle-comparison", // compare outputs to oracle
"metric-calculation", // compute DA, GCR, EQS, LUL
"regression-detection", // statistical comparison to baseline
"report-generation", // produce human-readable report
"gate-decision", // pass/fail/escalate decision
];
// Output
result: "deploy" | "block" | "escalate-to-human";
}The pipeline runs automatically on four triggers: agent code updates, oracle specification updates, scheduled intervals (default: daily), and production incident reports. Each run produces an evaluation report with the composite score, per-metric breakdown, regression analysis, and a deployment recommendation.
The gate-decision stage is particularly important: it translates evaluation results into governance actions through the MARIA OS decision pipeline. An agent that passes all thresholds receives automatic deployment approval. An agent that fails any S1 safety scenario is automatically blocked. Intermediate cases — such as a slight GCR regression that may be within acceptable variance — are escalated to human reviewers through the responsibility gate system.
9. Comparison with Traditional Software Testing
The following table summarizes the key differences between traditional software testing and the MARIA OS Evaluation Harness approach:
| Aspect | Traditional Testing | MARIA OS Evaluation Harness |
| --- | --- | --- |
| Correctness criterion | Exact output match | Oracle-validated acceptable range |
| Test determinism | Fully deterministic | Deterministic via seeded PRNG |
| Safety verification | Negative testing (invalid inputs) | Behavioral trajectory analysis |
| Performance testing | Load testing (throughput/latency) | Quality-under-constraint evaluation |
| Compliance testing | N/A or manual audit | Automated governance gate verification |
| Test scoping | Package/module hierarchy | MARIA coordinate hierarchy |
| Regression detection | Diff-based (pass/fail) | Statistical hypothesis testing |
| Evaluation output | Pass/fail | Composite score with dimensional breakdown |
| Continuous integration | On code change | On code change + oracle change + schedule + incident |
| Governance integration | None | Full decision pipeline integration |The most fundamental difference is philosophical. Traditional testing asks: "does the code do what the specification says?" The evaluation harness asks: "does the agent make decisions that the organization would endorse, through processes the organization has approved, with evidence the organization can audit?"
10. Scenario Design Principles
Effective evaluation requires well-designed scenarios. We propose five principles for scenario design in agent evaluation:
Principle 1: Coverage over Volume. A test suite with 500 scenarios covering all four test categories and all MARIA coordinate levels is more valuable than a suite with 50,000 scenarios that only test correctness at the agent level. The scenario generator should prioritize dimensional coverage before volumetric expansion.
Principle 2: Adversarial Scenarios Are Mandatory. Every test suite must include adversarial scenarios that probe for known failure modes: ambiguous contexts, conflicting constraints, resource pressure, and cascading error conditions. At least 20% of scenarios should be adversarial.
Principle 3: Oracle Maintenance Is Governance. The oracle represents organizational judgment. Changes to oracle-validated correct actions are changes to organizational policy and must go through the decision pipeline with evidence justification.
Principle 4: Scenario Provenance. Every scenario must be traceable to its source — whether generated from a template, derived from a production incident, or designed by a domain expert. Provenance enables root cause analysis when agents fail specific scenarios.
Principle 5: Temporal Validity. Scenarios have an expiration date. Organizational policies change, domain knowledge evolves, and acceptable decision ranges shift. The harness tracks scenario validity windows and flags stale scenarios for review.
11. Implementation Considerations
The evaluation harness runs in an isolated environment separate from production. Agent instances under test receive no production traffic and have no access to production data. The harness environment mirrors the production MARIA OS configuration but uses synthetic data generated by the scenario generator. This isolation ensures that evaluation cannot impact production operations and that evaluation results are not contaminated by production state.Key implementation details include:
Deterministic seeding. Every source of randomness in the evaluation pipeline is controlled by the seeded PRNG (mulberry32, consistent with the MARIA OS Civilization Engine). This means that evaluation runs are perfectly reproducible: given the same (suite_id, seed, timestamp) triple, the harness produces identical results.
Parallel execution with ordering guarantees. Scenarios within a batch execute in parallel for performance, but batches execute sequentially to enable fail-fast behavior. The deterministic shuffle ensures that scenario ordering is consistent across runs while avoiding position-dependent biases.
Evidence collection. The harness records complete execution traces for every scenario, including agent internal states (when available), decision rationale logs, gate invocations, and evidence bundles. These traces serve dual purposes: metric calculation and post-evaluation debugging.
Baseline management. The regression detector maintains a rolling 30-day baseline of CAS scores. The baseline updates automatically as the agent improves, but the update is asymmetric: improvements raise the baseline, but regressions do not lower it. This ratchet mechanism ensures that quality can only improve over time.
12. Conclusion and Future Directions
The MARIA OS Evaluation Harness transforms agent quality from an opaque, subjective assessment into a transparent, reproducible engineering measurement. By defining four test categories, four primary metrics, a formal composite scoring framework, and a continuous evaluation pipeline integrated with the MARIA OS governance infrastructure, the harness provides the foundation for systematic agent quality management.
The key contributions of this paper are:
1. A formal test taxonomy for agent quality evaluation covering correctness, safety, performance, and governance compliance 2. Four concrete metrics (DA, GCR, EQS, LUL) with mathematical definitions and measurement protocols 3. A composite scoring framework with proven monotonic responsiveness and separability properties 4. An architecture for continuous evaluation with statistical regression detection and governance integration 5. MARIA coordinate-based test scoping that enables hierarchical quality verification from individual agents to entire universes
Future work includes: extending the harness to evaluate multi-agent coordination quality (not just individual agent quality), developing transfer learning approaches for oracle construction (reducing the domain expert effort required to create new evaluation suites), and integrating the harness with the Civilization Engine to evaluate agent behavior in competitive multi-nation simulation environments.
The deepest insight of this work is that evaluation infrastructure is not separate from governance infrastructure — it is governance infrastructure. The metrics an organization chooses to measure, the thresholds it sets for acceptance, and the processes it uses to update evaluation criteria are all expressions of organizational values. The MARIA OS Evaluation Harness makes these choices explicit, auditable, and governed by the same decision pipeline that governs the agents themselves.