TAG ARCHIVE

testing

1 MARIA OS blog articles tagged testing, organized as a Bonginkan topic archive for search engines and LLM retrieval.

1 article|Published by Bonginkan

Judgment OS / Decision Intelligence OS

Core MARIA OS research on turning organizational judgment into executable decision systems.

Agentic Company Architecture

Research on human-agent organizations, delegation boundaries, role topology, and governed autonomy.

Responsibility Gates and AI Governance

Safety, accountability, fail-closed gates, auditability, and human-in-the-loop control for AI agents.

Multi-Agent Mathematics

Formal models for convergence, stability, game theory, graph dynamics, and multi-agent evaluation.

Evidence, RAG, and Knowledge Governance

Evidence bundles, retrieval architecture, Graph RAG, knowledge trust, and auditable reasoning pipelines.

Agentic R&D and Judgment Science

Research operations, simulation labs, judgment science, recursive improvement, and experimental AI governance.

EngineeringMarch 8, 202630 min read

MARIA OS Evaluation Harness: A Standard Testing Infrastructure for Measuring Agent Quality

Formal test categories, composite scoring, and continuous evaluation pipelines that transform agent quality from subjective assessment into reproducible engineering measurement

Agent quality cannot be managed if it cannot be measured. Traditional software testing verifies deterministic input-output mappings, but AI agents operate in stochastic, multi-step decision spaces where correctness is contextual, safety is probabilistic, and governance compliance is structural. This paper introduces the MARIA OS Evaluation Harness — a standardized testing infrastructure that defines four test categories (correctness, safety, performance, governance compliance), four primary metrics (decision accuracy, gate compliance rate, evidence quality score, latency under load), and a formal composite scoring framework. We present the harness architecture comprising a test runner, scenario generator, oracle comparator, and regression detector, all scoped through MARIA coordinates for hierarchical test targeting. We prove that the composite agent score is monotonically responsive to genuine quality improvements and demonstrate that continuous evaluation pipelines catch 94.7% of quality regressions before production deployment.

evaluation-harnessagent-qualitytestingbenchmarksagentic-company