Name: MARIA OS
Author: MARIA OS

Abstract

Calibration — the alignment between stated confidence and realized accuracy — is a foundational requirement for trustworthy autonomous decision-making. For a single agent, calibration is well-understood: the Confidence Calibration Error CCEi = (1/N) Σk |conf(dk) − acc(dk)| measures the average absolute gap between confidence and accuracy across decisions. But in multi-agent governance systems where teams of agents collaborate on joint decisions, individual calibration is necessary but insufficient. A team where every member is individually well-calibrated can still produce collectively miscalibrated joint decisions when interaction dynamics distort the aggregation of individual confidence signals. This paper introduces the Collective Calibration Error CCEcollective, a metric that captures the team-level gap between aggregated confidence and joint accuracy, and proves that CCEcollective cannot be reduced to the average of individual CCEi values. We model agent interactions as a weighted directed graph G = (V, E, W) and derive the Calibration Propagation Operator Φ : &reals;|V| → &reals;|V| that governs how calibration errors spread through the team. We prove that collective calibration converges if and only if the spectral radius ρ(Φ) < 1, and that the convergence rate is determined by the spectral gap of the interaction graph Laplacian. We identify a fundamental tension between consensus quality and calibration accuracy: forcing agents to agree degrades calibration, while preserving calibration may prevent consensus. We resolve this tension through a Pareto-optimal scheduling protocol that alternates between consensus-seeking and calibration-preserving interaction rounds. Experimental validation on 9 MARIA OS production zones with 623 agents demonstrates 41.7% reduction in collective calibration error and 2.8× convergence speedup through topology-aware reflection scheduling.

1. Introduction

When we deploy a single AI agent and ask whether we should trust its decisions, the answer depends heavily on calibration: does the agent know what it knows? An agent that reports 90% confidence on decisions it gets right 90% of the time is well-calibrated and trustworthy. An agent that reports 90% confidence on decisions it gets right only 60% of the time is dangerously overconfident. This intuition, formalized through the Confidence Calibration Error metric, has become a standard evaluation criterion for individual agents in MARIA OS and similar governance platforms.

But enterprise AI governance rarely involves single agents making isolated decisions. In MARIA OS, agents operate within zones — teams of 5 to 50 agents that collaborate on decisions within a shared operational domain. A financial compliance zone might contain agents specializing in AML detection, KYC verification, transaction monitoring, and regulatory reporting. These agents do not simply make independent decisions in parallel; they interact, share evidence, influence each other’s confidence levels, and ultimately produce joint decisions that aggregate their individual assessments. The question of trust therefore shifts from “is this agent well-calibrated?” to “is this team well-calibrated?”

The distinction matters because individual calibration does not guarantee collective calibration. Consider a team of three agents evaluating a loan application. Agent A specializes in credit history and reports 85% confidence that the applicant is low-risk. Agent B specializes in income verification and reports 80% confidence. Agent C specializes in collateral assessment and reports 75% confidence. If the team uses a simple confidence-weighted average, the joint confidence is approximately 80.5%. But the joint accuracy — the probability that all three assessments are correct simultaneously — depends on the correlation structure between the agents' errors, which the individual calibration errors do not capture. If agents A and B tend to make correlated errors (both fail on the same applicant types), the joint accuracy may be substantially lower than the individual accuracies suggest, making the team overconfident even though each individual is perfectly calibrated.

This paper formalizes the gap between individual and collective calibration, develops a mathematical framework for understanding how calibration errors propagate through agent interactions, and derives practical conditions under which collective calibration can be achieved and maintained in MARIA OS deployments.

2. Individual Calibration Theory

2.1 The Standard CCE Metric

For agent i with decision history Di = {d1, d2, …, dN}, the Individual Confidence Calibration Error is defined as CCEi = (1/N) Σk=1N |conf(dk) − acc(dk)|, where conf(dk) ∈ [0, 1] is the agent’s stated confidence on decision dk and acc(dk) ∈ {0, 1} is the binary accuracy indicator. In the binned formulation used for practical evaluation, decisions are grouped into confidence bins Bm = {dk : conf(dk) ∈ [(m−1)/M, m/M)}, and the calibration error is computed as CCEi = Σm=1M (|Bm|/N) · |avg_conf(Bm) − avg_acc(Bm)|. A perfectly calibrated agent has CCEi = 0: in every confidence bin, the fraction of correct decisions exactly matches the average stated confidence.

2.2 Properties and Limitations

Individual CCE has several well-known properties. It is bounded: CCEi ∈ [0, 1]. It decomposes into overconfidence and underconfidence components: CCEi = CCEiover + CCEiunder, where the overconfidence component sums over bins where avg_conf > avg_acc and the underconfidence component sums over the complementary bins. It is minimized by the identity calibration function: if conf(d) = P(correct | features(d)) for all decisions, then CCEi = 0 in expectation. However, individual CCE has a critical limitation for multi-agent settings: it treats each agent as an isolated decision-maker and does not account for the statistical dependencies between agents’ errors. Two agents may each have CCE = 0.02, but if their errors are perfectly correlated, the team’s collective calibration error can be substantially worse than either individual’s. Conversely, if their errors are negatively correlated, the team can achieve better collective calibration than any individual member.

3. Collective Calibration: A New Metric

3.1 From Individual to Collective

Let T = {1, 2, …, n} be a team of n agents operating within a MARIA OS zone. The team produces joint decisions DT = {d1T, d2T, …, dKT}, where each joint decision dkT is formed by aggregating the individual assessments of participating agents. The aggregation function α : [0,1]n → [0,1] maps the vector of individual confidences to a joint confidence: confT(dkT) = α(conf1(dk), conf2(dk), …, confn(dk)). Common aggregation functions include the arithmetic mean αmean = (1/n) Σi confi, the confidence-weighted mean αwt = Σi wi confi / Σi wi, and the geometric mean αgeo = (Πi confi)1/n.

3.2 The CCEcollective Metric

We define the Collective Calibration Error as CCEcollective(T) = (1/K) Σk=1K |confT(dkT) − accT(dkT)| + λ · Σi<j |cov(erri, errj)|, where the first term is the standard calibration error applied to the aggregated team confidence and joint accuracy, and the second term penalizes error correlations between team members. The parameter λ ≥ 0 controls the weight placed on error correlation. When λ = 0, CCEcollective reduces to the naive team-level calibration error. When λ > 0, the metric explicitly accounts for the correlation structure that individual CCE ignores.

3.3 The Non-Reducibility Theorem

Theorem 1 (Non-Reducibility). There exist team configurations where CCEi ≤ ε for all i ∈ T, yet CCEcollective(T) ≥ c for an arbitrarily large constant c/ε. Conversely, there exist configurations where maxi CCEi ≥ c, yet CCEcollective(T) ≤ ε.

Proof sketch. For the first claim, construct n agents whose individual errors are perfectly correlated: each agent fails on exactly the same subset of decisions S ⊂ DT, with |S|/K = ε. Each agent has CCEi = ε (small), but the team’s joint confidence on decisions in S is high (since all agents are individually confident) while joint accuracy is zero (since all agents fail simultaneously). The correlation penalty λ · Σi<j |cov(erri, errj)| = λ · C(n,2) · ε2, which grows quadratically with team size, can dominate the first term. For the second claim, construct agents with large but negatively correlated errors: agent i fails on decision subset Si, with Si ∩ Sj = ∅ for i ≠ j and ∪i Si ⊂ DT. Each individual has high CCE, but the team’s aggregated confidence — being the average of one low-confidence signal and (n−1) high-confidence signals — is better calibrated than any individual. □

This theorem establishes that collective calibration is a genuinely emergent property: it cannot be inferred from, or reduced to, the calibration properties of individual team members. Any governance system that monitors only individual CCE will miss critical team-level calibration failures.

4. Calibration Propagation Dynamics

4.1 How Miscalibration Infects Team Decisions

When agents interact — sharing evidence, discussing assessments, updating confidences — miscalibration propagates through the team. A single overconfident agent can inflate the team’s collective confidence if other agents anchor to its high-confidence signals. We model this propagation through the Calibration Propagation Operator Φ. Let δi(t) = confi(t) − acci(t) denote agent i’s calibration deviation at time t. The vector δ(t) = (δ1(t), …, δn(t))&top; evolves according to δ(t+1) = Φ · δ(t) + η(t), where Φ ∈ &reals;n×n is the propagation matrix and η(t) is an exogenous noise term representing new evidence arrivals.

4.2 Structure of the Propagation Matrix

The entry Φij represents the influence that agent j’s calibration deviation exerts on agent i’s calibration in the next time step. Self-influence Φii represents calibration persistence: how much of an agent’s current miscalibration carries forward after one reflection cycle. Cross-influence Φij for i ≠ j represents calibration contagion: how much agent j’s miscalibration infects agent i through their interaction. In MARIA OS, these influence weights are determined by the interaction structure within a zone. Agents that frequently share evidence and co-evaluate decisions have larger cross-influence terms. Agents that operate on independent decision streams have near-zero cross-influence.

The spectral radius ρ(Φ) = maxk |λk(Φ)| determines the long-run behavior of calibration deviations. When ρ(Φ) < 1, calibration deviations decay exponentially: ||δ(t)|| ≤ ρ(Φ)t · ||δ(0)|| + C, where C is a constant determined by the noise magnitude. When ρ(Φ) ≥ 1, calibration deviations persist or grow: the team cannot self-correct its collective miscalibration through interaction alone.

4.3 Infection Radius and Containment

We define the infection radius of agent j as rinf(j) = min{r : Σi: d(i,j)>r |Φij(t)| < ε for all t ≥ 1}, where d(i,j) is the graph distance in the interaction topology and Φ(t) is the t-step propagation matrix. The infection radius measures how far a miscalibration event at agent j can propagate before its influence decays below threshold. Containment — limiting the infection radius — is achieved by introducing calibration firewalls: edges in the interaction graph where the influence weight is deliberately reduced. In MARIA OS, calibration firewalls are placed at zone boundaries, ensuring that miscalibration events within one zone cannot propagate to adjacent zones. Our experiments show that 94.3% of miscalibration infection events are contained within one interaction neighborhood when firewalls are active.

5. The Interaction Graph Model

5.1 Graph Construction

We model the agent team as a weighted directed graph G = (V, E, W), where V = T is the set of agents, E &sube; V × V is the set of directed interaction edges, and W : E → [0, 1] assigns interaction weights. An edge (i, j) ∈ E with weight wij indicates that agent i incorporates agent j’s confidence signal with weight wij when updating its own confidence. The graph Laplacian is L = D − W, where D is the diagonal matrix of weighted out-degrees Dii = Σj wij. The normalized Laplacian &Lscr; = D−1/2 L D−1/2 has eigenvalues 0 = λ1 ≤ λ2 ≤ … ≤ λn. The spectral gap γspec = λ2 governs the mixing time of information diffusion on the graph: larger spectral gaps imply faster convergence of confidence signals to consensus.

5.2 Topology and Calibration Convergence

Theorem 2 (Spectral Convergence). Let G = (V, E, W) be the interaction graph with normalized Laplacian &Lscr; and spectral gap γspec. Let Φ = I − η · &Lscr; be the calibration propagation operator with learning rate η ∈ (0, 2/λn). Then: (a) ρ(Φ) = max(|1 − ηλ2|, |1 − ηλn|) < 1, and (b) the convergence rate is maximized when η = 2/(λ2 + λn), yielding ρ(Φ) = (λn − λ2)/(λn + λ2).

The theorem reveals that graphs with larger spectral gaps and smaller spectral ratios λn/λ2 yield faster collective calibration convergence. Expander graphs — graphs with uniformly large spectral gap — are optimal for collective calibration. Conversely, graphs with small spectral gaps (e.g., long chains, star topologies with thin connections) converge slowly, leaving the team vulnerable to sustained miscalibration.

5.3 Pathological Topologies

Certain interaction topologies are provably harmful for collective calibration. A clique of overconfident agents — a fully connected subgraph where all members share the same overconfidence bias — reinforces its members’ miscalibration through positive feedback. The propagation matrix restricted to the clique has dominant eigenvalue 1, meaning the clique’s miscalibration does not decay. An echo chamber topology — where agents only interact with agents that hold similar confidence levels — creates disconnected calibration dynamics: each echo chamber converges to its own calibration equilibrium, but the team as a whole may not converge. A hub-and-spoke topology with a miscalibrated hub propagates the hub’s miscalibration to all spokes, creating a single point of calibration failure.

6. The Consensus-Calibration Tension

6.1 The Fundamental Tradeoff

Multi-agent decision-making systems face a fundamental tension between two desirable properties: consensus (agents should agree on joint decisions) and calibration (agents’ confidence should match their accuracy). These properties are in tension because achieving consensus often requires agents to adjust their confidence levels toward a group mean, which can degrade individual calibration. Formally, let CON(T) = 1 − (1/n2) Σi,j |confi − confj| measure consensus strength (1 = perfect agreement, 0 = maximum disagreement). The joint optimization problem minα [μ · CCEcollective(T, α) + (1 − μ) · (1 − CON(T, α))] has no solution that simultaneously minimizes both objectives when agents have heterogeneous information.

6.2 The Impossibility Result

Theorem 3 (Consensus-Calibration Impossibility). For any team T with |T| ≥ 3 agents holding non-identical private information, there exists no aggregation function α that simultaneously achieves CCEcollective = 0 and CON = 1 unless all agents’ private signals are perfectly concordant.

Proof. Suppose all agents agree (CON = 1), meaning confi = c for all i and some constant c. For CCEcollective = 0, we need c = P(correct | all agents’ information). But each agent computes confi = P(correct | agent i’s information), and achieving c = P(correct | all information) from individual posteriors requires each agent to perform exact Bayesian updating on all other agents’ private signals. When private signals are non-identical and conditionally dependent, the required update is computationally intractable and communicationally infeasible in finite interaction rounds. Therefore, forced consensus (confi = c for all i) necessarily degrades calibration whenever agents hold genuinely different information. □

6.3 Pareto-Optimal Resolution

Since simultaneous optimization is impossible, we seek Pareto-optimal tradeoffs: configurations where neither consensus nor calibration can be improved without degrading the other. We implement this through a two-phase interaction protocol. In Phase 1 (calibration-preserving), agents share evidence without adjusting their confidence levels, improving the information basis for future calibration. In Phase 2 (consensus-seeking), agents negotiate toward agreement using a weighted median protocol that bounds the maximum confidence adjustment per round to Δmax, preventing large calibration disruptions. The protocol alternates between phases with a ratio determined by the current CCEcollective/CON balance: when CCEcollective is high relative to (1 − CON), more calibration-preserving rounds are scheduled; when consensus is low relative to CCEcollective, more consensus-seeking rounds are scheduled.

7. MARIA OS Implementation

7.1 Zone-Level Calibration Monitoring

MARIA OS implements collective calibration monitoring at the Zone level of the MARIA coordinate system (G.U.P.Z.A). Each zone maintains a CalibrationMonitor service that: (a) tracks individual CCEi for each agent in the zone, (b) computes CCEcollective for the zone’s team at configurable intervals (default: every 50 joint decisions), (c) maintains the interaction graph G based on observed agent co-evaluation patterns, (d) computes the spectral gap γspec and propagation matrix spectral radius ρ(Φ) as leading indicators of calibration health, and (e) triggers calibration reflection events when CCEcollective exceeds the zone’s configured threshold τCCE.

7.2 Reflection Triggers and Interventions

When CCEcollective exceeds τCCE, the CalibrationMonitor initiates a structured reflection sequence. First, it identifies the calibration hotspot: the agent or agent cluster contributing most to CCEcollective, computed as ∇i CCEcollective = ∂ CCEcollective / ∂ confi for each agent i. Second, it computes the infection analysis: how far the hotspot’s miscalibration has propagated through the interaction graph, using the t-step propagation matrix Φ(t). Third, it prescribes an intervention: either individual recalibration (adjusting the hotspot agent’s confidence model), topology reconfiguration (adding calibration firewalls around the hotspot), or team recalibration (triggering a calibration-preserving interaction round for the entire zone). The choice of intervention is governed by the ratio r = CCEi,hotspot / CCEcollective: when r > 0.5 (the hotspot dominates collective miscalibration), individual recalibration is preferred; when r < 0.3 (miscalibration is distributed), team recalibration is prescribed; in between, topology reconfiguration is applied.

7.3 Integration with Meta-Insight Layers

Collective calibration monitoring integrates with MARIA OS’s three-layer Meta-Insight architecture. At the Individual layer, each agent’s CCEi feeds into its personal Bias Detection Score Bi(t). At the Collective layer, the zone’s CCEcollective feeds into the zone’s Consensus Quality metric CQ(d). At the System layer, cross-zone calibration patterns — whether multiple zones develop correlated miscalibration — feed into the Organizational Learning Rate OLR(t). The spectral radius ρ(Φ) is surfaced as a leading indicator in the zone’s health dashboard: when ρ(Φ) approaches 1.0 from below, it signals that the zone’s interaction topology is nearing the calibration instability boundary, even if CCEcollective is still within threshold.

8. Convergence Analysis

8.1 Sufficient Conditions for Collective Calibration Convergence

Theorem 4 (Convergence). Let G = (V, E, W) be the interaction graph of an agent team, let Φ be the calibration propagation operator, and let each agent apply individual calibration correction with learning rate ηi ∈ (0, 1). Collective calibration converges (limt→∞ CCEcollective(t) = 0) if the following conditions hold: (C1) The graph G is strongly connected with spectral gap γspec > 0. (C2) The propagation matrix satisfies ρ(Φ) < 1. (C3) Individual correction rates satisfy ηi < 2/(1 + maxj≠i wij) for all i. (C4) The noise process η(t) satisfies E[||η(t)||2] ≤ σ2 for all t.

Proof. Under (C1)–(C3), the operator Ψ = Φ − diag(η1, …, ηn) governs the corrected dynamics δ(t+1) = Ψ · δ(t) + η(t). By Gershgorin’s circle theorem, the eigenvalues of Ψ lie in the union of discs {z : |z − (Φii − ηi)| ≤ Σj≠i |Φij|}. Under (C3), each disc is contained in the open unit disc, giving ρ(Ψ) < 1. The stochastic convergence ||E[δ(t)]|| ≤ ρ(Ψ)t · ||δ(0)|| follows by induction. The variance bound Var(δ(t)) ≤ σ2/(1 − ρ(Ψ)2) follows from the geometric series for bounded noise. Together, these give CCEcollective(t) → O(σ/√(1 − ρ(Ψ)2)) as t → ∞, which equals zero when σ = 0 (no exogenous noise) and remains small for small σ. □

8.2 Necessary Conditions

Strong connectivity (C1) is necessary: if the interaction graph is disconnected, the disconnected components converge independently, and there is no mechanism to ensure their calibration equilibria are consistent. The spectral radius condition (C2) is necessary: if ρ(Φ) ≥ 1, at least one eigenvector direction of calibration deviation does not decay, and the corresponding calibration error persists indefinitely. The learning rate bound (C3) is necessary to prevent oscillatory divergence: agents that correct too aggressively overshoot and create new miscalibration that propagates through the team.

8.3 Convergence Rate and Topology Optimization

The convergence rate is determined by ρ(Ψ). Faster convergence requires smaller ρ(Ψ), which in turn requires larger spectral gap γspec and appropriately tuned learning rates ηi. For a fixed number of agents n, the interaction topology that minimizes ρ(Ψ) is the Ramanujan graph — a graph that achieves the optimal spectral gap bound λ2 ≥ 2√(d−1) for d-regular graphs. In practice, constructing exact Ramanujan graphs is computationally expensive, so we use spectral sparsification: starting from the complete graph, iteratively remove edges that least impact γspec, yielding a sparse topology that approximates the optimal convergence rate with O(n log n) edges instead of O(n2).

9. Experimental Results

9.1 Deployment Configuration

We evaluated the collective calibration framework on 9 production MARIA OS zones spanning financial compliance (3 zones, 231 agents), healthcare diagnostics (3 zones, 198 agents), and manufacturing quality (3 zones, 194 agents). Each zone ran for 120 days with collective calibration monitoring active, preceded by a 120-day baseline period with individual-only CCE monitoring. Total joint decisions evaluated: 47,382 (baseline) and 51,209 (treatment).

9.2 CCEcollective Reduction

Across all 9 zones, CCEcollective decreased from an average of 0.127 (baseline with individual-only monitoring) to 0.074 (with collective calibration framework), a 41.7% relative reduction. Financial compliance zones showed the largest improvement (46.2%), driven by the high degree of agent interaction in AML/KYC decision chains where miscalibration propagation was most acute. Healthcare zones showed 39.1% reduction. Manufacturing zones showed 38.8% reduction. The improvement was strongly correlated with the baseline interaction graph density: denser graphs — where more agents interact directly — showed larger CCEcollective improvements because they had more pathways for miscalibration propagation that the framework could address.

9.3 Convergence Speedup

Topology-aware reflection scheduling, using spectral-gap-optimized interaction patterns, achieved 2.8× faster convergence to calibration equilibrium compared to naive round-robin reflection ordering. The spectral gap of optimized topologies averaged γspec = 0.34, compared to γspec = 0.11 for the default topologies. The theoretical prediction for convergence speedup based on spectral gap ratios was 3.1×; the realized speedup of 2.8× reflects the impact of exogenous noise that the deterministic analysis does not account for.

9.4 Consensus-Calibration Pareto Front

The two-phase interaction protocol achieved a hypervolume indicator of 0.91 on the CCEcollective vs. (1 − CON) Pareto front, compared to 0.67 for consensus-first protocols and 0.73 for calibration-first protocols. The Pareto-optimal protocol found configurations where CCEcollective ≤ 0.08 and CON ≥ 0.82, demonstrating that the consensus-calibration tension is navigable when the interaction schedule is explicitly optimized for both objectives.

10. Conclusion

Collective calibration is a distinct phenomenon from individual calibration: it depends on the correlation structure of agent errors and the topology of agent interactions, neither of which is captured by individual CCE metrics. The Non-Reducibility Theorem establishes that monitoring only individual calibration creates a blind spot for collective miscalibration — teams of individually well-calibrated agents can produce collectively overconfident decisions when their errors are correlated. The Calibration Propagation Operator formalizes how miscalibration spreads through agent teams, and the Spectral Convergence Theorem provides precise conditions under which collective calibration converges. The Consensus-Calibration Impossibility result demonstrates a fundamental tension that cannot be eliminated but can be managed through Pareto-optimal interaction scheduling. MARIA OS’s implementation integrates these theoretical results into production governance: zone-level CalibrationMonitors track CCEcollective in real time, spectral analysis of interaction graphs provides leading indicators of calibration instability, and topology-aware reflection scheduling achieves 41.7% reduction in collective calibration error with 2.8× convergence speedup. For enterprise deployments where teams of agents must produce trustworthy joint decisions, collective calibration monitoring is not optional — it is the difference between a team that knows what it knows and a team that merely believes it does.

Collective Calibration Dynamics: How Agent Teams Achieve Shared Epistemic Accuracy in MARIA OS