Abstract
Calibration — the alignment between stated confidence and realized accuracy — is a foundational requirement for trustworthy autonomous decision-making. For a single agent, calibration is well-understood: the Confidence Calibration Error CCE<sub>i</sub> = (1/N) Σ<sub>k</sub> |conf(d<sub>k</sub>) − acc(d<sub>k</sub>)| measures the average absolute gap between confidence and accuracy across decisions. But in multi-agent governance systems where teams of agents collaborate on joint decisions, individual calibration is necessary but insufficient. A team where every member is individually well-calibrated can still produce collectively miscalibrated joint decisions when interaction dynamics distort the aggregation of individual confidence signals. This paper introduces the Collective Calibration Error CCE<sub>collective</sub>, a metric that captures the team-level gap between aggregated confidence and joint accuracy, and proves that CCE<sub>collective</sub> cannot be reduced to the average of individual CCE<sub>i</sub> values. We model agent interactions as a weighted directed graph G = (V, E, W) and derive the Calibration Propagation Operator Φ : ℝ<sup>|V|</sup> → ℝ<sup>|V|</sup> that governs how calibration errors spread through the team. We prove that collective calibration converges if and only if the spectral radius ρ(Φ) < 1, and that the convergence rate is determined by the spectral gap of the interaction graph Laplacian. We identify a fundamental tension between consensus quality and calibration accuracy: forcing agents to agree degrades calibration, while preserving calibration may prevent consensus. We resolve this tension through a Pareto-optimal scheduling protocol that alternates between consensus-seeking and calibration-preserving interaction rounds. Experimental validation on 9 MARIA OS production zones with 623 agents demonstrates 41.7% reduction in collective calibration error and 2.8× convergence speedup through topology-aware reflection scheduling.
1. Introduction
When we deploy a single AI agent and ask whether we should trust its decisions, the answer depends heavily on calibration: does the agent know what it knows? An agent that reports 90% confidence on decisions it gets right 90% of the time is well-calibrated and trustworthy. An agent that reports 90% confidence on decisions it gets right only 60% of the time is dangerously overconfident. This intuition, formalized through the Confidence Calibration Error metric, has become a standard evaluation criterion for individual agents in MARIA OS and similar governance platforms.
But enterprise AI governance rarely involves single agents making isolated decisions. In MARIA OS, agents operate within zones — teams of 5 to 50 agents that collaborate on decisions within a shared operational domain. A financial compliance zone might contain agents specializing in AML detection, KYC verification, transaction monitoring, and regulatory reporting. These agents do not simply make independent decisions in parallel; they interact, share evidence, influence each other’s confidence levels, and ultimately produce joint decisions that aggregate their individual assessments. The question of trust therefore shifts from “is this agent well-calibrated?” to “is this team well-calibrated?”
The distinction matters because individual calibration does not guarantee collective calibration. Consider a team of three agents evaluating a loan application. Agent A specializes in credit history and reports 85% confidence that the applicant is low-risk. Agent B specializes in income verification and reports 80% confidence. Agent C specializes in collateral assessment and reports 75% confidence. If the team uses a simple confidence-weighted average, the joint confidence is approximately 80.5%. But the joint accuracy — the probability that all three assessments are correct simultaneously — depends on the correlation structure between the agents' errors, which the individual calibration errors do not capture. If agents A and B tend to make correlated errors (both fail on the same applicant types), the joint accuracy may be substantially lower than the individual accuracies suggest, making the team overconfident even though each individual is perfectly calibrated.
This paper formalizes the gap between individual and collective calibration, develops a mathematical framework for understanding how calibration errors propagate through agent interactions, and derives practical conditions under which collective calibration can be achieved and maintained in MARIA OS deployments.
2. Individual Calibration Theory
2.1 The Standard CCE Metric
For agent i with decision history D<sub>i</sub> = {d<sub>1</sub>, d<sub>2</sub>, …, d<sub>N</sub>}, the Individual Confidence Calibration Error is defined as CCE<sub>i</sub> = (1/N) Σ<sub>k=1</sub><sup>N</sup> |conf(d<sub>k</sub>) − acc(d<sub>k</sub>)|, where conf(d<sub>k</sub>) ∈ [0, 1] is the agent’s stated confidence on decision d<sub>k</sub> and acc(d<sub>k</sub>) ∈ {0, 1} is the binary accuracy indicator. In the binned formulation used for practical evaluation, decisions are grouped into confidence bins B<sub>m</sub> = {d<sub>k</sub> : conf(d<sub>k</sub>) ∈ [(m−1)/M, m/M)}, and the calibration error is computed as CCE<sub>i</sub> = Σ<sub>m=1</sub><sup>M</sup> (|B<sub>m</sub>|/N) · |avg_conf(B<sub>m</sub>) − avg_acc(B<sub>m</sub>)|. A perfectly calibrated agent has CCE<sub>i</sub> = 0: in every confidence bin, the fraction of correct decisions exactly matches the average stated confidence.
2.2 Properties and Limitations
Individual CCE has several well-known properties. It is bounded: CCE<sub>i</sub> ∈ [0, 1]. It decomposes into overconfidence and underconfidence components: CCE<sub>i</sub> = CCE<sub>i</sub><sup>over</sup> + CCE<sub>i</sub><sup>under</sup>, where the overconfidence component sums over bins where avg_conf > avg_acc and the underconfidence component sums over the complementary bins. It is minimized by the identity calibration function: if conf(d) = P(correct | features(d)) for all decisions, then CCE<sub>i</sub> = 0 in expectation. However, individual CCE has a critical limitation for multi-agent settings: it treats each agent as an isolated decision-maker and does not account for the statistical dependencies between agents’ errors. Two agents may each have CCE = 0.02, but if their errors are perfectly correlated, the team’s collective calibration error can be substantially worse than either individual’s. Conversely, if their errors are negatively correlated, the team can achieve better collective calibration than any individual member.
3. Collective Calibration: A New Metric
3.1 From Individual to Collective
Let T = {1, 2, …, n} be a team of n agents operating within a MARIA OS zone. The team produces joint decisions D<sub>T</sub> = {d<sub>1</sub><sup>T</sup>, d<sub>2</sub><sup>T</sup>, …, d<sub>K</sub><sup>T</sup>}, where each joint decision d<sub>k</sub><sup>T</sup> is formed by aggregating the individual assessments of participating agents. The aggregation function α : [0,1]<sup>n</sup> → [0,1] maps the vector of individual confidences to a joint confidence: conf<sub>T</sub>(d<sub>k</sub><sup>T</sup>) = α(conf<sub>1</sub>(d<sub>k</sub>), conf<sub>2</sub>(d<sub>k</sub>), …, conf<sub>n</sub>(d<sub>k</sub>)). Common aggregation functions include the arithmetic mean α<sub>mean</sub> = (1/n) Σ<sub>i</sub> conf<sub>i</sub>, the confidence-weighted mean α<sub>wt</sub> = Σ<sub>i</sub> w<sub>i</sub> conf<sub>i</sub> / Σ<sub>i</sub> w<sub>i</sub>, and the geometric mean α<sub>geo</sub> = (Π<sub>i</sub> conf<sub>i</sub>)<sup>1/n</sup>.
3.2 The CCE<sub>collective</sub> Metric
We define the Collective Calibration Error as CCE<sub>collective</sub>(T) = (1/K) Σ<sub>k=1</sub><sup>K</sup> |conf<sub>T</sub>(d<sub>k</sub><sup>T</sup>) − acc<sub>T</sub>(d<sub>k</sub><sup>T</sup>)| + λ · Σ<sub>i<j</sub> |cov(err<sub>i</sub>, err<sub>j</sub>)|, where the first term is the standard calibration error applied to the aggregated team confidence and joint accuracy, and the second term penalizes error correlations between team members. The parameter λ ≥ 0 controls the weight placed on error correlation. When λ = 0, CCE<sub>collective</sub> reduces to the naive team-level calibration error. When λ > 0, the metric explicitly accounts for the correlation structure that individual CCE ignores.
3.3 The Non-Reducibility Theorem
Theorem 1 (Non-Reducibility). There exist team configurations where CCE<sub>i</sub> ≤ ε for all i ∈ T, yet CCE<sub>collective</sub>(T) ≥ c for an arbitrarily large constant c/ε. Conversely, there exist configurations where max<sub>i</sub> CCE<sub>i</sub> ≥ c, yet CCE<sub>collective</sub>(T) ≤ ε.
Proof sketch. For the first claim, construct n agents whose individual errors are perfectly correlated: each agent fails on exactly the same subset of decisions S ⊂ D<sub>T</sub>, with |S|/K = ε. Each agent has CCE<sub>i</sub> = ε (small), but the team’s joint confidence on decisions in S is high (since all agents are individually confident) while joint accuracy is zero (since all agents fail simultaneously). The correlation penalty λ · Σ<sub>i<j</sub> |cov(err<sub>i</sub>, err<sub>j</sub>)| = λ · C(n,2) · ε<sup>2</sup>, which grows quadratically with team size, can dominate the first term. For the second claim, construct agents with large but negatively correlated errors: agent i fails on decision subset S<sub>i</sub>, with S<sub>i</sub> ∩ S<sub>j</sub> = ∅ for i ≠ j and ∪<sub>i</sub> S<sub>i</sub> ⊂ D<sub>T</sub>. Each individual has high CCE, but the team’s aggregated confidence — being the average of one low-confidence signal and (n−1) high-confidence signals — is better calibrated than any individual. □
This theorem establishes that collective calibration is a genuinely emergent property: it cannot be inferred from, or reduced to, the calibration properties of individual team members. Any governance system that monitors only individual CCE will miss critical team-level calibration failures.
4. Calibration Propagation Dynamics
4.1 How Miscalibration Infects Team Decisions
When agents interact — sharing evidence, discussing assessments, updating confidences — miscalibration propagates through the team. A single overconfident agent can inflate the team’s collective confidence if other agents anchor to its high-confidence signals. We model this propagation through the Calibration Propagation Operator Φ. Let δ<sub>i</sub>(t) = conf<sub>i</sub>(t) − acc<sub>i</sub>(t) denote agent i’s calibration deviation at time t. The vector δ(t) = (δ<sub>1</sub>(t), …, δ<sub>n</sub>(t))<sup>⊤</sup> evolves according to δ(t+1) = Φ · δ(t) + η(t), where Φ ∈ ℝ<sup>n×n</sup> is the propagation matrix and η(t) is an exogenous noise term representing new evidence arrivals.
4.2 Structure of the Propagation Matrix
The entry Φ<sub>ij</sub> represents the influence that agent j’s calibration deviation exerts on agent i’s calibration in the next time step. Self-influence Φ<sub>ii</sub> represents calibration persistence: how much of an agent’s current miscalibration carries forward after one reflection cycle. Cross-influence Φ<sub>ij</sub> for i ≠ j represents calibration contagion: how much agent j’s miscalibration infects agent i through their interaction. In MARIA OS, these influence weights are determined by the interaction structure within a zone. Agents that frequently share evidence and co-evaluate decisions have larger cross-influence terms. Agents that operate on independent decision streams have near-zero cross-influence.
The spectral radius ρ(Φ) = max<sub>k</sub> |λ<sub>k</sub>(Φ)| determines the long-run behavior of calibration deviations. When ρ(Φ) < 1, calibration deviations decay exponentially: ||δ(t)|| ≤ ρ(Φ)<sup>t</sup> · ||δ(0)|| + C, where C is a constant determined by the noise magnitude. When ρ(Φ) ≥ 1, calibration deviations persist or grow: the team cannot self-correct its collective miscalibration through interaction alone.
4.3 Infection Radius and Containment
We define the infection radius of agent j as r<sub>inf</sub>(j) = min{r : Σ<sub>i: d(i,j)>r</sub> |Φ<sub>ij</sub><sup>(t)</sup>| < ε for all t ≥ 1}, where d(i,j) is the graph distance in the interaction topology and Φ<sup>(t)</sup> is the t-step propagation matrix. The infection radius measures how far a miscalibration event at agent j can propagate before its influence decays below threshold. Containment — limiting the infection radius — is achieved by introducing calibration firewalls: edges in the interaction graph where the influence weight is deliberately reduced. In MARIA OS, calibration firewalls are placed at zone boundaries, ensuring that miscalibration events within one zone cannot propagate to adjacent zones. Our experiments show that 94.3% of miscalibration infection events are contained within one interaction neighborhood when firewalls are active.
5. The Interaction Graph Model
5.1 Graph Construction
We model the agent team as a weighted directed graph G = (V, E, W), where V = T is the set of agents, E ⊆ V × V is the set of directed interaction edges, and W : E → [0, 1] assigns interaction weights. An edge (i, j) ∈ E with weight w<sub>ij</sub> indicates that agent i incorporates agent j’s confidence signal with weight w<sub>ij</sub> when updating its own confidence. The graph Laplacian is L = D − W, where D is the diagonal matrix of weighted out-degrees D<sub>ii</sub> = Σ<sub>j</sub> w<sub>ij</sub>. The normalized Laplacian ℒ = D<sup>−1/2</sup> L D<sup>−1/2</sup> has eigenvalues 0 = λ<sub>1</sub> ≤ λ<sub>2</sub> ≤ … ≤ λ<sub>n</sub>. The spectral gap γ<sub>spec</sub> = λ<sub>2</sub> governs the mixing time of information diffusion on the graph: larger spectral gaps imply faster convergence of confidence signals to consensus.
5.2 Topology and Calibration Convergence
Theorem 2 (Spectral Convergence). Let G = (V, E, W) be the interaction graph with normalized Laplacian ℒ and spectral gap γ<sub>spec</sub>. Let Φ = I − η · ℒ be the calibration propagation operator with learning rate η ∈ (0, 2/λ<sub>n</sub>). Then: (a) ρ(Φ) = max(|1 − ηλ<sub>2</sub>|, |1 − ηλ<sub>n</sub>|) < 1, and (b) the convergence rate is maximized when η = 2/(λ<sub>2</sub> + λ<sub>n</sub>), yielding ρ(Φ) = (λ<sub>n</sub> − λ<sub>2</sub>)/(λ<sub>n</sub> + λ<sub>2</sub>).
The theorem reveals that graphs with larger spectral gaps and smaller spectral ratios λ<sub>n</sub>/λ<sub>2</sub> yield faster collective calibration convergence. Expander graphs — graphs with uniformly large spectral gap — are optimal for collective calibration. Conversely, graphs with small spectral gaps (e.g., long chains, star topologies with thin connections) converge slowly, leaving the team vulnerable to sustained miscalibration.
5.3 Pathological Topologies
Certain interaction topologies are provably harmful for collective calibration. A clique of overconfident agents — a fully connected subgraph where all members share the same overconfidence bias — reinforces its members’ miscalibration through positive feedback. The propagation matrix restricted to the clique has dominant eigenvalue 1, meaning the clique’s miscalibration does not decay. An echo chamber topology — where agents only interact with agents that hold similar confidence levels — creates disconnected calibration dynamics: each echo chamber converges to its own calibration equilibrium, but the team as a whole may not converge. A hub-and-spoke topology with a miscalibrated hub propagates the hub’s miscalibration to all spokes, creating a single point of calibration failure.
6. The Consensus-Calibration Tension
6.1 The Fundamental Tradeoff
Multi-agent decision-making systems face a fundamental tension between two desirable properties: consensus (agents should agree on joint decisions) and calibration (agents’ confidence should match their accuracy). These properties are in tension because achieving consensus often requires agents to adjust their confidence levels toward a group mean, which can degrade individual calibration. Formally, let CON(T) = 1 − (1/n<sup>2</sup>) Σ<sub>i,j</sub> |conf<sub>i</sub> − conf<sub>j</sub>| measure consensus strength (1 = perfect agreement, 0 = maximum disagreement). The joint optimization problem min<sub>α</sub> [μ · CCE<sub>collective</sub>(T, α) + (1 − μ) · (1 − CON(T, α))] has no solution that simultaneously minimizes both objectives when agents have heterogeneous information.
6.2 The Impossibility Result
Theorem 3 (Consensus-Calibration Impossibility). For any team T with |T| ≥ 3 agents holding non-identical private information, there exists no aggregation function α that simultaneously achieves CCE<sub>collective</sub> = 0 and CON = 1 unless all agents’ private signals are perfectly concordant.
Proof. Suppose all agents agree (CON = 1), meaning conf<sub>i</sub> = c for all i and some constant c. For CCE<sub>collective</sub> = 0, we need c = P(correct | all agents’ information). But each agent computes conf<sub>i</sub> = P(correct | agent i’s information), and achieving c = P(correct | all information) from individual posteriors requires each agent to perform exact Bayesian updating on all other agents’ private signals. When private signals are non-identical and conditionally dependent, the required update is computationally intractable and communicationally infeasible in finite interaction rounds. Therefore, forced consensus (conf<sub>i</sub> = c for all i) necessarily degrades calibration whenever agents hold genuinely different information. □
6.3 Pareto-Optimal Resolution
Since simultaneous optimization is impossible, we seek Pareto-optimal tradeoffs: configurations where neither consensus nor calibration can be improved without degrading the other. We implement this through a two-phase interaction protocol. In Phase 1 (calibration-preserving), agents share evidence without adjusting their confidence levels, improving the information basis for future calibration. In Phase 2 (consensus-seeking), agents negotiate toward agreement using a weighted median protocol that bounds the maximum confidence adjustment per round to Δ<sub>max</sub>, preventing large calibration disruptions. The protocol alternates between phases with a ratio determined by the current CCE<sub>collective</sub>/CON balance: when CCE<sub>collective</sub> is high relative to (1 − CON), more calibration-preserving rounds are scheduled; when consensus is low relative to CCE<sub>collective</sub>, more consensus-seeking rounds are scheduled.
7. MARIA OS Implementation
7.1 Zone-Level Calibration Monitoring
MARIA OS implements collective calibration monitoring at the Zone level of the MARIA coordinate system (G.U.P.Z.A). Each zone maintains a CalibrationMonitor service that: (a) tracks individual CCE<sub>i</sub> for each agent in the zone, (b) computes CCE<sub>collective</sub> for the zone’s team at configurable intervals (default: every 50 joint decisions), (c) maintains the interaction graph G based on observed agent co-evaluation patterns, (d) computes the spectral gap γ<sub>spec</sub> and propagation matrix spectral radius ρ(Φ) as leading indicators of calibration health, and (e) triggers calibration reflection events when CCE<sub>collective</sub> exceeds the zone’s configured threshold τ<sub>CCE</sub>.
7.2 Reflection Triggers and Interventions
When CCE<sub>collective</sub> exceeds τ<sub>CCE</sub>, the CalibrationMonitor initiates a structured reflection sequence. First, it identifies the calibration hotspot: the agent or agent cluster contributing most to CCE<sub>collective</sub>, computed as ∇<sub>i</sub> CCE<sub>collective</sub> = ∂ CCE<sub>collective</sub> / ∂ conf<sub>i</sub> for each agent i. Second, it computes the infection analysis: how far the hotspot’s miscalibration has propagated through the interaction graph, using the t-step propagation matrix Φ<sup>(t)</sup>. Third, it prescribes an intervention: either individual recalibration (adjusting the hotspot agent’s confidence model), topology reconfiguration (adding calibration firewalls around the hotspot), or team recalibration (triggering a calibration-preserving interaction round for the entire zone). The choice of intervention is governed by the ratio r = CCE<sub>i,hotspot</sub> / CCE<sub>collective</sub>: when r > 0.5 (the hotspot dominates collective miscalibration), individual recalibration is preferred; when r < 0.3 (miscalibration is distributed), team recalibration is prescribed; in between, topology reconfiguration is applied.
7.3 Integration with Meta-Insight Layers
Collective calibration monitoring integrates with MARIA OS’s three-layer Meta-Insight architecture. At the Individual layer, each agent’s CCE<sub>i</sub> feeds into its personal Bias Detection Score B<sub>i</sub>(t). At the Collective layer, the zone’s CCE<sub>collective</sub> feeds into the zone’s Consensus Quality metric CQ(d). At the System layer, cross-zone calibration patterns — whether multiple zones develop correlated miscalibration — feed into the Organizational Learning Rate OLR(t). The spectral radius ρ(Φ) is surfaced as a leading indicator in the zone’s health dashboard: when ρ(Φ) approaches 1.0 from below, it signals that the zone’s interaction topology is nearing the calibration instability boundary, even if CCE<sub>collective</sub> is still within threshold.
8. Convergence Analysis
8.1 Sufficient Conditions for Collective Calibration Convergence
Theorem 4 (Convergence). Let G = (V, E, W) be the interaction graph of an agent team, let Φ be the calibration propagation operator, and let each agent apply individual calibration correction with learning rate η<sub>i</sub> ∈ (0, 1). Collective calibration converges (lim<sub>t→∞</sub> CCE<sub>collective</sub>(t) = 0) if the following conditions hold: (C1) The graph G is strongly connected with spectral gap γ<sub>spec</sub> > 0. (C2) The propagation matrix satisfies ρ(Φ) < 1. (C3) Individual correction rates satisfy η<sub>i</sub> < 2/(1 + max<sub>j≠i</sub> w<sub>ij</sub>) for all i. (C4) The noise process η(t) satisfies E[||η(t)||<sup>2</sup>] ≤ σ<sup>2</sup> for all t.
Proof. Under (C1)–(C3), the operator Ψ = Φ − diag(η<sub>1</sub>, …, η<sub>n</sub>) governs the corrected dynamics δ(t+1) = Ψ · δ(t) + η(t). By Gershgorin’s circle theorem, the eigenvalues of Ψ lie in the union of discs {z : |z − (Φ<sub>ii</sub> − η<sub>i</sub>)| ≤ Σ<sub>j≠i</sub> |Φ<sub>ij</sub>|}. Under (C3), each disc is contained in the open unit disc, giving ρ(Ψ) < 1. The stochastic convergence ||E[δ(t)]|| ≤ ρ(Ψ)<sup>t</sup> · ||δ(0)|| follows by induction. The variance bound Var(δ(t)) ≤ σ<sup>2</sup>/(1 − ρ(Ψ)<sup>2</sup>) follows from the geometric series for bounded noise. Together, these give CCE<sub>collective</sub>(t) → O(σ/√(1 − ρ(Ψ)<sup>2</sup>)) as t → ∞, which equals zero when σ = 0 (no exogenous noise) and remains small for small σ. □
8.2 Necessary Conditions
Strong connectivity (C1) is necessary: if the interaction graph is disconnected, the disconnected components converge independently, and there is no mechanism to ensure their calibration equilibria are consistent. The spectral radius condition (C2) is necessary: if ρ(Φ) ≥ 1, at least one eigenvector direction of calibration deviation does not decay, and the corresponding calibration error persists indefinitely. The learning rate bound (C3) is necessary to prevent oscillatory divergence: agents that correct too aggressively overshoot and create new miscalibration that propagates through the team.
8.3 Convergence Rate and Topology Optimization
The convergence rate is determined by ρ(Ψ). Faster convergence requires smaller ρ(Ψ), which in turn requires larger spectral gap γ<sub>spec</sub> and appropriately tuned learning rates η<sub>i</sub>. For a fixed number of agents n, the interaction topology that minimizes ρ(Ψ) is the Ramanujan graph — a graph that achieves the optimal spectral gap bound λ<sub>2</sub> ≥ 2√(d−1) for d-regular graphs. In practice, constructing exact Ramanujan graphs is computationally expensive, so we use spectral sparsification: starting from the complete graph, iteratively remove edges that least impact γ<sub>spec</sub>, yielding a sparse topology that approximates the optimal convergence rate with O(n log n) edges instead of O(n<sup>2</sup>).
9. Experimental Results
9.1 Deployment Configuration
We evaluated the collective calibration framework on 9 production MARIA OS zones spanning financial compliance (3 zones, 231 agents), healthcare diagnostics (3 zones, 198 agents), and manufacturing quality (3 zones, 194 agents). Each zone ran for 120 days with collective calibration monitoring active, preceded by a 120-day baseline period with individual-only CCE monitoring. Total joint decisions evaluated: 47,382 (baseline) and 51,209 (treatment).
9.2 CCE<sub>collective</sub> Reduction
Across all 9 zones, CCE<sub>collective</sub> decreased from an average of 0.127 (baseline with individual-only monitoring) to 0.074 (with collective calibration framework), a 41.7% relative reduction. Financial compliance zones showed the largest improvement (46.2%), driven by the high degree of agent interaction in AML/KYC decision chains where miscalibration propagation was most acute. Healthcare zones showed 39.1% reduction. Manufacturing zones showed 38.8% reduction. The improvement was strongly correlated with the baseline interaction graph density: denser graphs — where more agents interact directly — showed larger CCE<sub>collective</sub> improvements because they had more pathways for miscalibration propagation that the framework could address.
9.3 Convergence Speedup
Topology-aware reflection scheduling, using spectral-gap-optimized interaction patterns, achieved 2.8× faster convergence to calibration equilibrium compared to naive round-robin reflection ordering. The spectral gap of optimized topologies averaged γ<sub>spec</sub> = 0.34, compared to γ<sub>spec</sub> = 0.11 for the default topologies. The theoretical prediction for convergence speedup based on spectral gap ratios was 3.1×; the realized speedup of 2.8× reflects the impact of exogenous noise that the deterministic analysis does not account for.
9.4 Consensus-Calibration Pareto Front
The two-phase interaction protocol achieved a hypervolume indicator of 0.91 on the CCE<sub>collective</sub> vs. (1 − CON) Pareto front, compared to 0.67 for consensus-first protocols and 0.73 for calibration-first protocols. The Pareto-optimal protocol found configurations where CCE<sub>collective</sub> ≤ 0.08 and CON ≥ 0.82, demonstrating that the consensus-calibration tension is navigable when the interaction schedule is explicitly optimized for both objectives.
10. Conclusion
Collective calibration is a distinct phenomenon from individual calibration: it depends on the correlation structure of agent errors and the topology of agent interactions, neither of which is captured by individual CCE metrics. The Non-Reducibility Theorem establishes that monitoring only individual calibration creates a blind spot for collective miscalibration — teams of individually well-calibrated agents can produce collectively overconfident decisions when their errors are correlated. The Calibration Propagation Operator formalizes how miscalibration spreads through agent teams, and the Spectral Convergence Theorem provides precise conditions under which collective calibration converges. The Consensus-Calibration Impossibility result demonstrates a fundamental tension that cannot be eliminated but can be managed through Pareto-optimal interaction scheduling. MARIA OS’s implementation integrates these theoretical results into production governance: zone-level CalibrationMonitors track CCE<sub>collective</sub> in real time, spectral analysis of interaction graphs provides leading indicators of calibration instability, and topology-aware reflection scheduling achieves 41.7% reduction in collective calibration error with 2.8× convergence speedup. For enterprise deployments where teams of agents must produce trustworthy joint decisions, collective calibration monitoring is not optional — it is the difference between a team that knows what it knows and a team that merely believes it does.