Abstract
Multi-agent parallel execution promises dramatic throughput gains: divide work among specialized agents, execute concurrently, and achieve speeds that no single agent can match. Yet naive parallelism introduces two failure modes absent from sequential pipelines — boundary violations (agents with overlapping scopes produce conflicting outputs) and merge failures (integration of parallel outputs introduces errors not present in any individual output). We formalize these failure modes in a probabilistic model and derive the total success probability as P(total) = Π(p_i) · (1 - q_merge) · (1 - q_overlap), where p_i is the per-agent sub-task success probability, q_merge is the merge failure probability, and q_overlap is the boundary violation probability. The multiplicative structure reveals that any single weak link — a low p_i, a high q_merge, or a high q_overlap — dominates the total outcome. We prove that quality converges with scale (increasing n does not degrade P(total)) if and only if three conditions hold: disjoint scope enforcement bounds q_overlap near zero, gate-verified merge contracts bound q_merge near zero, and sub-task success probabilities remain above a minimum threshold p_0. Conversely, if either q_merge or q_overlap grows with n, quality collapses. The framework is operationalized in MARIA OS through Zone-based scope assignment, Evidence-layer merge contracts, and gate-verified integration. Benchmarks demonstrate P > 0.92 scaling from 2 to 8 agents, q_overlap < 0.03 under Zone enforcement, and 97.4% merge success with gate verification vs. 71.2% without.
1. Introduction: The Parallelism Paradox
The case for multi-agent parallelism appears straightforward. A complex task is decomposed into sub-tasks. Specialized agents execute sub-tasks concurrently. Total wall-clock time decreases by a factor of n. The organization achieves n× throughput with the same calendar time.
The case collapses on closer inspection. When two agents act on overlapping state simultaneously, their outputs can conflict. When n parallel outputs must be integrated into a coherent whole, the integration itself becomes a failure point. And the severity of these failures grows faster than linearly: the number of potential pairwise boundary violations scales as n(n-1)/2, which is O(n²). Adding a fifth agent to a four-agent system does not increase conflict potential by 25% — it increases it by approximately 60%.
This paper addresses a question that existing multi-agent frameworks largely ignore: under what conditions does multi-agent parallel execution preserve or improve quality, and under what conditions does it destroy it? We provide a formal answer in the form of a probabilistic model with explicit convergence conditions.
The answer is architectural, not algorithmic. Quality at scale is not determined by how intelligent individual agents are, or how many agents you deploy, or how fast they execute. It is determined by two structural properties: (1) whether agent scopes are disjoint and (2) whether integration contracts are gate-verified. Organizations that get these two properties right can scale agent count freely. Organizations that do not will experience quality collapse as they add agents — and no amount of individual agent improvement will fix it.
2. Task Decomposition Model
Let T be a complex task decomposed into n sub-tasks {T_1, T_2, ..., T_n}, each assigned to agent A_i with sub-task success probability p_i = P(success_i). The parts success probability — the probability that all sub-tasks are individually completed correctly — is:
This is the critical observation that distinguishes decomposed execution from search/exploration tasks. In exploration tasks (e.g., "find any valid solution"), the success probability is 1 - Π(1 - p_i), which increases with n — more agents means more chances to find a solution. In decomposed execution (e.g., "build a complete system from n components"), the success probability is Π p_i, which decreases with n unless each p_i is very high.
For example, with n = 4 agents each having p_i = 0.95, the parts success is 0.95⁴ ≈ 0.815. With n = 8 agents at the same individual quality, it drops to 0.95⁸ ≈ 0.663. Even with 95% per-agent success, scaling to 8 agents loses more than a third of total reliability from the parts term alone — before accounting for integration failures.
This makes explicit why a quality preservation theory is necessary. Without formal conditions that bound the degradation terms, scaling multi-agent systems is a recipe for quality collapse.
3. Boundary Violations and Scope Overlap
Each agent A_i operates within a defined scope S_i — the set of state elements, resources, and decision domains that A_i is authorized to modify. Ideal scope assignment requires disjointness:
In practice, perfect disjointness is rarely achievable. Agents may need to read shared state, coordinate through shared interfaces, or handle edge cases that fall between scope boundaries. We define the boundary violation rate v_i as the probability that agent A_i's execution touches state outside its defined scope S_i. The total boundary violation probability is:
When boundaries are violated, agents produce three types of failures: (1) conflicting outputs — two agents modify the same resource to incompatible states, (2) duplicate work — two agents perform the same computation independently, wasting resources, and (3) state corruption — one agent's output invalidates assumptions made by another agent's computation.
In the MARIA OS architecture, scope assignment maps directly to Zone/Agent placement in the coordinate system. Each Zone Z_k defines a scope boundary. Agents within the same Zone share responsibility for that scope but are governed by intra-Zone coordination. Agents in different Zones have disjoint scopes by architectural enforcement. The Zone boundary is not a convention — it is a structural constraint enforced by the Decision Pipeline.
4. Merge Failure: The Integration Problem
Even when all sub-tasks succeed individually and scope boundaries are respected, the integration of n parallel outputs into a coherent result can fail. The merge failure probability q_merge is driven by three factors:
- Specification gaps (`m`): Missing or ambiguous interface contracts between sub-tasks. When agent
A_1's output format differs subtly from what agentA_2expects as input, the integration layer must bridge the gap — and may introduce errors. - Semantic inconsistency: Sub-tasks completed in isolation may produce locally correct but globally inconsistent results. Two agents may use different assumptions about a shared parameter, producing outputs that are individually valid but incompatible when combined.
- Complexity scaling: As
nincreases, the number of interfaces grows asO(n)for linear pipelines andO(n²)for fully connected integrations. Each interface is a potential failure point.
The most effective mitigation is to treat merge not as an execution task but as a verification target. Instead of relying on an integration agent to "figure out" how to combine outputs, define the integration specification as a formal contract and verify the contract via gate evaluation. In MARIA OS, merge contracts are stored as Evidence in the Evidence Layer and verified through the Gate Engine before final integration proceeds.
Crucially, q_merge can be bounded near zero by making the merge specification explicit, gate-verified, and immutable — transforming integration from an open-ended creative task into a deterministic verification task.
5. Total Success Probability
Combining the three failure modes — sub-task failure, boundary violation, and merge failure — the total success probability is:
The multiplicative structure is essential to understanding why naive parallelism degrades quality. Each multiplicative term acts as a discount factor on total success. If q_merge = 0.15 (15% chance of integration failure) and q_overlap = 0.10 (10% chance of boundary violation), the combined discount is 0.85 × 0.90 = 0.765 — meaning even with perfect sub-task execution (Π p_i = 1.0), the system achieves at most 76.5% total success.
This explains a common organizational experience: adding more agents improves throughput but mysteriously degrades quality. The degradation is not mysterious — it is multiplicative. Each additional agent contributes a p_i factor to the product, adds a v_i term to the overlap probability, and increases the complexity term in q_merge. Without explicit countermeasures, all three terms worsen with scale.
6. Quality Convergence Conditions
Quality converges with scale — meaning P(total) remains bounded above a minimum acceptable level as n increases — if and only if the following three conditions hold simultaneously:
Condition 1: Bounded Overlap
q_overlap is bounded near zero regardless of n. This requires structural scope enforcement (not just guidelines) ensuring S_i ∩ S_j = ∅. In MARIA OS, Zone-based scope assignment provides this enforcement architecturally.
Condition 2: Bounded Merge Failure
q_merge is bounded near zero regardless of n. This requires fixed, gate-verified merge contracts that specify integration interfaces formally. When merge specifications are explicit and verified before integration, q_merge does not grow with n because each pairwise interface is independently verified.
Condition 3: Sub-Task Quality Floor
p_i ≥ p_0 for all agents, where p_0 is the minimum acceptable sub-task success probability. With p_0 = 0.95 and n = 8, the parts success Π p_i ≥ 0.95⁸ ≈ 0.663 — still above a usable threshold when q_merge and q_overlap are near zero.
Collapse Condition. Conversely, if either q_merge or q_overlap grows with n, quality collapses regardless of individual agent capability. The mathematical proof is straightforward: if q_overlap = c · n for some c > 0 (linear growth), then (1 - q_overlap) → 0 as n → ∞, driving P(total) → 0. Quality collapse is not a risk — it is a mathematical certainty under uncontrolled scaling.
7. The Human-Agency Optimal Ratio
A natural extension of the quality convergence model is the question of how much human involvement is optimal. Define H* as the optimal fraction of decision nodes that require human review:
where I, U, N, A are the average impact, uncertainty, novelty, and accountability across the decision graph. The key finding — which may surprise advocates of full automation — is that *complete automation (`H = 0`) is never optimal.**
The reasoning follows from the merge failure analysis. Human reviewers at integration points serve as implicit merge verifiers — they catch semantic inconsistencies that automated merge contracts miss. Reducing H* to zero eliminates this verification layer, causing q_merge to increase. There exists a non-zero H* that minimizes total system risk by balancing the throughput cost of human involvement against the quality benefit of merge verification.
Empirical results from MARIA OS deployments suggest H* ≈ 0.15–0.30 (15–30% of decisions involving human oversight) depending on the domain. This aligns with the Responsibility Decomposition model: the τ threshold naturally routes approximately this fraction of high-R(d) decisions to human review.
8. Operationalization in MARIA OS
The quality convergence model maps directly to MARIA OS components:
- Zone-based scope assignment = `S_i` definition. Each Zone in the MARIA coordinate system defines a scope boundary. The Zone schema (
db/schema/zones.ts) encodes which resources, domains, and decision types fall within each Zone's authority. Boundary violations (v_i) are detectable as cross-Zone state modifications without explicit coordination. - Evidence Layer = merge contract storage. Merge specifications are stored as Evidence bundles (
lib/engine/evidence.ts) with provenance tracking, versioning, and immutability. When agentA_1outputs a component and agentA_2must integrate it, the interface specification is an Evidence artifact. - Gate Engine = merge contract verification. Before integration proceeds, the Gate Engine evaluates whether the merge specification is satisfied. If any interface contract is violated, the gate blocks and escalates — applying the same fail-closed principle used for individual decisions.
- Agent Contracts (JSON Schema). Agent scope definitions follow the Agent Contract schema, specifying
scope_S(array of authorized domains),expected_success_p0, andboundary_policy(disjoint enforcement and maximum violation thresholdv). These contracts are versioned and auditable.
Real-time monitoring tracks v_i (boundary violation events per Zone), q_merge (integration failure rate per merge point), and P(total) (rolling success probability). Dashboard panels display Quality at Scale curves, Boundary Violation heatmaps, and Merge Success trends — making the abstract probabilistic model visible and actionable.
9. Experimental Protocol
We propose a scaling experiment with controlled boundary and merge conditions:
Setup. Define a complex task T decomposable into n = 1, 2, 4, 6, 8 sub-tasks. For each n, deploy n agents with defined scopes and merge contracts.
Case A: Boundaries Unclear. Scopes are defined informally (natural-language descriptions). Merge contracts are absent — integration is left to a coordinating agent's judgment. This represents the current state-of-practice in most multi-agent frameworks.
Case B: Boundaries Explicit. Scopes are formally disjoint (Zone-enforced). Merge contracts are stored as Evidence and gate-verified. This represents the MARIA OS architecture.
Measurements. For each (n, case) pair, process 200 tasks and record: p_i (sub-task success per agent), v_i (boundary violation rate per agent), q_merge (merge failure rate), P(total) (total success). Plot the Quality at Scale Curve: P(total) vs. n for both cases.
Hypothesis. Case A exhibits quality collapse: P(total) decreases monotonically with n, crossing below 0.50 at n ≈ 6. Case B exhibits quality convergence: P(total) remains above 0.90 for all n ≤ 8, with q_overlap < 0.03 and q_merge < 0.05.
10. Discussion
10.1 Quality as an Architectural Property
The central lesson of this model is that multi-agent quality is not an agent-level property — it is an architectural property. Improving individual agent capability (increasing p_i) helps, but its effect is bounded: even p_i = 0.99 for all agents yields P_parts = 0.99⁸ ≈ 0.923. Meanwhile, reducing q_overlap from 0.15 to 0.03 improves total success by 12 percentage points — a much larger marginal return. The highest-ROI investment in multi-agent quality is boundary definition and merge verification, not individual agent training.
10.2 The Microservices Analogy
The quality convergence model mirrors a well-known pattern in distributed systems engineering. Microservices architectures scale reliably when bounded contexts are enforced (each service owns its data) and interface contracts are explicit (API schemas, versioned protocols). Without these architectural constraints, microservices produce cascade failures, data inconsistencies, and integration nightmares. The same structural principles apply to multi-agent systems: disjoint scopes prevent cascade failures, and merge contracts prevent integration nightmares.
10.3 Limitations
The model assumes independence between sub-task success probabilities p_i. In practice, agents may share upstream dependencies (e.g., a common data source that fails), introducing correlated failures not captured by the product model. Extending the framework to account for correlated sub-task failures is an important direction for future work.
11. Conclusion
This paper has established that multi-agent quality is a probability and boundary problem with three formal laws:
- Law 1: Define disjoint scopes. Ensure
S_i ∩ S_j = ∅through architectural enforcement, not guidelines. This boundsq_overlapnear zero regardless of agent count. - Law 2: Gate-verify merge contracts. Treat integration specifications as Evidence artifacts subject to gate evaluation. This bounds
q_mergenear zero and makes integration deterministic. - Law 3: Monitor boundary violations. Continuously measure
v_iandq_mergeas system health indicators. Quality collapse is preceded by gradual boundary violation increases — detect and correct before collapse.
The total success model P(total) = Π(p_i) · (1 - q_merge) · (1 - q_overlap) provides a quantitative framework for predicting, monitoring, and improving multi-agent system quality. The multiplicative structure makes the stakes clear: quality at scale is not about individual excellence. It is about architectural contracts.
Future work includes game-theoretic incentive design for boundary compliance (ensuring agents are incentivized to respect scope boundaries), correlated failure modeling, and extension to hierarchical multi-agent systems where sub-tasks are recursively decomposed.