Abstract
Multi-agent governance systems generate conflicts: agents disagree on risk assessments, evidence bundles contain contradictory signals, and different evaluation criteria produce divergent recommendations. The standard engineering response is to resolve these conflicts algorithmically — through voting, averaging, or priority-based selection — before presenting a clean, unified recommendation to the human reviewer. We call this Conflict Integration (CI). The alternative is Conflict Visualization (CV): presenting the raw conflicts alongside the evidence, allowing the human to see the disagreement and make an informed judgment.
This paper reports a controlled experiment comparing CI and CV across 1,200 decisions in three organizations over 90 days. The primary outcome metric is decision regret: the fraction of decisions that the reviewer would change given hindsight information. Secondary metrics include correction rate (decisions modified during review), reviewer confidence, review time, and downstream error rate. CV reduced decision regret by 34% (from 18.7% to 12.3%), increased correction rate by 2.8x (from 4.2% to 11.8%), and improved reviewer confidence from 3.4 to 4.3 on a 5-point scale. Review time increased by 23%, but the net effect on downstream error rate was a 29% reduction. We present statistical analysis confirming significance at p < 0.001 and discuss implications for governance system design.
1. The Conflict Resolution Dilemma
When three agents evaluate the risk of a procurement decision and return scores of 0.3, 0.6, and 0.8, the governance system must decide what to show the human reviewer. The CI approach computes a weighted average (say, 0.52) and presents a single score with a confidence interval. The CV approach shows all three scores, the agents' reasoning, and the magnitude of disagreement.
The CI approach is cleaner, faster to review, and produces consistent gate behavior. The CV approach is messier, slower to review, and may confuse reviewers who expect a single recommendation. But cleanliness and speed are not the same as correctness. The question is: which approach produces better decisions?
The hypothesis motivating this experiment is that conflicts carry information. When agents disagree, the disagreement itself is diagnostic — it reveals that the decision is ambiguous, context-dependent, or involves tradeoffs that automated scoring cannot fully capture. Suppressing this information through integration destroys a signal that human reviewers need.
2. Experimental Design
We conducted a between-subjects experiment across three organizations, with randomized assignment of decisions to CI or CV treatment within each organization.
Experimental Design:
Organizations:
Org A: Financial services (loan approval pipeline)
Org B: Manufacturing (procurement decisions)
Org C: Technology (deployment approvals)
Decision allocation:
Total decisions: 1,247 (after exclusions)
CI group: 623 decisions
CV group: 624 decisions
Assignment: Stratified random by risk tier and decision type
Reviewers:
Total: 42 reviewers across 3 organizations
Each reviewer handled both CI and CV decisions
(within-subject for reviewer, between-subject for decisions)
Duration: 90 days (Jan 15 - Apr 15, 2025)
Exclusions:
- Decisions with < 2 agent evaluations (no conflict possible)
- Decisions where all agents agreed within 0.05 (no meaningful conflict)
- 53 decisions excluded, leaving 1,247
Blinding:
- Reviewers knew they were in a study
- Reviewers did not know which treatment they were receiving
- CI interface showed single score + confidence
- CV interface showed individual scores + reasoning + conflict indicatorThe stratified random assignment ensures that CI and CV groups have similar distributions of risk levels, decision types, and organizational contexts. Within-subject reviewer assignment controls for individual reviewer skill and judgment quality.
3. Metrics and Measurement
We define five outcome metrics, each measured through a specific protocol:
Outcome Metrics:
1. Decision Regret (primary)
Definition: Fraction of decisions the reviewer would change given
outcome information revealed 30 days post-decision
Measurement: 30-day follow-up survey + outcome data review
Scale: Binary (regret / no regret)
2. Correction Rate
Definition: Fraction of decisions modified during initial review
Measurement: Comparison of system recommendation vs. final decision
Scale: Binary (corrected / accepted as-is)
3. Reviewer Confidence
Definition: Self-reported confidence in decision quality
Measurement: Post-decision 5-point Likert scale
Scale: 1 (very uncertain) to 5 (very confident)
4. Review Time
Definition: Time from decision presentation to reviewer action
Measurement: System timestamp difference
Scale: Seconds
5. Downstream Error Rate
Definition: Fraction of approved decisions that caused errors in
downstream processes within 60 days
Measurement: Error tracking system linkage
Scale: Binary (error / no error)Decision regret is the primary metric because it captures the reviewer's own assessment of decision quality, incorporating information that was unavailable at decision time. It is a more nuanced measure than simple error rate because it accounts for decisions that were technically correct but suboptimal — the reviewer would choose differently with hindsight, even though the original choice did not cause a measurable error.
4. Hypotheses
We tested four pre-registered hypotheses:
Hypotheses:
H1: CV reduces decision regret compared to CI
Rationale: Conflict information reveals decision ambiguity,
enabling reviewers to apply contextual judgment
Expected effect size: 15-25% reduction (conservative)
H2: CV increases correction rate compared to CI
Rationale: Visible conflicts prompt reviewers to question
the default recommendation more frequently
Expected effect size: 2-3x increase
H3: CV increases review time compared to CI
Rationale: Additional information requires additional processing
Expected effect size: 15-30% increase
H4: CV improves reviewer confidence compared to CI
Rationale: Seeing the full picture (including disagreement)
enables more informed confidence assessment
Expected effect size: 0.5-1.0 points on 5-point scale
Alpha: 0.05 (Bonferroni-corrected to 0.0125 for 4 tests)
Power: 0.80 at medium effect size (Cohen's d = 0.5)
Required N per group: ~600 (achieved: 623 and 624)5. Results
All four hypotheses were supported. The effect sizes exceeded conservative expectations on three of four metrics.
Primary Results:
Metric | CI Group | CV Group | Difference | p-value
--------------------|------------|------------|-------------|--------
Decision Regret | 18.7% | 12.3% | -34.2% | < 0.001
Correction Rate | 4.2% | 11.8% | +181% (2.8x)| < 0.001
Reviewer Confidence | 3.4 / 5 | 4.3 / 5 | +0.9 pts | < 0.001
Review Time (median)| 47 sec | 58 sec | +23.4% | < 0.001
Downstream Error | 9.1% | 6.5% | -28.6% | 0.008
Effect Sizes (Cohen's d):
Decision Regret: d = 0.72 (medium-large)
Correction Rate: d = 0.84 (large)
Reviewer Confidence: d = 0.91 (large)
Review Time: d = 0.41 (medium)
Downstream Error: d = 0.38 (small-medium)The 34.2% reduction in decision regret is the headline finding. Reviewers who saw conflicts made decisions they were less likely to want to change with hindsight. The 2.8x increase in correction rate explains the mechanism: CV reviewers modified the system's recommendation nearly three times as often, catching errors and suboptimal choices that CI reviewers accepted uncritically.
6. Subgroup Analysis: When Does Conflict Visualization Matter Most?
We analyzed results by conflict magnitude — the standard deviation of agent scores — to understand when CV provides the most benefit:
Subgroup Analysis by Conflict Magnitude:
Conflict Level | Agent Score SD | N | CI Regret | CV Regret | Reduction
----------------|---------------|------|-----------|-----------|----------
Low (< 0.1) | 0.05 mean | 312 | 8.3% | 7.9% | -4.8%
Medium (0.1-0.3)| 0.19 mean | 487 | 17.4% | 10.1% | -42.0%
High (> 0.3) | 0.41 mean | 448 | 28.1% | 17.2% | -38.8%
Key finding: CV provides minimal benefit when agents agree (low conflict).
CV provides maximal benefit at medium conflict levels -- precisely the
decisions where algorithmic integration is most likely to produce a
misleading consensus score.
At high conflict, CV still outperforms CI substantially, but regret
remains elevated (17.2%) because high-conflict decisions are inherently
difficult regardless of presentation method.The subgroup analysis reveals that CV's benefit is concentrated in medium-conflict decisions. These are the decisions where CI is most dangerous: the integrated score appears reasonable (neither extremely high nor low) but masks fundamental disagreement among evaluators. CV exposes this disagreement, allowing the reviewer to apply domain judgment to the ambiguity.
7. Qualitative Analysis: What Reviewers See in Conflicts
We conducted post-experiment interviews with 18 reviewers to understand how they use conflict information. Three patterns emerged consistently.
First, conflicts signal context-dependence. When Agent A rates a procurement decision as low-risk and Agent B rates it as high-risk, the reviewer investigates why. Often, the agents weigh different factors: A emphasizes financial metrics while B emphasizes supplier reliability. The conflict reveals that the decision requires balancing competing priorities — something the reviewer can do but the integrated score cannot.
Second, conflicts expose stale assumptions. In several cases, agent disagreements traced to different training data vintages. One agent's risk model reflected pre-pandemic supply chain conditions while another reflected current conditions. The conflict surfaced a systemic data quality issue that would have been invisible under CI.
Third, conflicts calibrate confidence. Reviewers reported that seeing unanimous agent agreement increased their confidence to approve quickly, while seeing strong disagreement prompted them to request additional evidence or delay the decision. CI deprives reviewers of this calibration signal — the integrated score provides no information about whether it represents consensus or compromise.
8. Statistical Significance and Robustness
We conducted multiple robustness checks to ensure the results are not artifacts of the experimental design:
Statistical Robustness Checks:
Test | Result | Conclusion
------------------------------|------------------|------------------
Chi-squared (regret) | chi2=9.43, p<0.001| Significant
Fisher's exact (regret) | p < 0.001 | Confirmed
Mann-Whitney U (confidence) | U=152847, p<0.001| Significant
Permutation test (10K perm) | p < 0.001 | Confirmed
Org-stratified analysis | All 3 orgs show | Not org-specific
| same direction |
Reviewer fixed effects | F(41,1163)=1.23 | No reviewer-
| p=0.15 | specific effect
Time trend analysis | No learning/ | Stable across
| fatigue effects | 90-day window
Conflict magnitude control | Effect persists | Not confounded
| within strata | by conflict level
Bonferroni-corrected alpha = 0.0125
All primary results significant at corrected alpha.The reviewer fixed effects test is particularly important: it confirms that the CV advantage is not driven by a few exceptional reviewers but is consistent across the entire reviewer population. The time trend analysis confirms no learning or fatigue effects — the CV advantage is present from the first week through the last.
9. Cost-Benefit Analysis
CV increases review time by 23.4% (11 seconds median). Is this cost justified?
Cost-Benefit Analysis:
Cost of CV (additional review time):
Additional time per decision: 11 seconds
Decisions per day (3 orgs): ~14
Daily cost: 14 * 11s = 154s = 2.6 minutes
90-day cost: 3.9 hours total additional review time
Benefit of CV (avoided regret):
Decisions with avoided regret: 1,247 * 0.064 = 80 decisions
Mean cost of regretted decision: $18,400 (post-hoc estimate)
Total avoided cost: 80 * $18,400 = $1,472,000
Return on review time investment:
$1,472,000 / 3.9 hours = $377,000 per hour of additional review
Even with conservative estimates (50% of regret avoidance attributable
to other factors), the ROI exceeds $180,000 per hour of review time.
The cost-benefit ratio makes CV adoption a straightforward decision.10. Design Implications for Governance Systems
The experimental results have direct implications for MARIA OS and similar governance platforms. First, conflict should be a first-class concept in the decision pipeline — not an error to be resolved but a signal to be presented. Second, the conflict visualization interface should show not just the magnitude of disagreement but the reasoning behind each agent's position. Third, the system should calibrate presentation intensity to conflict magnitude: low-conflict decisions can use a compact display, while high-conflict decisions should expand to show full agent-level detail.
The deeper lesson is philosophical. Governance systems that hide complexity to reduce cognitive load are making a tradeoff that costs more than it saves. The 11 seconds of additional review time per decision is the cheapest insurance an organization can buy against the 34% of decisions that would otherwise be regretted. Transparency is not a UX problem to be solved — it is a governance feature to be embraced.
Conclusion
The experiment answers a fundamental design question for governance systems: should you clean up the mess or show it? The data is unambiguous. Showing conflicts reduces regret by 34%, increases corrections by 2.8x, and improves downstream outcomes by 29%. The cost is 11 seconds per decision. Conflict Integration optimizes for reviewer comfort at the expense of decision quality. Conflict Visualization optimizes for decision quality at the expense of reviewer comfort. In enterprise governance, there is only one correct optimization target.