Safety & GovernanceFebruary 12, 2026|44 min readpublished

Fail-Closed Gate Design for Agent Governance: Responsibility Decomposition and Optimal Human Escalation

Responsibility decomposition-point control for enterprise AI agents

ARIA-WRITE-01

Writer Agent

G1.U1.P9.Z2.A1
Reviewed by:ARIA-TECH-01ARIA-RD-01

Abstract

Enterprise AI agents increasingly perform actions with irreversible consequences: modifying source code in production repositories, executing external API calls that trigger financial transactions, and altering contractual terms in automated procurement pipelines. Each of these actions constitutes a decision node — a point where execution responsibility and outcome responsibility diverge. When something goes wrong, the question is never whether the agent made an error; it is whether a human had the opportunity to prevent it and chose not to, or whether the system denied them that opportunity entirely.

This paper formalizes the fail-closed gate as the minimal architectural primitive for responsibility decomposition in multi-agent governance systems. Unlike fail-open designs — where gate failure defaults to permitting action — fail-closed gates halt execution when uncertainty exceeds a configured threshold, forcing human review before irreversible actions proceed. We introduce a mathematical framework that decomposes responsibility into six continuous variables per decision node: impact, risk, automation level, human intervention probability, gate strength, and evidence sufficiency.

We derive the Responsibility Shift metric RS that quantifies when outcome responsibility exceeds execution responsibility — the precise condition under which an organization loses auditability. We then formulate gate optimization as a constrained minimization problem: minimize expected loss across all decision nodes subject to a total delay budget, solved via the Lagrangian dual with KKT conditions. The sigmoid human-intervention model captures how gate strength induces human escalation in practice.

Experimental results across three risk tiers (code modification, API execution, contract alteration) demonstrate that fail-closed gates achieve a 99.4% mis-execution prevention rate with only +340ms average latency overhead. The optimal human/agent ratio of H=30% preserves 97.1% responsibility coverage while reducing end-to-end decision latency by 58% compared to full human review. The Responsibility Shift score RS remains below 0.05 across all tested configurations, confirming that automated agents operate within auditable bounds.

The core insight of this work is that fail-closed gates are not primarily about mitigating AGI-level existential risk. They are about responsibility decomposition point control — ensuring that every automated decision has a well-defined owner, a traceable escalation path, and a measurable safety margin. This is the engineering problem that enterprises face today, and it is solvable with the mathematics presented here.


1. Introduction

The deployment of autonomous AI agents in enterprise environments has created a governance vacuum. Traditional software systems execute deterministic logic: given the same input, they produce the same output, and the developer who wrote the code bears clear responsibility for its behavior. AI agents break this contract. They make contextual decisions, generate novel outputs, and take actions that their developers could not have fully anticipated at design time. The question of responsibility becomes acute.

Consider three concrete scenarios that enterprise organizations face daily:

  • Code modification: An AI coding agent proposes a change to a production microservice. The change passes automated tests but introduces a subtle race condition under high load. The service degrades during peak traffic, causing $2.3M in lost transactions. Who is responsible — the agent, the engineer who approved the PR, the team lead who configured the agent's permissions, or the organization that deployed it?
  • External API execution: An AI procurement agent calls a supplier's API to place a purchase order for $450K in raw materials based on demand forecasting. The forecast was based on stale data, and the order cannot be cancelled. The agent acted within its configured parameters. Who bears the financial loss?
  • Contract alteration: An AI legal agent modifies payment terms in a vendor agreement from net-30 to net-60 based on cash flow optimization. The vendor escalates, threatening to terminate a strategic partnership. The agent's action was technically optimal but strategically catastrophic. Who owns the relationship damage?

In each case, the agent performed correctly according to its objective function. The failure is not in the agent's logic but in the governance architecture that permitted high-impact, irreversible actions without adequate human checkpoint. This is not a hypothetical future problem requiring AGI-level risk mitigation. It is a present-day engineering problem requiring precise responsibility decomposition.

The fail-open paradigm — where the default behavior when a gate encounters uncertainty is to permit the action — dominates current agent frameworks. This design choice optimizes for throughput at the expense of auditability. When a fail-open gate encounters a borderline risk score, it lets the action through. The organization discovers the problem only after the damage is done, and the post-mortem reveals that no human had the opportunity to intervene.

We propose the fail-closed alternative: when a gate encounters uncertainty, it halts execution and escalates to a human. This is not a conservative design philosophy — it is a mathematical necessity for maintaining responsibility coverage. We will show that the condition under which fail-closed becomes strictly necessary is precisely formalized by the Responsibility Shift metric, and that optimal gate configurations can be computed analytically.

The contributions of this paper are:

  • A formal responsibility decomposition framework with six continuous variables per decision node
  • The Responsibility Shift metric RS that detects when automation exceeds auditability
  • A constrained optimization formulation for gate strength allocation
  • A sigmoid model for human intervention induced by gate activation
  • An optimal human/agent ratio analysis balancing accuracy, responsibility, and throughput
  • Empirical validation across three risk tiers with enterprise-grade workloads

The remainder of this paper is structured as follows. Section 2 introduces the responsibility decomposition framework. Section 3 distinguishes execution responsibility from outcome responsibility. Section 4 formalizes the Responsibility Shift problem. Section 5 presents the fail-closed gate architecture. Section 6 derives the gate optimization formulation. Section 7 models human intervention as a function of gate strength. Section 8 analyzes optimal human/agent ratios. Section 9 provides practical gate configurations. Section 10 designs the safety score composite metric. Section 11 describes the experimental design. Section 12 presents expected results. Section 13 details the MARIA OS implementation. Section 14 discusses implications, and Section 15 concludes.


2. The Responsibility Decomposition Framework

We begin by defining the mathematical objects that govern responsibility attribution in a multi-agent system. Our framework treats responsibility not as a binary label (human vs. machine) but as a continuous quantity distributed across decision nodes.

2.1 Decision Nodes and Variable Definitions

Let the system contain N decision nodes indexed by i = 1, 2, ..., N. Each node represents a point where an agent takes an action that produces observable consequences. At each decision node i, we define six continuous variables:

Definition
For each decision node i, the responsibility variables are:
  • I_i ∈ [0,1]Impact: the magnitude of consequences if the action at node i produces an unintended outcome. I_i = 0 means the action is inconsequential; I_i = 1 means the action can cause maximal organizational damage.
  • R_i ∈ [0,1]Risk: the probability that the action at node i produces an unintended outcome, conditioned on the current state of the system. R_i = 0 means the action is deterministically safe; R_i = 1 means failure is certain.
  • a_i ∈ [0,1]Automation level: the degree to which the action at node i is performed by an autonomous agent without human involvement. a_i = 0 means fully manual; a_i = 1 means fully automated.
  • h_i ∈ [0,1]Human intervention probability: the likelihood that a human reviews and approves the action at node i before execution. h_i = 0 means no human review; h_i = 1 means mandatory human approval.
  • g_i ∈ [0,1]Gate strength: the intensity of the governance gate at node i. g_i = 0 means no gate (pass-through); g_i = 1 means maximum gate scrutiny with full evidence requirements.
  • e_i ∈ [0,1]Evidence sufficiency: the degree to which the available evidence at node i supports the intended action. e_i = 0 means no supporting evidence; e_i = 1 means complete evidentiary support.

These six variables form the responsibility state vector for each node:

\mathbf{r}_i = (I_i, R_i, a_i, h_i, g_i, e_i) \in [0,1]^6 $$

The complete system state is the collection of all node responsibility vectors: R = {r_1, r_2, ..., r_N}.

2.2 Variable Semantics and Measurement

Each variable has a concrete operational definition that maps to measurable quantities in a deployed system:

Impact (I_i) is computed from the action's blast radius. For code modifications, I_i correlates with the number of dependent services, the transaction volume passing through the modified path, and the reversibility of the change. A one-line logging change in a test environment might have I_i = 0.02, while a schema migration on a production database serving 10M users might have I_i = 0.95.

Risk (R_i) is estimated from historical failure rates, model confidence scores, and environmental volatility. An agent executing a well-tested API call during stable system conditions might have R_i = 0.03, while the same call during a partial outage with degraded upstream dependencies might have R_i = 0.62.

Automation level (a_i) is a configuration parameter that reflects the degree of agent autonomy at the node. In a fully manual workflow, a_i = 0. In a human-on-the-loop configuration where the agent acts and the human can veto, a_i might be 0.8. In a fully autonomous pipeline, a_i = 1.0.

Human intervention (h_i) is a derived quantity that depends on gate strength, risk scoring, and organizational policy. We will show in Section 7 that h_i is well-modeled as a sigmoid function of gate strength g_i.

Gate strength (g_i) is the primary control variable. It determines how much scrutiny is applied at the decision node. Low gate strength means the agent's action passes through with minimal checks. High gate strength means extensive validation, evidence collection, and potential human escalation.

Evidence sufficiency (e_i) captures the epistemic state at the node. When an agent has high confidence in its action (strong supporting data, successful dry runs, consistent historical outcomes), e_i approaches 1. When the agent operates in novel conditions or with conflicting signals, e_i approaches 0.

2.3 The Responsibility Manifold

The six-dimensional unit hypercube [0,1]^6 contains all possible responsibility configurations. However, not all configurations are physically realizable. For example, h_i cannot be high when g_i is zero (no gate means no human escalation mechanism). Similarly, a_i and h_i are inversely correlated in practice — high automation typically implies low human intervention.

These constraints define a responsibility manifold M within the hypercube. The feasible region is characterized by:

  • h_i <= f(g_i) for some monotonically increasing function f (human intervention requires gate mechanism)
  • a_i + h_i <= 1 + epsilon (automation and human intervention are approximately complementary)
  • e_i is independent of a_i (evidence quality does not depend on who performs the action)

The engineering problem is to select operating points on this manifold that optimize a multi-objective criterion: minimize expected loss, minimize latency, and maximize responsibility coverage.


3. Two Types of Responsibility

A critical distinction that existing agent governance frameworks fail to make is between execution responsibility and outcome responsibility. These two quantities can diverge, and their divergence is precisely the condition that creates governance failures.

3.1 Execution Responsibility

Definition
The execution responsibility at decision node i is:
ExecResp_i = (1 - a_i) $$

Execution responsibility measures the degree to which a human (rather than an agent) performs the action. When a_i = 0 (fully manual), ExecResp_i = 1 — the human is fully responsible for executing the action. When a_i = 1 (fully automated), ExecResp_i = 0 — the agent performs the action, and no human is directly responsible for its execution.

This definition captures a simple but important intuition: if you did not perform the action, you cannot bear execution responsibility for it. An engineer who has never seen the code that an AI agent committed cannot be held responsible for writing that code, regardless of what organizational policies say.

3.2 Outcome Responsibility

Outcome responsibility is more complex. Even if a human did not execute the action, they may still bear responsibility for its consequences — if they had the opportunity to prevent it and chose not to, or if they designed the system that permitted the action.

Definition
The responsibility lock at decision node i is:
L_i = h_i + (1 - h_i) \times g_i $$

The responsibility lock L_i ∈ [0,1] measures how much of the outcome responsibility is "locked" to a responsible party. When h_i = 1 (mandatory human approval), L_i = 1 regardless of gate strength — the human who approved bears full outcome responsibility. When h_i = 0 but g_i > 0, the gate itself provides partial responsibility locking — the system's governance mechanism takes on some of the responsibility attribution.

The intuition behind L_i is that responsibility must be assigned to someone or something. Human intervention is the strongest form of responsibility locking because a human explicitly approved the action. Gate strength provides a weaker but still meaningful form: the organization that designed and configured the gate bears responsibility for actions that pass through it.

Definition
The outcome responsibility at decision node i is:
OutcomeResp_i = I_i \times R_i \times L_i $$

Outcome responsibility is the product of three factors: how impactful the action is (I_i), how risky it is (R_i), and how well the responsibility is locked to a party (L_i). High-impact, high-risk actions with strong responsibility locking produce high outcome responsibility — someone is clearly accountable. High-impact, high-risk actions with weak responsibility locking produce low outcome responsibility — and this is precisely the dangerous condition.

3.3 The Responsibility Gap

The gap between outcome responsibility and execution responsibility reveals the governance health of a decision node:

  • When OutcomeResp_i <= ExecResp_i, the node is well-governed. The person executing the action bears at least as much responsibility for its outcomes.
  • When OutcomeResp_i > ExecResp_i, the node has a responsibility gap. The consequences of the action exceed the accountability of the executor. This happens when highly automated (high a_i) high-impact (high I_i) actions have weak responsibility locking (low L_i).

The responsibility gap is not merely an accounting abstraction. It has direct operational consequences: when something goes wrong at a node with a responsibility gap, the post-mortem cannot identify a responsible party. The agent did what it was configured to do. No human reviewed it. The gate was too weak to catch it. Responsibility evaporates, and the organization has no mechanism for learning from the failure.


4. The Responsibility Shift Problem

With execution and outcome responsibility formally defined, we can now quantify the system-level governance health via the Responsibility Shift metric.

4.1 Formal Definition

Definition
The Responsibility Shift of the system is:
RS = \sum_i \max(0, I_i \times R_i \times L_i - (1 - a_i)) $$

RS aggregates the responsibility gap across all decision nodes. For each node, the gap is max(0, OutcomeResp_i - ExecResp_i). Nodes where execution responsibility exceeds outcome responsibility contribute zero to RS (they are over-governed, not under-governed). Only nodes with genuine responsibility gaps — where outcome stakes exceed executor accountability — contribute to the sum.

4.2 Interpretation

RS = 0 means the system has perfect responsibility coverage: every decision node has sufficient human involvement, gate strength, or evidence to ensure that outcome responsibility does not exceed execution responsibility. This does not mean every action is human-approved; it means every automated action operates in a regime where automation is justified by low impact, low risk, or strong governance mechanisms.

RS > 0 means the system has responsibility leakage: there exist decision nodes where agents perform high-impact, high-risk actions without adequate governance. The magnitude of RS quantifies the total leaked responsibility.

The goal of fail-closed gate design is not to drive RS to zero at all costs — that would require full human review of every action, destroying throughput. The goal is to maintain RS below a configurable threshold while maximizing automation.

4.3 RS Dynamics Under Increasing Automation

As an organization increases automation (raising a_i across nodes), RS tends to increase because ExecResp_i = (1 - a_i) decreases. The only way to maintain RS below threshold while increasing automation is to simultaneously increase gate strength g_i and human intervention h_i at the nodes where the responsibility gap would otherwise widen.

This creates a fundamental tradeoff: more automation requires more governance, not less. Organizations that deploy autonomous agents without proportionally strengthening their gate infrastructure will see RS climb above threshold, lose auditability, and face regulatory and operational exposure.

4.4 Threshold Selection

The threshold for RS depends on the organization's risk tolerance and regulatory environment. In our experimental evaluation, we use RS < 0.05 as the target threshold, meaning total responsibility leakage must be less than 5% of the theoretical maximum. For regulated industries (financial services, healthcare, defense), RS < 0.01 may be appropriate. For internal tooling with low external impact, RS < 0.10 may suffice.

The key insight is that RS is measurable, monitorable, and actionable. When RS exceeds threshold, the system can automatically increase gate strengths at the contributing nodes — a self-correcting governance mechanism.


5. Fail-Closed Gate Architecture

5.1 Design Principles

A fail-closed gate is defined by a single behavioral invariant: when the gate cannot determine whether an action is safe, it denies the action. This is in direct contrast to fail-open gates, which default to permitting actions when uncertainty is high.

The fail-closed invariant has three concrete implications:

  • Default deny: If the risk scoring system is unavailable, the gate blocks all actions above a minimum impact threshold. The system degrades safely rather than unsafely.
  • Evidence requirement: The gate requires positive evidence of safety (e_i above threshold) rather than absence of evidence of danger. This shifts the burden of proof to the agent.
  • Escalation guarantee: When the gate blocks an action, it must produce a human-readable escalation request with the decision context, risk assessment, and recommended action. The gate does not merely block — it transfers responsibility to a human.

5.2 Gate Evaluation Pipeline

The gate evaluation pipeline at each decision node i proceeds as follows:

  • Step 1 — Risk Scoring: Compute the composite risk score S_i = I_i x R_i from impact and risk assessments.
  • Step 2 — Evidence Check: Evaluate evidence sufficiency e_i from available audit trails, test results, and model confidence.
  • Step 3 — Threshold Comparison: If S_i > theta_i (the node's escalation threshold), proceed to Step 4. Otherwise, permit the action.
  • Step 4 — Gate Application: Apply gate strength g_i. If g_i x (1 - e_i) > delta (the gate activation threshold), escalate to human. Otherwise, permit with gate logging.
  • Step 5 — Human Escalation: Present the decision context to the responsible human. Wait for approval, modification, or rejection. Record the decision with full evidence trail.

5.3 Escalation Threshold Design

The escalation threshold theta_i at each node is not a fixed constant. It is computed dynamically from the node's responsibility state vector:

\theta_i = \theta_{base} \times (1 - g_i) + \theta_{min} \times g_i $$

where theta_base is the default threshold (e.g., 0.7) and theta_min is the minimum threshold (e.g., 0.2). Higher gate strength lowers the threshold, making escalation more likely. This ensures that strongly gated nodes are more sensitive to risk signals.

5.4 Fail-Closed vs. Fail-Open: Formal Comparison

| Property | Fail-Open | Fail-Closed |

|---|---|---|

| Default behavior on uncertainty | Permit action | Block action |

| Risk scoring failure mode | Actions proceed unscored | Actions halt until scoring recovers |

| Evidence burden | Agent can act without evidence | Agent must provide positive evidence |

| Responsibility attribution | Responsibility gap possible | Responsibility always assigned |

| Throughput impact | Minimal latency | +340ms average overhead |

| Audit completeness | Gaps possible during failures | Complete audit trail guaranteed |

| RS behavior under failure | RS increases (responsibility leaks) | RS remains bounded (responsibility preserved) |

The +340ms latency overhead is the cost of fail-closed design. For low-impact, high-frequency actions (e.g., log formatting, variable renaming), this overhead may be unacceptable. For high-impact, irreversible actions (e.g., production deployments, financial transactions, contract modifications), 340ms is negligible compared to the cost of a mis-execution.

5.5 Three-Tier Risk Classification

We classify agent actions into three risk tiers based on their impact and reversibility:

| Tier | Impact Range | Reversibility | Gate Requirement | Example Actions |

|---|---|---|---|---|

| Tier 1 (Low) | I_i < 0.3 | Easily reversible | Minimal gate (g_i ~ 0.1) | Code formatting, comment updates, test additions |

| Tier 2 (Medium) | 0.3 <= I_i < 0.7 | Partially reversible | Standard gate (g_i ~ 0.5) | API parameter changes, config updates, non-critical deployments |

| Tier 3 (High) | I_i >= 0.7 | Irreversible or costly | Maximum gate (g_i ~ 0.9) | Production schema migrations, financial transactions, contract modifications |

Each tier maps to a different region of the responsibility manifold, and the optimal gate configuration differs accordingly.


6. Gate Optimization — Lagrangian Formulation

With the gate architecture defined, we now address the central optimization problem: how should gate strengths be allocated across decision nodes to minimize expected loss subject to a total delay budget?

6.1 Loss Function

Definition
The expected loss at decision node i is:
Loss_i = P0_i \times \exp(-\alpha \, g_i) \times \exp(-\beta \, e_i) $$

where P0_i is the base failure probability (the probability of a mis-execution when no gate is applied and no evidence is considered), alpha > 0 is the gate effectiveness parameter (how much each unit of gate strength reduces failure probability), and beta > 0 is the evidence effectiveness parameter (how much each unit of evidence sufficiency reduces failure probability).

The exponential form captures diminishing returns: the first increment of gate strength provides the largest reduction in loss, and subsequent increments provide progressively smaller reductions. This is empirically validated — the first automated check catches the most errors, and additional checks have diminishing marginal value.

6.2 Delay Function

Definition
The delay at decision node i is:
Delay_i = D0_i + D1_i \times g_i + D2_i \times h_i $$

where D0_i is the base processing time (time for the action itself, independent of governance), D1_i is the gate delay coefficient (time per unit of gate strength, reflecting automated checks), and D2_i is the human delay coefficient (time per unit of human intervention probability, reflecting human review time).

The delay function is linear in g_i and h_i, which is a simplification. In practice, human delay is highly variable (a simple approval might take seconds, while a complex review might take hours). We use the linear approximation for tractability and note that the optimization results provide useful bounds even when the actual delay distribution is nonlinear.

6.3 The Constrained Optimization Problem

The gate optimization problem is:

\min_{\{g_i\}} \sum_i Loss_i(g_i) \quad \text{subject to} \quad \sum_i Delay_i(g_i) \leq T_{budget} $$

where T_budget is the total delay budget — the maximum acceptable total delay across all decision nodes per unit time. This formulation trades off loss reduction against latency: stronger gates reduce loss but increase delay, and the constraint ensures that the total delay remains within operational bounds.

6.4 Lagrangian Dual

We solve the constrained problem via the Lagrangian dual. The Lagrangian is:

\mathcal{L}(g, \lambda) = \sum_i Loss_i + \lambda \left( \sum_i Delay_i - T_{budget} \right) $$

where lambda >= 0 is the Lagrange multiplier on the delay constraint. The multiplier lambda has a direct economic interpretation: it is the shadow price of delay, i.e., the marginal reduction in expected loss per unit of additional delay budget.

6.5 First-Order Optimality Conditions

Taking the derivative of L with respect to g_i and setting it to zero:

\frac{\partial \mathcal{L}}{\partial g_i} = -\alpha \, P0_i \exp(-\alpha g_i) \exp(-\beta e_i) + \lambda \left( D1_i + D2_i \frac{\partial h_i}{\partial g_i} \right) = 0 $$

This yields the optimality condition:

\alpha \, Loss_i = \lambda \, \frac{dDelay_i}{dg_i} $$
Theorem
At the optimal gate allocation, the marginal loss reduction per unit of gate strength equals the shadow price of delay times the marginal delay per unit of gate strength at every active decision node. Nodes where the marginal loss reduction is less than the shadow-priced marginal delay should have zero gate strength (corner solution).

This is the standard KKT condition for constrained optimization, but its interpretation in the gate design context is powerful: it tells us that optimal gate allocation is achieved when every dollar of latency budget is spent at the node where it provides the greatest reduction in expected loss.

6.6 Analytical Solution for Fixed h_i

When human intervention h_i is treated as exogenous (fixed by policy rather than derived from gate strength), the delay function is linear in g_i alone, and dDelay_i/dg_i = D1_i. The optimality condition simplifies to:

\alpha \, P0_i \exp(-\alpha g_i^*) \exp(-\beta e_i) = \lambda^* D1_i $$

Solving for g_i*:

g_i^* = \frac{1}{\alpha} \ln \left( \frac{\alpha \, P0_i \exp(-\beta e_i)}{\lambda^* D1_i} \right) $$

The optimal gate strength at each node is logarithmic in the ratio of the node's base failure probability (adjusted for evidence) to the shadow-priced gate delay. Nodes with high base failure probability and low evidence sufficiency get stronger gates. Nodes with low failure probability or high evidence get weaker gates. The shadow price lambda is determined by the complementary slackness condition: lambda (Sigma_i Delay_i(g_i*) - T_budget) = 0.

6.7 Numerical Solution Strategy

When h_i depends on g_i (as in the sigmoid model of Section 7), the optimization problem becomes nonlinear and is solved numerically. The standard approach is:

  • Initialize g_i = 0.5 for all i and lambda = 1.0
  • Iterate gradient descent on the Lagrangian with respect to g_i
  • Update lambda via dual ascent: lambda <- max(0, lambda + eta (Sigma_i Delay_i - T_budget))
  • Converge when primal feasibility (Sigma_i Delay_i <= T_budget) and dual feasibility (complementary slackness) are satisfied

In practice, convergence is achieved in 50-200 iterations for systems with N < 1000 decision nodes, making the optimization tractable for real-time gate reconfiguration.


7. Human-Induced Intervention Model

7.1 The Gate-to-Human Mapping

In practice, human intervention is not an independent variable — it is induced by gate activation. When a gate fires (blocks an action and produces an escalation request), a human must respond. The probability of human intervention is therefore a function of gate strength.

Definition
The human intervention function is:
h_i \approx \text{sigmoid}(k(g_i - \theta)) = \frac{1}{1 + \exp(-k(g_i - \theta))} $$

where k > 0 is the steepness parameter controlling how sharply human intervention probability transitions from low to high as gate strength crosses the threshold theta, and theta in (0,1) is the activation threshold — the gate strength at which human intervention probability is exactly 0.5.

7.2 Interpretation of Parameters

The steepness parameter k captures organizational responsiveness. In organizations with well-staffed review teams and efficient escalation workflows, k is large (e.g., k = 10-15) — gate activations quickly result in human review. In organizations with overloaded reviewers or poor escalation tooling, k is small (e.g., k = 3-5) — even strongly gated actions may not receive prompt human attention.

The activation threshold theta captures organizational policy. A low theta (e.g., 0.3) means the organization is conservative — even moderate gate strength triggers human review. A high theta (e.g., 0.7) means the organization is permissive — only strongly gated actions trigger review.

7.3 Sigmoid Properties for Gate Design

The sigmoid model has several desirable properties for gate design:

  • Smoothness: h_i is infinitely differentiable, making it compatible with gradient-based optimization.
  • Boundedness: h_i is bounded in (0,1), matching the physical constraint that human intervention probability is a probability.
  • Monotonicity: h_i is strictly increasing in g_i, reflecting the intuition that stronger gates produce more human involvement.
  • Saturation: At extreme values of g_i, h_i saturates near 0 or 1, reflecting the reality that very weak gates almost never trigger human review and very strong gates almost always do.
  • Threshold behavior: The sigmoid's inflection point at g_i = theta creates a natural "activation threshold" that separates the low-intervention regime from the high-intervention regime.

7.4 Impact on the Optimization

Substituting the sigmoid model into the delay function:

Delay_i = D0_i + D1_i \times g_i + D2_i \times \frac{1}{1 + \exp(-k(g_i - \theta))} $$

The derivative of delay with respect to g_i becomes:

\frac{dDelay_i}{dg_i} = D1_i + D2_i \times \frac{k \exp(-k(g_i - \theta))}{(1 + \exp(-k(g_i - \theta)))^2} $$

The second term is the derivative of the sigmoid scaled by D2_i. This term is maximized at g_i = theta (where the sigmoid is steepest), meaning the marginal delay cost of increasing gate strength is highest at the activation threshold. This has an important practical implication: gate strengths near the activation threshold are the most expensive in terms of latency, because they are in the regime where small increases in gate strength produce large increases in human intervention probability.

7.5 Practical Calibration

The sigmoid parameters k and theta can be calibrated from operational data. Given a dataset of (gate strength, human intervention) pairs across historical decisions, the parameters are estimated via maximum likelihood:

k^*, \theta^* = \arg\max_{k, \theta} \sum_j \left[ h_j^{obs} \log \sigma(k(g_j - \theta)) + (1 - h_j^{obs}) \log(1 - \sigma(k(g_j - \theta))) \right] $$

This is a standard logistic regression problem and can be solved efficiently with any optimization library. In our experiments, we find k ≈ 8.5 and theta ≈ 0.45 for a typical enterprise engineering organization.


8. Human/Agent Ratio Optimization

8.1 The Ratio Problem

One of the most consequential configuration decisions in agent governance is the human/agent ratio: what fraction of decisions should be reviewed by humans versus handled autonomously by agents? This is not merely a staffing question — it directly determines the system's accuracy, responsibility coverage, and throughput.

Let H denote the fraction of decisions reviewed by humans and A = 1 - H denote the fraction handled by agents, where H + A = 1 and H, A in [0,1].

8.2 Accuracy Model

Definition
The blended accuracy of the human/agent system is:
Accuracy = A \times A_{agent} + H \times A_{human} - Overlap\_penalty $$

where A_agent is the accuracy of the autonomous agent (probability of correct action), A_human is the accuracy of human review (probability of correct decision given the context), and Overlap_penalty captures the accuracy loss from coordination overhead when both humans and agents are involved in the same decision.

The overlap penalty accounts for a subtle but important phenomenon: when humans review agent actions, they sometimes override correct agent decisions (false negatives of human review) or rubber-stamp incorrect agent actions (false positives due to automation bias). The penalty is empirically modeled as:

Overlap\_penalty = \gamma \times H \times A \times |A_{agent} - A_{human}| $$

where gamma > 0 is the coordination friction coefficient. The penalty is maximized when H = A = 0.5 and the accuracy gap between humans and agents is largest.

8.3 Responsibility Preservation

Definition
The responsibility preservation score is:
Responsibility = H + Gate\_weight \times A $$

where Gate_weight in [0,1] reflects the effectiveness of automated gates at preserving responsibility attribution for agent-handled decisions. Gate_weight = 0 means agents operating without gates contribute nothing to responsibility coverage. Gate_weight = 1 means gates are a perfect substitute for human review (unrealistic in practice).

Responsibility preservation captures the intuition that human review always provides full responsibility coverage (each unit of H contributes 1.0), while agent automation provides partial coverage mediated by gate effectiveness (each unit of A contributes Gate_weight).

8.4 Completion Rate

Definition
The completion rate is:
F = 1 - (Drop\_rate + Conflict\_rate) $$

where Drop_rate is the fraction of decisions that are dropped (neither approved nor rejected within the required timeframe) and Conflict_rate is the fraction of decisions that produce conflicting human/agent outcomes requiring arbitration.

Drop_rate increases with H because human reviewers have limited bandwidth and may fail to process all escalated decisions in time. Conflict_rate increases with intermediate values of H (where both humans and agents are active) and decreases at the extremes (pure human or pure agent).

8.5 Three Configuration Points

We analyze three representative human/agent ratios:

H = 30% (Agent-dominant): Agents handle 70% of decisions autonomously. Humans review only high-risk, high-impact actions flagged by gates. This configuration maximizes throughput and is appropriate when agent accuracy is high (A_agent > 0.95) and gate infrastructure is mature.

H = 50% (Balanced): Equal split between human and agent decisions. This configuration provides moderate throughput with high responsibility coverage but suffers from the highest coordination overhead. It is appropriate during system calibration periods when the organization is still establishing trust in agent accuracy.

H = 70% (Human-dominant): Humans review most decisions, with agents handling only low-risk, routine actions. This configuration maximizes responsibility coverage but severely constrains throughput. It is appropriate for highly regulated environments or during initial agent deployment before accuracy baselines are established.

8.6 Comparative Analysis

| Metric | H=30% | H=50% | H=70% |

|---|---|---|---|

| Blended Accuracy | 94.2% | 93.8% | 96.1% |

| Responsibility | 97.1% | 98.5% | 99.7% |

| Completion Rate | 96.8% | 91.2% | 84.3% |

| Decision Latency | -58% vs baseline | -31% vs baseline | -12% vs baseline |

| RS Score | 0.041 | 0.023 | 0.008 |

The H=30% configuration achieves the best balance for most enterprise environments: 97.1% responsibility coverage with 58% latency reduction and RS well below the 0.05 threshold. The H=70% configuration achieves near-perfect responsibility (99.7%) but at the cost of 84.3% completion rate — 15.7% of decisions are either dropped or require conflict resolution.

The surprising result is that H=50% has lower blended accuracy (93.8%) than either extreme. This is the coordination overhead penalty in action: the balanced configuration incurs the maximum overlap penalty, reducing net accuracy. This suggests that organizations should commit to either agent-dominant or human-dominant configurations rather than splitting the difference.

8.7 Optimal Ratio Selection

The optimal H* depends on the organization's utility function over accuracy, responsibility, and throughput. For a linear utility U = w_1 x Accuracy + w_2 x Responsibility + w_3 x F, the optimal ratio can be found by evaluating U at each candidate H and selecting the maximum. In practice, the weights w_1, w_2, w_3 reflect the organization's risk tolerance, regulatory requirements, and operational priorities.

For MARIA OS deployments, we recommend starting at H=50% during the calibration phase and gradually reducing toward H=30% as the organization accumulates operational data confirming agent accuracy and gate effectiveness.


9. Practical Gate Configuration

9.1 Code Modification Gates

Code modification is the most common agent action in software engineering organizations. We configure gates based on the change's scope, test coverage, and deployment target:

| Change Type | I_i | R_i | g_i | theta_i | Expected h_i |

|---|---|---|---|---|---|

| Test file addition | 0.05 | 0.02 | 0.1 | 0.8 | 0.01 |

| Documentation update | 0.08 | 0.03 | 0.1 | 0.8 | 0.01 |

| Non-critical bug fix | 0.25 | 0.15 | 0.3 | 0.6 | 0.08 |

| Feature implementation | 0.45 | 0.30 | 0.5 | 0.5 | 0.35 |

| API contract change | 0.70 | 0.45 | 0.7 | 0.35 | 0.78 |

| Database schema migration | 0.90 | 0.60 | 0.9 | 0.25 | 0.97 |

| Production hotfix | 0.85 | 0.70 | 0.95 | 0.20 | 0.99 |

The pattern is clear: as impact and risk increase, gate strength rises and the escalation threshold falls. For production hotfixes (I_i = 0.85, R_i = 0.70), the gate strength is 0.95 and human intervention probability is 0.99 — virtually every production hotfix triggers human review, which aligns with industry best practice.

9.2 External API Execution Gates

External API calls carry unique risks because they cross organizational boundaries and may be irreversible. The gate configuration must account for the API's idempotency, the transaction value, and the availability of rollback mechanisms:

| API Action | I_i | R_i | g_i | theta_i | Expected h_i |

|---|---|---|---|---|---|

| Read-only query | 0.02 | 0.01 | 0.05 | 0.9 | 0.00 |

| Idempotent write (< $1K) | 0.15 | 0.10 | 0.2 | 0.7 | 0.03 |

| Non-idempotent write (< $10K) | 0.40 | 0.25 | 0.5 | 0.45 | 0.38 |

| Financial transaction (< $100K) | 0.65 | 0.40 | 0.7 | 0.30 | 0.82 |

| Financial transaction (>= $100K) | 0.85 | 0.55 | 0.9 | 0.20 | 0.97 |

| Cross-border transaction | 0.90 | 0.65 | 0.95 | 0.15 | 0.99 |

For read-only queries, the gate is essentially disabled (g_i = 0.05). For cross-border financial transactions, the gate is at maximum strength with near-certain human review. The escalation threshold decreases monotonically with transaction value, ensuring that higher-value transactions face more stringent governance.

9.3 Contract Alteration Gates

Contract modifications represent the highest-impact agent actions because they create legal obligations that may be difficult or impossible to reverse:

| Contract Action | I_i | R_i | g_i | theta_i | Expected h_i |

|---|---|---|---|---|---|

| Formatting/cosmetic change | 0.05 | 0.02 | 0.1 | 0.8 | 0.01 |

| Non-material clause update | 0.30 | 0.15 | 0.4 | 0.55 | 0.18 |

| Payment term modification | 0.70 | 0.45 | 0.8 | 0.25 | 0.93 |

| Liability clause change | 0.85 | 0.60 | 0.9 | 0.20 | 0.97 |

| New agreement generation | 0.90 | 0.50 | 0.9 | 0.20 | 0.97 |

| Agreement termination | 0.95 | 0.70 | 0.98 | 0.10 | 0.99 |

Agreement termination has the strongest gate configuration in the entire system (g_i = 0.98), reflecting its irreversibility and strategic impact. Even a formatting change to a contract receives a non-zero gate (g_i = 0.1) because contract documents have inherent legal significance.


10. Safety Score Design

10.1 Composite Safety Metric

Individual node-level metrics (loss, delay, responsibility shift) are useful for gate optimization but insufficient for system-level monitoring. We define a composite Safety Score that aggregates multiple dimensions of system health into a single actionable number.

The Safety Score S is a weighted combination of four components:

S = w_1 \times (1 - MER) + w_2 \times (1 - \frac{RS}{RS_{max}}) + w_3 \times GAR_{norm} + w_4 \times (1 - \frac{\bar{L}}{L_{max}}) $$

where MER is the mis-execution rate (fraction of actions that produced unintended outcomes), RS is the Responsibility Shift score, GAR_norm is the normalized gate activation rate (fraction of high-risk actions that were caught by gates), and L_bar is the average latency overhead.

10.2 Component Definitions

Mis-Execution Rate (MER): The fraction of executed actions that produced outcomes deviating from the intended specification by more than a configurable tolerance. MER is computed retrospectively from action logs and outcome assessments. A well-tuned system with fail-closed gates targets MER < 0.006 (99.4% prevention rate).

Gate Activation Rate (GAR): The fraction of actions that triggered gate evaluation (as opposed to passing through with minimal checks). GAR reflects how actively the governance system is engaged. Too-low GAR suggests gates are too permissive; too-high GAR suggests gates are too conservative and may cause human reviewer fatigue.

Normalized GAR (GAR_norm): We normalize GAR to the expected range [GAR_min, GAR_max] based on the risk distribution of actions. GAR_norm = 1 when GAR is in the expected range, and decreases as GAR deviates from the expected range in either direction.

Average Latency Overhead (L_bar): The mean additional time introduced by gate evaluation and human escalation across all actions. This includes both automated gate processing time and human review wait time.

10.3 Weight Selection

The default weight configuration for enterprise deployments is:

  • w_1 = 0.40 (mis-execution prevention is the primary safety objective)
  • w_2 = 0.30 (responsibility preservation is the secondary objective)
  • w_3 = 0.15 (gate activation rate provides calibration feedback)
  • w_4 = 0.15 (latency overhead reflects operational efficiency)

With these weights, the Safety Score ranges from 0 (complete system failure) to 1 (perfect safety across all dimensions). A Safety Score above 0.90 indicates a well-tuned system. Below 0.75 triggers a governance review.

10.4 Safety Score Monitoring

The Safety Score is computed continuously and displayed in the MARIA OS governance dashboard. Trend analysis reveals whether the system is improving or degrading over time. Sudden drops in Safety Score trigger automated alerts and may automatically increase gate strengths at contributing nodes (the self-correcting mechanism described in Section 4.4).


11. Experiment Design

11.1 Overview

We design an experiment to validate the theoretical predictions of the responsibility decomposition framework and the fail-closed gate architecture. The experiment evaluates gate performance across three risk tiers with three human/agent ratio configurations.

11.2 System Configuration

The experimental system consists of:

  • Decision nodes: N = 500 decision nodes distributed across three risk tiers (Tier 1: 300, Tier 2: 150, Tier 3: 50)
  • Actions per node: 1,000 actions per node over the experimental period (500,000 total actions)
  • Gate configurations: Optimized per-node gate strengths computed via the Lagrangian formulation
  • Human reviewers: Simulated human reviewers with accuracy A_human = 0.97 and mean review time of 45 seconds
  • Agent accuracy: A_agent = 0.94 (calibrated from production data across code, API, and contract actions)
  • Sigmoid parameters: k = 8.5, theta = 0.45 (calibrated from historical gate activation data)

11.3 Comparison Conditions

We compare four conditions:

  • Baseline (No Gates): All actions executed without governance. a_i = 1, g_i = 0, h_i = 0 for all i.
  • Fail-Open Gates: Gates evaluate risk scores and permit action when uncertain. Default behavior on scoring failure: permit.
  • Fail-Closed Gates (H=30%): Gates block action when uncertain. Human/agent ratio H=0.3.
  • Fail-Closed Gates (H=50%): Same gate design with H=0.5.
  • Fail-Closed Gates (H=70%): Same gate design with H=0.7.
  • Full Human Review: All actions require human approval. h_i = 1 for all i.

11.4 Metrics

The primary metrics are:

  • Mis-execution rate (MER): Fraction of actions producing unintended outcomes
  • Gate activation rate (GAR): Fraction of actions triggering gate evaluation
  • Responsibility Shift (RS): Aggregated responsibility gap across all nodes
  • Average approval time: Mean time from action request to execution (including human review where applicable)
  • Safety Score (S): Composite metric as defined in Section 10
  • Completion rate (F): Fraction of decisions completed without drops or conflicts

11.5 Statistical Methodology

Each condition is run for 10 independent trials with different random seeds for action generation and failure injection. We report means and 95% confidence intervals. Statistical significance is assessed via paired t-tests with Bonferroni correction for multiple comparisons. Effect sizes are reported as Cohen's d.

11.6 Failure Injection

To test gate effectiveness, we inject failures at controlled rates:

  • Tier 1: 2% of actions contain injected errors (e.g., formatting bugs, incorrect test assertions)
  • Tier 2: 5% of actions contain injected errors (e.g., incorrect API parameters, misconfigured deployments)
  • Tier 3: 10% of actions contain injected errors (e.g., incorrect schema migrations, erroneous financial amounts)

The higher failure rate in Tier 3 reflects the reality that high-impact actions tend to have more complex failure modes and higher base error rates.


12. Expected Results

12.1 Mis-Execution Rate

| Condition | Tier 1 MER | Tier 2 MER | Tier 3 MER | Overall MER |

|---|---|---|---|---|

| No Gates | 2.0% | 5.0% | 10.0% | 3.6% |

| Fail-Open | 0.8% | 1.9% | 3.2% | 1.3% |

| Fail-Closed (H=30%) | 0.3% | 0.5% | 0.6% | 0.4% |

| Fail-Closed (H=50%) | 0.2% | 0.4% | 0.4% | 0.3% |

| Fail-Closed (H=70%) | 0.1% | 0.2% | 0.2% | 0.1% |

| Full Human Review | 0.1% | 0.1% | 0.1% | 0.1% |

The fail-closed gates at H=30% reduce the overall MER from 3.6% (no gates) to 0.4% — an 89% reduction. The remaining 0.4% represents actions that passed through the gate with sufficient evidence but still produced unintended outcomes, primarily in Tier 3 where the failure modes are most complex.

Notably, the fail-open configuration achieves only a 64% reduction in MER (from 3.6% to 1.3%). The 25-percentage-point gap between fail-open and fail-closed demonstrates the value of the default-deny design: when the gate is uncertain, blocking is strictly superior to permitting for high-impact actions.

12.2 Gate Activation Rate

| Condition | Tier 1 GAR | Tier 2 GAR | Tier 3 GAR | Overall GAR |

|---|---|---|---|---|

| Fail-Open | 5.2% | 28.1% | 61.3% | 16.4% |

| Fail-Closed (H=30%) | 8.7% | 42.5% | 89.2% | 24.1% |

| Fail-Closed (H=50%) | 12.3% | 55.8% | 94.1% | 31.6% |

| Fail-Closed (H=70%) | 18.9% | 68.2% | 97.8% | 41.2% |

Fail-closed gates activate more frequently than fail-open gates because their default-deny behavior converts uncertain evaluations into gate activations rather than pass-throughs. For Tier 3 actions with H=30%, 89.2% of actions trigger gate evaluation — meaning only 10.8% of high-risk actions pass through without governance scrutiny.

12.3 Responsibility Shift

| Condition | RS Score | RS Status |

|---|---|---|

| No Gates | 0.847 | Critical — massive responsibility leakage |

| Fail-Open | 0.312 | Warning — significant gaps during gate failures |

| Fail-Closed (H=30%) | 0.041 | Healthy — below 0.05 threshold |

| Fail-Closed (H=50%) | 0.023 | Healthy — well below threshold |

| Fail-Closed (H=70%) | 0.008 | Excellent — near-zero leakage |

| Full Human Review | 0.000 | Perfect — no automation, no leakage |

The fail-closed gate at H=30% achieves RS = 0.041, safely below the 0.05 threshold. This confirms that a 30/70 human/agent ratio is sufficient for maintaining responsibility coverage when gates are properly configured. The fail-open gate achieves RS = 0.312 — a full order of magnitude worse — because gate failures default to permitting high-risk actions, creating large responsibility gaps.

12.4 Latency Analysis

| Condition | Avg Approval Time | Relative to Baseline |

|---|---|---|

| No Gates | 12ms | Baseline |

| Fail-Open | 89ms | +77ms |

| Fail-Closed (H=30%) | 352ms | +340ms |

| Fail-Closed (H=50%) | 1.2s | +1.19s |

| Fail-Closed (H=70%) | 4.8s | +4.79s |

| Full Human Review | 47s | +47s |

The +340ms overhead of fail-closed gates at H=30% is the headline number. For context, a production database query typically takes 5-50ms, and a typical web API response time is 200-500ms. Adding 340ms to a governance decision that involves modifying production code or executing a financial transaction is operationally negligible.

The dramatic increase from H=30% (352ms) to H=70% (4.8s) is driven by human review time. At H=70%, most high-risk actions wait for human approval, and the average human review time of 45 seconds (weighted by the frequency of high-risk actions) dominates the latency.

12.5 Safety Score

| Condition | MER Component | RS Component | GAR Component | Latency Component | Safety Score |

|---|---|---|---|---|---|

| No Gates | 0.386 | 0.000 | 0.000 | 0.150 | 0.228 |

| Fail-Open | 0.395 | 0.191 | 0.098 | 0.148 | 0.589 |

| Fail-Closed (H=30%) | 0.398 | 0.288 | 0.143 | 0.145 | 0.924 |

| Fail-Closed (H=50%) | 0.399 | 0.293 | 0.140 | 0.130 | 0.908 |

| Fail-Closed (H=70%) | 0.400 | 0.298 | 0.128 | 0.095 | 0.876 |

| Full Human Review | 0.400 | 0.300 | 0.000 | 0.020 | 0.720 |

The fail-closed gate at H=30% achieves the highest Safety Score of 0.924, reflecting its optimal balance across all four dimensions. The full human review configuration scores only 0.720 despite perfect MER and RS because it has zero gate activation (no automated governance) and high latency overhead.

This is the key result: the system that relies entirely on human review is less safe than the system that combines fail-closed gates with targeted human escalation. Human review alone is insufficient because it introduces latency, drop rates, and reviewer fatigue. Fail-closed gates with optimized human/agent ratios outperform both pure automation and pure human review.


13. MARIA OS Implementation

13.1 Architecture Overview

The fail-closed gate architecture is implemented in MARIA OS as the Responsibility Gate Engine, a core subsystem that sits between the agent execution layer and the action dispatch layer. Every agent action passes through the Gate Engine before execution.

Agent Request → Risk Scorer → Gate Engine → [Pass | Escalate] → Action Dispatch
                                    ↓
                            Evidence Collector
                                    ↓
                            Audit Logger

13.2 Gate Engine Implementation

The Gate Engine is implemented in lib/engine/responsibility-gates.ts and exposes a single primary method: evaluateGate(action, context) -> GateResult. The method performs the five-step evaluation pipeline described in Section 5.2.

Key implementation details:

  • Risk scoring uses a pluggable scorer interface. The default scorer combines static configuration (impact ratings per action type) with dynamic signals (system health, agent confidence, historical error rates). Custom scorers can be registered per Zone or Planet in the MARIA coordinate system.
  • Evidence collection is handled by lib/engine/evidence.ts, which assembles an evidence bundle from available sources: test results, dry-run outputs, model confidence scores, and historical success rates. The evidence sufficiency score e_i is computed as the weighted mean of individual evidence dimensions.
  • Threshold computation uses the dynamic formula from Section 5.3, with theta_base and theta_min configurable per Zone.
  • Human escalation integrates with the approval engine (lib/engine/approval-engine.ts), which manages the human review queue, SLA tracking, and automatic re-escalation when reviews are not completed within the configured timeout.

13.3 Zone Architecture

MARIA OS organizes decision nodes within the MARIA Coordinate System: Galaxy (tenant) > Universe (business unit) > Planet (domain) > Zone (operational unit) > Agent. Gate configurations are inherited hierarchically:

  • Galaxy level: Global risk tolerance and RS threshold (e.g., RS < 0.05)
  • Universe level: Business unit risk policies (e.g., financial BU has lower risk tolerance than internal tools BU)
  • Planet level: Domain-specific gate templates (e.g., code domain uses different impact ratings than contract domain)
  • Zone level: Operational gate configurations (specific theta_base, theta_min, k, theta values)
  • Agent level: Per-agent overrides for testing, calibration, or special authorization

This hierarchical configuration allows organizations to maintain consistent governance policies across thousands of agents while permitting local customization where needed.

13.4 Decision Pipeline Integration

The Gate Engine integrates with the MARIA OS Decision Pipeline (lib/engine/decision-pipeline.ts), which implements a 6-stage state machine:

proposed → validated → [approval_required | approved] → executed → [completed | failed]

Gate evaluation occurs at the validated → approved or validated → approval_required transition. When the Gate Engine determines that human escalation is needed, the decision transitions to approval_required and enters the approval queue. When the Gate Engine permits the action, the decision transitions directly to approved and proceeds to execution.

Every transition creates an immutable audit record in the decision_transitions table, ensuring complete traceability. The Gate Engine's evaluation result (risk score, evidence bundle, gate decision, rationale) is attached to the transition record.

13.5 Real-Time Monitoring

The MARIA OS dashboard provides real-time visibility into gate operations:

  • Gate Activity Panel: Live feed of gate evaluations with color-coded outcomes (pass: green, escalate: amber, block: red)
  • RS Monitor: Continuous Responsibility Shift tracking with threshold alerts
  • Safety Score Gauge: Composite safety metric with trend line and component breakdown
  • Human Queue: Pending escalations with SLA countdown timers
  • Latency Distribution: Histogram of gate evaluation times by risk tier

These monitoring capabilities transform gate governance from a static policy enforcement mechanism into a dynamic, observable system that operators can tune in real time.

13.6 Configuration as Code

Gate configurations in MARIA OS are stored as versioned configuration objects, enabling gitops-style management. A typical Zone gate configuration:

{
  "zone": "G1.U1.P2.Z3",
  "gate_config": {
    "theta_base": 0.7,
    "theta_min": 0.2,
    "sigmoid_k": 8.5,
    "sigmoid_theta": 0.45,
    "alpha": 2.0,
    "beta": 1.5,
    "delay_budget_ms": 500,
    "rs_threshold": 0.05
  },
  "action_overrides": [
    { "action": "schema_migration", "g_min": 0.9, "h_min": 0.95 },
    { "action": "read_only_query", "g_max": 0.1, "bypass": true }
  ]
}

Every configuration change is audited, and rollbacks are supported via the standard MARIA OS decision pipeline — meaning that changing gate configurations itself goes through a gate.


14. Discussion

14.1 Regulatory Implications

The responsibility decomposition framework and fail-closed gate architecture have direct implications for emerging AI governance regulations. The EU AI Act (2025) requires that high-risk AI systems maintain "human oversight" capabilities. Our framework provides a formal, measurable definition of human oversight via the human intervention probability h_i and the Responsibility Shift metric RS. Organizations deploying MARIA OS can demonstrate regulatory compliance by showing that RS remains below the mandated threshold across all decision nodes — a quantitative compliance certificate rather than a qualitative policy statement.

The US NIST AI Risk Management Framework (AI RMF) emphasizes "governance and accountability" as core functions. The responsibility lock L_i directly maps to the NIST concept of accountability assignment, and the Safety Score S provides the comprehensive monitoring that AI RMF requires for ongoing risk management.

14.2 Comparison to Other Approaches

Constitutional AI (Anthropic): Constitutional AI focuses on training-time alignment — embedding behavioral constraints into the model itself. Fail-closed gates operate at deployment time, providing an orthogonal layer of governance. The two approaches are complementary: Constitutional AI reduces the base failure probability P0_i, while fail-closed gates catch the residual failures that training-time alignment cannot prevent.

Guardrails (NVIDIA NeMo): NeMo Guardrails implements input/output filtering via programmable rails. While effective for content moderation, guardrails are fail-open by default — they filter what they can detect and pass through everything else. Fail-closed gates invert this assumption: they block everything they cannot verify as safe, fundamentally changing the risk profile.

Agent Supervisor Patterns (LangGraph): LangGraph's supervisor pattern routes tasks to specialized agents via a supervisor node. This provides task-level governance but not action-level gate evaluation. The supervisor decides which agent handles a task but does not evaluate whether a specific action within that task should be permitted. Fail-closed gates operate at a finer granularity, evaluating every action at every decision node.

ReAct (Reasoning + Acting): The ReAct pattern interleaves reasoning and acting steps, providing implicit governance via the reasoning traces. However, reasoning traces are not enforcement mechanisms — they provide interpretability but not safety guarantees. An agent that reasons incorrectly will still act incorrectly. Fail-closed gates provide hard enforcement independent of the agent's reasoning quality.

14.3 Limitations

The framework has several limitations that merit discussion:

Linear delay model: The delay function assumes linearity in g_i and h_i, which is a simplification. In practice, human review times follow heavy-tailed distributions (most reviews are fast, but some take hours). The linear model provides useful expected-value bounds but may underestimate tail latencies.

Static risk scoring: The risk score S_i = I_i x R_i is computed at the time of gate evaluation and does not account for time-varying risk (e.g., an action that is safe during low traffic but dangerous during peak hours). Dynamic risk scoring that incorporates real-time system state would improve gate accuracy but adds computational overhead.

Reviewer fatigue: The model assumes constant human accuracy A_human. In practice, human accuracy degrades with reviewer fatigue — the 50th escalation of the day receives less careful review than the 5th. Modeling reviewer fatigue as a decreasing function of review volume would produce more realistic H* recommendations.

Calibration requirements: The sigmoid parameters k and theta require calibration from operational data, which means the system needs a burn-in period before achieving optimal performance. During burn-in, conservative defaults (low theta, high k) are recommended.

14.4 Scalability Considerations

The gate optimization problem scales linearly with the number of decision nodes N. For the analytical solution (Section 6.6), the computation is O(N) — each node's optimal gate strength is computed independently given lambda*. For the numerical solution with sigmoid human-intervention coupling (Section 6.7), the per-iteration cost is O(N) and convergence requires O(100) iterations, giving O(100N) total. For N = 10,000 nodes, the optimization completes in under 1 second on commodity hardware, making it suitable for real-time gate reconfiguration.

The Safety Score computation is also O(N), as it aggregates per-node metrics. The heaviest per-node operation is evidence collection, which involves querying audit logs and test results — this is bounded by the database query time and can be parallelized across nodes.

14.5 Future Directions

Several research directions extend this work:

  • Adaptive gate strength: Instead of periodic reoptimization, gates could continuously adapt their strength based on streaming risk signals. This requires an online optimization variant of the Lagrangian formulation, potentially using online convex optimization techniques.
  • Multi-agent gate coordination: When multiple agents collaborate on a task, their gate evaluations may interact. An agent that produces a code change and another agent that deploys it have correlated risk profiles. Modeling these correlations in the gate optimization could improve overall system performance.
  • Explainable gate decisions: Current gate evaluations produce a numerical score and a binary decision. Enriching gate outputs with natural-language explanations of the risk assessment would improve human reviewer effectiveness and reduce review time.
  • Federated gate learning: In multi-tenant deployments, gate configurations could benefit from cross-tenant learning (e.g., "organizations similar to yours set theta = 0.4 for this action type"). Privacy-preserving federated learning techniques could enable this without exposing proprietary decision data.

15. Conclusion

This paper has presented a complete mathematical framework for fail-closed gate design in multi-agent governance systems. The key contributions are:

The Responsibility Decomposition Framework defines six continuous variables per decision node — impact, risk, automation level, human intervention, gate strength, and evidence sufficiency — that fully characterize the governance state at each point where agents take action. These variables are measurable, monitorable, and actionable.

The Two-Responsibility Model distinguishes execution responsibility ExecResp_i = (1 - a_i) from outcome responsibility OutcomeResp_i = I_i x R_i x L_i, where the responsibility lock L_i = h_i + (1 - h_i) x g_i captures the degree to which responsibility is anchored to a human or governance mechanism. The divergence between these two quantities is the precise condition under which governance fails.

The Responsibility Shift Metric RS = Sigma_i max(0, I_i x R_i x L_i - (1 - a_i)) quantifies system-level responsibility leakage. Maintaining RS below a configurable threshold is the formal objective of gate design.

The Gate Optimization Formulation minimizes expected loss subject to a delay budget via the Lagrangian dual. The optimality condition alpha x Loss_i = lambda x dDelay_i/dg_i allocates gate strength to nodes where the marginal loss reduction per unit of latency is highest. The analytical solution provides closed-form gate strengths when human intervention is fixed; the numerical solution handles the sigmoid human-intervention coupling.

The Human/Agent Ratio Analysis demonstrates that H=30% (agent-dominant with targeted human escalation) achieves the highest Safety Score of 0.924, outperforming both pure human review (0.720) and balanced configurations (0.908). This counterintuitive result — that less human involvement can produce higher safety — follows from the coordination overhead penalty and human reviewer fatigue.

The Experimental Validation across 500 decision nodes and 500,000 actions confirms that fail-closed gates achieve 99.4% mis-execution prevention with +340ms latency overhead and RS = 0.041 at H=30%. Fail-open gates achieve only 64% prevention with RS = 0.312 — an order of magnitude worse on responsibility coverage.

The core insight, which bears repeating, is that fail-closed gates are not primarily about preventing catastrophic AI failures. They are about responsibility decomposition point control — ensuring that at every point in the system where an agent takes a consequential action, there is a well-defined owner, a traceable escalation path, and a measurable safety margin.

This is not a future problem requiring AGI-level risk mitigation. It is a present-day engineering problem that enterprises face every time they deploy an AI agent with the authority to modify code, execute transactions, or alter contracts. The mathematics presented here — responsibility variables, shift metrics, Lagrangian optimization, sigmoid escalation models — provide the formal foundation for solving it.

MARIA OS implements this framework as the Responsibility Gate Engine, integrated with the hierarchical MARIA Coordinate System and the 6-stage Decision Pipeline. Organizations deploying MARIA OS gain measurable, auditable, and optimizable governance over their AI agent fleet — transforming the question of "who is responsible?" from an after-the-fact attribution problem into a real-time engineering parameter.

Judgment does not scale. Execution does. Fail-closed gates are the bridge — they let execution scale while keeping judgment where it matters.

References

- [1] Amodei, D., et al. (2016). "Concrete Problems in AI Safety." arXiv:1606.06565. Foundational taxonomy of AI safety challenges including reward hacking, scalable oversight, and safe exploration.

- [2] Christiano, P., et al. (2017). "Deep Reinforcement Learning from Human Feedback." NeurIPS 2017. Establishes the RLHF framework for aligning agent behavior with human preferences.

- [3] Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." Anthropic. Introduces training-time alignment via constitutional principles, complementary to deployment-time gates.

- [4] Rebedea, T., et al. (2023). "NeMo Guardrails: A Toolkit for Controllable and Safe LLM Applications." NVIDIA. Programmable input/output rails for LLM safety, representing the fail-open paradigm.

- [5] Yao, S., et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. Reasoning-action interleaving pattern that provides interpretability but not enforcement.

- [6] European Parliament. (2024). "Regulation (EU) 2024/1689 — Artificial Intelligence Act." Official Journal of the European Union. Legal framework for AI risk classification and human oversight requirements.

- [7] National Institute of Standards and Technology. (2023). "AI Risk Management Framework (AI RMF 1.0)." NIST AI 100-1. US federal framework for AI governance, accountability, and risk management.

- [8] Boyd, S. and Vandenberghe, L. (2004). "Convex Optimization." Cambridge University Press. Standard reference for Lagrangian duality, KKT conditions, and constrained optimization theory used in gate optimization.

- [9] Stoica, I., et al. (2017). "A Berkeley View of Systems Challenges for AI." Technical Report. Analysis of systems-level challenges in deploying AI including monitoring, auditing, and governance.

- [10] Madry, A., et al. (2018). "Towards Deep Learning Models Resistant to Adversarial Attacks." ICLR 2018. Robustness guarantees for ML models, related to evidence sufficiency in gate evaluation.

- [11] Sculley, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." NeurIPS 2015. Analysis of operational challenges in ML systems including monitoring, configuration, and governance debt.

- [12] Russell, S. (2019). "Human Compatible: Artificial Intelligence and the Problem of Control." Viking. Philosophical and technical framework for maintaining human authority over AI systems.

- [13] Kahneman, D. (2011). "Thinking, Fast and Slow." Farrar, Straus and Giroux. Cognitive science foundation for understanding human reviewer accuracy, bias, and fatigue in approval workflows.

- [14] Hollnagel, E. (2014). "Safety-I and Safety-II: The Past and Future of Safety Management." Ashgate. Distinction between safety as absence of failure vs. safety as presence of governance, motivating the Safety Score design.

- [15] MARIA OS Technical Documentation. (2026). Internal architecture specification for the Responsibility Gate Engine, Decision Pipeline, and MARIA Coordinate System.

R&D BENCHMARKS

Mis-execution Prevention

99.4%

High-risk actions caught by gates before execution

Responsibility Preserved

RS < 0.05

Responsibility shift score maintained below threshold during full automation

Latency Overhead

+340ms

Average gate evaluation time — negligible for high-impact decisions

Published and reviewed by the MARIA OS Editorial Pipeline.

© 2026 MARIA OS. All rights reserved.