Abstract
Action routing in enterprise AI governance determines which agent or agent group handles each decision, approval, escalation, or delegation. Current routing architectures in production systems are predominantly static: routing rules are defined by administrators based on organizational structure, agent roles, and domain boundaries, and remain fixed until manually updated. This static approach fails to capture the dynamic nature of agent performance, evolving workload patterns, and context-dependent quality variations that characterize real enterprise environments. This paper introduces a recursive adaptation framework for MARIA OS action routing in which routing parameters are updated after each execution cycle based on observed outcomes. The core learning rule θ_{t+1} = θ_t + η∇J(θ_t) updates routing parameters in the direction that maximizes expected routing quality J(θ), where the gradient is estimated from execution outcome signals including decision accuracy, completion time, escalation frequency, and stakeholder satisfaction. We prove that under standard stochastic approximation conditions — diminishing step sizes with ∑η_t = ∞ and ∑η_t² < ∞ — the parameter sequence {θ_t} converges almost surely to the set of local optima of J(θ). We establish Lyapunov stability guarantees showing that the adaptation process remains bounded within a safe parameter region throughout convergence. Thompson sampling provides Bayesian exploration of alternative routes, and a multi-agent coordination protocol based on distributed consensus prevents oscillatory conflicts when multiple agents adapt routing simultaneously. Experimental evaluation across 14 production MARIA OS deployments with 983 agents demonstrates 27.8% routing quality improvement, convergence within 23 adaptation cycles, and zero stability violations across 1.8 million adapted routing decisions over a 150-day evaluation period.
1. Introduction
Static routing is dead. This blunt assertion reflects a growing recognition in the AI governance community that fixed routing rules cannot keep pace with the dynamic reality of enterprise AI operations. When MARIA OS routes an action to an agent, the routing decision encodes implicit assumptions about that agent’s current capability, availability, domain expertise, and workload. These assumptions are valid at configuration time but degrade continuously: agents learn new skills, develop fatigue patterns, acquire domain-specific experience that makes them differentially suited to certain action types, and undergo organizational changes that shift their responsibilities. A routing system that cannot learn from execution outcomes is systematically misinformed, making decisions based on stale assumptions that diverge further from reality with each passing day.
The challenge of adaptive routing in AI governance goes beyond conventional reinforcement learning. In a standard RL setting, the agent explores actions, observes rewards, and updates its policy without concern for the consequences of exploratory actions. In AI governance, every routed action has real consequences — a poorly routed decision may result in delayed patient care, regulatory non-compliance, or financial loss. Exploration must be balanced against the responsibility to maintain service quality during the learning process. Furthermore, the routing system operates in a multi-agent environment where multiple agents are simultaneously adapting their routing parameters, creating the potential for oscillatory conflicts where one agent’s adaptation destabilizes another’s learned policy.
This paper addresses these challenges through a principled recursive adaptation framework that provides formal convergence guarantees, stability bounds, and coordination protocols. The framework treats route adaptation as a stochastic approximation problem, leveraging decades of mathematical theory on convergent iterative algorithms to ensure that the adaptive routing process is well-behaved. Thompson sampling provides a Bayesian mechanism for balancing exploration and exploitation, naturally concentrating routing attempts on promising alternatives while maintaining sufficient exploration to detect changes in the environment. A distributed consensus protocol ensures that multi-agent adaptation converges to a coordinated equilibrium rather than oscillating between conflicting policies.
2. Feedback Loop Architecture
2.1 Execution Outcome Signals
The foundation of recursive route adaptation is the execution outcome signal — the observable result of a routed action that informs the routing system about the quality of its decision. We define the outcome signal o(a, t) for action a routed to target t as a vector of five components: o(a, t) = (accuracy, latency, escalation, satisfaction, compliance). Accuracy measures whether the action was completed correctly (binary for deterministic actions, continuous for probabilistic ones). Latency measures the time from routing to completion relative to the action’s urgency. Escalation indicates whether the target agent needed to escalate the action, suggesting a capability mismatch. Satisfaction captures stakeholder feedback when available. Compliance records whether the action’s execution satisfied all regulatory and policy requirements.
2.2 Reward Function
The outcome signal is aggregated into a scalar reward through a weighted combination: r(a, t) = w_acc · accuracy + w_lat · (1 - latency/latency_max) + w_esc · (1 - escalation) + w_sat · satisfaction + w_comp · compliance. The weights w = (w_acc, w_lat, w_esc, w_sat, w_comp) are configurable per deployment and reflect organizational priorities. Financial services deployments typically weight compliance and accuracy heavily, while customer-facing deployments emphasize latency and satisfaction. The reward function maps to [0, 1], enabling cross-deployment comparison. The expected routing quality under parameters θ is J(θ) = E_{a ∼ A, t = R_θ(a,s)}[r(a, t)], where R_θ is the routing function parameterized by θ.
2.3 Feedback Delay and Credit Assignment
A critical challenge in route adaptation is feedback delay: the outcome of a routed action may not be observable for hours, days, or even weeks after the routing decision. A decision routed to an analyst for review may take three days to complete, and the accuracy of the decision may not be known until its consequences manifest weeks later. The adaptation framework handles this through a temporal credit assignment mechanism. Each routing decision is timestamped and stored in a pending-feedback buffer. When outcome signals arrive, they are matched to their originating routing decisions and the corresponding parameter updates are computed and applied. The effective learning rate is adjusted for delay: η_eff = η · γ^{Δt} where γ ∈ (0, 1) is a discount factor and Δt is the feedback delay in adaptation cycles. This ensures that delayed feedback still contributes to learning but with appropriately reduced influence.
3. Formal Learning Rule
3.1 Parameter Space
The routing parameters θ ∈ Θ ⊆ R^d encode the routing policy. In MARIA OS, θ comprises: (1) capability-affinity weights θ_cap ∈ R^{|C| × |T|} mapping action capability requirements to target agent scores; (2) workload-sensitivity weights θ_wl ∈ R^{|T|} encoding how strongly each target’s current workload penalizes routing to them; (3) domain-expertise weights θ_dom ∈ R^{|D| × |T|} mapping action domain to target agent domain expertise scores; and (4) historical-performance weights θ_hist ∈ R^{|T|} encoding each target’s accumulated performance score. The total parameter dimension d = |C| · |T| + |T| + |D| · |T| + |T| is typically in the range of 500 to 5,000 for production deployments.
3.2 Gradient Estimation
The gradient ∇J(θ) cannot be computed analytically because J(θ) depends on the unknown distribution of future actions and the unknown mapping from routing assignments to outcomes. We estimate the gradient using the REINFORCE estimator: ∇̂J(θ_t) = (1/B) ∑_{b=1}^{B} r(a_b, t_b) · ∇_θ log π_θ(t_b | a_b) where π_θ(t | a) is the softmax routing policy: π_θ(t | a) = exp(q_θ(a, t)) / ∑_{t′} exp(q_θ(a, t′)), and q_θ(a, t) is the quality score computed from the current parameters. B is the batch size (number of actions per adaptation cycle, typically 50-200). The REINFORCE estimator is unbiased but has high variance. We reduce variance using a baseline: ∇̂J(θ_t) = (1/B) ∑_{b=1}^{B} (r(a_b, t_b) - b_t) · ∇_θ log π_θ(t_b | a_b) where b_t = (1/B) ∑_{b} r(a_b, t_b) is the batch average reward.
3.3 Update Rule
The complete parameter update rule is: θ_{t+1} = Proj_Θ(θ_t + η_t ∇̂J(θ_t)) where Proj_Θ denotes projection onto the feasible parameter set Θ, ensuring that parameters remain within valid bounds. The step size schedule η_t = c / (t + t_0) satisfies the Robbins-Monro conditions: ∑_{t=0}^{∞} η_t = ∞ and ∑_{t=0}^{∞} η_t² < ∞. The constants c > 0 and t_0 > 0 are tuning parameters that control the initial learning rate and its decay rate. In production, we use c = 0.1 and t_0 = 10, giving an initial effective learning rate of η_0 = 0.01 that decays as O(1/t).
4. Convergence Proof
4.1 Stochastic Approximation Framework
The parameter update rule θ_{t+1} = θ_t + η_t(∇J(θ_t) + ε_t) is an instance of the Robbins-Monro stochastic approximation, where ε_t = ∇̂J(θ_t) - ∇J(θ_t) is the gradient estimation noise. For convergence, we require four conditions. Condition A1 (Step sizes): ∑_t η_t = ∞ and ∑_t η_t² < ∞, satisfied by our schedule η_t = c/(t + t_0). Condition A2 (Unbiased gradient): E[ε_t | F_t] = 0, where F_t is the filtration (history) up to time t. This holds because the REINFORCE estimator is unbiased. Condition A3 (Bounded variance): E[‖ε_t‖² | F_t] ≤ C(1 + ‖θ_t‖²) for some constant C. This holds because rewards are bounded in [0, 1] and the softmax policy gradient is bounded. Condition A4 (Lipschitz gradient): ‖∇J(θ_1) - ∇J(θ_2)‖ ≤ L‖θ_1 - θ_2‖ for some Lipschitz constant L. This holds when the routing quality function is smooth, which is guaranteed by the softmax parameterization.
4.2 Convergence Theorem
Theorem (Almost Sure Convergence). Under conditions A1-A4, the parameter sequence {θ_t} generated by the recursive adaptation rule converges almost surely: θ_t → θ as t → ∞, where θ is a stationary point of J(θ), i.e., ∇J(θ*) = 0.
Proof sketch. Define the Lyapunov-like function V(θ) = J(θ) - J(θ) ≥ 0. Along the trajectory of the stochastic approximation: E[V(θ_{t+1}) | F_t] = E[J(θ) - J(θ_{t+1}) | F_t] = V(θ_t) - η_t ‖∇J(θ_t)‖² + O(η_t²). The negative term -η_t‖∇J(θ_t)‖² drives V toward zero (i.e., θ_t toward θ) as long as ∇J(θ_t) ≠ 0. The O(η_t²) term is summable because ∑η_t² < ∞. By the supermartingale convergence theorem (Robbins-Siegmund), V(θ_t) converges and ∑_t η_t‖∇J(θ_t)‖² < ∞. Since ∑η_t = ∞, this implies liminf_{t→∞} ‖∇J(θ_t)‖ = 0. Continuity of ∇J and convergence of V then yield ∇J(θ) = 0. □
4.3 Convergence Rate
Under the additional assumption that J is strongly concave with parameter μ > 0 (i.e., the Hessian satisfies ∇²J(θ) ≤ -μI for all θ), the convergence rate is: E[‖θ_t - θ*‖²] ≤ C′ / t^{min(1, 2μc)} where C′ depends on initial conditions and gradient noise variance. For c > 1/(2μ), the rate is O(1/t), which is optimal for stochastic first-order methods. In practice, J is not globally strongly concave, but local strong concavity near the optimum suffices for local convergence rate guarantees.
5. Exploration vs Exploitation: Thompson Sampling
5.1 The Exploration Dilemma in Governance
Standard reinforcement learning approaches to exploration — epsilon-greedy, UCB, Boltzmann exploration — treat exploration as uniform random perturbation of the policy. In AI governance, this is unacceptable: randomly routing a high-risk regulatory action to an unqualified agent for exploration purposes could result in compliance violations. Exploration in governance routing must be responsibility-aware: it should explore alternative routes only when the risk of the exploratory routing is bounded and the potential information gain justifies the risk.
5.2 Thompson Sampling for Route Exploration
We use Thompson sampling, a Bayesian exploration strategy that naturally balances exploration and exploitation by sampling routing decisions from the posterior distribution over routing quality. For each action-target pair (a, t), we maintain a posterior distribution over the true quality q(a, t) based on observed outcomes. In the simplest case (Bernoulli outcomes), the posterior is a Beta distribution: q(a, t) ∼ Beta(α_{a,t}, β_{a,t}) where α_{a,t} counts successful outcomes and β_{a,t} counts unsuccessful outcomes. At each routing decision, we sample q̂(a, t) from the posterior for each target t and route to the target with the highest sampled quality: t* = argmax_t q̂(a, t). This naturally explores under-sampled routes (where the posterior is wide and samples may exceed the exploit-optimal route’s expected quality) while concentrating on high-quality routes as evidence accumulates (narrowing the posterior around the true quality).
5.3 Risk-Bounded Exploration
To ensure that exploratory routing does not violate responsibility constraints, we augment Thompson sampling with a risk bound. For each sampled route, we check whether the probability of the route’s quality falling below a minimum acceptable threshold q_min exceeds a risk tolerance δ: if P(q(a, t) < q_min | data) > δ, the route is excluded from consideration regardless of its sampled quality. This risk-bounded Thompson sampling provides a formal guarantee: the probability of routing to a target whose true quality is below q_min is at most δ per decision. In production, we set q_min = 0.5 and δ = 0.05, ensuring that at most 5% of exploratory routes fall below the minimum acceptable quality threshold. Across our deployments, the actual rate of sub-threshold exploratory routes was 2.1%, well within the tolerance.
6. Multi-Agent Route Coordination
6.1 The Coordination Problem
When multiple agents simultaneously adapt their routing parameters, their adaptations can interact in harmful ways. Consider two agents A1 and A2 that both route actions to a shared pool of target agents. If A1 learns that target T3 produces high-quality outcomes and increases its routing weight toward T3, this increases T3’s workload, degrading T3’s quality for A2’s routed actions. A2 then learns to route away from T3, reducing T3’s workload, which improves T3’s quality, causing A1 to increase its weight further. This oscillatory dynamic can prevent convergence and degrade system-wide routing quality.
6.2 Distributed Consensus Protocol
We address multi-agent coordination through a distributed consensus protocol inspired by the consensus ADMM (Alternating Direction Method of Multipliers) algorithm. Each agent i maintains local routing parameters θ_i and a shared consensus variable θ̄ representing the agreed-upon routing policy. The local update for agent i is: θ_i^{(k+1)} = argmax_{θ} [J_i(θ) - (ρ/2)‖θ - θ̄^{(k)} + u_i^{(k)}‖²] where J_i is agent i’s local routing quality objective, ρ > 0 is the consensus penalty parameter, and u_i is the dual variable for agent i’s consensus constraint. The consensus update averages the local parameters: θ̄^{(k+1)} = (1/N) ∑_{i=1}^{N} θ_i^{(k+1)}. The dual update is: u_i^{(k+1)} = u_i^{(k)} + θ_i^{(k+1)} - θ̄^{(k+1)}. This protocol ensures that individual agent adaptations are pulled toward a common consensus, preventing the oscillatory divergence described above while still allowing each agent to specialize its routing based on its local action distribution.
6.3 Convergence of Consensus
Under standard ADMM convergence conditions — convexity of each J_i and appropriate choice of ρ — the consensus protocol converges to the solution of the global problem: max_θ ∑_{i=1}^{N} J_i(θ). The convergence rate is O(1/k) in the objective value and O(1/√k) in the primal residual ‖θ_i - θ̄‖. In practice, we observe convergence of the consensus within 15 communication rounds, after which individual agent parameters differ from the consensus by less than 0.01 in L2 norm. The consensus penalty ρ controls the trade-off between individual specialization and global coordination: larger ρ forces faster consensus but limits specialization, while smaller ρ allows more specialization at the cost of potential coordination failures.
7. Lyapunov Stability Analysis
7.1 Stability Requirements
Convergence guarantees that the adaptation process eventually reaches the optimal policy, but they do not guarantee that the system remains safe during the adaptation process. A routing system that converges to optimality after 23 cycles but produces catastrophically poor routing during cycles 5-10 is unacceptable in enterprise governance. We therefore require stability: the adaptation process must remain within a bounded region of acceptable routing quality at all times, not just at convergence.
7.2 Lyapunov Function Construction
We construct a Lyapunov function V(θ) that certifies the stability of the adaptation process. Define V(θ) = (1/2)‖θ - θ‖²_{P} where ‖x‖_P = x^T P x is the weighted norm with positive definite matrix P chosen to satisfy the Lyapunov equation: A^T P + P A = -Q for a negative definite matrix Q, where A = ∇²J(θ) is the Hessian of the routing quality at the optimum (negative definite for a local maximum). The Lyapunov function V(θ) is positive definite (V(θ) > 0 for θ ≠ θ*) and radially unbounded (V(θ) → ∞ as ‖θ‖ → ∞).
7.3 Stability Theorem
Theorem (Lyapunov Stability). Let V(θ) = (1/2)‖θ - θ‖²_P be the Lyapunov function defined above. Under the recursive adaptation rule with step size η_t ≤ η_max, the expected change in V satisfies: E[ΔV_t | F_t] = E[V(θ_{t+1}) - V(θ_t) | F_t] ≤ -η_t λ_min(Q) ‖θ_t - θ‖² + η_t² C_V where λ_min(Q) is the minimum eigenvalue of Q and C_V is a constant depending on the gradient noise variance and the norm of P. For η_t sufficiently small (specifically, η_t < λ_min(Q)‖θ_t - θ*‖² / C_V), the expected change is negative, guaranteeing that V decreases in expectation at every step. This implies that the parameter trajectory remains within the sublevel set {V(θ) ≤ V(θ_0)} with high probability, providing a formal stability envelope for the adaptation process.
7.4 Safe Adaptation Region
The stability theorem implies that if the initial parameters θ_0 lie within the sublevel set S_c = {θ : V(θ) ≤ c} for some c > 0, then the parameters remain within S_c throughout the adaptation process (in expectation). We define the safe adaptation region as S_{safe} = {θ : J(θ) ≥ J_min} where J_min is the minimum acceptable routing quality. By choosing the initial parameters such that the sublevel set {V(θ) ≤ V(θ_0)} ⊆ S_{safe}, we guarantee that adaptation never degrades routing quality below the minimum acceptable level. In practice, we compute J_min as 90% of the static routing quality and initialize parameters near the static policy, ensuring that the safe adaptation region is large enough to contain the convergence trajectory.
8. Experimental Results
8.1 Deployment Configuration
We evaluated the recursive adaptation framework across 14 production MARIA OS deployments spanning financial services (5 deployments, 378 agents), healthcare (3 deployments, 215 agents), manufacturing (4 deployments, 256 agents), and government (2 deployments, 134 agents). Total agent count: 983. Each deployment ran for 150 days with two phases: 75 days of static routing (baseline) and 75 days of recursive adaptive routing. The transition between phases was gradual: during the first 10 days of the adaptive phase, the system used a weighted blend of static and adaptive routing, linearly increasing the adaptive weight from 0 to 1. This blending prevented abrupt quality changes during the initial adaptation period.
8.2 Routing Quality Improvement
Average routing quality J(θ) improved from 0.67 (static baseline) to 0.86 (converged adaptive routing), a 27.8% relative improvement. The improvement trajectory showed three distinct phases: rapid initial improvement during cycles 1-8 (quality reaching 0.78), slower refinement during cycles 9-18 (quality reaching 0.83), and fine-tuning during cycles 19-23 (quality reaching 0.86 and stabilizing). Financial services showed the largest improvement (32.1%) due to the highly dynamic nature of regulatory expertise requirements. Healthcare showed 24.3% improvement, manufacturing showed 26.7%, and government showed 22.9%. The improvement correlated strongly with the diversity of action types in each deployment: more diverse action portfolios provided richer feedback signals for adaptation.
8.3 Exploration Efficiency
Thompson sampling exploration efficiency was measured as the fraction of exploratory routing decisions (those where the sampled route differed from the exploit-optimal route) that provided actionable information for parameter updates. Across all deployments, 94.3% of exploratory routes yielded quality signals that resulted in non-trivial parameter updates (gradient norm > 0.001). The remaining 5.7% of explorations were uninformative, either because the route quality was too similar to the current optimal or because feedback was too delayed to be useful. Risk-bounded Thompson sampling successfully limited sub-threshold explorations to 2.1% of routing decisions, well below the 5% tolerance (δ = 0.05). No exploratory route produced a critical failure (defined as quality below 0.3), validating the risk-bounding mechanism.
8.4 Stability and Convergence
Zero Lyapunov stability violations were observed across all deployments. The parameter trajectory remained within the safe adaptation region throughout the adaptation process, with the minimum routing quality during adaptation being 0.62 (occurring during cycle 3 of the financial services deployments), which exceeded J_min = 0.60 (90% of the static baseline of 0.67). Convergence was achieved in an average of 23 adaptation cycles, with a standard deviation of 4.7 cycles across deployments. The fastest convergence was 14 cycles (a government deployment with relatively homogeneous action types), and the slowest was 34 cycles (a financial services deployment with high action diversity and complex regulatory constraints). The consensus protocol for multi-agent coordination converged within 12 communication rounds on average, with no deployment requiring more than 18 rounds.
9. Ablation Studies and Analysis
9.1 Impact of Feedback Delay
We conducted ablation studies to isolate the contribution of each component. Removing the temporal credit assignment mechanism (setting γ = 1, treating all feedback equally regardless of delay) degraded convergence from 23 to 41 cycles and reduced final routing quality by 4.2 percentage points, confirming that delay-discounted feedback is essential for efficient adaptation in environments with heterogeneous feedback latencies.
9.2 Impact of Thompson Sampling
Replacing Thompson sampling with epsilon-greedy exploration (ε = 0.1) reduced routing quality improvement from 27.8% to 19.3% and increased the rate of sub-threshold explorations from 2.1% to 7.8%, exceeding the 5% tolerance. The epsilon-greedy approach’s uniform random exploration wastes exploration budget on targets that the posterior already identifies as low-quality, whereas Thompson sampling concentrates exploration on targets with uncertain but potentially high quality.
9.3 Impact of Multi-Agent Consensus
Removing the ADMM consensus protocol and allowing each agent to adapt independently produced oscillatory behavior in 4 of 14 deployments, with routing quality oscillating by ±0.08 around the mean rather than converging. The affected deployments were those with the highest degree of target sharing among agents, confirming that coordination is essential when agents compete for shared routing targets.
10. Conclusion
Recursive adaptation transforms action routing from a static configuration problem into a dynamic learning system that continuously improves by observing the consequences of its own decisions. The formal framework presented in this paper provides three critical guarantees that make adaptive routing viable in enterprise AI governance: convergence (the routing policy converges to a local optimum of routing quality under standard stochastic approximation conditions), stability (the adaptation process remains within a bounded region of acceptable routing quality throughout convergence, certified by Lyapunov analysis), and coordination (multi-agent adaptation converges to a coordinated equilibrium via distributed consensus, preventing oscillatory conflicts).
The experimental results demonstrate that these theoretical guarantees translate into practical benefits: 27.8% routing quality improvement over static baselines, convergence within 23 adaptation cycles, and zero stability violations across 1.8 million adapted routing decisions. Thompson sampling provides exploration efficiency of 94.3% while maintaining strict risk bounds on exploratory routing quality. The consensus protocol successfully coordinates multi-agent adaptation in all deployments, preventing the oscillatory dynamics that arise under independent adaptation.
The implications for MARIA OS and enterprise AI governance are significant. Adaptive routing enables the system to track changes in agent capabilities, workload patterns, and domain requirements without manual reconfiguration, reducing the administrative burden on governance operators and improving decision quality in dynamic environments. The formal convergence and stability guarantees provide the assurance required for deployment in regulated industries where routing quality directly impacts compliance and patient safety. Future work will extend the framework to non-stationary environments where the optimal routing policy changes over time, requiring the adaptation process to track a moving target rather than converge to a fixed point, and to adversarial settings where malicious agents may attempt to manipulate the adaptation process through strategic outcome reporting.