1. Introduction: The Exploration-Exploitation Dilemma in Enterprise
Consider a procurement department that has traditionally used three approved vendors for office supplies. Vendor A delivers in 3 days with 2% defect rate at $100/unit. Vendor B delivers in 5 days with 1% defect rate at $95/unit. Vendor C delivers in 2 days with 4% defect rate at $105/unit. A new Vendor D appears, offering unverified claims of 1-day delivery, 0.5% defect rate, at $90/unit. Should the department continue ordering from the known-good vendors, or divert some orders to test Vendor D's claims? How many orders should be diverted? When should testing stop? This is the multi-armed bandit problem, and enterprises face it thousands of times daily across every operational domain.
The traditional enterprise approach to this problem is either pure exploitation (stick with what works, never test alternatives) or structured A/B testing (allocate fixed percentages to each variant for a predetermined period, then analyze results). Both approaches are mathematically suboptimal. Pure exploitation has infinite regret when a superior alternative exists — the organization pays the opportunity cost forever. A/B testing is statically inefficient: it continues allocating equal traffic to clearly inferior variants long after the evidence distinguishes them, and it requires predetermined sample sizes that may be too large (wasting resources) or too small (producing inconclusive results).
Multi-Armed Bandit algorithms solve both problems by dynamically adjusting allocation based on accumulating evidence. They pull promising arms more often, reducing the cost of exploration, while maintaining sufficient exploration to detect superior alternatives. The mathematical guarantee is sublinear regret: the cumulative cost of exploration grows slower than linearly with time, meaning the per-period cost of exploration approaches zero.
1.1 The Exploration Layer in the Algorithm Stack
In the 7-layer agentic company architecture, the Exploration Layer (Layer 5) sits above the Control Layer (Layer 4, actor-critic RL) and below the Abstraction Layer (Layer 6, PCA). While the Control Layer handles sequential decision-making with state transitions, the Exploration Layer handles stateless or weakly-stateful selection problems: choosing among discrete strategies, allocating resources among competing options, or tuning continuous parameters. The distinction is temporal depth: the Control Layer optimizes over multi-step trajectories with delayed rewards, while the Exploration Layer optimizes single-step selections with immediate (or near-immediate) feedback.
The Exploration Layer serves several critical enterprise functions: | Function | Bandit Formulation | Arms | Reward | |---|---|---|---| | Strategy A/B testing | Beta-Bernoulli Thompson sampling | Strategy variants | Conversion / success rate | | Dynamic pricing | Contextual bandits | Price points | Revenue per transaction | | Resource allocation | Combinatorial bandits | Resource configurations | Throughput / efficiency | | Priority scheduling | Restless bandits | Task priority rules | SLA compliance rate | | Vendor selection | MAB with switching costs | Approved vendors | Quality-adjusted cost | | Gate threshold tuning | Bayesian optimization | Threshold values | Governance quality score |
1.2 Paper Organization
Section 2 formalizes the multi-armed bandit problem. Section 3 develops Thompson sampling for enterprise strategy optimization. Section 4 presents UCB algorithms and their enterprise interpretations. Section 5 introduces contextual bandits for personalized decisions. Section 6 extends to Bayesian optimization for continuous strategy spaces. Section 7 derives regret bounds with enterprise interpretations. Section 8 addresses practical deployment considerations. Section 9 describes MARIA OS strategy engine integration. Section 10 presents experimental results. Section 11 discusses limitations and extensions. Section 12 concludes.
2. The Multi-Armed Bandit Problem: Formal Foundations
The canonical multi-armed bandit problem is defined as follows. An agent faces K arms (strategies, options, alternatives). At each round t = 1, 2, ..., T, the agent selects an arm I_t in {1, ..., K} and receives a reward X_{I_t, t} drawn from a distribution nu_{I_t} with mean mu_{I_t}. The agent's goal is to maximize cumulative reward, or equivalently, to minimize cumulative regret.
2.1 Regret Definitions
Let mu = max_k mu_k be the mean reward of the optimal arm, and let Delta_k = mu - mu_k be the suboptimality gap of arm k. The cumulative pseudo-regret is: $$ R_T = T \mu^* - \sum_{t=1}^{T} \mu_{I_t} = \sum_{k=1}^{K} \Delta_k \cdot \mathbb{E}[N_k(T)] $$ where N_k(T) is the number of times arm k is pulled in T rounds. This decomposition is fundamental: regret is the sum over all suboptimal arms of their gap times their expected pull count. Minimizing regret requires pulling suboptimal arms as few times as possible while still gathering enough information to identify them as suboptimal.
2.2 Information-Theoretic Lower Bound
The Lai-Robbins lower bound establishes that no algorithm can achieve regret better than: $$ \liminf_{T \to \infty} \frac{R_T}{\log T} \geq \sum_{k: \Delta_k > 0} \frac{\Delta_k}{\text{KL}(\nu_k, \nu^)} $$ where KL(nu_k, nu) is the Kullback-Leibler divergence between the reward distributions of arm k and the optimal arm. This bound implies that logarithmic regret is the best any algorithm can achieve — and algorithms that achieve it are said to be asymptotically optimal. For Bernoulli rewards, KL(Ber(p), Ber(q)) = p log(p/q) + (1-p) log((1-p)/(1-q)).
2.3 Enterprise Interpretation of Regret
In enterprise contexts, regret has a direct monetary or operational interpretation: - Pricing regret: Revenue lost by not immediately charging the optimal price. If the optimal price yields $50/transaction and the current price yields $45/transaction, each transaction at the suboptimal price costs $5 in regret. - Vendor selection regret: Quality-adjusted cost premium paid by not using the best vendor. Each order placed with a suboptimal vendor adds Delta_k to cumulative regret. - Approval workflow regret: Efficiency lost by not using the optimal gate threshold. Each decision processed under a suboptimal threshold adds processing time regret. The business case for bandit algorithms is the difference between linear regret (static strategies) and logarithmic regret (optimal bandits). Over T = 100,000 decisions with average gap Delta = $5, static strategies accumulate $500K in regret while Thompson sampling accumulates approximately $5K * log(100,000) = $57.5K — an order of magnitude improvement.
3. Thompson Sampling for Enterprise Strategy Optimization
Thompson sampling (TS), also known as posterior sampling or probability matching, is a Bayesian bandit algorithm that selects arms according to their posterior probability of being optimal. It maintains a posterior distribution over each arm's reward parameter and, at each round, samples from each posterior and plays the arm with the highest sample.
3.1 The Beta-Bernoulli Model
For binary outcomes (success/failure), the natural conjugate model is Beta-Bernoulli. Each arm k has a true success probability theta_k. The prior is Beta(alpha_k^0, beta_k^0), and after observing S_k successes and F_k failures, the posterior is: $$ \theta_k | \text{data} \sim \text{Beta}(\alpha_k^0 + S_k, \; \beta_k^0 + F_k) $$ At each round, Thompson sampling draws theta_hat_k ~ Beta(alpha_k^0 + S_k, beta_k^0 + F_k) for each arm and selects I_t = argmax_k theta_hat_k. The elegance of this approach is that arms with higher posterior uncertainty (fewer observations) are more likely to produce high samples, naturally driving exploration, while arms with high posterior means are more likely to produce the highest sample, naturally driving exploitation.
3.2 Enterprise Applications of Beta-Bernoulli Thompson Sampling
The Beta-Bernoulli model applies to any enterprise decision with a binary outcome: 1. Approval rate optimization: Each approval workflow variant (different evidence requirements, different reviewer assignments) is an arm. Success = approved within SLA. Thompson sampling identifies the workflow that maximizes approval rate. 2. Conversion optimization: Each sales script, pricing offer, or outreach template is an arm. Success = customer converts. Thompson sampling allocates more traffic to high-converting variants while maintaining exploration. 3. Compliance pass rate: Each compliance check procedure is an arm. Success = decision passes audit. Thompson sampling identifies the procedure that maximizes first-pass audit rates. 4. Agent task assignment: Each agent-task assignment policy is an arm. Success = task completed within quality and time thresholds. Thompson sampling learns the optimal assignment policy.
3.3 Prior Selection for Enterprise Settings
The choice of prior hyperparameters alpha_k^0 and beta_k^0 encodes domain knowledge. In enterprise settings, we recommend three prior strategies: - Uninformative prior: Beta(1, 1) = Uniform(0, 1). Used when no historical data exists. Requires the most exploration. - Historical prior: Beta(alpha_H, beta_H) where alpha_H and beta_H are computed from historical success/failure counts. Used when migrating from a rule-based system with performance records. - Conservative prior: Beta(1, c) where c > 1 encodes a pessimistic prior (assuming low success probability). Used for high-risk experiments where the cost of false optimism exceeds the cost of delayed discovery. In MARIA OS, the prior is automatically initialized from historical decision data in the evidence system. When a new strategy variant is introduced, its prior is set to the population average, providing a reasonable starting point without requiring manual specification.
3.4 Thompson Sampling with Gaussian Rewards
For continuous rewards (revenue, processing time, quality scores), the conjugate model is Normal-Normal. Each arm k has reward X_{k,t} ~ N(mu_k, sigma_k^2) with known variance. The posterior for mu_k after n_k observations with sample mean x_bar_k is: $$ \mu_k | \text{data} \sim N\left( \frac{\sigma_k^{-2} n_k \bar{x}_k + \sigma_0^{-2} \mu_0}{\sigma_k^{-2} n_k + \sigma_0^{-2}}, \; \frac{1}{\sigma_k^{-2} n_k + \sigma_0^{-2}} \right) $$ Thompson sampling draws from these posteriors and selects the arm with the highest sample. The posterior variance shrinks as 1/n_k, ensuring that well-explored arms are selected based on their mean while under-explored arms retain high variance and occasional high draws that drive exploration.
4. Upper Confidence Bound Algorithms
Where Thompson sampling is Bayesian and stochastic, Upper Confidence Bound (UCB) algorithms are frequentist and deterministic. UCB selects arms based on an optimistic estimate: the empirical mean plus a confidence bonus that reflects uncertainty.
4.1 UCB1 Algorithm
The UCB1 algorithm selects the arm that maximizes: $$ I_t = \arg\max_k \left[ \bar{X}_k + \sqrt{\frac{2 \ln t}{N_k(t)}} \right] $$ where X_bar_k is the empirical mean reward of arm k and N_k(t) is the number of pulls. The confidence bonus sqrt(2 ln(t) / N_k(t)) decreases with more pulls of arm k (exploitation) and increases with total rounds t (ensuring continued exploration). UCB1 achieves O(K log(T) / Delta) regret, matching the Lai-Robbins lower bound up to constants.
4.2 KL-UCB for Tighter Bounds
KL-UCB replaces the Hoeffding-based confidence bonus with a KL divergence-based bonus, achieving the Lai-Robbins lower bound exactly: $$ I_t = \arg\max_k \left\{ q \in [0, 1] : N_k(t) \cdot \text{KL}(\bar{X}_k, q) \leq \ln(t) + c \ln(\ln(t)) \right\} $$ The selected arm is the one whose maximum plausible mean (as determined by the KL divergence constraint) is highest. KL-UCB is particularly effective for Bernoulli rewards where Hoeffding bounds are loose. In enterprise approval rate optimization, KL-UCB converges to the optimal approval workflow 20-30% faster than UCB1.
4.3 Enterprise UCB Extensions
Several UCB variants address enterprise-specific requirements: UCB with switching costs. Changing strategies incurs operational costs (retraining, communication, process reconfiguration). UCB-S adds a switching penalty to the index: $$ I_t^{\text{UCB-S}} = \arg\max_k \left[ \bar{X}_k + \sqrt{\frac{2 \ln t}{N_k(t)}} - c_{\text{switch}} \cdot \mathbb{1}[k \neq I_{t-1}] \right] $$ The switching cost c_switch prevents rapid alternation between arms, producing more stable strategy selections. UCB with delayed feedback. Enterprise outcomes often arrive with delay (audit results take days, customer retention is measured over months). UCB-D accounts for pending outcomes by adjusting the confidence bonus based on the number of pending observations. UCB with fairness constraints. Some enterprise settings require minimum allocation to each arm (e.g., regulatory requirements to test all vendor options). Constrained UCB ensures each arm receives at least p_min fraction of pulls while maximizing the UCB index otherwise.
5. Contextual Bandits for Personalized Enterprise Decisions
Standard bandits assume the reward distribution is stationary — every pull of arm k draws from the same distribution. In enterprise settings, context matters: the best pricing strategy depends on the customer segment, the best escalation routing depends on the issue type, the best gate threshold depends on the agent's recent performance. Contextual bandits extend the MAB framework to incorporate observable features.
5.1 The Contextual Bandit Formulation
At each round t, the agent observes a context vector x_t in R^d, selects an arm I_t in {1, ..., K}, and receives reward r_t = f_{I_t}(x_t) + epsilon_t where f_k is the unknown reward function for arm k and epsilon_t is zero-mean noise. The goal is to learn the context-dependent optimal arm: $$ \pi^*(x) = \arg\max_k \mathbb{E}[r | x, k] = \arg\max_k f_k(x) $$
5.2 Linear Contextual Bandits (LinUCB)
LinUCB assumes the reward is a linear function of the context: f_k(x) = theta_k^T x where theta_k in R^d is arm k's unknown parameter vector. The algorithm maintains a regularized least-squares estimate and confidence ellipsoid for each arm: $$ \hat{\theta}_k = (X_k^T X_k + \lambda I)^{-1} X_k^T r_k $$ and selects the arm maximizing the UCB: $$ I_t = \arg\max_k \left[ \hat{\theta}_k^T x_t + \alpha \sqrt{x_t^T (X_k^T X_k + \lambda I)^{-1} x_t} \right] $$ The confidence bonus is context-dependent: it is larger in directions of the feature space that arm k has not explored. LinUCB achieves O(d sqrt(T K * log(T))) regret.
5.3 Enterprise Contextual Bandit Applications
Dynamic pricing with customer context. Context features include customer segment, purchase history, time of day, and competitor pricing. Arms are price points. The bandit learns which price maximizes revenue for each customer context. In MARIA OS, the pricing bandit operates within the Sales Universe, with gate constraints ensuring prices remain within approved ranges. Escalation routing with issue context. Context features include issue category, customer sentiment score, agent availability, and historical resolution rates. Arms are routing destinations (tier-1 support, specialist team, manager escalation). The bandit learns the routing that maximizes first-contact resolution for each issue type. Gate threshold adaptation with agent context. Context features include agent trust score, recent error rate, task complexity, and time-of-day. Arms are threshold configurations. The bandit learns the gate configuration that optimizes the tradeoff between autonomy and oversight for each agent context.
5.4 Neural Contextual Bandits
When the reward function is non-linear in the context, we use neural contextual bandits where f_k(x) is parameterized by a neural network. The key challenge is uncertainty quantification: neural networks provide point estimates but not confidence intervals. We address this through two approaches: Ensemble disagreement. Train an ensemble of M networks with different random initializations. The prediction is the ensemble mean, and the confidence bonus is proportional to the ensemble standard deviation. This provides a computationally tractable approximation to posterior uncertainty. Neural Thompson sampling. Maintain a distribution over the last-layer weights of the neural network using a Bayesian linear regression on the learned features. The network body provides a feature representation phi(x), and the last layer weights are sampled from their posterior: $$ w_k | \text{data} \sim N(\hat{w}_k, \sigma^2 (\Phi_k^T \Phi_k + \lambda I)^{-1}) $$ This combines the representational power of neural networks with the principled exploration of Thompson sampling.
6. Bayesian Optimization for Continuous Strategy Spaces
When the strategy space is continuous rather than discrete — tuning a gate threshold in [0, 1], optimizing a pricing curve parameterized by elasticity and base price, or configuring a resource allocation ratio — the MAB framework extends to Bayesian optimization.
6.1 Gaussian Process Surrogate
Bayesian optimization models the unknown objective function f(x) as a Gaussian process: f ~ GP(mu, k) where mu is the prior mean function and k is the kernel function. After observing n data points D_n = {(x_i, y_i)}_{i=1}^n where y_i = f(x_i) + epsilon_i, the posterior is: $$ f | D_n \sim \text{GP}(\mu_n, k_n) $$ with posterior mean mu_n(x) = k(x, X)(K + sigma^2 I)^{-1} y and posterior variance k_n(x, x') = k(x, x') - k(x, X)(K + sigma^2 I)^{-1} k(X, x'). The posterior provides both a prediction (mean) and an uncertainty estimate (variance) at any point in the strategy space.
6.2 Acquisition Functions
The acquisition function determines the next point to evaluate, balancing exploration (high uncertainty) and exploitation (high predicted value). Three common choices: Expected Improvement (EI): $$ \text{EI}(x) = \mathbb{E}[\max(f(x) - f^+, 0)] = (\mu_n(x) - f^+) \Phi(z) + \sigma_n(x) \phi(z) $$ where f^+ is the best observed value, z = (mu_n(x) - f^+) / sigma_n(x), and Phi, phi are the standard normal CDF and PDF. Upper Confidence Bound (GP-UCB): $$ \text{UCB}(x) = \mu_n(x) + \beta_t \sigma_n(x) $$ where beta_t = 2 log(t^{d/2+2} pi^2 / 3 delta) provides a frequentist confidence guarantee. Knowledge Gradient (KG): $$ \text{KG}(x) = \mathbb{E}[\mu_{n+1}^ - \mu_n^ | x_{n+1} = x] $$ which measures the expected improvement in the best predicted value after evaluating x. KG is particularly effective in enterprise settings where each evaluation is expensive (e.g., running a 7-day A/B test).
6.3 Enterprise Bayesian Optimization Applications
Gate threshold optimization. The objective is the governance quality score (a composite of false acceptance rate, false rejection rate, and processing latency) as a function of the gate threshold tau in [0, 1]. Bayesian optimization finds the threshold that maximizes quality using far fewer evaluations than grid search. With a Matern 5/2 kernel, the GP captures the typical non-monotonic relationship between threshold and quality: too low allows harmful actions, too high blocks productive ones. Pricing curve optimization. The objective is revenue as a function of price elasticity epsilon and base markup m. The search space is a 2D rectangle [0.5, 3.0] x [1.1, 2.5]. Bayesian optimization discovers the revenue-maximizing pricing curve in approximately 30 evaluations, compared to 900+ for a 30x30 grid search. Resource allocation ratio. The objective is throughput as a function of the fraction of compute allocated to each of N task types, subject to the simplex constraint. Bayesian optimization on the simplex finds the allocation that maximizes throughput while respecting the constraint that allocations sum to 1.
7. Regret Bounds and Enterprise Interpretations
Understanding regret bounds is essential for enterprise deployment because they determine the cost of learning. A CEO who deploys a bandit algorithm needs to know: how much revenue will I lose during the exploration phase before the algorithm converges?
7.1 Regret Bounds Summary
| Algorithm | Regret Bound | Order | Enterprise Interpretation |
|---|---|---|---|
| Thompson Sampling (Bernoulli) | sum_k Delta_k^{-1} log(T) | O(K log T / Delta) | Revenue loss scales logarithmically with time |
| UCB1 | sum_k (8 log T) / Delta_k | O(K log T / Delta) | Deterministic upper bound on exploration cost |
| KL-UCB | sum_k Delta_k / KL(mu_k, mu*) * log T | O(K log T / KL) | Matches Lai-Robbins lower bound |
| LinUCB (contextual) | d sqrt(T K log T) | O(d sqrt(TK log T)) | Personalization cost: grows with feature dimension |
| GP-UCB (Bayesian opt) | sqrt(T gamma_T log T) | O(sqrt(T gamma_T log T)) | gamma_T is information capacity of kernel |
7.2 The Logarithmic Regret Guarantee
The most important result for enterprise deployment is that all well-designed bandit algorithms achieve logarithmic regret: R_T = O(log T). This means: - After 100 rounds: regret approximately 23 C (where C is an algorithm-dependent constant) - After 1,000 rounds: regret approximately 35 C - After 10,000 rounds: regret approximately 46 C - After 100,000 rounds: regret approximately 58 C The regret from round 1,000 to round 100,000 (99,000 additional rounds) adds only 23 * C to the cumulative regret — the same as the first 100 rounds. This radically front-loads the cost of learning: most exploration happens early, and the system increasingly exploits the best strategy as evidence accumulates.
7.3 Finite-Time Bayesian Regret for Thompson Sampling
For the Beta-Bernoulli model, the Bayesian regret of Thompson sampling satisfies: $$ \mathbb{E}[R_T] \leq \sum_{k: \Delta_k > 0} \left( \frac{\Delta_k}{\text{KL}(\mu_k, \mu^*)} + C_k \right) \ln T $$ where C_k is a problem-dependent constant that captures the prior's influence. With an uninformative Beta(1,1) prior, C_k is small. With a strongly informative prior that is correct, C_k can be negative (the prior accelerates convergence). With a strongly informative prior that is wrong, C_k can be large (the prior delays convergence). This motivates the careful prior selection strategies described in Section 3.3.
8. Practical Deployment Considerations
Deploying bandit algorithms in enterprise environments introduces challenges not addressed by the theoretical framework.
8.1 Non-Stationarity and Drift
Enterprise reward distributions change over time: customer preferences evolve, market conditions shift, regulatory requirements update. Stationary bandit algorithms will converge to a previously optimal arm that is no longer best. We address non-stationarity through three mechanisms: Sliding window Thompson sampling. Instead of using all historical data, maintain a window of the most recent W observations. The posterior is Beta(alpha_0 + S_k^W, beta_0 + F_k^W) where S_k^W and F_k^W are successes and failures within the window. This forgets outdated information and adapts to distribution shifts. Discounted UCB. Weight observations by their recency: recent observations contribute more to the empirical mean and confidence bonus. The discounted empirical mean is X_bar_k^gamma = sum_t gamma^{T-t} X_{k,t} / sum_t gamma^{T-t} where gamma in (0,1) is the discount factor. Change-point detection. Monitor the posterior distribution for sudden shifts using CUSUM or Bayesian change-point detection. When a change point is detected, reset the arm's posterior to the prior, triggering a burst of re-exploration. This is more efficient than sliding windows when changes are infrequent but abrupt.
8.2 Batched Updates
Enterprise systems often process decisions in batches rather than one at a time. In batched Thompson sampling, the algorithm selects B actions simultaneously (one per batch element) and observes all B rewards before updating posteriors. The challenge is that all B selections use the same posterior, potentially under-exploring. We mitigate this by adding posterior perturbation: for each element in the batch, we add a small random perturbation to the posterior parameters before sampling, ensuring diversity in arm selection within the batch. $$ \theta_{k,b} \sim \text{Beta}(\alpha_k + \epsilon_b, \; \beta_k + \epsilon_b') \quad \text{where } \epsilon_b, \epsilon_b' \sim \text{Uniform}(0, \sigma_{\text{perturb}}) $$
8.3 Safety Constraints During Exploration
Enterprise exploration must respect safety boundaries. A pricing bandit cannot explore a price of $0 (revenue destruction) or $10,000 (customer alienation). We impose safety constraints by restricting the arm set to safe options: $$ A_{\text{safe}} = \{ k \in \{1, ..., K\} : P(r_k < r_{\min}) < \delta \} $$ where r_min is the minimum acceptable reward and delta is the safety tolerance. Under the Beta-Bernoulli model, this constraint is computed analytically using the posterior CDF. Arms whose posterior places too much probability below the safety threshold are excluded from selection until more evidence accumulates.
9. MARIA OS Strategy Engine Integration
The MARIA OS strategy engine implements the Exploration Layer through a modular bandit service that supports multiple algorithm backends and integrates with the platform's governance infrastructure.
9.1 Strategy Engine Architecture
The strategy engine consists of four components: 1. Experiment Registry: Defines active experiments with their arm sets, context features, reward definitions, and safety constraints. Experiments are scoped to MARIA coordinates — a pricing experiment in G1.U1 (Sales Universe) is independent of a routing experiment in G1.U3 (FAQ Universe). 2. Algorithm Backend: Pluggable bandit algorithms (Thompson sampling, UCB1, KL-UCB, LinUCB, GP-UCB) selected per experiment based on the reward type (binary vs continuous), context availability, and convergence requirements. 3. Decision Gateway: Intercepts strategy decisions at gate boundaries and routes them through the active experiment's bandit algorithm. When an agent needs to select a strategy, the gateway queries the algorithm for a recommendation and records the selection. 4. Outcome Tracker: Monitors decision outcomes and feeds rewards back to the algorithm. Handles delayed feedback by maintaining a pending outcomes buffer and updating posteriors when outcomes arrive.
9.2 Coordinate-Scoped Experiments
Experiments in MARIA OS are scoped to specific MARIA coordinates, enabling hierarchical strategy optimization: $$ \text{Experiment}(G_i.U_j.P_k) \implies \text{all agents in } Z_l.A_m \text{ under } P_k \text{ participate} $$ A galaxy-level experiment tests strategies across all business units. A universe-level experiment tests strategies within a single business unit. A planet-level experiment tests strategies within a domain. This scoping ensures that experiment results are relevant to the operational context and that gate constraints from ancestor levels are respected.
9.3 Governance Integration
The strategy engine respects MARIA OS governance at every level: - Arm approval: New strategy variants must pass a gate review before being added to an experiment's arm set. This prevents untested or high-risk strategies from entering the exploration pool. - Exploration budget: Each experiment has a maximum exploration allocation (e.g., 'no more than 20% of procurement decisions can be experimental'). The bandit algorithm operates within this budget, concentrating exploration in the allocated fraction. - Outcome audit: Every arm selection and reward observation is recorded in the evidence system, creating a complete audit trail of the organization's exploration behavior. - Escalation on anomaly: If a bandit algorithm selects an arm that produces consistently poor outcomes (reward below safety threshold for n consecutive rounds), the strategy engine escalates to human review.
10. Experimental Results
We evaluate the bandit framework across three enterprise optimization scenarios within MARIA OS.
10.1 Experiment 1: Approval Workflow Optimization
Setup. 5 approval workflow variants (differing in evidence requirements, reviewer assignment rules, and SLA targets) tested across 8,000 approval requests over 60 days. Binary reward: 1 if approved within SLA, 0 otherwise. Results. Thompson sampling identified the optimal workflow (Variant C, with targeted evidence requirements and workload-balanced reviewer assignment) within 14 days, achieving 87.3% approval rate vs 79.1% for the worst variant. Cumulative regret: 412 approval-days lost during exploration. Epsilon-greedy (epsilon = 0.1) required 31 days and accumulated 1,537 approval-days of regret — 3.7x worse. UCB1 performed comparably to Thompson sampling (16 days, 489 regret) but with less intuitive arm selection behavior.
10.2 Experiment 2: Dynamic Pricing
Setup. Contextual bandit (LinUCB) for 3 product categories with 12 context features (customer segment, purchase history, time of day, competitor price, inventory level, etc.) and 8 price points per category. Continuous reward: revenue per transaction. Results. Over 90 days and 45,000 transactions, LinUCB achieved 8.7% revenue uplift over static pricing rules. The contextual model learned meaningful pricing patterns: enterprise customers in high-urgency contexts had lower price sensitivity (optimal price 15% above baseline), while price-sensitive segments responded to dynamic discounting during off-peak hours (optimal price 8% below baseline during 2-6 AM). Revenue uplift was front-loaded: 5.2% in the first 30 days, 7.9% by day 60, 8.7% by day 90.
10.3 Experiment 3: Gate Threshold Tuning
Setup. Bayesian optimization (GP-UCB with Matern 5/2 kernel) for gate threshold tuning across 4 gate dimensions (financial risk, compliance risk, operational risk, evidence quality). Continuous search space in [0, 1]^4. Objective: composite governance quality score (weighted combination of false acceptance rate, false rejection rate, and average processing latency). Results. Bayesian optimization found a threshold configuration with governance quality score 0.847 in 28 evaluations (each evaluation = 500 decisions). Grid search with 5 points per dimension required 625 evaluations to find a configuration with score 0.831. The GP-UCB-optimal thresholds were: financial risk = 0.72, compliance risk = 0.89, operational risk = 0.58, evidence quality = 0.41 — notably, compliance risk required a much higher threshold than other dimensions, confirming the domain intuition that compliance violations are more costly than operational inefficiencies.
11. Limitations and Extensions
11.1 Limitations
The bandit framework has several limitations in enterprise contexts. First, the independence assumption: standard MABs assume arm rewards are independent, but enterprise strategies often interact (e.g., changing pricing in one product category affects demand in adjacent categories). Combinatorial bandits address this partially but with higher computational cost. Second, the stationarity assumption: despite our non-stationarity mitigations, rapid environmental changes can cause significant regret before the algorithm adapts. Third, the reward definition problem: enterprise 'reward' is often a multi-dimensional construct that must be compressed into a scalar, and the compression function itself affects which strategies appear optimal.
11.2 Extensions
Several extensions are natural directions for enterprise bandit research: Cascading bandits for multi-stage decisions where each stage is a bandit problem and the reward depends on the entire cascade. Application: multi-step approval workflows where each step's configuration is an arm. Sleeping bandits for settings where arm availability varies over time. Application: vendor selection when vendors have capacity constraints and are periodically unavailable. Dueling bandits for settings where only relative feedback is available (preference A > B) rather than absolute rewards. Application: strategy comparison through human expert evaluation where absolute quality scores are unreliable. Federated bandits for multi-tenant MARIA OS deployments where each tenant runs a bandit instance and wants to benefit from cross-tenant learning without sharing raw data. Federated Thompson sampling enables this by sharing sufficient statistics (posterior parameters) rather than individual observations.
12. Conclusion
Multi-Armed Bandits form the Exploration Layer of the agentic company architecture, providing a principled mathematical framework for the exploration-exploitation tradeoff that pervades every enterprise decision domain. Thompson sampling with Beta-Bernoulli priors delivers asymptotically optimal exploration for binary outcomes — approval rates, conversion rates, compliance pass rates — with the computational simplicity of sampling from Beta distributions. UCB algorithms provide deterministic, auditable exploration strategies with provable confidence bounds that enterprise governance teams can inspect and verify. Contextual bandits extend the framework to personalized decisions, learning which strategies work best for which contexts. Bayesian optimization handles continuous strategy spaces with sample efficiency that makes it practical for expensive enterprise experiments.
The integration with MARIA OS transforms these algorithms from mathematical curiosities into operational infrastructure. Coordinate-scoped experiments ensure that exploration respects organizational boundaries. Gate-governed arm approval prevents untested strategies from entering production. Evidence-system integration provides complete audit trails of exploration behavior. Safety constraints prevent catastrophic exploration outcomes.
The experimental results validate the framework's practical value: 73% regret reduction over epsilon-greedy baselines means less revenue lost during learning, 14-day convergence means faster strategy optimization, and 8.7% pricing uplift demonstrates that the theoretical advantages translate to measurable business outcomes. As agentic companies scale the number of autonomous decisions, the Exploration Layer ensures that organizational learning is mathematically optimal — not left to the ad-hoc intuitions of A/B test designers.
Future work will integrate the Exploration Layer with the Control Layer (Layer 4, actor-critic RL) through a hierarchical architecture where bandits select high-level strategies and RL optimizes execution within each strategy. This two-level optimization — bandits for what to do, RL for how to do it — mirrors the distinction between strategic and operational decision-making in human organizations.