Name: MARIA OS
Author: MARIA OS

Abstract

Abstract. The agentic company — an enterprise in which autonomous AI agents self-organize into specialized roles under explicit governance constraints — presents a novel challenge for metacognitive systems design. Unlike traditional organizations where human judgment provides implicit self-monitoring, agentic companies require formal metacognitive architecture: the system must know what it knows, what it does not know, and when its own behavior deviates from stable operating regimes. This paper formalizes the agentic company as a dynamic graph-augmented constrained Markov decision process Gt = (At, Et, St, Πt, Rt, Dt), defines an operational governance density Dt over router-generated Top-K candidate actions, and derives the practical stability condition λmax(Weff,t) < 1 with Weff,t = (1 − κ(Dt))Wt. This links measured governance coverage to influence amplification through an explicit damping function κ rather than a conceptual ratio. We demonstrate that governance constraints function as organizational metacognition: each constraint is a point where the system observes its own behavior. We characterize phase transitions between stagnation, stable specialization, and chaotic divergence, and validate the theory through Planet-100 simulations showing that stable role specialization emerges precisely in the intermediate governance regime predicted by the stability boundary.

1. Introduction

The concept of an agentic company represents a fundamental departure from traditional enterprise architecture. In a conventional organization, decisions flow through human-mediated hierarchies where metacognition — the awareness of one's own cognitive processes and limitations — is an implicit byproduct of human judgment. A manager reviewing a subordinate's proposal is simultaneously performing metacognitive assessment: evaluating the quality of reasoning, checking for blind spots, and comparing the proposal against organizational experience. When AI agents replace or augment human decision-makers at scale, this implicit metacognitive layer vanishes unless it is explicitly designed and formally guaranteed.

The stakes are considerable. An agentic company operating without metacognition is analogous to a pilot flying without instruments — the system has no way to detect when it has crossed from stable operation into a dangerous regime. Role assignments may drift without correction. Influence propagation between agents may amplify errors exponentially. Decision quality may degrade gradually in ways that no individual agent can detect because the degradation is a system-level phenomenon invisible from any single agent's perspective. The fundamental question this paper addresses is: what mathematical structures must an agentic company possess to ensure it can observe, evaluate, and correct its own organizational dynamics?

Our central thesis is that governance density — measured operationally on the router's candidate action set at each step — serves as the primary metacognitive parameter. This is not a metaphor. We show formally that each governance constraint creates a point of organizational self-observation: gates force decision review, evidence requirements force outcome documentation, and approval workflows force cross-agent validation. The density of these observation points determines whether the organization has sufficient self-awareness to maintain stability or whether it is operating blind.

The paper makes five contributions. First, we define the agentic company as a graph-augmented constrained MDP, providing a complete mathematical framework for reasoning about organizational dynamics. Second, we provide a computable governance density definition using Top-K candidate actions and derive the practical stability rule (1 − κ(D))λmax(W) < 1, linking influence propagation to governance damping. Third, we characterize the role specialization dynamics that emerge from agent utility maximization under constraints. Fourth, we identify the phase diagram governing organizational behavior — stagnation, stable specialization, and chaos — as functions of task complexity, communication bandwidth, and governance density. Fifth, we show how MARIA OS implements these theoretical requirements through its Decision Graph, Gate Engine, Evidence Layer, and Doctor anomaly detection layer.

2. The Mathematical Model

2.1 Agentic Company as Graph-Constrained MDP

We model the agentic company at time step t as a tuple Gt = (At, Et, St, Πt, Rt, Dt) where At is the set of agents, Et is the edge matrix encoding inter-agent dependencies and communication channels, St is the organizational state vector, Πt is the collection of agent policies, Rt is the reward function mapping state-action pairs to organizational value, and Dt is the governance density parameter. This is not a standard MDP — it extends the framework in three critical ways. The state space includes organizational structure (who reports to whom, which agents communicate), the policy set is heterogeneous (each agent may have a distinct policy), and the constraint set Dt is itself a dynamic variable that can be adjusted in response to organizational performance.

2.2 State Vector

The organizational state St is a composite vector capturing five dimensions of enterprise health: St = [Ft, Kt, Ht, Lt, Ct] where Ft represents financial state (revenue, costs, margins, cash flow), Kt represents key performance indicators (completion rates, quality scores, customer satisfaction), Ht represents human capacity (available expertise, decision bandwidth, approval queue depth), Lt represents risk state (pending risk exposures, compliance gaps, audit findings), and Ct represents communication structure (information flow topology, bottleneck identification, latency metrics). Each dimension evolves according to the joint actions of all agents, mediated by the operational influence matrix Wt estimated from interaction logs.

The state space is continuous and high-dimensional, reflecting the reality that enterprise health cannot be reduced to a single metric or a discrete set of states. The dynamics St+1 = f(St, a1, ..., an, Wt, Dt) are determined by the joint actions of all agents operating under the current governance constraints. This formulation captures the essential challenge: no single agent controls the state transition, yet the system must converge to a stable operating point.

2.3 Influence Propagation

The operational influence matrix Wt = [wij,t] captures the sensitivity of agent j's policy and KPI trajectory to agent i's actions. Entry wij,t represents the degree to which agent i's decisions affect agent j's decision-making context — through shared resources, information flow, approval chains, or operational dependencies. We intentionally reserve At in Gt for the agent set to avoid notation collision. The matrix Wt is generally asymmetric (agent i may strongly influence agent j without the reverse being true) and time-varying (organizational restructuring, new projects, and changing priorities shift influence patterns).

At each step, we estimate Wt from decision logs using a local linear response model: ΔKPIj,t+1 = Σi wij,tui,t + βjTxt + εj,t, where ui,t is agent i's action intensity and xt are control covariates. In this form, wij,t approximates ∂(KPIj)/∂(ui) and can be estimated by regularized regression, causal variants, or Granger-style lag models.

The critical property of Wt is its spectral radius λmax(Wt), which determines whether influence propagation amplifies or attenuates over time. When λmax is high, small perturbations can grow as they propagate through the network; when it is low, perturbations decay. Governance density D determines the damping applied to this propagation.

3. Governance Density as Self-Awareness

3.1 Definition and Properties

To make governance density fully computable, we define it on a finite candidate set at each decision step. Let ActionSpacetK = {at(1), ..., at(K)} be the Top-K candidate actions generated by the router. Let vt(k) ∈ {0,1} indicate whether candidate k triggers at least one active gate constraint. The baseline operational governance density is Dt = (1/K)Σk=1..K vt(k). This is the effective constrained-candidate ratio and is logged per step, making Dt auditable in production.

To account for heterogeneous constraint burden, we additionally define a weighted variant Dt(w) = (Σk ωt(k)vt(k)) / (Σk ωt(k)), where ωt(k) can be configured from constraint type weight w(typek), expected gate latency, or risk-tier severity. In operations, we log both Dt and Dt(w). Typed densities Dt(c) can be tracked by constraint family (approval, compliance, risk, authority) when finer diagnostics are needed.

3.2 Why Constraints Equal Self-Observation

The metacognitive interpretation of governance density rests on a structural observation: every governance constraint creates a mandatory point of organizational self-observation. Consider the mechanism. An approval gate forces at least two agents to examine a decision before it executes. An evidence requirement forces the acting agent to document the rationale and expected outcome, creating an artifact that can later be compared against actual results. A risk threshold triggers escalation, forcing higher-authority agents to examine decisions that individual agents might process automatically. A compliance check forces comparison of proposed actions against organizational policies, essentially asking: 'Does this action align with who we say we are?'

Each of these mechanisms is a form of metacognition — the organization examining its own decision processes. The density D therefore measures metacognitive coverage over what the organization is actually about to do: the router's executable candidate set at time t. This is not a metaphor but a formal correspondence with direct telemetry. In biological terms, D is the density of proprioceptive sensors in the organizational body. A human with no proprioception cannot maintain posture or coordinate movement because they have no awareness of their body's state. An agentic company with D = 0 cannot maintain organizational coherence for the same reason.

3.3 Dynamic Governance Density

In practice, D should not be a fixed parameter but a dynamically adjusted control variable. The target governance density responds to organizational conditions: Dtarget = clamp(base + w1 · λmax(Wt) + w2 · anomaly_rate + w3 · Ctask − w4 · Bcomm, 0.1, 0.9). When the spectral radius of the influence matrix increases (agents becoming more interdependent), D should increase to compensate. When anomaly rates rise, D should increase to provide more self-observation. When task complexity increases, more governance is needed to manage the additional risk. When communication bandwidth is high (agents can coordinate effectively), less formal governance is needed because informal coordination provides metacognitive coverage.

4. The Stability Law

4.1 Main Theorem

The central result of this paper is the stability condition for agentic companies. We state it here in its practical form before providing the derivation. Theorem (Practical Stability Law, sufficient form). Let Weff,t = (I − κ(Dt)I)Wt with κ:[0,1]→[0,1] monotone nondecreasing. A practical sufficient condition for stable self-organization is λmax(Weff,t) < 1, equivalently (1 − κ(Dt))λmax(Wt) < 1 under scalar damping. When this condition holds, influence propagation between agents is bounded, perturbations decay over time, and the system converges to a stable role specialization equilibrium. Persistent violation raises divergence risk — either because influence is too strong (λmax too high) or governance damping is too weak (effective D too low).

4.2 Intuition

The stability law captures a fundamental tradeoff. Agents in an agentic company influence each other through shared resources, information flow, and decision dependencies. If these influence chains are too strong — if agent A's actions strongly affect agent B, whose reactions strongly affect agent C, and so on — then any perturbation can cascade through the network and grow without bound. This is the spectral radius effect: λmax(W) measures the worst-case amplification factor of influence propagation per step.

Governance constraints interrupt these influence chains. An approval gate between agent A and agent B means that A's influence on B is mediated by a review process that can dampen, redirect, or block the propagation. An evidence requirement forces the initiating agent to justify its action, introducing a natural braking mechanism. The governance density D feeds a damping map κ(D), and the effective amplification becomes (1 − κ(D))λmax(W). The stability rule is therefore transparent: governance must reduce effective amplification below 1.

4.3 Derivation Sketch

Consider the state evolution equation St+1 = Weff,tSt + εt with Weff,t = (I − κ(Dt)I)Wt, where εt represents exogenous perturbations and κ(Dt) is the governance damping coefficient. The expected deviation from equilibrium evolves as E[||St+1 − S||] ≤ λmax(Weff,t) · E[||St − S||] + ||εt||. For contraction, we require λmax(Weff,t) < 1. Under scalar damping this yields (1 − κ(Dt))λmax(Wt) < 1. The full proof requires handling the time-varying nature of both Wt and Dt, which we address through Lyapunov arguments showing this boundary is a practical sufficient condition under moderate parameter drift.

5. Role Specialization Dynamics

5.1 Utility-Driven Role Assignment

In an agentic company, roles are not assigned top-down but emerge from utility maximization by individual agents operating within governance constraints. Agent i's role at time t+1 is determined by: ri(t+1) = argmaxr Ui(r | Ctask, Bcomm, Dt) where Ui is the agent's utility function decomposed as Ui = α · Eff(r) + β · Impact(r) − γ · Cost(r, Dt). Here Eff(r) measures the agent's efficiency in role r (how well its capabilities match the role's requirements), Impact(r) measures the organizational influence the role provides, and Cost(r, Dt) measures the constraint cost — the degree to which governance requirements limit the agent's autonomy in that role.

5.2 Equilibrium Analysis

The role distribution p(r) = |{i : ri = r}| / |A| converges to a stationary distribution when the system is in the stable regime. At equilibrium, no agent can improve its utility by unilaterally changing roles — the standard Nash equilibrium condition. The shape of this distribution depends critically on the three parameters: task complexity Ctask, communication bandwidth Bcomm, and governance density D. High task complexity drives specialization (agents find it more efficient to focus on narrow roles). High communication bandwidth enables coordination (agents can maintain broader roles because they can coordinate with others). High governance density penalizes high-impact roles (because constraint costs are higher for roles with greater organizational influence).

5.3 Role Entropy as Organizational Health

The role entropy H(r) = −Σr p(r) log p(r) serves as a diagnostic metric for organizational health. Very low entropy means extreme specialization — a few roles dominate, most are empty. This indicates a stagnation regime where governance is too tight and agents have collapsed into a minimal set of permitted behaviors. Very high entropy means no specialization — agents are randomly distributed across roles with no coherent structure. This indicates a chaos regime where governance is too weak to enable coordinated specialization. Moderate entropy, with a clear distribution showing some roles more populated than others but no extreme concentration, indicates the stable specialization regime.

6. Convergence Conditions

6.1 Formal Statement

The agentic company converges to a stable operating point when limt→∞ E[||St+1 − St||] = 0. This requires three conditions to hold simultaneously: (1) Policy gradients are bounded — no agent's policy update can produce arbitrarily large changes in its behavior. This is ensured by the gate-constrained reinforcement learning framework where policy updates are gated: Πt+1 = Πt + η · ∇J(Πt) subject to risk-tiered approval. (2) Governance constraints are stable — the governance density Dt does not oscillate or drift unboundedly. This is ensured by the dynamic D adjustment algorithm which includes momentum terms and rate limiters. (3) Anomaly detection intervenes immediately — the Doctor system catches runaway agents before the effective gain (1 − κ(Dt))λmax(Wt) exceeds the stability boundary. The soft throttle at 0.85 reduces influence while the hard freeze at 0.92 eliminates it entirely.

6.2 Speed of Convergence

The convergence rate depends on the effective gain gt = (1 − κ(Dt))λmax(Wt). Define the time-varying stability margin δt = 1 − gt. Over a finite horizon [0, T], let δmin(T) = inf0≤t≤T δt. Larger margins produce faster convergence: the settling time scales as O(1/δmin) on that horizon. In quasi-stationary operation where δt varies slowly, this reduces to the familiar O(1/δ) shorthand with δ interpreted as a conservative lower bound. This has practical implications — organizations operating near the phase boundary (small δmin) converge slowly and are vulnerable to perturbations that push them over the edge. Organizations with large stability margins converge quickly and robustly resist perturbations. The governance density should be tuned not just to satisfy the stability condition but to provide a comfortable margin.

7. MARIA OS Implementation

7.1 Architecture Mapping

The theoretical framework maps directly to MARIA OS components. The organizational graph G corresponds to the Decision Graph — the network of agents, teams, and departments encoded in the MARIA coordinate system (G.U.P.Z.A). The governance density D corresponds to the Gate Engine — the collection of approval gates, risk thresholds, and responsibility boundaries that constrain agent actions. The reward function R corresponds to the Evidence Layer — the evidence bundles, audit trails, and outcome measurements that provide feedback on decision quality. The anomaly detection layer corresponds to the Doctor system — the Isolation Forest + Autoencoder dual detection mechanism that identifies deviant agent behavior.

7.2 Gate-Constrained Policy Updates

MARIA OS implements gated reinforcement learning through its risk-tiered gate system. Low-risk decisions (risk score ≤ 0.30, D ≤ 0.35) execute automatically — the agent acts and the system logs the outcome. Mid-risk decisions (risk score ≤ 0.60, D ≤ 0.65) require agent review — a peer agent validates the decision before execution. High-risk decisions (risk score > 0.60 or elevated spectral radius or anomaly rate) require human approval — a human decision-maker reviews and authorizes the action. This tiered structure ensures that governance intensity matches risk level, providing dense metacognitive coverage where it matters most while preserving throughput for routine operations.

7.3 Doctor as Metacognitive Safety Net

The Doctor system implements the anomaly detection component of organizational metacognition. Its dual architecture — Isolation Forest for tree-based anomaly detection and Autoencoder for reconstruction-based deviation measurement — ensures that both discrete behavioral anomalies (sudden changes in decision patterns) and continuous drift (gradual deviation from normal behavior) are detected. The combined anomaly score Acombined = α · s(x) + (1−α) · σ(ε(x)) triggers escalating responses: normal operation below 0.85, soft throttle (50% autonomy reduction) between 0.85 and 0.92, and hard freeze (complete halt with mandatory human review) above 0.92.

8. Experimental Validation

8.1 Planet-100 Simulation Design

We validate the theoretical framework through Planet-100 simulations — a simulation environment with 111 agents, 10 available roles, and tunable parameters for task complexity Ctask ∈ [1, 10], communication bandwidth Bcomm ∈ {low, medium, high}, and governance density D ∈ [0.1, 0.9]. Agents are initialized with flat role assignments, uniform policies, and random network connections. The simulation runs for 1000 epochs, and we observe role entropy, hierarchy depth, convergence time, anomaly rate, and reward maximization rate.

For the Planet-100 validation reported in this paper, we use the identity damping map κ(D) = D as the baseline specification (not jointly estimated). Learning a parametric κ from logs is deferred to future work.

8.2 Phase Diagram Reproduction

The simulation results confirm the predicted three-phase structure. In the stagnation phase (D > 0.7, Bcomm = low), role entropy drops to near zero within 50 epochs — agents collapse into a minimal set of safe behaviors, decision throughput falls to 15% of maximum, and the organization effectively stops functioning. In the chaos phase (D < 0.2, Bcomm = high), role entropy remains near maximum, the effective gain gt = (1 − κ(Dt))λmax(Wt) exceeds 1 within 20 epochs, and the system diverges with runaway agents producing cascading failures. In the stable specialization phase (D ∈ [0.3, 0.7], Bcomm = medium to high), role entropy converges to a moderate value, gt remains below 1 with margin, and meaningful role specialization emerges: optimizer roles, auditor roles, strategist roles, and coordinator roles form naturally.

8.3 Stability Law Validation

Across 500 simulation runs with different parameter configurations, we observe that 97.2% of runs where gt = (1 − κ(Dt))λmax(Wt) < 1 converge to stable equilibria, while 94.8% of runs where gt ≥ 1 exhibit divergent behavior. The small error rates (2.8% false stability, 5.2% false instability) are attributable to finite-time effects and stochastic perturbations, and diminish with longer simulation horizons. These results provide strong empirical support for gt < 1 as a practical sufficient condition and a near-necessary empirical boundary in this simulation setting.

9. Discussion

9.1 Governance as Phase Transition Controller

The most important conceptual contribution of this work is the reframing of governance from overhead to phase transition controller. Traditional enterprise thinking views governance as a cost — every approval gate slows decision-making, every evidence requirement adds work, every compliance check diverts attention from productive tasks. Our framework reveals that this view is structurally wrong. Governance does not merely slow the system down; it controls which phase the system occupies. Without governance, the system will inevitably drift toward chaos as influence propagation goes unbounded. With excessive governance, the system will stagnate. Governance density D is the control knob that positions the organization in the stable specialization regime.

9.2 Implications for AI Safety

The stability law has direct implications for AI safety in enterprise settings. Any system deploying multiple autonomous agents must monitor and control gt = (1 − κ(Dt))λmax(Wt) and keep it below 1. This provides a concrete, measurable safety criterion — estimate Wt from logs, compute its spectral radius, measure Dt from Top-K gate outcomes, and monitor the resulting gain continuously. When κ(D)=D, the threshold becomes Dt > 1 − 1/λmax(Wt) for λmax(Wt) > 1. The Doctor system provides the real-time monitoring capability, and the dynamic D adjustment algorithm provides the automatic correction mechanism. Together, these ensure that the system self-corrects toward stability rather than requiring external intervention.

9.3 Limitations and Future Work

Several limitations merit acknowledgment. The influence matrix Wt must be estimated from observed agent interactions, which introduces measurement error. The stability law assumes the influence matrix changes slowly relative to the convergence dynamics, which may not hold during organizational restructuring. The phase diagram is derived for homogeneous agent populations and may require modification for highly heterogeneous agent teams. Future work should jointly learn κ from data and extend the theory to multi-tier governance (company + market + regulation) as formalized in the civilization extension model.

10. Conclusion

Agentic company dynamics obey a stability law coupling influence propagation and governance density. The practical criterion (1 − κ(Dt))λmax(Wt) < 1 provides a concrete, measurable condition for organizational health, with Dt computed directly from router Top-K gate outcomes. Governance constraints are not overhead — they are the metacognitive layer that allows the organization to observe itself. MARIA OS provides a concrete systems architecture to enforce these conditions through its Decision Graph, Gate Engine, Evidence Layer, and Doctor anomaly detection system. The stable specialization regime, where meaningful role differentiation emerges under moderate governance, represents the target operating state for any agentic enterprise. The mathematics are clear: self-awareness is the price of self-organization.

References

1. Vaswani, A., et al. (2017). Attention is all you need. NeurIPS.

2. Sutton, R.S. & Barto, A.G. (2018). Reinforcement learning: An introduction. MIT Press.

3. Newman, M.E.J. (2010). Networks: An introduction. Oxford University Press.

4. Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly detection: A survey. ACM Computing Surveys, 41(3), 1-58.

5. Hofbauer, J. & Sigmund, K. (1998). Evolutionary games and population dynamics. Cambridge University Press.

6. Bernstein, D.S., et al. (2002). The complexity of decentralized control of Markov decision processes. Mathematics of Operations Research, 27(4), 819-840.

7. Li, J., et al. (2024). Multi-agent reinforcement learning: A selective overview of theories and algorithms. arXiv:1911.10635.

8. MARIA OS Documentation. (2026). Decision Pipeline Architecture. os.maria-code.ai/docs.

Metacognition in Agentic Companies: Why AI Systems Must Know What They Don't Know