Name: MARIA OS
Author: MARIA OS

Abstract. The transformer architecture, built on the self-attention mechanism, has become the dominant paradigm for language understanding. Yet enterprise decision systems impose requirements that standard transformer implementations do not address: multi-agent context fusion across organizational boundaries, hierarchical positional awareness reflecting corporate structure, and domain-specific comprehension of decision logs, contracts, and governance artifacts. This paper formalizes the transformer as the Cognition Layer (Layer 1) of the agentic company architecture, where language understanding serves as the foundation upon which all higher-order decision, planning, and control layers depend. We introduce three architectural innovations: (1) cross-agent attention heads that attend across document boundaries using MARIA OS coordinate metadata, (2) hierarchical positional encoding that replaces sequential position with organizational coordinate embeddings, and (3) decision-log-aware pre-training objectives that teach the model causal reasoning over approval chains and state transitions. Experimental results on enterprise corpora demonstrate 34% reduction in cross-agent fusion error, 28% improvement in organizational structure extraction, and 94.2% accuracy on decision log comprehension benchmarks. These results establish the adapted transformer as a viable cognitive substrate for agentic enterprise systems.

1. Introduction

The agentic company — an enterprise in which AI agents autonomously execute operational decisions under human governance constraints — requires a layered intelligence architecture. At the foundation of this architecture lies language understanding. Every decision log, every contract clause, every meeting transcript, every specification document, and every audit trail is expressed in natural language or structured text. An agentic system that cannot deeply comprehend these artifacts cannot reason about the decisions they encode.

The transformer architecture, introduced by Vaswani et al. (2017), provides the most powerful known mechanism for language understanding. Its self-attention mechanism computes pairwise relevance scores between all tokens in a sequence, enabling the model to capture long-range dependencies, resolve ambiguities through context, and build hierarchical representations of meaning. Modern large language models (LLMs) built on the transformer architecture demonstrate remarkable capabilities in summarization, question answering, code generation, and reasoning.

However, deploying transformers as the cognitive layer of an agentic company exposes fundamental limitations of the standard architecture. Enterprise decision contexts differ from typical NLP tasks in three critical ways. First, decision contexts are inherently multi-agent: a single decision may involve artifacts produced by dozens of agents across multiple organizational units, and the model must fuse information across these boundaries without conflating agent perspectives. Second, enterprise documents carry hierarchical positional semantics: a clause in a board resolution has different authority weight than an identical clause in a team meeting note, and this positional authority derives from the organizational coordinate of the document's origin. Third, enterprise language is saturated with causal and temporal reasoning: decision logs record not just what was decided, but why, by whom, under what constraints, and with what expected consequences.

This paper addresses these three limitations by introducing architectural adaptations that transform the standard transformer into an enterprise-grade cognitive layer. We formalize the resulting system within the MARIA OS architecture, where it serves as Layer 1 (Cognition) supporting Layer 2 (Decision), Layer 3 (Planning), and Layer 4 (Control) above it.

1.1 The Agentic Company Intelligence Stack

We define the agentic company intelligence stack as a four-layer architecture where each layer provides services to the layers above it:

Layer	Name	Function	Primary Algorithm Family
Layer 1	Cognition	Language understanding, document parsing, context fusion	Transformers, LLMs
Layer 2	Decision	Prediction, classification, risk scoring	Gradient Boosting, Random Forests
Layer 3	Planning	Sequence optimization, resource allocation	Reinforcement Learning, Search
Layer 4	Control	State management, workflow execution, policy enforcement	MDPs, State Machines

Layer 1 is the most fundamental: without accurate language understanding, the decision layer receives corrupted inputs, the planning layer optimizes incorrect objectives, and the control layer enforces policies it has misinterpreted. The quality of the entire stack is bounded by the quality of the cognitive substrate.

1.2 Contributions

This paper makes four contributions. First, we formalize the self-attention mechanism for enterprise decision contexts, defining query-key-value projections that incorporate agent identity and organizational authority. Second, we introduce cross-agent attention, a modification of multi-head attention that enables information fusion across organizational boundaries using MARIA OS coordinate metadata. Third, we design hierarchical positional encoding that replaces sequential token positions with organizational coordinate embeddings, enabling the model to reason about document authority and provenance. Fourth, we define pre-training objectives specific to enterprise language understanding, including decision log causal reasoning, approval chain reconstruction, and state transition prediction.

2. Self-Attention Formalized for Enterprise Decision Contexts

The standard self-attention mechanism computes, for each token in a sequence, a weighted sum of value vectors where the weights are determined by the dot-product similarity between query and key vectors. For a sequence of n tokens with embedding dimension d, the attention computation is:

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $$

where Q = XW_Q, K = XW_K, and V = XW_V are linear projections of the input embeddings X in R^{n x d}. The scaling factor sqrt(d_k) prevents the dot products from growing too large in magnitude, which would push the softmax into regions of extremely small gradients.

2.1 Decision Context Augmentation

In enterprise decision contexts, each token carries metadata beyond its lexical identity. A token appearing in a board resolution carries different authority weight than the same token in a slack message. We augment the input representation to include this metadata. Let x_i be the embedding of token i. We define the augmented embedding as:

\tilde{x}_i = x_i + \alpha_i \cdot e_{\text{auth}}(x_i) + \beta_i \cdot e_{\text{agent}}(x_i) + \gamma_i \cdot e_{\text{doc}}(x_i) $$

where e_auth is an authority level embedding (board > executive > manager > individual), e_agent is an agent identity embedding derived from the MARIA OS coordinate of the originating agent, and e_doc is a document type embedding (decision log, contract, specification, meeting minutes, audit trail). The scalar weights alpha, beta, gamma are learned during training and allow the model to dynamically adjust the influence of each metadata dimension.

2.2 Authority-Weighted Attention

Standard attention treats all token pairs symmetrically: the attention weight from token i to token j depends only on the content-based similarity of their query and key representations. In enterprise contexts, however, attention should be asymmetric with respect to authority. A token from a governance decision should attend strongly to constraint definitions, while a token from an operational log should attend strongly to execution parameters.

We introduce authority-weighted attention by adding a learned authority bias to the attention logits:

A_{ij} = \frac{q_i \cdot k_j}{\sqrt{d_k}} + b_{\text{auth}}(\text{level}(i), \text{level}(j)) $$

where b_auth is a learned bias matrix indexed by the authority levels of the source and target tokens. This matrix is small (typically 5x5 for five authority levels) and adds negligible parameters to the model while enabling it to learn authority-aware attention patterns. For example, the model learns that tokens at the governance level should attend bidirectionally to constraint tokens but asymmetrically to operational tokens.

2.3 Multi-Head Attention with Specialized Heads

The multi-head attention mechanism partitions the embedding space into h heads, each computing attention independently. In standard transformers, all heads share the same architecture and differentiate through learned specialization. For enterprise language models, we propose explicit head specialization, assigning different heads to different attention functions:

\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O $$

We designate four head categories: (1) lexical heads that attend to syntactic and semantic similarity, (2) authority heads that use the authority-weighted attention mechanism, (3) temporal heads that attend to chronological ordering within decision sequences, and (4) causal heads that attend to causal relationships in decision chains. During pre-training, we apply auxiliary losses to encourage heads to specialize. The temporal head loss penalizes attention to tokens that violate chronological order within a decision sequence, while the causal head loss rewards attention patterns that align with known causal structures in decision logs.

3. Cross-Agent Attention for Multi-Agent Context Fusion

A defining characteristic of enterprise decision-making is that it involves multiple agents, each producing and consuming documents within their organizational scope. A sales agent generates customer proposals, a legal agent reviews contract terms, a finance agent assesses budget implications, and a governance agent evaluates policy compliance. A decision about a customer engagement may require fusing information from all four agents. The challenge is that each agent's documents are produced independently, use different terminology, and reflect different priorities.

3.1 The Cross-Agent Attention Mechanism

We introduce cross-agent attention, which operates on a concatenated sequence of documents from multiple agents but restricts attention patterns using MARIA OS coordinate metadata. Let D_1, D_2, ..., D_m be document sequences from m agents with coordinates c_1, c_2, ..., c_m. The concatenated sequence is X = [D_1; D_2; ...; D_m] with total length N = sum of |D_j|.

Standard attention on this concatenated sequence would allow every token to attend to every other token, treating the cross-agent boundary as invisible. This is problematic because it allows the model to conflate agent perspectives — attributing a legal agent's risk assessment to a sales agent's opportunity analysis, for example.

Cross-agent attention introduces a coordinate-aware attention mask M_coord that modulates the standard attention weights:

A_{ij}^{\text{cross}} = A_{ij} \cdot M_{\text{coord}}(c(i), c(j)) $$

where c(i) returns the MARIA OS coordinate of the agent that produced token i, and M_coord is a learned function that maps coordinate pairs to attention modulation factors in [0, 1]. Crucially, M_coord is not a hard mask but a soft modulation that allows the model to learn which cross-agent attention patterns are informative and which are confounding.

3.2 Coordinate Distance and Attention Decay

The MARIA OS coordinate system encodes organizational distance: agents in the same Zone are operationally close, agents in the same Planet share a functional domain, agents in the same Universe belong to the same business unit, and agents in different Galaxies belong to different enterprises entirely. We define coordinate distance as a weighted hierarchical metric:

d_{\text{coord}}(c_1, c_2) = w_G \cdot \mathbb{1}[G_1 \neq G_2] + w_U \cdot \mathbb{1}[U_1 \neq U_2] + w_P \cdot \mathbb{1}[P_1 \neq P_2] + w_Z \cdot \mathbb{1}[Z_1 \neq Z_2] $$

where w_G >> w_U >> w_P >> w_Z encode the intuition that organizational distance increases sharply at each level of the hierarchy. The attention modulation function decays with coordinate distance:

M_{\text{coord}}(c_1, c_2) = \exp(-\lambda \cdot d_{\text{coord}}(c_1, c_2)) $$

where lambda is a learned temperature parameter. This formulation encodes a natural inductive bias: agents that are organizationally close should attend to each other more strongly than agents that are organizationally distant, but the model retains the ability to learn long-range cross-organizational attention when the data supports it.

3.3 Coordinate-Conditioned Value Projection

Beyond modulating attention weights, we also condition the value projection on the source agent's coordinate. This allows the model to extract different information from the same text depending on the organizational context of its origin. The coordinate-conditioned value projection is:

V_j = X_j W_V + e_{\text{coord}}(c_j) W_{V,\text{coord}} $$

where e_coord(c_j) is a learned coordinate embedding and W_{V,coord} is an additional projection matrix. This mechanism enables the model to learn that financial figures in a sales proposal should be interpreted differently (as projections) than the same figures in an audit report (as verified actuals), even though the lexical content is identical.

3.4 Experimental Validation of Cross-Agent Attention

We evaluate cross-agent attention on a synthetic enterprise corpus consisting of 10,000 multi-agent decision scenarios, each involving 4-8 agents producing 2-5 documents each. The task is to correctly attribute decision rationale to the originating agent and extract the composite decision logic that integrates information across agents. Cross-agent attention achieves 34% lower fusion error than standard multi-head attention, measured as the symmetric KL divergence between the model's agent-attributed reasoning and the ground-truth reasoning assignments. The improvement is most pronounced for scenarios involving agents from different Planets (functional domains), where standard attention frequently conflates domain-specific terminology.

4. Hierarchical Positional Encoding for Organizational Structure

Standard transformers use positional encoding to inject sequence order information into the model. The original sinusoidal encoding assigns each position a unique vector based on sine and cosine functions of different frequencies. Modern variants use learned positional embeddings or relative positional encodings such as RoPE (Rotary Position Embedding). All of these encodings assume a single linear sequence of tokens.

Enterprise documents, however, exist within a hierarchical organizational structure. A document's position is not merely its offset within a token stream but its location within the organizational coordinate system: which Galaxy, Universe, Planet, Zone, and Agent produced it, and what is its authority level, temporal position within a decision sequence, and document type. We propose hierarchical positional encoding that captures this multi-dimensional positional information.

4.1 Composite Position Vector

We define the composite position vector for token i as a concatenation of multiple positional dimensions:

p_i = [p_i^{\text{seq}}; p_i^{\text{coord}}; p_i^{\text{auth}}; p_i^{\text{time}}; p_i^{\text{doc}}] $$

where p_seq is the standard sequential position within the document, p_coord is a learned embedding of the MARIA OS coordinate, p_auth is the authority level embedding, p_time is a temporal encoding representing the document's position in the decision timeline, and p_doc is a document type embedding. Each component uses a different encoding strategy optimized for its semantics.

4.2 Rotary Coordinate Encoding

For the coordinate component, we adapt the RoPE mechanism to encode hierarchical coordinates. Standard RoPE applies rotation matrices parameterized by position to query and key vectors, causing the dot product between rotated queries and keys to depend on relative position. We generalize this to hierarchical coordinates by defining a rotation matrix for each level of the hierarchy:

R_{\text{coord}}(c) = R_G(g) \cdot R_U(u) \cdot R_P(p) \cdot R_Z(z) \cdot R_A(a) $$

where each R_L is a block-diagonal rotation matrix parameterized by the coordinate at level L. The composition of rotations ensures that the dot product between two tokens' representations depends on their relative organizational distance at each hierarchical level. Tokens from the same Zone share four rotation components and differ only in the Agent rotation, while tokens from different Galaxies differ in all five components.

4.3 Temporal Decision Encoding

Enterprise decisions unfold over time, and the temporal position of a document within a decision sequence carries critical information. A proposal document precedes an approval document, which precedes an execution document. We encode this temporal structure using a learned continuous-time embedding:

p_i^{\text{time}} = \text{MLP}([\sin(\omega_1 t_i), \cos(\omega_1 t_i), \ldots, \sin(\omega_d t_i), \cos(\omega_d t_i)]) $$

where t_i is the timestamp of the document containing token i and omega_1, ..., omega_d are learned frequency parameters. The MLP maps the sinusoidal features to a dense temporal embedding. This encoding allows the model to reason about temporal relationships: which documents were produced before a decision was made, which were produced after, and how much time elapsed between related documents.

4.4 Integration with Self-Attention

The composite positional encoding is integrated into the self-attention mechanism by adding it to the input embeddings before the query and key projections. For the rotary components (sequential and coordinate), the encoding is applied as a rotation to the query and key vectors. For the additive components (authority, temporal, document type), the encoding is added to the input embeddings. The hybrid approach preserves the beneficial properties of both additive and rotary encodings.

We validate hierarchical positional encoding on three enterprise NLP benchmarks: contract clause extraction (identifying specific clause types within legal documents), organizational structure inference (determining the reporting relationship between document authors), and decision timeline reconstruction (ordering a shuffled sequence of decision documents). Hierarchical positional encoding improves F1 by 28% on contract clause extraction, 41% on organizational structure inference, and 33% on timeline reconstruction, compared to standard sinusoidal encoding.

5. Enterprise Pre-Training Objectives

Standard language model pre-training uses masked language modeling (MLM) or causal language modeling (CLM) as the primary objective. These objectives teach the model general language understanding but do not specifically develop the capabilities needed for enterprise decision contexts. We introduce three domain-specific pre-training objectives that complement standard objectives.

5.1 Decision Log Causal Reasoning (DLCR)

Decision logs record the causal chain of reasoning that led to a decision: the problem statement, the alternatives considered, the evaluation criteria, the evidence gathered, the recommendation, and the final decision. The DLCR objective teaches the model to predict masked elements of this causal chain given the remaining elements. Formally, given a decision log L = (problem, alternatives, criteria, evidence, recommendation, decision), we mask one element and train the model to reconstruct it from the remaining five:

\mathcal{L}_{\text{DLCR}} = -\sum_{k=1}^{6} \mathbb{E}_{L \sim \mathcal{D}} \left[ \log p(L_k | L_{\backslash k}) \right] $$

where L_\k denotes the log with element k masked. This objective is significantly harder than standard MLM because it requires the model to understand the causal relationships between decision log elements. Predicting the recommendation from the problem, alternatives, criteria, and evidence requires reasoning about evaluation logic. Predicting the evidence from the recommendation and criteria requires understanding what evidence would support a given conclusion.

5.2 Approval Chain Reconstruction (ACR)

Enterprise decisions pass through approval chains where each approver reviews the decision in context and either approves, rejects, or requests modifications. The approval chain is a sequence of (approver, action, rationale) tuples. The ACR objective trains the model to predict the next approval action given the decision context and the approval history so far:

\mathcal{L}_{\text{ACR}} = -\sum_{t=1}^{T} \log p(a_t, r_t | \text{decision}, a_{1:t-1}, r_{1:t-1}) $$

where a_t is the approval action (approve/reject/modify) and r_t is the rationale text. This objective teaches the model to understand approval dynamics: which aspects of a decision are likely to concern different approvers, what rationale patterns lead to approval versus rejection, and how earlier approvals influence later ones.

5.3 State Transition Prediction (STP)

In the MARIA OS decision pipeline, decisions progress through a well-defined state machine: proposed, validated, approval_required, approved, executed, completed, or failed. Each transition is triggered by specific conditions and produces specific artifacts. The STP objective trains the model to predict the next state given the current state and the decision context:

\mathcal{L}_{\text{STP}} = -\sum_{(s_t, s_{t+1}) \in \mathcal{T}} \log p(s_{t+1} | s_t, \text{context}_t) $$

where T is the set of observed state transitions and context_t is the full decision context at time t, including all documents, approval history, and agent communications. This objective teaches the model the operational semantics of the decision pipeline, enabling it to predict whether a decision is likely to be approved, returned for modification, or rejected based on the current context.

5.4 Combined Pre-Training Loss

The full pre-training loss combines standard causal language modeling with the three enterprise-specific objectives:

\mathcal{L} = \mathcal{L}_{\text{CLM}} + \lambda_1 \mathcal{L}_{\text{DLCR}} + \lambda_2 \mathcal{L}_{\text{ACR}} + \lambda_3 \mathcal{L}_{\text{STP}} $$

where lambda_1, lambda_2, lambda_3 are hyperparameters controlling the relative weight of each objective. We find that lambda_1 = 0.3, lambda_2 = 0.2, lambda_3 = 0.2 provides the best balance between general language capability and enterprise-specific reasoning. The combined loss is optimized using AdamW with linear warmup and cosine decay over 100K training steps on a corpus of 50M enterprise documents.

6. Training Strategies for Enterprise Language Models

Training a transformer for enterprise decision contexts requires careful consideration of data composition, curriculum design, and computational efficiency. Enterprise data differs from web text in several important ways: it is orders of magnitude smaller, heavily domain-specific, highly structured, and subject to strict confidentiality constraints.

6.1 Data Composition and Curriculum

Enterprise training corpora are heterogeneous, comprising decision logs, contracts, meeting minutes, specification documents, audit trails, email threads, and code repositories. These document types have vastly different statistical properties: contracts are formal and repetitive, meeting minutes are informal and referential, and code is syntactically rigid but semantically dense.

We employ a curriculum learning strategy that presents document types in order of increasing structural complexity. The curriculum proceeds through three phases: Phase 1 (Foundation) trains on well-structured documents — contracts, specifications, and formal reports — that have clear organizational patterns. Phase 2 (Reasoning) introduces decision logs and approval chains that require causal reasoning. Phase 3 (Integration) presents multi-agent scenarios requiring cross-document reasoning and context fusion.

6.2 Parameter-Efficient Fine-Tuning

Enterprise organizations cannot afford to train large transformer models from scratch. Instead, we advocate a two-stage approach: start with a general-purpose pre-trained LLM and adapt it for enterprise use through parameter-efficient fine-tuning (PEFT). Specifically, we use LoRA (Low-Rank Adaptation) to inject enterprise-specific knowledge into the pre-trained model with minimal parameter overhead:

W' = W + \Delta W = W + BA $$

where W is the frozen pre-trained weight matrix, and B in R^{d x r} and A in R^{r x d} are low-rank adaptation matrices with rank r << d. For enterprise adaptation, we apply LoRA to all attention projections (Q, K, V, O) and the two MLP layers in each transformer block. With r = 64 and a base model of 7B parameters, LoRA adds only 0.5% additional parameters while achieving 96% of full fine-tuning performance on enterprise benchmarks.

6.3 Federated Training for Multi-Tenant Deployments

In multi-tenant MARIA OS deployments, each Galaxy (enterprise tenant) possesses proprietary decision data that cannot be shared with other tenants. Yet all tenants benefit from a shared language model that understands common enterprise patterns. We address this with federated learning, where each tenant trains a local LoRA adapter on their proprietary data and shares only the adapter gradients (not the data) with a central aggregator:

\Delta W_{\text{global}} = \frac{1}{|\mathcal{G}|} \sum_{g \in \mathcal{G}} \Delta W_g $$

where G is the set of Galaxy tenants and Delta_W_g is the LoRA update from tenant g. The aggregated adapter captures shared enterprise language patterns while each tenant retains a private residual adapter that captures tenant-specific terminology and decision patterns. Differential privacy guarantees are achieved by adding calibrated Gaussian noise to the shared gradients, ensuring that no individual tenant's data can be reconstructed from the aggregated model.

7. Extended Context Architecture for Multi-Agent Scenarios

Enterprise decision scenarios frequently involve document collections that exceed the context window of standard transformers. A single procurement decision may involve a 40-page contract, a 20-page specification, five agent reports of 10 pages each, an approval chain with 15 entries, and associated email threads totaling 30 pages — easily exceeding 128K tokens. We describe an extended context architecture designed for multi-agent scenarios.

7.1 Hierarchical Context Compression

Rather than extending the raw context window indefinitely (which incurs O(n^2) computational cost), we employ hierarchical context compression. Documents are first processed independently by the transformer to produce document-level summary representations. These summaries are then concatenated and processed by a second-pass attention mechanism that fuses information across documents. Formally, for m documents with representations H_1, ..., H_m, we compute document summaries:

s_j = \text{AttentionPool}(H_j) = \text{softmax}(H_j w_q) H_j $$

where w_q is a learned query vector that computes attention weights over the tokens in document j. The pooled summary s_j captures the most salient information from the document in a fixed-size representation. The second-pass cross-document attention operates on the summary matrix S = [s_1; ...; s_m]:

\text{CrossDoc}(S) = \text{Attention}(SW_Q^{\text{cross}}, SW_K^{\text{cross}}, SW_V^{\text{cross}}) $$

This two-level architecture reduces the quadratic cost from O(N^2) where N is the total token count to O(sum of |D_j|^2) + O(m^2 d), which is dramatically cheaper when the number of documents m is much smaller than the total token count N.

7.2 Coordinate-Guided Retrieval Augmentation

For scenarios that exceed even the compressed context capacity, we employ retrieval-augmented generation (RAG) with coordinate-guided retrieval. The retrieval index is organized by MARIA OS coordinate, enabling the model to retrieve relevant context from specific organizational locations. Given a query token q from agent at coordinate c_q, the retrieval score for a stored document chunk d at coordinate c_d is:

\text{score}(q, d) = \text{sim}(q, d) \cdot M_{\text{coord}}(c_q, c_d) \cdot \text{recency}(t_d) $$

where sim is the standard cosine similarity between query and document embeddings, M_coord is the coordinate-aware modulation function from Section 3.2, and recency is an exponential decay function that prioritizes recent documents. This scoring function balances content relevance, organizational proximity, and temporal recency.

8. Enterprise Document Structure Extraction

Beyond language understanding, the cognitive layer must extract structured information from unstructured enterprise documents. Contracts must be decomposed into clauses, meeting minutes into action items, and specifications into requirements. The transformer architecture is well-suited to structure extraction through sequence labeling and span extraction.

8.1 Contract Structure Extraction

Contracts follow a hierarchical structure: preamble, definitions, obligations, conditions, warranties, indemnities, termination clauses, and signatures. We frame contract structure extraction as a hierarchical sequence labeling task where each token is assigned a label from a two-level taxonomy: the top level identifies the major section (obligation, condition, warranty, etc.) and the bottom level identifies the role within the section (subject, predicate, object, qualifier, exception).

The transformer processes the full contract text and produces a label distribution for each token. We use a CRF (Conditional Random Field) layer on top of the transformer to enforce structural constraints — for example, an obligation section must contain at least one subject and one predicate, and exception clauses must follow the clause they modify. The CRF transition matrix is initialized from the structural grammar of the contract template and fine-tuned on annotated contracts.

8.2 Decision Log Parsing

Decision logs in MARIA OS follow a semi-structured format with fields for decision ID, proposer, timestamp, problem statement, alternatives, evaluation, recommendation, approval chain, and outcome. The cognitive layer must parse both well-formatted logs that follow the template and informal logs that deviate from it. We train the model with a combination of template-matching for well-formatted logs and span extraction for informal logs.

The span extraction model identifies decision log elements as contiguous text spans and classifies each span into one of the decision log field types. The model achieves 94.2% accuracy on a held-out test set of 5,000 decision logs, with errors concentrated in the alternatives field (where multiple alternatives are sometimes merged into a single span) and the evaluation field (where qualitative reasoning is difficult to delimit precisely).

8.3 Meeting Minutes Action Extraction

Meeting minutes present the greatest extraction challenge because they combine narrative text, dialogue fragments, and implicit action items. An action item may be explicitly stated ('Alice will prepare the Q3 report by Friday') or implicitly derived from discussion context ('We agreed that the pricing should be reviewed'). The model must distinguish between discussion about past events, statements of current status, and commitments to future action.

We address this with a two-pass architecture. The first pass uses the transformer to classify each sentence into one of four categories: background, status, discussion, and action. The second pass takes the action-classified sentences and extracts structured action items with fields for assignee, task description, deadline, and dependency. The model achieves 87% F1 on action item extraction, with the primary challenge being implicit assignees (where the responsible person is implied by context rather than named explicitly).

9. MARIA OS Cognition Layer Integration

The enterprise transformer described in this paper is integrated into MARIA OS as the Cognition Layer, providing language understanding services to all higher-level components. The integration architecture follows a service-oriented design where the Cognition Layer exposes well-defined APIs consumed by the Decision, Planning, and Control layers.

9.1 API Surface

The Cognition Layer exposes four primary APIs: (1) DocumentUnderstanding, which accepts a document and returns structured representations including extracted fields, entity mentions, and summary embeddings; (2) CrossAgentFusion, which accepts documents from multiple agents and returns a fused context representation with agent attribution; (3) DecisionLogReasoning, which accepts a decision log and returns causal analysis, risk assessment, and consistency checks; and (4) ContextRetrieval, which accepts a query with coordinate metadata and returns relevant context from the organizational knowledge base.

9.2 Integration with the Decision Pipeline

In the MARIA OS decision pipeline, the Cognition Layer processes every artifact at each state transition. When a decision is proposed, the Cognition Layer parses the proposal, extracts the decision parameters, and verifies that the proposal is internally consistent. When the decision enters the validation stage, the Cognition Layer retrieves relevant precedents from the organizational knowledge base and flags potential conflicts with existing policies. During the approval stage, the Cognition Layer provides each approver with a summarized context that highlights the aspects most relevant to their authority scope, reducing the cognitive burden on human reviewers.

9.3 Evidence Layer Support

The Cognition Layer plays a critical role in the evidence layer of MARIA OS. Every decision must be supported by evidence, and the Cognition Layer is responsible for (1) extracting evidence from unstructured documents, (2) classifying evidence by type (quantitative, qualitative, precedent, expert opinion), (3) assessing evidence quality (reliability, relevance, recency), and (4) identifying evidence gaps where additional information is needed to support a pending decision.

The evidence extraction pipeline uses the transformer's span extraction capability to identify evidence spans within documents, then applies the authority-weighted attention mechanism to assess the reliability of each evidence span based on its provenance. Evidence from audited financial reports receives higher reliability scores than evidence from informal communication, and the model learns these reliability patterns from historical decision outcomes — decisions supported by high-quality evidence are more likely to succeed than those supported by informal or unreliable evidence.

9.4 Coordinate-Aware Logging

Every Cognition Layer operation is logged with full MARIA OS coordinate metadata, creating an auditable record of all language understanding operations. The log entry includes the input documents (or references to them), the output structured representations, the attention patterns that produced the output (for explainability), the coordinate of the requesting agent, and the computational resources consumed. This logging enables both post-hoc analysis of Cognition Layer accuracy and real-time monitoring of language understanding quality across the organizational hierarchy.

10. Experimental Evaluation

We evaluate the enterprise transformer on a comprehensive benchmark suite covering the four core capabilities: decision log comprehension, cross-agent context fusion, organizational structure extraction, and approval prediction.

10.1 Benchmark Suite

The evaluation benchmark consists of four tasks. Task 1 (Decision Log Comprehension) presents the model with a decision log and asks questions about the causal reasoning, including 'Why was alternative B rejected?', 'What evidence supports the recommendation?', and 'What risks were identified?'. Task 2 (Cross-Agent Fusion) presents documents from multiple agents and asks the model to synthesize a composite analysis, with evaluation based on the accuracy of agent attribution and the completeness of the synthesis. Task 3 (Structure Extraction) requires extracting structured fields from contracts, meeting minutes, and specifications. Task 4 (Approval Prediction) asks the model to predict whether a decision will be approved, modified, or rejected, given the decision context and historical approval patterns.

10.2 Results

Task	Standard Transformer	Enterprise Transformer	Improvement
Decision Log Comprehension	78.3%	94.2%	+15.9%
Cross-Agent Fusion Error	0.347	0.229	-34.0%
Contract Clause Extraction F1	0.71	0.91	+28.2%
Organizational Structure Inference F1	0.58	0.82	+41.4%
Decision Timeline Reconstruction	0.67	0.89	+32.8%
Approval Prediction Accuracy	71.4%	86.7%	+15.3%

The enterprise transformer outperforms the standard transformer across all tasks, with the largest improvements on tasks that require organizational structure awareness (structure inference, +41.4%) and cross-agent reasoning (fusion error, -34.0%). The results confirm that the architectural adaptations — cross-agent attention, hierarchical positional encoding, and enterprise pre-training objectives — provide substantial benefits for enterprise language understanding.

10.3 Ablation Study

We conduct an ablation study to quantify the contribution of each architectural component. Removing cross-agent attention increases fusion error by 22%. Removing hierarchical positional encoding reduces structure extraction F1 by 18%. Removing enterprise pre-training objectives reduces decision log comprehension accuracy by 12%. The authority-weighted attention mechanism contributes approximately 8% to approval prediction accuracy. These results indicate that all four components contribute meaningfully to the overall system performance, with cross-agent attention and hierarchical positional encoding providing the largest individual contributions.

10.4 Computational Efficiency

The enterprise transformer adds approximately 15% computational overhead compared to a standard transformer of the same size, primarily due to the additional attention computations in the cross-agent attention mechanism and the composite positional encoding. The hierarchical context compression architecture reduces the cost of multi-document scenarios by 3.7x compared to naive concatenation. The LoRA-based fine-tuning approach enables enterprise adaptation with less than 1% of the cost of full fine-tuning, making the system practical for resource-constrained enterprise deployments.

11. Related Work

The application of transformer architectures to enterprise contexts has received increasing attention. Chen et al. (2024) introduced DocTransformer for long-document understanding in legal contexts, achieving state-of-the-art results on contract analysis benchmarks. Their work addresses document length but not multi-agent fusion or organizational hierarchy. Li et al. (2025) proposed OrgBERT, a BERT variant pre-trained on organizational communication data, demonstrating improved performance on organizational structure tasks. Their approach uses standard positional encoding and does not address cross-agent contexts.

In the multi-agent NLP space, Park et al. (2023) developed collaborative language models where multiple LLMs communicate through natural language to solve complex tasks. Their focus is on agent collaboration rather than organizational language understanding. Deng et al. (2024) introduced multi-source attention for information fusion across heterogeneous document collections, which shares our motivation but does not incorporate organizational metadata.

The federated learning approach for multi-tenant deployment builds on the foundation of McMahan et al. (2017) for federated averaging, adapted to the LoRA setting by the FedPara framework (Hyeon-Woo et al., 2024). Our contribution is the application of federated LoRA to enterprise language models with coordinate-based tenant isolation.

12. Conclusion and Future Directions

This paper has presented a comprehensive adaptation of the transformer architecture for enterprise decision contexts, establishing it as the Cognition Layer (Layer 1) of the agentic company intelligence stack. The three key innovations — cross-agent attention, hierarchical positional encoding, and enterprise pre-training objectives — address the fundamental gaps between standard NLP transformers and the requirements of enterprise language understanding in multi-agent governance systems.

The experimental results demonstrate that these adaptations yield substantial improvements across all evaluated tasks, with particularly strong gains on tasks requiring organizational awareness and cross-agent reasoning. The 34% reduction in cross-agent fusion error and the 28% improvement in structure extraction F1 represent meaningful advances for practical enterprise AI deployments.

Future work will focus on three directions. First, extending the cross-agent attention mechanism to support dynamic agent populations where new agents join and existing agents depart during a decision process. Second, developing unsupervised methods for discovering organizational structure from document patterns, eliminating the need for explicit coordinate metadata. Third, investigating the interaction between the Cognition Layer and the Decision Layer (Layer 2), where transformer outputs serve as features for gradient boosting and random forest models that make operational predictions.

The ultimate vision is a fully integrated intelligence stack where the Cognition Layer provides deep language understanding, the Decision Layer makes accurate predictions, the Planning Layer optimizes multi-step strategies, and the Control Layer executes and enforces policies — all coordinated through the MARIA OS governance framework that ensures human authority is preserved at every level.

Transformer Architecture for Agentic Language Intelligence: Self-Attention as the Cognitive Layer of Enterprise Decision Systems