EngineeringJune 1, 2026|19 min readpublished

Why AI Agents Fail at Real Work: It Is Not the LLM, It Is the Harness Shortage

Understanding why agents work in PoC but never reach production — through the design of purpose, authority, memory, stop conditions, recovery paths, and audit trails

Engineering Case StudyReading label

Applies established engineering and mathematical methods to MARIA OS implementation and industry operations. The value is reproducible design, not novelty theater.

Provenance:ARIA-WRITE-01G1.U1.P9.Z2.A1
Reviewed by:ARIA-TECH-01ARIA-RD-01

Editorial Intent

This article is not another "what is an AI agent" explainer. It explains why AI agents fail in enterprise adoption — not as a matter of LLM performance, but as a harness shortage. The intended readers are business owners who ran an AI agent PoC but cannot take it to production, CTOs, IT departments, DX promotion teams, and executives driving AI adoption.

The conclusion is clear. The primary reason AI agents fail at real work is not that the model is slightly less intelligent than it could be. It is the absence of a harness that encloses purpose, authority, memory, quality, stop conditions, recovery paths, and audit trails. This is also exactly why MARIA OS places the Dynamic Harness at its core.


1. Why Agents Work in PoC but Stall in Production

AI agent demos work well. They read email, summarize it, create tasks, check calendars, update the CRM, and post to Slack. Cut into a short video, it looks like the future has arrived. Even in a PoC, results come through — as long as the agent runs on limited data, with limited permissions, close to the person in charge.

But as production approaches, things suddenly get hard. Permissions multiply. Exceptions multiply. Stale data gets mixed in. Rules conflict across departments. The AI does not know the tacit knowledge humans were assuming. Nobody has decided who fixes things when the AI gets it wrong. Neither the frontline nor management knows how far the AI is allowed to execute autonomously.

At this point, many organizations conclude that "the model is still too weak." Model capability matters, of course. But swapping in the latest model often reproduces the same failures. The problem is not only reasoning capability — it is operational structure.

An AI agent is not a clever function. It is an executing entity that holds goals, reads memory, calls tools, affects others, and changes state across time. Without a harness enclosing it, the smarter the model gets, the farther its failures reach.


2. What Is a Harness?

A harness is the operational frame for running an AI agent safely. It is not merely a test. It is the mechanism that defines what goal the agent acts toward, which permissions it holds, which information it grounds its decisions in, where it stops, how it recovers from failure, and to whom accountability is recorded.

In software development, a test harness runs the target code under fixed conditions and compares results against expectations. The harness for an AI agent is broader. It must cover before execution, during execution, and after execution.

Before execution, it verifies the goal, the inputs, the permissions, the prohibitions, the data the agent may reference, and the approvals required. During execution, it monitors tool calls, reasoning drift, cost, latency, responsibility boundaries, and risk signals. After execution, it records the results, the evidence, the impact, whether corrections are needed, the audit log, and recurrence prevention.

In other words, a harness does not constrain AI. It is the precondition for entrusting work to AI. The weaker an organization's harness, the less it can expand autonomy. The stronger the harness, the more an organization can decide, in graduated steps, how much can be delegated.


3. Typical Failures of Enterprise AI Agents

The failures that commonly occur in enterprise adoption are not just model hallucinations. More often, the failures are far more mundane.

First, goal drift. An agent that started as an invoice-processing assistant begins taking on exception handling, payment reminders, accounting judgments, and customer communication. Humans do not stop it, because it is convenient. But nobody has defined where its authority ends.

Second, memory drift. Old rules, past stopgap fixes, and exceptions made for one specific customer remain in memory and bleed into new decisions. The AI appears to hold context, but it does not know whether that context is still valid today.

Third, the responsibility vacuum. When the AI proposes, a human approves, and a different department executes, who owns the failure? The AI vendor? The approver? The business owner? The system administrator? Go to production with this left ambiguous, and when an incident occurs, no one can move the decision forward.

Fourth, the absence of recovery paths. When the AI fails, is the answer to re-run, to roll back, to switch to manual processing, to contact the customer, or to freeze the logs? None of it has been decided. More dangerous than the failure itself is the organization freezing after the failure.

Fifth, insufficient audit trails. There is no record of why the AI made that judgment, which data it looked at, which rules it followed, or who approved it. Convenient while everything succeeds — but inexplicable the moment trouble hits.


4. Problems That Swapping the LLM Does Not Solve

As model performance improves, prose gets more natural, reasoning improves, and tool use stabilizes. That is true. But the model alone cannot answer the following questions.

  • What business purpose is this agent acting for?
  • Who delegated the authority to call this tool?
  • What evidence must be assembled before this decision may execute automatically?
  • Which risk signals trigger a fail-closed stop?
  • When handing off to a human, which context gets handed over?
  • After a failure, how far does the permitted recovery scope extend?
  • At audit time, can the basis of the decision be reproduced?

These are questions of harness design, not model selection. Use the latest model — without authority boundaries, it is dangerous. Use high-precision RAG — without a priority ordering between stale and fresh information, it wavers. Make tool calls perfectly accurate — without knowing when a tool must not be called, you get an incident.

Taking an enterprise AI agent to production is not about deploying a model. It is about defining the conditions under which the model is allowed to act.


5. The Dynamic Harness in MARIA OS

MARIA OS treats the harness not as a static test, but as a dynamic operational layer. The Dynamic Harness observes the state of the agent runtime, classifies failure signals, lowers autonomy when needed, hands off to human approval, and generates recovery candidates.

What matters is not looking at outputs alone. Even when the AI's answer looks correct, if retries are increasing behind the scenes, referenced data is fluctuating, out-of-authority tool calls are rising, and user corrections are climbing, the system is destabilizing. The Dynamic Harness watches this trajectory.

The MARIA OS harness has at least six layers.

  • Goal Harness: watches whether the agent's goal has drifted from the business purpose
  • Evidence Harness: watches whether the evidence required for a decision is in place
  • Authority Harness: verifies execution permissions and prohibitions
  • Quality Harness: observes output quality and rework rate
  • Recovery Harness: selects the recovery path on failure
  • Responsibility Harness: records who owns the decision

This structure is what makes graduated expansion of autonomy possible. Require human approval for everything, and the AI loses its point. Auto-execute everything, and the risk is too high. What matters is varying autonomy according to risk and evidence.


6. With a Harness, the Adoption Sequence Changes

Adoption without a harness builds the agent first and bolts on governance after problems occur. This sequence is dangerous. Because convenience is shown first, the frontline gets used to automation with no boundaries. Add restrictions later, and adoption appears to have regressed.

Adoption with a harness reverses the order. First, the target work is decomposed into episodes. Next, the goal, inputs, outputs, permissions, stop conditions, human-intervention conditions, and audit items are defined. Only then is work delegated to the agent, starting from the low-risk range. Execution logs are reviewed, and autonomy is expanded only in the regions that have proven stable.

In this adoption sequence, the AI does not look omnipotent at the start. But it is robust in production — because the organization can explain how much it is allowed to delegate.


7. The Metrics Business Owners Should Watch

When measuring the impact of AI agent adoption, looking only at throughput counts and hours saved is insufficient. Business owners should watch the following metrics.

  • Auto-execution rate: what range is being processed without humans
  • HITL trigger rate: under which conditions human intervention is being required
  • Rework rate: how much of the AI's output humans are correcting
  • Attempted authority violations: whether dangerous actions were stopped before execution
  • Recovery success rate: whether failures were returned via the correct path
  • Audit reproducibility rate: whether the basis of decisions can be traced afterward

Especially important is the trend of the HITL trigger rate. A high rate early in adoption is natural. The problem is when HITL never declines for the same category of work. That suggests either the AI is not learning, or work is being thrown to humans while business rules remain ambiguous.

Conversely, if the HITL rate is falling, the rework rate is falling, and audit reproducibility is being maintained, the AI agent has begun converting frontline knowledge into an operational asset.


8. Conclusion

The reason AI agents fail at real work is not that the LLM is slightly less intelligent. In most cases, it is that there is no harness. Organizations try to let AI act without enclosing the goal, without enclosing authority, without inspecting memory, without setting stop conditions, without designing recovery paths, and without leaving audit trails.

Articles asking "what is an AI agent" already exist in abundance. What Bonginkan should be writing is: "why do enterprise AI agents stall at PoC and never reach production?" And the answer lies not in the model, but in the operational structure.

The value of MARIA OS is not only in building AI agents. It is in building an OS in which AI agents can fail, stop, recover, hand off to humans, leave evidence behind — and gradually expand their autonomous range.

What will separate enterprises in the coming wave of AI adoption is not just which model they use. It is how strong a harness they use to connect intelligence to the business.

R&D BENCHMARKS

Failure cause

Harness shortage

Treats the primary cause as design gaps in purpose, authority, stopping, recovery, and audit — not model capability alone.

Operational metric

HITL trend

Continuously observes HITL trigger rate, rework rate, recovery success rate, and audit reproducibility rate.

Adoption order

Gate first

Define responsibility gates before releasing autonomy, instead of building the agent first and restricting it later.

Published by Bonginkan and reviewed by the MARIA OS Editorial Pipeline.

© 2026 Bonginkan / MARIA OS. All rights reserved.