Safety Lives in the Fan-In: Designing Fail-Closed Parallel Multi-Harness Systems

The moment most teams parallelize their safety checks, they give up a little bit of safety. And in most cases, they never notice.

On an agent platform, you run many checks against a single action. Is the identity correct? Is the authority sufficient? Is the tool call within the permitted scope? Is there evidence? Does it stay within budget? Does it violate any surface-specific contract? Run these sequentially and latency becomes a sum. So you want to parallelize.

But in a fail-closed system, naive parallelization quietly weakens safety. It looks faster, but you miss checks that never ran. Results fluctuate with completion order. Shared budgets race against each other. Reading runtime state makes the episode impossible to replay. This is not a performance problem — it is a governance problem.

Safety lives in the fan-in. Parallel execution is merely a latency optimization.

Whatever cannot be executed or trusted falls to the most restrictive side.

These two lines are the entire backbone of this article. Fan-out is a technique for going fast; it is not where safety comes from. What determines safety is how you normalize the results scattered across parallel execution, how you handle missing results, and which way you tilt the next action.

Let me also state the article's hidden principle up front: converge in the decision, distinguish in the evidence. Rejects, timeouts, store failures, low confidence, and budget exhaustion all converge to the restrictive side in the decision. But in the evidence, they remain as separate reasons. Fail-closed is not lazy thought-stopping — it is conservative design that preserves debuggability.

1. Why Run Multiple Harnesses in Parallel

A harness, in this context, is an independent unit of inspection over a single agent action. It is not like a unit test that only looks at a function's return value. It reads the action envelope, actor identity, authority boundary, tool permissions, evidence, budget, risk class, and surface-specific contracts, and decides whether the action may proceed, should be restricted, should be stopped, or should be returned to a human.

In an agent governance runtime like MARIA OS, harnesses fall broadly into two kinds. Cross-cutting harnesses operate across surfaces — identity, authority, trust, budget, evidence, audit trace. Partial harnesses inspect surface-specific failure modes for Sales, Audit, Voice, Meeting, Workflow, Auto-Dev, and so on.

This classification maps directly onto the unit of parallel execution. Cross-cutting harnesses read the same action snapshot as each other. Partial harnesses also read the same snapshot, each against its own surface contract. As long as they do not read each other's side effects, they can run in parallel within the same stage.

With sequential execution, latency is a sum.

T_{seq}=\sum_i t_i$$

With parallel execution, it ideally approaches the slowest harness.

T_{parallel}=\max_i t_i + T_{fan\text{-}in}$$

Looking only at this equation, parallelization is obviously correct. The problem arises the moment you take the fan-in at the end of the right-hand side lightly. Running harnesses fast and safely converting harness results into a decision are two separate design problems.

2. Central Thesis: Safety Lies in the Fold Over a Normalized Sequence of Envelopes

In parallel harness design, fan-out and fan-in must be considered separately.

Fan-out distributes the action snapshot to multiple harnesses. What you want here is independence, timeouts, resource caps, trace IDs, and episode IDs. Fan-in folds those harness results into a single runtime decision. What you want here is commutativity, monotonicity, fail-closed behavior, and auditability.

However, what goes into the fan-in is not raw domain results. If static harnesses, dynamic harnesses, meeting gates, audit gates, budget checks, and policy checks each return different units, different scores, and different severities, the fold cannot guarantee safety across them. Safety lives in the fold over a normalized sequence of envelopes.

The normalization function is not glue code. It is part of the Trusted Computing Base. If the mapping from domain result to envelope is sloppy, a domain-level fail mutates into an envelope-level warn. That is a silent demotion — safety is broken before the fold even runs.

Therefore the normalization mapping must be monotone: a more restrictive result in the domain maps to a more restrictive decision in the envelope. The shared score scale must also carry meaning. A normalization that never defines what a score of 0.5 means across all harnesses has only aligned the numbers while losing the meaning.

Domain-specific information is not discarded. The fold only needs to look at the common fields: decision, score, findings, provenance, failure kind. observedMetrics, requiredFixes, and the domain payload ride along as an opaque payload. This is exactly so we can converge in the decision and distinguish in the evidence.

Parallel is not what makes it safe. Because normalization and aggregation are correct, safety does not break when you go parallel.

To uphold this thesis, the implementation disciplines split into five.

First: the parallel answer must be the same every time. Second: a check that could not run is not a pass. Third: only independent things may be parallelized. Fourth: do not count budget in the wrong place. Fifth: parallelism that cannot be reproduced does not survive an audit.

Below, we examine each one in the order of "the naive implementation," "how it quietly breaks," and "the discipline."

3. Discipline 1: The Parallel Answer Must Be the Same Every Time

The naive implementation looks like this: put the harnesses into an array, run them with Promise.all, and reduce them in the order they come back. It looks correct. In fact, if everything succeeds, every harness runs at the same speed, and the aggregation is order-independent, there is no problem.

But in parallel execution, completion order is nondeterministic. Network I/O, LLM calls, vector stores, external policy services, cache hits, GC, and scheduler whims mean the return order changes even for the same episode. If the fold is order-dependent, the final verdict fluctuates.

The typical failure mode is an implementation that prioritizes the last result. One harness returns review, another returns allow. Depending on completion order, the final result becomes either review or allow. This is not a race condition — it is a design error in the aggregation function.

The discipline is simple. Normalize each harness's raw result into a common envelope, impose a total order on restrictiveness, and pick the most restrictive one. Fold scores toward the pessimistic side, not the optimistic side. That is, restrictiveness folds with max, while confidence and quality scores fold with min. With this, the result does not change even when completion order does.

type Restrictiveness =
  | "allow"
  | "review"
  | "quarantine"
  | "block"

const rank: Record<Restrictiveness, number> = {
  allow: 0,
  review: 1,
  quarantine: 2,
  block: 3,
}

type HarnessDecision = {
  kind: "harness-envelope"
  harnessId: string
  decision: Restrictiveness
  score: number
  findings: string[]
  provenance: {
    source: "fulfilled" | "rejected" | "timeout" | "untrusted"
    calibrationVersion: string
  }
  payload?: unknown
}

function foldDecisions(results: HarnessDecision[]): HarnessDecision {
  return results.reduce((acc, current) => ({
    kind: "harness-envelope",
    harnessId: "fan-in",
    decision:
      rank[current.decision] > rank[acc.decision]
        ? current.decision
        : acc.decision,
    score: Math.min(acc.score, current.score),
    findings: [...acc.findings, ...current.findings],
    provenance: {
      source: "fulfilled",
      calibrationVersion: acc.provenance.calibrationVersion,
    },
  }))
}

This fold is close to commutative — at least for decision and score, it is order-independent. Only the array order of findings can depend on execution order, so in the audit log either sort by harnessId or attach harnessId and startedAt to each finding. Do not mix the order humans read in with the safety machines decide on.

What matters is to never try to manufacture safety through parallelization. Parallel is just fast. Safety is confined to the normalized sequence of envelopes and a fold that is indifferent to completion order.

4. Discipline 2: A Check That Could Not Run Is Not a Pass

The next naive implementation is Promise.all. In JavaScript it looks like the natural choice. But in a fail-closed harness runtime, it is dangerous.

Promise.all rejects the whole thing the moment a single Promise rejects. That throws away the block or quarantine the other harnesses might have returned. Conversely, an implementation that swallows errors in a catch treats the rejected harness as if it never existed. Both are dangerous.

There are many reasons a harness can fail to run. The policy store goes down. The vector store times out. Schema validation throws. An external risk service returns 429. The LLM evaluator returns unparseable JSON. None of these is a "pass." The very fact that the check could not be performed is grounds for falling to the restrictive side.

Here, rejects and timeouts have the same outcome but a different mechanism. A reject means the Promise settled and returned an error. A timeout means it has not settled yet. So timeouts cannot be handled by allSettled alone. Race each harness against a per-harness timeout with Promise.race, and convert the timeout side into a restrictive envelope.

The timeout value is a calibration constant. Set it per harness so that one slow harness does not eat the whole budget. Furthermore, give it a version and provenance, just like a policy threshold. You must be able to explain afterwards why this harness gets 800ms and that one gets 3 seconds.

The discipline is to use allSettled — and to normalize rejected, timeout, and untrusted output as the most restrictive result. This is where the second line of the backbone first becomes an implementation.

function withTimeout<T>(
  task: Promise<T>,
  timeoutMs: number,
): Promise<T | "timeout"> {
  return Promise.race([
    task,
    new Promise<"timeout">((resolve) =>
      setTimeout(() => resolve("timeout"), timeoutMs),
    ),
  ])
}

async function evaluateAll(
  snapshot: ActionSnapshot,
  harnesses: Harness[],
): Promise<HarnessDecision> {
  const settled = await Promise.allSettled(
    harnesses.map((harness) =>
      withTimeout(
        harness.evaluate(snapshot),
        harness.calibration.timeoutMs,
      ),
    ),
  )

  const normalized = settled.map((result, index): HarnessDecision => {
    const harnessId = harnesses[index].id

    if (result.status === "fulfilled" && result.value !== "timeout" && isTrusted(result.value)) {
      return result.value
    }

    return {
      kind: "harness-envelope",
      harnessId,
      decision: "quarantine",
      score: 0,
      findings: ["harness_unavailable_or_untrusted"],
      provenance: {
        source:
          result.status === "fulfilled" && result.value === "timeout"
            ? "timeout"
            : result.status === "rejected"
              ? "rejected"
              : "untrusted",
        calibrationVersion: harnesses[index].calibration.version,
      },
    }
  })

  return foldDecisions(normalized)
}

Here we choose quarantine rather than block. That is a design choice. Right before an external action executes, block is fine. For side-effect-free work like draft generation, sending it to quarantine — for human review or re-evaluation — is easier to operate. The important thing is to at least never make it allow.

However, a harness that timed out may still be running in the background. There is work that completes late, after the runtime has already decided to "close." This design is safe only because a harness is a pure function that reads the snapshot and returns an envelope, with no write side effects. If a harness wrote to a store, a side effect landing late after the timeout would be an incident. The safety of timeouts depends on the snapshot and purity discussed later.

In the decision, both timeouts and rejections converge to the restrictive side. But in the evidence, distinguish them. Keep timedOutHarnessIds and failedHarnessIds separately. Timeouts are about latency, capacity, and external dependencies; rejects are about bugs, schema errors, and exceptions. Converging the outcome must not crush the debug information along with it.

This discipline appears to worsen the UX during outages. But in agent governance, a UX that proceeds when inspection is impossible is far worse. A safe system does not quietly turn optimistic when it breaks.

5. Discipline 3: Only Independent Things May Be Parallelized

The third failure mode is hidden dependencies. You fan out all the cross-cutting harnesses and partial harnesses together. At first, no problem. Then one day, the authority harness starts reading the surface harness's output. Or the budget harness reads the tool harness's estimated cost. Or the trust harness reads the evidence harness's confidence.

At that moment, an implicit dependency edge is born. Run things with dependencies in parallel within the same stage and they sometimes read it and sometimes don't. They read stale values. They read empty values. They read the previous episode's values. In the worst case, a cycle forms. The harness runtime is supposed to be the audit infrastructure — and the audit infrastructure itself becomes nondeterministic.

The discipline is to declare dependencies as a DAG. Dependencies are something you declare, not something you discover. In the first stage, derive the execution stages from the dependency declarations. Parallel within a stage, sequential across stages. Cycle detection comes for free at this point.

In the second stage, constrain the very scope of what the declared dependencies can read. Do not pass priorResults as a Map of all results. Narrow it to a Record containing only the declared dependencies. With this, the dependency declaration is no longer an ordering hint — it becomes an enforceable contract.

flowchart LR
  S[Action Snapshot] --> A[Identity Harness]
  S --> B[Authority Harness]
  S --> C[Evidence Harness]
  A --> D[Surface Contract Harness]
  B --> D
  C --> E[Budget Harness]
  D --> F[Fan-in Fold]
  E --> F

type HarnessNode = {
  id: string
  dependsOn: string[]
  evaluate: (input: {
    snapshot: ActionSnapshot
    priorResults: Record<string, HarnessDecision>
  }) => Promise<HarnessDecision>
}

async function runHarnessDag(
  snapshot: ActionSnapshot,
  stages: HarnessNode[][],
): Promise<HarnessDecision> {
  const decisions = new Map<string, HarnessDecision>()

  for (const stage of stages) {
    const settled = await Promise.allSettled(
      stage.map((node) =>
        node.evaluate({
          snapshot,
          priorResults: pick(decisions, node.dependsOn),
        }),
      ),
    )

    settled.forEach((result, index) => {
      const node = stage[index]
      decisions.set(node.id, normalizeSettled(node.id, result))
    })
  }

  return foldDecisions([...decisions.values()])
}

This implementation assumes the stages themselves have been topologically sorted in advance. A real runtime validates the DAG at startup or deploy time. Does anything reference a dependency that does not exist? Are there cycles? Does anything within the same stage reference a sibling? Has a high-risk harness added an unapproved dependency?

The pick(decisions, node.dependsOn) is crucial. A harness cannot read results it has not declared. The declaration graph becomes the true dataflow graph. With this, dependencies are not inferred from implementation side effects — they become a reviewable contract.

The unit of parallelization is determined by independence, not by classification. Classification is a good clue, but in the end the only thing you may trust is the DAG. Make dependencies data, not comments. Otherwise the safety is visible only in code review.

6. Discipline 4: Do Not Count in the Wrong Place

Budget races are unglamorous but dangerous. In the naive implementation, each harness reads the shared budget, judges whether its own action fits within budget, and updates the budget as consumed if needed. In isolation, this is correct. But do this inside a fan-out and the read-decide-write sequences race.

For example, the remaining budget is 100, and three harnesses each estimate a cost of 60. All of them read the remaining 100 at the same time. All of them conclude "60 fits." Ultimately, a cost of 180 gets approved. This looks like a DB transaction story, but in a harness runtime it is broader. Tokens, tool calls, external APIs, customer-visible actions, approval capacity, and the human review queue are all budgets.

The discipline is to never place the budget verdict in the fan-out. Harnesses inside the fan-out only return cost estimates, risk estimates, and resource requests. After fan-in, exactly once, hand the normalized total to the budget policy.

In this design, individual harnesses do not consume budget. Only the budget harness looks at the aggregated request and decides. If needed, it issues a reservation token. An action without that token cannot execute.

Budget concerns not only safety but audit. Which harness estimated how much, how it was totaled at the fan-in, and which policy approved or denied it. Without that sequence on record, you cannot explain a cost anomaly afterwards.

The implementation point is to never make budget a boolean. Not allowed: true, but requestedCost, reservedCost, budgetScope, reservationId, and expiresAt. An approval is not a state — it is evidence with an expiry.

7. Discipline 5: Parallelism That Cannot Be Reproduced Does Not Survive an Audit

Finally, snapshots. In the naive implementation, each harness reads the stores it needs directly. User state, policy state, memory, tool registry, budget ledger, evidence store, and risk profile, each at its own moment. Again, in isolation this looks natural.

But in parallel execution, the world each harness read can differ. Harness A reads policy version 12 while harness B reads policy version 13. Harness C reads before the memory write, harness D reads after. One harness fails before evidence was added, another passes after it. Replaying this episode does not reproduce the same verdict.

Parallel harnesses that survive an audit should approach pure functions that read a snapshot frozen at the episode. Every harness receives the same ActionSnapshot, containing policyVersion, toolRegistryVersion, budgetLedgerVersion, evidenceRefs, memoryRefs, inputHash, and createdAt.

type ActionSnapshot = {
  episodeId: string
  actionId: string
  actorId: string
  mariaCoordinate: string
  inputHash: string
  policyVersion: string
  toolRegistryVersion: string
  budgetLedgerVersion: string
  evidenceRefs: string[]
  memoryRefs: string[]
  createdAt: string
}

The snapshot does not need to be a giant copy. In most cases, versioned refs suffice. What matters is that harnesses do not read whatever store they like at whatever moment they like during execution — they read a world fixed by the episode.

With this design, harnesses become re-runnable. You can explain why something became review. When you add a new harness, you can backtest it against past episodes. You can compare the delta between human review judgments and machine decisions.

Reproducibility is not only for audit. It is also for repair. Failures you cannot reproduce are hard to fix. Failures you cannot fix are hard to convert into learning.

8. The Five Fold Down to Two

We have examined the five disciplines separately. In practice, they all fold down to the same two principles.

The first is an order-independent fold over a normalized sequence of envelopes. It does not depend on completion order. It does not depend on execution order within a stage. As the number of harnesses grows, restrictiveness folds toward the most restrictive value and scores toward the lowest. Budget too is judged exactly once after the fan-in. The DAG exists to protect the fold's preconditions by making explicit which results are readable when. The snapshot exists to guarantee that the values entering the fold come from the same world.

The second is tilting whatever cannot be executed or trusted toward the restrictive side. A rejected harness is not a pass. A timed-out harness is not a pass. Unparseable LLM output is not a pass. A budget check whose store could not be read is not a pass. A snapshot with mixed versions is not a pass. All of these fall to review, quarantine, or block — never allow.

In other words, the center of parallel multi-harness design is not how to run a large number of harnesses. It is how to convert the incomplete, asynchronous, failure-prone domain results coming back from a large number of harnesses into normalized envelopes, and how to fold them into a single conservative decision.

Once normalization is in place, extensibility changes too. Adding a new domain harness does not mean rewriting the existing fold. It means writing a monotone normalizer, satisfying the semantics of the shared score scale, and producing an envelope that preserves the domain payload. The fold stays singular while the system grows.

Here lies the catharsis. The five disciplines are not separate best practices. They are all implementations of the same posture.

Safety lives in the fan-in. Parallel execution is merely a latency optimization. Whatever cannot be executed or trusted falls to the most restrictive side.

9. The Oversight Infrastructure Embodies Its Own Philosophy

What is interesting about this design is that the oversight infrastructure itself runs on the very philosophy it demands of what it oversees.

We tell agents to respect their authority boundaries. So what does the harness runtime itself do when it cannot read a store? We tell agents not to proceed without evidence. So what does a harness do when its own confidence is low? We tell agents not to exceed their budget. So what happens when the budget ledger races? We tell agents to return to a human when uncertain. So what happens when a harness cannot run?

The answer must be the same in every case. If it cannot be trusted, return it to a human. Tilt to the restrictive side. Quarantine. Block. At minimum, do not allow.

But do not crush the debug information into the same box. Converge in the decision, distinguish in the evidence. Store failure, harness reject, timeout, low confidence, and budget exhaustion all converge to the restrictive side in the runtime decision. But in the evidence, they remain separate. timedOutHarnessIds, failedHarnessIds, loopBudgetReliabilityReason, and lowConfidenceFindings mean completely different things operationally.

A system without this self-consistency collapses over time. Even if it demands governance of agents on the surface, if the governance infrastructure itself fails optimistically, the most important boundary leaks right there.

Suppose, for example, that the Learning Store is temporarily unreadable. A naive system might decide to "proceed without learning." But the same failure may have happened before. A human may have rejected the same patch. It may have been marked do-not-repeat on a high-risk surface. A store being unreadable does not mean there is no information. It means the information needed for a safety judgment is unreachable.

So tilt to the restrictive side. Send low-risk cases to review, cases right before execution to quarantine, high-risk cases to block. This looks overly conservative. But in governable autonomy, stopping or returning to a human is not failure. It is normal operation in service of holding the boundary.

Only here does it become visible that this design is not merely an implementation technique for a harness runtime, but part of governable autonomy. A system that governs agents must itself be governed by the same disciplines. Otherwise, the autonomy is nothing more than a performance of control viewed from the outside.

10. How to Design "Faster Without Breaking"

If you are building an agent platform, parallelization is unavoidable. Evaluating identity, authority, trust, evidence, budget, surface contracts, quality, policy, memory, and audit sequentially every single time makes the runtime far too slow. Fanning out for latency is correct.

But parallel is not the design of safety. Parallel is just a latency optimization. What decides safety is the fan-in: a fold over a normalized sequence of envelopes that does not depend on completion order; a posture that never treats rejects or timeouts as passes; declaring dependencies as a DAG and constraining the readable prior results; counting budget exactly once after the fan-in; making everything replayable through episode snapshots.

Keep this design and you can grow the number of harnesses. You can grow the surfaces. You can grow the agents. A new domain harness does not add a branch to the fold — it produces an envelope and composes into the existing fold. Even with parallelized inspection, the final decision folds conservatively. Failed checks do not disappear; they compose toward the restrictive side. At audit time, the same decision can be reconstructed from the same snapshot.

The generalized lesson is short.

To everyone running agents in parallel: parallel is a latency optimization, not the design of safety. Safety is always decided by the aggregation, and by which way you tilt when something could not run.