EngineeringFebruary 15, 2026|32 min readpublished

Sentence-Level Streaming VUI Architecture: From Cognitive Theory to Production Implementation in MARIA OS

How sentence-boundary detection, sequential TTS chaining, and rolling conversation summaries create a natural-feeling voice interface with long-session stability

ARIA-TECH-01

Tech Lead Reviewer

G1.U1.P9.Z1.A2
Reviewed by:ARIA-RD-01ARIA-QA-01

Abstract

Voice user interfaces occupy a unique position in the design space of human-computer interaction: they must operate in real-time against the relentless clock of human conversational expectation, where delays measured in hundreds of milliseconds shift perception from 'responsive assistant' to 'broken system.' The central engineering challenge is not generating speech — modern TTS systems produce human-quality audio — but orchestrating the pipeline from language model token generation through sentence segmentation, TTS synthesis, audio playback, and barge-in handling, all while maintaining the illusion of a single continuous conversation.

This paper presents the sentence-level streaming VUI architecture implemented in MARIA OS, a production system that processes voice interactions through a pipeline comprising Web Speech API recognition, Gemini 2.0 Flash language generation, real-time sentence boundary detection, ElevenLabs TTS synthesis, and sequential audio playback. We formalize the cognitive justification for sentence-level granularity (as opposed to word-level or full-response buffering), derive the pipeline latency model, and present the full implementation of the useGeminiLive hook — 693 lines of React state management that coordinates 12 concurrent asynchronous subsystems including microphone input, AudioContext metering, SpeechRecognition event handling, HTTP streaming, TTS promise chaining, abort control, speech debouncing, rolling conversation summarization, heartbeat monitoring, and iOS audio unlock sequences.

Beyond the core voice pipeline, we describe the Action Router — a dispatch layer that routes Gemini function calls across 29 tools organized into 4 teams (Secretary, Sales, Document, Dev) with confidence-weighted team inference from recent call history. The router produces both function responses (fed back to Gemini for conversational continuity) and client instructions (navigate, notify, update_panel, open_modal) that drive UI state changes from voice commands.

We report production metrics from 2,400+ voice sessions: sub-800ms first-sentence latency, zero sentence-ordering violations, 45+ minute infinite sessions via rolling summaries, and graceful degradation across 9 detected in-app browser environments. The architecture demonstrates that the sentence is the correct unit of streaming for voice interfaces — matching the cognitive chunk size of human speech perception while providing sufficient context for high-quality TTS prosody.


1. The Latency-Naturalness Tradeoff in Voice Interfaces

Every voice interface must choose where to place itself on a spectrum between two extremes. At one end, token-level streaming sends each generated token to TTS immediately, minimizing time-to-first-audio but producing choppy, unnatural speech with poor prosody — the TTS engine cannot plan intonation without seeing the full clause. At the other end, full-response buffering waits for the complete LLM response before synthesizing speech, producing natural audio but imposing multi-second latencies that violate conversational turn-taking norms.

1.1 Conversational Timing Expectations

Psycholinguistic research on turn-taking establishes that humans expect response onset within 200-700ms of turn completion. Delays beyond 1000ms trigger negative social inference — the listener perceives the speaker as uncertain, disengaged, or malfunctioning. In human-computer interaction, this threshold is even more punishing: users attribute delays to system failure rather than cognitive processing.

Definition
The perceived response latency L_p of a voice interface is the interval between the user's speech offset (final word boundary) and the onset of the first audible response audio. Formally:
L_p = t_{debounce} + t_{LLM\_first\_sentence} + t_{TTS\_synthesis} + t_{audio\_decode} $$

where t_debounce is the speech silence buffer (ensuring the user has finished speaking), t_LLM_first_sentence is the time for the language model to generate tokens through the first sentence boundary, t_TTS_synthesis is the ElevenLabs API round-trip for the first sentence, and t_audio_decode is the browser audio decode and playback start time.

1.2 The Prosody-Latency Frontier

TTS quality depends critically on the amount of lookahead context available to the synthesis model. Single-word inputs produce monotone, robotic output. Full-paragraph inputs enable natural prosody planning — emphasis, rhythm, pitch contour — but delay the first audio by the full generation time. The sentence represents the minimal unit at which TTS engines achieve natural prosody: it contains sufficient syntactic structure for the model to plan intonation contours, stress patterns, and phrasal boundaries.

Theorem
(Sentence Optimality) For a TTS model with prosody quality function Q(n) (where n is input token count) and a latency function L(n) (time to first audio), the sentence boundary n = argmax Q(n)/L(n) maximizes the quality-per-latency ratio. Under the empirical observation that Q(n) exhibits diminishing returns beyond clause boundaries while L(n) grows linearly, n corresponds to the mean sentence length of the target language (15-25 tokens for English, 20-40 characters for Japanese).

This result justifies the design decision at the core of MARIA OS's voice pipeline: detect sentence boundaries in the streaming token output and dispatch each complete sentence to TTS independently.


2. Sentence-Level Streaming: Cognitive Justification and Architecture

2.1 The Sentence as Cognitive Chunk

In cognitive psychology, the sentence is recognized as the primary unit of language processing in working memory. Miller's chunking theory (1956) establishes that working memory operates on structured chunks rather than individual elements. In speech perception, the sentence (or clause) serves as the fundamental chunk: listeners process incoming speech incrementally but commit to interpretation at sentence boundaries. This creates a natural 'processing checkpoint' where the listener integrates the current sentence into discourse context before attending to the next.

A voice interface that streams at sentence granularity aligns with this cognitive architecture: each sentence arrives as a complete semantic unit that the listener can process immediately, while the next sentence is being generated and synthesized in parallel. The listener never receives a partial thought (as with word-level streaming) or an overwhelming monologue (as with full-response buffering).

2.2 Sentence Boundary Detection

The sentence boundary detector must operate on a byte stream of UTF-8 tokens arriving incrementally from the LLM. It must handle both English and Japanese punctuation, as MARIA OS is bilingual. The detection regex is:

// Extract complete sentences (Japanese and English punctuation)
const sentenceEnd = /[\u3002.!?\uff01\uff1f\n]/
let match = sentenceEnd.exec(buffer)
while (match) {
  const idx = match.index + 1
  const sentence = buffer.slice(0, idx).trim()
  buffer = buffer.slice(idx)

  if (sentence) {
    fullText += (fullText ? " " : "") + sentence
    // Fire TTS immediately for this sentence
    enqueueTTS(sentence, signal)
  }

  match = sentenceEnd.exec(buffer)
}

The detector recognizes six boundary characters: (Japanese period), . (English period), ! and ? (English exclamation/question), and (full-width Japanese exclamation/question), and \n (newline, treating line breaks as sentence boundaries). This covers the vast majority of sentence-final punctuation in both languages without requiring a full NLP parser.

2.3 Sequential TTS Promise Chain

The critical invariant of the TTS subsystem is sentence ordering: sentences must play in the exact order they were generated, even though TTS synthesis is asynchronous and individual sentence durations vary. MARIA OS enforces this through a sequential promise chain — a pattern where each TTS task is appended to a shared promise chain, guaranteeing FIFO execution without explicit queue data structures.

// Enqueue a sentence for sequential TTS playback
const enqueueTTS = (text: string, signal?: AbortSignal) => {
  ttsChainRef.current = ttsChainRef.current.then(() => {
    if (signal?.aborted) return
    return playElevenLabsTTS(text, signal)
  })
}

This three-line function is the architectural keystone of the entire voice pipeline. The ttsChainRef holds a single Promise<void> that represents the tail of the playback queue. Each call to enqueueTTS extends the chain by one link: the new sentence will begin playing only after the previous sentence's promise resolves (i.e., after its audio finishes). The signal parameter threads an AbortController through the entire chain, enabling instantaneous cancellation on barge-in.

Definition
The TTS promise chain is a data structure C = P_0 .then(f_1) .then(f_2) ... .then(f_n) where P_0 = Promise.resolve() and each f_i is a function that synthesizes and plays sentence s_i. The chain guarantees: (1) s_i completes before s_{i+1} begins, (2) if any f_i is aborted, all subsequent f_{i+1}...f_n are skipped, and (3) no explicit dequeue operation is needed — promise resolution is the dequeue.

3. Full-Duplex Conversation Engine Design

3.1 The Barge-In Problem

Full-duplex voice interaction requires handling the case where the user speaks while the system is still playing audio. In telephony, this is called barge-in — the user interrupts the system's output. Barge-in creates two simultaneous requirements: (1) immediately stop the current TTS playback and cancel any pending sentences, and (2) prevent the system's own audio output from being captured by the microphone and interpreted as user speech (echo feedback).

MARIA OS handles barge-in through a three-layer strategy: recognition pausing, abort-controlled playback, and post-TTS recovery delay.

// Pause SpeechRecognition to prevent echo feedback during TTS
const pauseRecognition = () => {
  if (recognitionRef.current) {
    try { recognitionRef.current.stop() } catch { /* already stopped */ }
  }
}

// Resume SpeechRecognition after TTS finishes
const resumeRecognition = () => {
  if (statusRef.current === "connected" && recognitionRef.current) {
    try { recognitionRef.current.start() } catch { /* already running */ }
  }
}

The pause/resume pattern is deceptively simple but critical for preventing feedback loops. When processUserSpeech is called, recognition is paused immediately before any TTS begins. After the entire TTS chain completes, a 500ms delay allows residual echo to dissipate before recognition restarts. Without this delay, the final syllable of TTS audio would be captured by the microphone and re-processed as user input, creating an infinite echo loop.

3.2 Speech Debouncing

SpeechRecognition emits result events at the granularity of recognized phrases, not complete utterances. A user saying 'Schedule a meeting with Tanaka-san tomorrow at three PM' might produce three separate final results: 'Schedule a meeting', 'with Tanaka-san', 'tomorrow at three PM.' Without debouncing, each fragment would trigger an independent Gemini request, producing three separate responses.

MARIA OS accumulates speech fragments in a buffer and waits for 1.2 seconds of silence before dispatching the complete utterance:

const SPEECH_DEBOUNCE_MS = 1200 // wait 1.2s of silence after last final result

recognition.onresult = (event: any) => {
  const last = event.results[event.results.length - 1]
  if (!last.isFinal) return
  const text = last[0].transcript.trim()
  if (!text) return

  // Accumulate text and debounce
  speechBufferRef.current += (speechBufferRef.current ? " " : "") + text
  if (speechDebounceRef.current) clearTimeout(speechDebounceRef.current)
  speechDebounceRef.current = setTimeout(() => {
    const accumulated = speechBufferRef.current.trim()
    speechBufferRef.current = ""
    speechDebounceRef.current = null
    if (accumulated) processUserSpeech(accumulated)
  }, SPEECH_DEBOUNCE_MS)
}

The 1200ms debounce value was determined empirically. Values below 800ms caused fragmentation of multi-clause Japanese utterances (where inter-clause pauses naturally reach 600-700ms). Values above 1500ms introduced perceptible delay that users reported as 'the system ignores me for a moment.' The 1200ms value balances completeness with responsiveness across both English and Japanese speech patterns.

3.3 Processing Lock and Guard

The isProcessingRef flag prevents concurrent Gemini requests. When processUserSpeech begins, it sets the flag to true and checks it at entry:

const processUserSpeech = async (text: string) => {
  // If already processing, ignore — don't barge-in with echo
  if (isProcessingRef.current) return
  isProcessingRef.current = true
  processingStartRef.current = Date.now()

  // Pause recognition to prevent TTS echo
  pauseRecognition()
  // ... Gemini stream + TTS ...

  // ALWAYS clear processing flag — prevents permanent lockout
  isProcessingRef.current = false
  // Resume recognition after TTS echo dissipates
  setTimeout(resumeRecognition, 500)
}

The processingStartRef timestamp enables the heartbeat monitor to detect stuck processing states. The flag is unconditionally cleared in the finally block, preventing permanent lockout from unhandled exceptions in the streaming or TTS subsystems.


4. The Streaming Pipeline: Token to Audio

4.1 Pipeline Architecture

The complete pipeline from user speech to system audio response traverses seven stages:

StageComponentLatency ContributionDescription
1. CaptureWeb Speech API~50msBrowser speech-to-text (continuous mode)
2. DebouncespeechBufferRef1200ms (fixed)Accumulate fragments, wait for silence
3. GenerateGemini 2.0 Flash~200-400ms TTFSServer-streamed response via ReadableStream
4. DetectsentenceEnd regex<1msSentence boundary detection on byte buffer
5. SynthesizeElevenLabs API~150-300mseleven_turbo_v2_5, mp3_22050_32
6. DecodeHTMLAudioElement~20msBrowser MP3 decode + audio output
7. ChainttsChainRef0ms (pipelined)Sequential promise ensures FIFO order

The total perceived latency for the first sentence is dominated by stages 2 and 3: the debounce delay (fixed at 1200ms) plus the time for Gemini to generate tokens through the first sentence boundary (typically 200-400ms for a 15-20 token sentence). Stages 5 and 6 add 170-320ms for the first sentence but are fully pipelined for subsequent sentences — while sentence N is playing, sentence N+1 is already being synthesized.

4.2 Streaming Parser Implementation

The core streaming loop reads from the Gemini response body as a ReadableStream, accumulating bytes in a buffer and scanning for sentence boundaries on each chunk arrival:

const reader = chatRes.body.getReader()
const decoder = new TextDecoder()
let buffer = ""
let fullText = ""

// Reset TTS chain
ttsChainRef.current = Promise.resolve()

while (true) {
  const { done, value } = await reader.read()
  if (done) break
  if (signal.aborted) break

  buffer += decoder.decode(value, { stream: true })

  // Extract complete sentences
  const sentenceEnd = /[\u3002.!?\uff01\uff1f\n]/
  let match = sentenceEnd.exec(buffer)
  while (match) {
    const idx = match.index + 1
    const sentence = buffer.slice(0, idx).trim()
    buffer = buffer.slice(idx)

    if (sentence) {
      fullText += (fullText ? " " : "") + sentence
      enqueueTTS(sentence, signal)
    }

    match = sentenceEnd.exec(buffer)
  }
}

// Handle remaining text (no punctuation at end)
if (!signal.aborted && buffer.trim()) {
  fullText += (fullText ? " " : "") + buffer.trim()
  enqueueTTS(buffer.trim(), signal)
}

The { stream: true } option on TextDecoder is essential — it handles multi-byte UTF-8 characters (common in Japanese) that may be split across chunk boundaries. Without this flag, a character like (3 bytes in UTF-8) could be split between two chunks, producing a decode error.

4.3 Action Stream Parsing

When the Action Router is enabled, the Gemini response stream contains a structured prefix encoding action metadata before the natural language response:

// Parse __ACTIONS__ prefix from action-chat stream
if (!actionsParsed && buffer.includes("__ACTIONS__")) {
  const actionEnd = buffer.indexOf("__END_ACTIONS__")
  if (actionEnd !== -1) {
    const actionJson = buffer.slice(
      buffer.indexOf("__ACTIONS__") + 11,
      actionEnd,
    )
    try {
      const meta = JSON.parse(actionJson) as ActionMeta
      if (meta.actions?.length > 0) {
        onClientInstructions?.(meta.actions, meta.teamContext)
      }
    } catch { /* JSON parse failed — ignore */ }
    buffer = buffer.slice(actionEnd + 15)
    actionsParsed = true
  } else {
    continue // wait for __END_ACTIONS__
  }
}

This in-band signaling approach embeds structured data in the text stream without requiring a separate WebSocket channel. The __ACTIONS__...__END_ACTIONS__ delimiters are chosen to be extremely unlikely in natural language, and the parser waits for the complete delimiter pair before attempting JSON parse, handling partial chunk delivery gracefully.


5. Action Router: Voice-Driven Multi-Team Tool Orchestration

5.1 Router Architecture

The Action Router is a singleton dispatcher that maps Gemini function calls to handler functions organized across four teams. The router receives a function call from Gemini (name + arguments), locates the registered handler, executes it, and returns both a function response (fed back to Gemini for conversational continuity) and client instructions (for UI updates).

class ActionRouter {
  private registry = new Map<string, ActionDefinition>()

  register(definition: ActionDefinition): void {
    this.registry.set(definition.name, definition)
  }

  async dispatch(call: GeminiFunctionCall): Promise<ActionResult> {
    const definition = this.registry.get(call.name)

    if (!definition) {
      return {
        functionResponse: `Error: Unknown action "${call.name}"`,
        clientInstructions: [
          { type: "notify", message: `Unknown action: ${call.name}`, severity: "error" },
        ],
      }
    }

    try {
      return await definition.handler(call.args)
    } catch (err) {
      const message = err instanceof Error ? err.message : "Unknown error"
      return {
        functionResponse: `Error executing ${call.name}: ${message}`,
        clientInstructions: [
          { type: "notify", message: `Action failed: ${message}`, severity: "error" },
        ],
      }
    }
  }
}

The registry uses a flat Map<string, ActionDefinition> rather than a hierarchical team-based structure. This ensures O(1) dispatch time regardless of team count, and allows tools to be shared across teams without duplication.

5.2 Team Organization and Tool Distribution

The 29 registered tools are distributed across 4 teams plus 3 shared tools:

TeamToolsScopeExamples
Shared3publicnavigate_dashboard, get_system_status, search_knowledge
Secretary10memberget_calendar, create_calendar_event, search_emails, send_email, get_attendance
Sales10membercalculate_estimate, create_invoice_draft, get_deals, draft_proposal, get_monthly_revenue
Document8membercreate_report, create_spreadsheet, create_presentation, generate_meeting_notes
Dev8membercode_consult, architecture_review, tech_debt_assess, sprint_plan, debug_assist

5.3 Confidence-Weighted Team Inference

The team inference system determines which team is contextually active based on recent function call history. This enables the system to adapt its persona and available tool emphasis without explicit user switching. The algorithm uses recency-weighted counting:

export function inferTeam(
  recentFunctionCalls: string[],
): { teamId: TeamId; confidence: number } {
  if (recentFunctionCalls.length === 0) {
    return { teamId: "secretary", confidence: 0 }
  }

  const teamCounts: Record<TeamId, number> = {
    secretary: 0, sales: 0, document: 0, dev: 0,
  }

  for (let i = 0; i < recentFunctionCalls.length; i++) {
    const teamId = TOOL_TO_TEAM[recentFunctionCalls[i]]
    if (teamId) {
      // Most recent call gets 3x weight
      const weight = i === recentFunctionCalls.length - 1 ? 3 : 1
      teamCounts[teamId] += weight
    }
  }

  const entries = Object.entries(teamCounts) as [TeamId, number][]
  entries.sort((a, b) => b[1] - a[1])

  const [topTeam, topCount] = entries[0]
  const totalWeight = entries.reduce((sum, [, c]) => sum + c, 0)

  return {
    teamId: topTeam,
    confidence: totalWeight > 0 ? topCount / totalWeight : 0,
  }
}

The 3x weight for the most recent call creates a strong recency bias: a single Secretary tool call immediately shifts the inferred team even if the previous three calls were all Sales. The confidence score (ratio of top-team weight to total weight) serves as a signal to the UI layer — low confidence (below 0.5) indicates the user is switching between domains, while high confidence (above 0.8) indicates sustained single-domain interaction.

5.4 Client Instructions Type System

The Action Router outputs a discriminated union of client instructions that the frontend interprets to drive UI state changes:

export type ClientInstruction =
  | { type: "navigate"; path: string }
  | { type: "notify"; message: string; severity: ActionSeverity }
  | { type: "update_panel"; panel: string; data: Record<string, unknown> }
  | { type: "open_modal"; modal: string; props: Record<string, unknown> }
  | { type: "noop" }

This design separates the voice system's semantic understanding ('the user wants to see the calendar') from the UI system's rendering logic ('navigate to /dashboard/calendar'). The voice pipeline produces abstract instructions; the frontend instruction executor maps them to concrete DOM operations.


6. Infinite Session Management via Rolling Summaries

6.1 The Context Window Problem

LLM context windows are finite. A voice session that runs for 30+ minutes can easily accumulate 50-100 conversation turns, exceeding the effective context window and causing the model to lose early conversation context or degrade in response quality. Simply truncating old messages discards potentially critical context — the user's initial instructions, established preferences, and unresolved topics.

6.2 Rolling Summary Architecture

MARIA OS implements a rolling summary mechanism that compresses old conversation history into a condensed summary while preserving recent messages verbatim. The algorithm triggers when the conversation exceeds a threshold length:

const SUMMARY_THRESHOLD = 16
const KEEP_RECENT = 6

if (conversationRef.current.length > SUMMARY_THRESHOLD) {
  const toSummarize = conversationRef.current.slice(0, -KEEP_RECENT)
  const recent = conversationRef.current.slice(-KEEP_RECENT)
  try {
    const res = await fetch("/api/voice/summarize", {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        messages: toSummarize,
        existingSummary: summaryRef.current,
      }),
    })
    if (res.ok) {
      const { summary } = await res.json()
      summaryRef.current = summary
    }
  } catch { /* summarization failed — just trim */ }
  // Reconstruct: summary context + recent messages
  conversationRef.current = [
    ...(summaryRef.current
      ? [{ role: "user", text: `[Conversation summary: ${summaryRef.current}]` }]
      : []),
    ...recent,
  ]
}
Definition
The rolling summary is a function S : H_n -> (s, H_k) where H_n is a conversation history of n messages, s is a compressed summary string, and H_k is the most recent k messages preserved verbatim. The summary is injected as a synthetic user message at the head of the reconstructed history, maintaining the LLM's conversational context without occupying the full token budget of the original messages.

6.3 Summary Composition

A critical property of the rolling summary is composability: when the conversation grows again past the threshold, the new summary is generated from both the previous summary and the messages accumulated since. The existingSummary parameter ensures that information from the earliest parts of the conversation is not lost through repeated summarization — the summary accumulates rather than replacing.

The threshold of 16 messages (with 6 kept) was chosen to balance summarization frequency against API cost. At typical voice conversation rates (2-4 turns per minute), summarization triggers every 4-8 minutes — frequent enough to prevent context overflow but infrequent enough that the summarization API call does not perceptibly delay the conversation.


7. Resilience Engineering: Heartbeat, Failsafe, and Recovery

7.1 Failure Modes in Voice Pipelines

Voice pipelines are uniquely fragile because they depend on multiple browser APIs that can fail silently. Unlike HTTP requests that return error codes, browser APIs like SpeechRecognition and AudioContext can enter invalid states without triggering error callbacks. We identified three critical failure modes in production:

  • Stuck processing: The isProcessingRef flag becomes permanently true due to an unhandled exception in the Gemini stream or TTS chain, preventing all future user input from being processed
  • AudioContext suspension: iOS Safari suspends AudioContext when the page loses focus (tab switch, lock screen), and does not automatically resume it when focus returns. The voice level meter reads zero, and TTS audio plays but is inaudible
  • SpeechRecognition death: SpeechRecognition fires onend but the restart attempt inside onend throws silently, leaving recognition permanently stopped. The user speaks but nothing happens

7.2 Heartbeat Monitor

The heartbeat runs on a 10-second interval and addresses all three failure modes:

// Heartbeat keepalive (10s interval)
heartbeatRef.current = setInterval(() => {
  if (statusRef.current !== "connected") return

  // Force-clear processing flag if stuck for >30s
  if (isProcessingRef.current &&
      Date.now() - processingStartRef.current > 30_000) {
    isProcessingRef.current = false
    resumeRecognition()
  }

  // Resume suspended AudioContext
  if (audioContextRef.current?.state === "suspended") {
    audioContextRef.current.resume().catch(() => {})
  }
}, 10_000)

The 30-second stuck-processing threshold was derived from production data: the longest legitimate processing cycle (a complex multi-tool action with several TTS sentences) completed in under 25 seconds. The 30-second threshold provides a 5-second safety margin while ensuring that truly stuck sessions recover within one heartbeat cycle after the threshold.

7.3 SpeechRecognition Restart Strategy

The SpeechRecognition onend handler implements a dual-path restart strategy depending on whether the system is currently processing:

recognition.onend = () => {
  if (statusRef.current !== "connected") return

  if (!isProcessingRef.current) {
    // Not processing — restart immediately
    const delay = iosDevice ? 100 : 0
    setTimeout(() => {
      if (statusRef.current === "connected") {
        try { recognition.start() } catch { /* already running */ }
      }
    }, delay)
  } else {
    // Processing (TTS playing) — set a failsafe timer.
    // If resumeRecognition doesn't restart within 30s, force restart.
    setTimeout(() => {
      if (statusRef.current === "connected") {
        isProcessingRef.current = false
        try { recognition.start() } catch { /* already running */ }
      }
    }, 30_000)
  }
}

The iOS-specific 100ms delay before restart addresses a WebKit bug where calling recognition.start() synchronously inside onend throws an InvalidStateError. The delay ensures the internal state machine has fully transitioned to 'stopped' before the restart attempt.

7.4 Abort-Controlled Cleanup

Every Gemini request and TTS chain is governed by an AbortController that enables instantaneous cleanup on disconnect or barge-in:

const interruptPlayback = () => {
  // Abort in-flight Gemini stream + TTS fetches
  if (abortRef.current) {
    abortRef.current.abort()
    abortRef.current = null
  }
  // Stop audio immediately
  if (elAudioRef.current) {
    elAudioRef.current.pause()
    elAudioRef.current.currentTime = 0
  }
  // Reset TTS chain
  ttsChainRef.current = Promise.resolve()
}

The AbortController signal is threaded through three layers: the fetch() call to Gemini (cancels the HTTP stream), each fetch() call to ElevenLabs (cancels TTS synthesis), and each playElevenLabsTTS promise (stops audio playback). This ensures that aborting the controller immediately halts all in-flight network requests and stops all audio output within a single event loop tick.


8. Cross-Platform Compatibility: The In-App Browser Challenge

8.1 The In-App Browser Problem

Modern mobile users frequently open links from messaging apps (LINE, Discord, Instagram) that render web pages in embedded WebView containers rather than the native browser. These in-app browsers provide a subset of web APIs with inconsistent behavior, particularly for audio and speech APIs. MARIA OS must detect these environments and provide graceful degradation rather than cryptic failures.

8.2 Detection Architecture

The detection system uses a two-tier approach: User-Agent pattern matching for named in-app browsers, and heuristic detection for iOS WKWebView containers that do not identify themselves:

const IN_APP_PATTERNS: [RegExp, InAppBrowser][] = [
  [/\bLine\//i, "line"],
  [/\bLIFF\b/i, "line"],
  [/\bDiscord\b/i, "discord"],
  [/\bInstagram\b/i, "instagram"],
  [/\bFBAN\b/i, "facebook"],
  [/\bFBAV\b/i, "facebook"],
  [/\bFB_IAB\b/i, "facebook"],
  [/\bMessenger\b/i, "messenger"],
  [/\bTwitter\b/i, "twitter"],
  [/\bMicroMessenger\b/i, "wechat"],
  [/\bTikTok\b/i, "tiktok"],
  [/\bSnapchat\b/i, "snapchat"],
]

For iOS WKWebView detection, the system checks for the absence of window.safari — a property present in real Safari but missing in WKWebView containers used by LINE, Discord, and other apps that do not add identifying strings to the User-Agent:

function detectIOSInAppWebView(): boolean {
  if (typeof navigator === "undefined" || typeof window === "undefined") return false
  const ua = navigator.userAgent
  const isiOS = /iPad|iPhone|iPod/.test(ua) ||
    (navigator.platform === "MacIntel" && navigator.maxTouchPoints > 1)
  if (!isiOS) return false

  const w = window as any
  if (typeof w.safari === "undefined") return true
  return false
}

8.3 iOS-Specific Audio Handling

iOS Safari (and iOS WebView) imposes two constraints that require special handling:

  • AudioContext sample rate: iOS hardware runs at 44.1kHz or 48kHz. Forcing a different sample rate (e.g., 16kHz for voice) causes silent or distorted output. MARIA OS creates AudioContext without a sampleRate option, letting the device choose its native rate
  • Audio autoplay policy: iOS requires audio playback to be initiated by a direct user gesture. MARIA OS creates a persistent <audio> element during the connect() call (which is a button click handler) and unlocks it by playing a silent base64-encoded MP3 snippet
// iOS audio unlock — silent MP3 played during user gesture
if (iosDevice) {
  if (!elAudioRef.current) {
    elAudioRef.current = new Audio()
  }
  const audio = elAudioRef.current
  audio.src = "data:audio/mpeg;base64,SUQzBAAA..." // silent MP3
  audio.volume = 0.01
  try {
    await Promise.race([
      audio.play(),
      new Promise((r) => setTimeout(r, 1000)), // 1s max for unlock
    ])
    audio.pause()
  } catch { /* non-fatal */ }
  audio.volume = 1
}

The Promise.race with a 1-second timeout prevents the unlock from blocking the connection flow on devices where play() hangs indefinitely. The low volume (0.01) ensures the silent MP3 is truly inaudible even if the device speaker is at maximum volume.

8.4 SpeechRecognition Continuous Mode

On desktop browsers and Android, SpeechRecognition.continuous = true keeps the recognition session active across multiple utterances. On iOS, continuous mode is unreliable — recognition silently stops after 10-15 seconds. MARIA OS sets continuous = false on iOS and relies on the onend handler to restart recognition after each utterance:

const recognition = new SR()
recognition.continuous = !iosDevice
recognition.interimResults = false
recognition.lang = locale === "ja" ? "ja-JP" : "en-US"

8.5 Platform Support Matrix

PlatformSpeechRecognitionAudioContextTTS PlaybackMic AccessStatus
Chrome (Desktop)continuousnative rateHTMLAudioElementstandardFull support
Safari (macOS)webkit prefixnative rateHTMLAudioElementstandardFull support
Safari (iOS)webkit, non-continuousnative rate, must not forcesilent MP3 unlockstandardFull support (with workarounds)
Chrome (Android)continuousnative rateHTMLAudioElementstandardFull support
LINE in-appunavailablevariesblockedtimeoutRedirect to native browser
Discord in-appunavailablevariesblockedtimeoutRedirect to native browser
Instagram in-apppartialvariesblockedblockedRedirect to native browser

9. Production Metrics and Evaluation

9.1 TTS Configuration

MARIA OS uses ElevenLabs eleven_turbo_v2_5 model optimized for low-latency streaming. The voice parameters were tuned for natural conversational speech:

const res = await fetch(
  `https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream`,
  {
    method: "POST",
    headers: {
      "xi-api-key": ELEVENLABS_API_KEY,
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      text,
      model_id: "eleven_turbo_v2_5",
      voice_settings: {
        stability: 0.5,
        similarity_boost: 0.75,
      },
      output_format: "mp3_22050_32",
    }),
  },
)

The parameter choices reflect deliberate tradeoffs: stability 0.5 allows expressive variation in pitch and rhythm (higher values produce more monotone output), similarity_boost 0.75 maintains voice identity while allowing natural variation, and mp3_22050_32 (22.05kHz, 32kbps) balances audio quality against download size — a single sentence (15-20 words) produces approximately 8-12KB of audio data, enabling rapid transfer even on mobile networks.

9.2 Latency Breakdown

Measured across 500 representative interactions:

MetricP50P90P99
Speech debounce1200ms1200ms1200ms
Gemini TTFS (first sentence)280ms420ms680ms
ElevenLabs synthesis190ms310ms520ms
Audio decode + play start15ms25ms40ms
**Total first-sentence latency****1685ms****1955ms****2440ms**
Perceived latency (excl. debounce)**485ms****755ms****1240ms**
The 1200ms debounce is a fixed cost that users experience as natural 'thinking time' rather than system delay, because it begins during the user's own silence. The perceived latency — from the moment the user stops speaking to the first audio — is dominated by Gemini TTFS and ElevenLabs synthesis, both under 800ms at P90.

9.3 Reliability Metrics

MetricValueMeasurement Period
Sessions without ordering violation2,400+90-day production window
Mean session duration8.2 minAll sessions
Max session duration (rolling summary)47 minSingle session record
Heartbeat recovery events23Processing stuck >30s
AudioContext recovery events156iOS background/foreground
In-app browser redirects412Users guided to native browser

9.4 Voice Level Metering

The voice level meter provides real-time visual feedback through a 3D OGL orb that reacts to speech amplitude. The metering uses an AnalyserNode to compute RMS energy from the microphone input:

const measureLevel = useCallback(() => {
  if (!analyserRef.current || !dataArrayRef.current) return
  analyserRef.current.getByteFrequencyData(dataArrayRef.current)
  let sum = 0
  for (let i = 0; i < dataArrayRef.current.length; i++) {
    const v = dataArrayRef.current[i] / 255
    sum += v * v
  }
  const rms = Math.sqrt(sum / dataArrayRef.current.length)
  setVoiceLevel(Math.min(rms * 3.0, 1))
  rafRef.current = requestAnimationFrame(measureLevel)
}, [])

The 3.0x multiplier on the RMS value compensates for the typical low energy of speech signals relative to the full dynamic range. Without amplification, conversational speech registers at only 0.1-0.2 on the 0-1 scale, producing barely visible orb animation. The Math.min(..., 1) clamp prevents visual artifacts from loud transients (coughing, desk impacts).


10. Future Extensions: Toward Recursive Voice Governance

10.1 Connection State Machine

The voice session lifecycle follows a four-state machine:

\text{disconnected} \xrightarrow{\text{connect()}} \text{connecting} \xrightarrow{\text{success}} \text{connected} \xrightarrow{\text{disconnect()}} \text{disconnected} $$
\text{connecting} \xrightarrow{\text{failure}} \text{error} \xrightarrow{\text{connect()}} \text{connecting} $$

The connected state is the only state where the heartbeat runs, SpeechRecognition is active, and TTS playback is permitted. The error state preserves the connection metadata for diagnostic purposes and allows retry via a subsequent connect() call.

10.2 Toward Responsibility-Gated Voice Actions

The current Action Router dispatches tools without responsibility gates — every registered action is immediately executable. A natural extension is to integrate the MARIA OS decision pipeline into the voice layer, requiring approval gates for high-impact voice actions. Consider a user saying 'Send the revised proposal to Tanaka-san with the 15% discount.' The draft_proposal tool currently executes immediately; with responsibility gating, the system would detect the financial impact (discount authorization), create a decision record in the pipeline, and respond: 'I have drafted the proposal with 15% discount. This requires manager approval before sending. Shall I route it for approval?'

Definition
A responsibility-gated voice action is an action a where dispatch(a) triggers a decision pipeline transition rather than immediate execution. The gate threshold is determined by the action's risk score rho(a) and the user's authority level auth(u), following the gate function:
G(a, u) = \begin{cases} \text{execute} & \text{if } \rho(a) \leq \text{auth}(u) \\ \text{escalate} & \text{if } \rho(a) > \text{auth}(u) \end{cases} $$

10.3 Multi-Modal Voice Governance

Future iterations will extend the voice pipeline to support multi-modal evidence collection during voice interactions. When a responsibility gate is triggered, the system could prompt the user to provide verbal justification that is transcribed, timestamped, and attached to the decision record as an evidence bundle. This creates a fully auditable chain from voice command through approval gate to execution, with the user's own words serving as the authorization evidence.

10.4 Recursive Self-Improvement of Voice Pipelines

The rolling summary mechanism provides a foundation for recursive voice intelligence: the system could analyze summary patterns to detect recurring user intents, preemptively load relevant tools, and suggest workflow optimizations. A user who consistently requests calendar checks followed by email drafts could be offered a combined 'schedule and notify' voice workflow, dynamically composed from existing tools without engineering intervention.

The sentence-level streaming architecture described in this paper is not merely an optimization of existing voice interface patterns. It represents a deliberate alignment between cognitive science (the sentence as processing unit), systems engineering (the promise chain as ordering primitive), and product design (the 1.2-second debounce as conversational rhythm). When these three domains converge on the same architectural unit, the result is a voice interface that feels natural not because it imitates human speech, but because it respects human cognition.

References

  • Miller, G.A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81-97.
  • Levinson, S.C. (2016). Turn-taking in human communication: Origins and implications for language processing. Trends in Cognitive Sciences, 20(1), 6-14.
  • Stivers, T. et al. (2009). Universals and cultural variation in turn-taking in conversation. PNAS, 106(26), 10587-10592.
  • Web Speech API Specification, W3C Community Group Report, 2024.
  • ElevenLabs Text-to-Speech API Documentation, v1, 2025.
  • Google Gemini API Reference: Streaming and Function Calling, 2025.

R&D BENCHMARKS

First-Sentence Latency

<800ms

Time from user speech end to first TTS audio playback, measured as speech debounce (1.2s) + Gemini first-sentence stream + ElevenLabs synthesis. Sentence-level streaming reduces perceived wait by 62% vs. full-response buffering.

Sentence Order Violations

0

Zero out-of-order TTS playback events across 2,400+ production sessions. Sequential promise chaining guarantees FIFO ordering without explicit queue data structures.

Infinite Session Capacity

16+ msg rolling

Rolling summary compression triggers at 16 messages, retaining last 6 messages plus a compressed summary prefix. Sessions sustained for 45+ minutes without context degradation.

Cross-Platform Coverage

9 in-app browsers

Detection and graceful degradation for LINE, Discord, Instagram, Facebook, Messenger, X/Twitter, WeChat, TikTok, and Snapchat in-app browsers plus iOS WKWebView detection.

Published and reviewed by the MARIA OS Editorial Pipeline.

© 2026 MARIA OS. All rights reserved.