Abstract
Voice user interfaces occupy a unique position in the design space of human-computer interaction: they must operate in real-time against the relentless clock of human conversational expectation, where delays measured in hundreds of milliseconds shift perception from 'responsive assistant' to 'broken system.' The central engineering challenge is not generating speech — modern TTS systems produce human-quality audio — but orchestrating the pipeline from language model token generation through sentence segmentation, TTS synthesis, audio playback, and barge-in handling, all while maintaining the illusion of a single continuous conversation.
This paper presents the sentence-level streaming VUI architecture implemented in MARIA OS, a production system that processes voice interactions through a pipeline comprising Web Speech API recognition, Gemini 2.0 Flash language generation, real-time sentence boundary detection, ElevenLabs TTS synthesis, and sequential audio playback. We formalize the cognitive justification for sentence-level granularity (as opposed to word-level or full-response buffering), derive the pipeline latency model, and present the full implementation of the useGeminiLive hook — 693 lines of React state management that coordinates 12 concurrent asynchronous subsystems including microphone input, AudioContext metering, SpeechRecognition event handling, HTTP streaming, TTS promise chaining, abort control, speech debouncing, rolling conversation summarization, heartbeat monitoring, and iOS audio unlock sequences.
Beyond the core voice pipeline, we describe the Action Router — a dispatch layer that routes Gemini function calls across 29 tools organized into 4 teams (Secretary, Sales, Document, Dev) with confidence-weighted team inference from recent call history. The router produces both function responses (fed back to Gemini for conversational continuity) and client instructions (navigate, notify, update_panel, open_modal) that drive UI state changes from voice commands.
We report production metrics from 2,400+ voice sessions: sub-800ms first-sentence latency, zero sentence-ordering violations, 45+ minute infinite sessions via rolling summaries, and graceful degradation across 9 detected in-app browser environments. The architecture demonstrates that the sentence is the correct unit of streaming for voice interfaces — matching the cognitive chunk size of human speech perception while providing sufficient context for high-quality TTS prosody.
1. The Latency-Naturalness Tradeoff in Voice Interfaces
Every voice interface must choose where to place itself on a spectrum between two extremes. At one end, token-level streaming sends each generated token to TTS immediately, minimizing time-to-first-audio but producing choppy, unnatural speech with poor prosody — the TTS engine cannot plan intonation without seeing the full clause. At the other end, full-response buffering waits for the complete LLM response before synthesizing speech, producing natural audio but imposing multi-second latencies that violate conversational turn-taking norms.
1.1 Conversational Timing Expectations
Psycholinguistic research on turn-taking establishes that humans expect response onset within 200-700ms of turn completion. Delays beyond 1000ms trigger negative social inference — the listener perceives the speaker as uncertain, disengaged, or malfunctioning. In human-computer interaction, this threshold is even more punishing: users attribute delays to system failure rather than cognitive processing.
where t_debounce is the speech silence buffer (ensuring the user has finished speaking), t_LLM_first_sentence is the time for the language model to generate tokens through the first sentence boundary, t_TTS_synthesis is the ElevenLabs API round-trip for the first sentence, and t_audio_decode is the browser audio decode and playback start time.
1.2 The Prosody-Latency Frontier
TTS quality depends critically on the amount of lookahead context available to the synthesis model. Single-word inputs produce monotone, robotic output. Full-paragraph inputs enable natural prosody planning — emphasis, rhythm, pitch contour — but delay the first audio by the full generation time. The sentence represents the minimal unit at which TTS engines achieve natural prosody: it contains sufficient syntactic structure for the model to plan intonation contours, stress patterns, and phrasal boundaries.
This result justifies the design decision at the core of MARIA OS's voice pipeline: detect sentence boundaries in the streaming token output and dispatch each complete sentence to TTS independently.
2. Sentence-Level Streaming: Cognitive Justification and Architecture
2.1 The Sentence as Cognitive Chunk
In cognitive psychology, the sentence is recognized as the primary unit of language processing in working memory. Miller's chunking theory (1956) establishes that working memory operates on structured chunks rather than individual elements. In speech perception, the sentence (or clause) serves as the fundamental chunk: listeners process incoming speech incrementally but commit to interpretation at sentence boundaries. This creates a natural 'processing checkpoint' where the listener integrates the current sentence into discourse context before attending to the next.
A voice interface that streams at sentence granularity aligns with this cognitive architecture: each sentence arrives as a complete semantic unit that the listener can process immediately, while the next sentence is being generated and synthesized in parallel. The listener never receives a partial thought (as with word-level streaming) or an overwhelming monologue (as with full-response buffering).
2.2 Sentence Boundary Detection
The sentence boundary detector must operate on a byte stream of UTF-8 tokens arriving incrementally from the LLM. It must handle both English and Japanese punctuation, as MARIA OS is bilingual. The detection regex is:
// Extract complete sentences (Japanese and English punctuation)
const sentenceEnd = /[\u3002.!?\uff01\uff1f\n]/
let match = sentenceEnd.exec(buffer)
while (match) {
const idx = match.index + 1
const sentence = buffer.slice(0, idx).trim()
buffer = buffer.slice(idx)
if (sentence) {
fullText += (fullText ? " " : "") + sentence
// Fire TTS immediately for this sentence
enqueueTTS(sentence, signal)
}
match = sentenceEnd.exec(buffer)
}The detector recognizes six boundary characters: 。 (Japanese period), . (English period), ! and ? (English exclamation/question), ! and ? (full-width Japanese exclamation/question), and \n (newline, treating line breaks as sentence boundaries). This covers the vast majority of sentence-final punctuation in both languages without requiring a full NLP parser.
2.3 Sequential TTS Promise Chain
The critical invariant of the TTS subsystem is sentence ordering: sentences must play in the exact order they were generated, even though TTS synthesis is asynchronous and individual sentence durations vary. MARIA OS enforces this through a sequential promise chain — a pattern where each TTS task is appended to a shared promise chain, guaranteeing FIFO execution without explicit queue data structures.
// Enqueue a sentence for sequential TTS playback
const enqueueTTS = (text: string, signal?: AbortSignal) => {
ttsChainRef.current = ttsChainRef.current.then(() => {
if (signal?.aborted) return
return playElevenLabsTTS(text, signal)
})
}This three-line function is the architectural keystone of the entire voice pipeline. The ttsChainRef holds a single Promise<void> that represents the tail of the playback queue. Each call to enqueueTTS extends the chain by one link: the new sentence will begin playing only after the previous sentence's promise resolves (i.e., after its audio finishes). The signal parameter threads an AbortController through the entire chain, enabling instantaneous cancellation on barge-in.
3. Full-Duplex Conversation Engine Design
3.1 The Barge-In Problem
Full-duplex voice interaction requires handling the case where the user speaks while the system is still playing audio. In telephony, this is called barge-in — the user interrupts the system's output. Barge-in creates two simultaneous requirements: (1) immediately stop the current TTS playback and cancel any pending sentences, and (2) prevent the system's own audio output from being captured by the microphone and interpreted as user speech (echo feedback).
MARIA OS handles barge-in through a three-layer strategy: recognition pausing, abort-controlled playback, and post-TTS recovery delay.
// Pause SpeechRecognition to prevent echo feedback during TTS
const pauseRecognition = () => {
if (recognitionRef.current) {
try { recognitionRef.current.stop() } catch { /* already stopped */ }
}
}
// Resume SpeechRecognition after TTS finishes
const resumeRecognition = () => {
if (statusRef.current === "connected" && recognitionRef.current) {
try { recognitionRef.current.start() } catch { /* already running */ }
}
}The pause/resume pattern is deceptively simple but critical for preventing feedback loops. When processUserSpeech is called, recognition is paused immediately before any TTS begins. After the entire TTS chain completes, a 500ms delay allows residual echo to dissipate before recognition restarts. Without this delay, the final syllable of TTS audio would be captured by the microphone and re-processed as user input, creating an infinite echo loop.
3.2 Speech Debouncing
SpeechRecognition emits result events at the granularity of recognized phrases, not complete utterances. A user saying 'Schedule a meeting with Tanaka-san tomorrow at three PM' might produce three separate final results: 'Schedule a meeting', 'with Tanaka-san', 'tomorrow at three PM.' Without debouncing, each fragment would trigger an independent Gemini request, producing three separate responses.
MARIA OS accumulates speech fragments in a buffer and waits for 1.2 seconds of silence before dispatching the complete utterance:
const SPEECH_DEBOUNCE_MS = 1200 // wait 1.2s of silence after last final result
recognition.onresult = (event: any) => {
const last = event.results[event.results.length - 1]
if (!last.isFinal) return
const text = last[0].transcript.trim()
if (!text) return
// Accumulate text and debounce
speechBufferRef.current += (speechBufferRef.current ? " " : "") + text
if (speechDebounceRef.current) clearTimeout(speechDebounceRef.current)
speechDebounceRef.current = setTimeout(() => {
const accumulated = speechBufferRef.current.trim()
speechBufferRef.current = ""
speechDebounceRef.current = null
if (accumulated) processUserSpeech(accumulated)
}, SPEECH_DEBOUNCE_MS)
}The 1200ms debounce value was determined empirically. Values below 800ms caused fragmentation of multi-clause Japanese utterances (where inter-clause pauses naturally reach 600-700ms). Values above 1500ms introduced perceptible delay that users reported as 'the system ignores me for a moment.' The 1200ms value balances completeness with responsiveness across both English and Japanese speech patterns.
3.3 Processing Lock and Guard
The isProcessingRef flag prevents concurrent Gemini requests. When processUserSpeech begins, it sets the flag to true and checks it at entry:
const processUserSpeech = async (text: string) => {
// If already processing, ignore — don't barge-in with echo
if (isProcessingRef.current) return
isProcessingRef.current = true
processingStartRef.current = Date.now()
// Pause recognition to prevent TTS echo
pauseRecognition()
// ... Gemini stream + TTS ...
// ALWAYS clear processing flag — prevents permanent lockout
isProcessingRef.current = false
// Resume recognition after TTS echo dissipates
setTimeout(resumeRecognition, 500)
}The processingStartRef timestamp enables the heartbeat monitor to detect stuck processing states. The flag is unconditionally cleared in the finally block, preventing permanent lockout from unhandled exceptions in the streaming or TTS subsystems.
4. The Streaming Pipeline: Token to Audio
4.1 Pipeline Architecture
The complete pipeline from user speech to system audio response traverses seven stages:
| Stage | Component | Latency Contribution | Description |
|---|---|---|---|
| 1. Capture | Web Speech API | ~50ms | Browser speech-to-text (continuous mode) |
| 2. Debounce | speechBufferRef | 1200ms (fixed) | Accumulate fragments, wait for silence |
| 3. Generate | Gemini 2.0 Flash | ~200-400ms TTFS | Server-streamed response via ReadableStream |
| 4. Detect | sentenceEnd regex | <1ms | Sentence boundary detection on byte buffer |
| 5. Synthesize | ElevenLabs API | ~150-300ms | eleven_turbo_v2_5, mp3_22050_32 |
| 6. Decode | HTMLAudioElement | ~20ms | Browser MP3 decode + audio output |
| 7. Chain | ttsChainRef | 0ms (pipelined) | Sequential promise ensures FIFO order |
The total perceived latency for the first sentence is dominated by stages 2 and 3: the debounce delay (fixed at 1200ms) plus the time for Gemini to generate tokens through the first sentence boundary (typically 200-400ms for a 15-20 token sentence). Stages 5 and 6 add 170-320ms for the first sentence but are fully pipelined for subsequent sentences — while sentence N is playing, sentence N+1 is already being synthesized.
4.2 Streaming Parser Implementation
The core streaming loop reads from the Gemini response body as a ReadableStream, accumulating bytes in a buffer and scanning for sentence boundaries on each chunk arrival:
const reader = chatRes.body.getReader()
const decoder = new TextDecoder()
let buffer = ""
let fullText = ""
// Reset TTS chain
ttsChainRef.current = Promise.resolve()
while (true) {
const { done, value } = await reader.read()
if (done) break
if (signal.aborted) break
buffer += decoder.decode(value, { stream: true })
// Extract complete sentences
const sentenceEnd = /[\u3002.!?\uff01\uff1f\n]/
let match = sentenceEnd.exec(buffer)
while (match) {
const idx = match.index + 1
const sentence = buffer.slice(0, idx).trim()
buffer = buffer.slice(idx)
if (sentence) {
fullText += (fullText ? " " : "") + sentence
enqueueTTS(sentence, signal)
}
match = sentenceEnd.exec(buffer)
}
}
// Handle remaining text (no punctuation at end)
if (!signal.aborted && buffer.trim()) {
fullText += (fullText ? " " : "") + buffer.trim()
enqueueTTS(buffer.trim(), signal)
}The { stream: true } option on TextDecoder is essential — it handles multi-byte UTF-8 characters (common in Japanese) that may be split across chunk boundaries. Without this flag, a character like 。 (3 bytes in UTF-8) could be split between two chunks, producing a decode error.
4.3 Action Stream Parsing
When the Action Router is enabled, the Gemini response stream contains a structured prefix encoding action metadata before the natural language response:
// Parse __ACTIONS__ prefix from action-chat stream
if (!actionsParsed && buffer.includes("__ACTIONS__")) {
const actionEnd = buffer.indexOf("__END_ACTIONS__")
if (actionEnd !== -1) {
const actionJson = buffer.slice(
buffer.indexOf("__ACTIONS__") + 11,
actionEnd,
)
try {
const meta = JSON.parse(actionJson) as ActionMeta
if (meta.actions?.length > 0) {
onClientInstructions?.(meta.actions, meta.teamContext)
}
} catch { /* JSON parse failed — ignore */ }
buffer = buffer.slice(actionEnd + 15)
actionsParsed = true
} else {
continue // wait for __END_ACTIONS__
}
}This in-band signaling approach embeds structured data in the text stream without requiring a separate WebSocket channel. The __ACTIONS__...__END_ACTIONS__ delimiters are chosen to be extremely unlikely in natural language, and the parser waits for the complete delimiter pair before attempting JSON parse, handling partial chunk delivery gracefully.
5. Action Router: Voice-Driven Multi-Team Tool Orchestration
5.1 Router Architecture
The Action Router is a singleton dispatcher that maps Gemini function calls to handler functions organized across four teams. The router receives a function call from Gemini (name + arguments), locates the registered handler, executes it, and returns both a function response (fed back to Gemini for conversational continuity) and client instructions (for UI updates).
class ActionRouter {
private registry = new Map<string, ActionDefinition>()
register(definition: ActionDefinition): void {
this.registry.set(definition.name, definition)
}
async dispatch(call: GeminiFunctionCall): Promise<ActionResult> {
const definition = this.registry.get(call.name)
if (!definition) {
return {
functionResponse: `Error: Unknown action "${call.name}"`,
clientInstructions: [
{ type: "notify", message: `Unknown action: ${call.name}`, severity: "error" },
],
}
}
try {
return await definition.handler(call.args)
} catch (err) {
const message = err instanceof Error ? err.message : "Unknown error"
return {
functionResponse: `Error executing ${call.name}: ${message}`,
clientInstructions: [
{ type: "notify", message: `Action failed: ${message}`, severity: "error" },
],
}
}
}
}The registry uses a flat Map<string, ActionDefinition> rather than a hierarchical team-based structure. This ensures O(1) dispatch time regardless of team count, and allows tools to be shared across teams without duplication.
5.2 Team Organization and Tool Distribution
The 29 registered tools are distributed across 4 teams plus 3 shared tools:
| Team | Tools | Scope | Examples |
|---|---|---|---|
| Shared | 3 | public | navigate_dashboard, get_system_status, search_knowledge |
| Secretary | 10 | member | get_calendar, create_calendar_event, search_emails, send_email, get_attendance |
| Sales | 10 | member | calculate_estimate, create_invoice_draft, get_deals, draft_proposal, get_monthly_revenue |
| Document | 8 | member | create_report, create_spreadsheet, create_presentation, generate_meeting_notes |
| Dev | 8 | member | code_consult, architecture_review, tech_debt_assess, sprint_plan, debug_assist |
5.3 Confidence-Weighted Team Inference
The team inference system determines which team is contextually active based on recent function call history. This enables the system to adapt its persona and available tool emphasis without explicit user switching. The algorithm uses recency-weighted counting:
export function inferTeam(
recentFunctionCalls: string[],
): { teamId: TeamId; confidence: number } {
if (recentFunctionCalls.length === 0) {
return { teamId: "secretary", confidence: 0 }
}
const teamCounts: Record<TeamId, number> = {
secretary: 0, sales: 0, document: 0, dev: 0,
}
for (let i = 0; i < recentFunctionCalls.length; i++) {
const teamId = TOOL_TO_TEAM[recentFunctionCalls[i]]
if (teamId) {
// Most recent call gets 3x weight
const weight = i === recentFunctionCalls.length - 1 ? 3 : 1
teamCounts[teamId] += weight
}
}
const entries = Object.entries(teamCounts) as [TeamId, number][]
entries.sort((a, b) => b[1] - a[1])
const [topTeam, topCount] = entries[0]
const totalWeight = entries.reduce((sum, [, c]) => sum + c, 0)
return {
teamId: topTeam,
confidence: totalWeight > 0 ? topCount / totalWeight : 0,
}
}The 3x weight for the most recent call creates a strong recency bias: a single Secretary tool call immediately shifts the inferred team even if the previous three calls were all Sales. The confidence score (ratio of top-team weight to total weight) serves as a signal to the UI layer — low confidence (below 0.5) indicates the user is switching between domains, while high confidence (above 0.8) indicates sustained single-domain interaction.
5.4 Client Instructions Type System
The Action Router outputs a discriminated union of client instructions that the frontend interprets to drive UI state changes:
export type ClientInstruction =
| { type: "navigate"; path: string }
| { type: "notify"; message: string; severity: ActionSeverity }
| { type: "update_panel"; panel: string; data: Record<string, unknown> }
| { type: "open_modal"; modal: string; props: Record<string, unknown> }
| { type: "noop" }This design separates the voice system's semantic understanding ('the user wants to see the calendar') from the UI system's rendering logic ('navigate to /dashboard/calendar'). The voice pipeline produces abstract instructions; the frontend instruction executor maps them to concrete DOM operations.
6. Infinite Session Management via Rolling Summaries
6.1 The Context Window Problem
LLM context windows are finite. A voice session that runs for 30+ minutes can easily accumulate 50-100 conversation turns, exceeding the effective context window and causing the model to lose early conversation context or degrade in response quality. Simply truncating old messages discards potentially critical context — the user's initial instructions, established preferences, and unresolved topics.
6.2 Rolling Summary Architecture
MARIA OS implements a rolling summary mechanism that compresses old conversation history into a condensed summary while preserving recent messages verbatim. The algorithm triggers when the conversation exceeds a threshold length:
const SUMMARY_THRESHOLD = 16
const KEEP_RECENT = 6
if (conversationRef.current.length > SUMMARY_THRESHOLD) {
const toSummarize = conversationRef.current.slice(0, -KEEP_RECENT)
const recent = conversationRef.current.slice(-KEEP_RECENT)
try {
const res = await fetch("/api/voice/summarize", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
messages: toSummarize,
existingSummary: summaryRef.current,
}),
})
if (res.ok) {
const { summary } = await res.json()
summaryRef.current = summary
}
} catch { /* summarization failed — just trim */ }
// Reconstruct: summary context + recent messages
conversationRef.current = [
...(summaryRef.current
? [{ role: "user", text: `[Conversation summary: ${summaryRef.current}]` }]
: []),
...recent,
]
}6.3 Summary Composition
A critical property of the rolling summary is composability: when the conversation grows again past the threshold, the new summary is generated from both the previous summary and the messages accumulated since. The existingSummary parameter ensures that information from the earliest parts of the conversation is not lost through repeated summarization — the summary accumulates rather than replacing.
The threshold of 16 messages (with 6 kept) was chosen to balance summarization frequency against API cost. At typical voice conversation rates (2-4 turns per minute), summarization triggers every 4-8 minutes — frequent enough to prevent context overflow but infrequent enough that the summarization API call does not perceptibly delay the conversation.
7. Resilience Engineering: Heartbeat, Failsafe, and Recovery
7.1 Failure Modes in Voice Pipelines
Voice pipelines are uniquely fragile because they depend on multiple browser APIs that can fail silently. Unlike HTTP requests that return error codes, browser APIs like SpeechRecognition and AudioContext can enter invalid states without triggering error callbacks. We identified three critical failure modes in production:
- Stuck processing: The
isProcessingRefflag becomes permanently true due to an unhandled exception in the Gemini stream or TTS chain, preventing all future user input from being processed - AudioContext suspension: iOS Safari suspends AudioContext when the page loses focus (tab switch, lock screen), and does not automatically resume it when focus returns. The voice level meter reads zero, and TTS audio plays but is inaudible
- SpeechRecognition death: SpeechRecognition fires
onendbut the restart attempt insideonendthrows silently, leaving recognition permanently stopped. The user speaks but nothing happens
7.2 Heartbeat Monitor
The heartbeat runs on a 10-second interval and addresses all three failure modes:
// Heartbeat keepalive (10s interval)
heartbeatRef.current = setInterval(() => {
if (statusRef.current !== "connected") return
// Force-clear processing flag if stuck for >30s
if (isProcessingRef.current &&
Date.now() - processingStartRef.current > 30_000) {
isProcessingRef.current = false
resumeRecognition()
}
// Resume suspended AudioContext
if (audioContextRef.current?.state === "suspended") {
audioContextRef.current.resume().catch(() => {})
}
}, 10_000)The 30-second stuck-processing threshold was derived from production data: the longest legitimate processing cycle (a complex multi-tool action with several TTS sentences) completed in under 25 seconds. The 30-second threshold provides a 5-second safety margin while ensuring that truly stuck sessions recover within one heartbeat cycle after the threshold.
7.3 SpeechRecognition Restart Strategy
The SpeechRecognition onend handler implements a dual-path restart strategy depending on whether the system is currently processing:
recognition.onend = () => {
if (statusRef.current !== "connected") return
if (!isProcessingRef.current) {
// Not processing — restart immediately
const delay = iosDevice ? 100 : 0
setTimeout(() => {
if (statusRef.current === "connected") {
try { recognition.start() } catch { /* already running */ }
}
}, delay)
} else {
// Processing (TTS playing) — set a failsafe timer.
// If resumeRecognition doesn't restart within 30s, force restart.
setTimeout(() => {
if (statusRef.current === "connected") {
isProcessingRef.current = false
try { recognition.start() } catch { /* already running */ }
}
}, 30_000)
}
}The iOS-specific 100ms delay before restart addresses a WebKit bug where calling recognition.start() synchronously inside onend throws an InvalidStateError. The delay ensures the internal state machine has fully transitioned to 'stopped' before the restart attempt.
7.4 Abort-Controlled Cleanup
Every Gemini request and TTS chain is governed by an AbortController that enables instantaneous cleanup on disconnect or barge-in:
const interruptPlayback = () => {
// Abort in-flight Gemini stream + TTS fetches
if (abortRef.current) {
abortRef.current.abort()
abortRef.current = null
}
// Stop audio immediately
if (elAudioRef.current) {
elAudioRef.current.pause()
elAudioRef.current.currentTime = 0
}
// Reset TTS chain
ttsChainRef.current = Promise.resolve()
}The AbortController signal is threaded through three layers: the fetch() call to Gemini (cancels the HTTP stream), each fetch() call to ElevenLabs (cancels TTS synthesis), and each playElevenLabsTTS promise (stops audio playback). This ensures that aborting the controller immediately halts all in-flight network requests and stops all audio output within a single event loop tick.
8. Cross-Platform Compatibility: The In-App Browser Challenge
8.1 The In-App Browser Problem
Modern mobile users frequently open links from messaging apps (LINE, Discord, Instagram) that render web pages in embedded WebView containers rather than the native browser. These in-app browsers provide a subset of web APIs with inconsistent behavior, particularly for audio and speech APIs. MARIA OS must detect these environments and provide graceful degradation rather than cryptic failures.
8.2 Detection Architecture
The detection system uses a two-tier approach: User-Agent pattern matching for named in-app browsers, and heuristic detection for iOS WKWebView containers that do not identify themselves:
const IN_APP_PATTERNS: [RegExp, InAppBrowser][] = [
[/\bLine\//i, "line"],
[/\bLIFF\b/i, "line"],
[/\bDiscord\b/i, "discord"],
[/\bInstagram\b/i, "instagram"],
[/\bFBAN\b/i, "facebook"],
[/\bFBAV\b/i, "facebook"],
[/\bFB_IAB\b/i, "facebook"],
[/\bMessenger\b/i, "messenger"],
[/\bTwitter\b/i, "twitter"],
[/\bMicroMessenger\b/i, "wechat"],
[/\bTikTok\b/i, "tiktok"],
[/\bSnapchat\b/i, "snapchat"],
]For iOS WKWebView detection, the system checks for the absence of window.safari — a property present in real Safari but missing in WKWebView containers used by LINE, Discord, and other apps that do not add identifying strings to the User-Agent:
function detectIOSInAppWebView(): boolean {
if (typeof navigator === "undefined" || typeof window === "undefined") return false
const ua = navigator.userAgent
const isiOS = /iPad|iPhone|iPod/.test(ua) ||
(navigator.platform === "MacIntel" && navigator.maxTouchPoints > 1)
if (!isiOS) return false
const w = window as any
if (typeof w.safari === "undefined") return true
return false
}8.3 iOS-Specific Audio Handling
iOS Safari (and iOS WebView) imposes two constraints that require special handling:
- AudioContext sample rate: iOS hardware runs at 44.1kHz or 48kHz. Forcing a different sample rate (e.g., 16kHz for voice) causes silent or distorted output. MARIA OS creates AudioContext without a sampleRate option, letting the device choose its native rate
- Audio autoplay policy: iOS requires audio playback to be initiated by a direct user gesture. MARIA OS creates a persistent
<audio>element during theconnect()call (which is a button click handler) and unlocks it by playing a silent base64-encoded MP3 snippet
// iOS audio unlock — silent MP3 played during user gesture
if (iosDevice) {
if (!elAudioRef.current) {
elAudioRef.current = new Audio()
}
const audio = elAudioRef.current
audio.src = "data:audio/mpeg;base64,SUQzBAAA..." // silent MP3
audio.volume = 0.01
try {
await Promise.race([
audio.play(),
new Promise((r) => setTimeout(r, 1000)), // 1s max for unlock
])
audio.pause()
} catch { /* non-fatal */ }
audio.volume = 1
}The Promise.race with a 1-second timeout prevents the unlock from blocking the connection flow on devices where play() hangs indefinitely. The low volume (0.01) ensures the silent MP3 is truly inaudible even if the device speaker is at maximum volume.
8.4 SpeechRecognition Continuous Mode
On desktop browsers and Android, SpeechRecognition.continuous = true keeps the recognition session active across multiple utterances. On iOS, continuous mode is unreliable — recognition silently stops after 10-15 seconds. MARIA OS sets continuous = false on iOS and relies on the onend handler to restart recognition after each utterance:
const recognition = new SR()
recognition.continuous = !iosDevice
recognition.interimResults = false
recognition.lang = locale === "ja" ? "ja-JP" : "en-US"8.5 Platform Support Matrix
| Platform | SpeechRecognition | AudioContext | TTS Playback | Mic Access | Status |
|---|---|---|---|---|---|
| Chrome (Desktop) | continuous | native rate | HTMLAudioElement | standard | Full support |
| Safari (macOS) | webkit prefix | native rate | HTMLAudioElement | standard | Full support |
| Safari (iOS) | webkit, non-continuous | native rate, must not force | silent MP3 unlock | standard | Full support (with workarounds) |
| Chrome (Android) | continuous | native rate | HTMLAudioElement | standard | Full support |
| LINE in-app | unavailable | varies | blocked | timeout | Redirect to native browser |
| Discord in-app | unavailable | varies | blocked | timeout | Redirect to native browser |
| Instagram in-app | partial | varies | blocked | blocked | Redirect to native browser |
9. Production Metrics and Evaluation
9.1 TTS Configuration
MARIA OS uses ElevenLabs eleven_turbo_v2_5 model optimized for low-latency streaming. The voice parameters were tuned for natural conversational speech:
const res = await fetch(
`https://api.elevenlabs.io/v1/text-to-speech/${voiceId}/stream`,
{
method: "POST",
headers: {
"xi-api-key": ELEVENLABS_API_KEY,
"Content-Type": "application/json",
},
body: JSON.stringify({
text,
model_id: "eleven_turbo_v2_5",
voice_settings: {
stability: 0.5,
similarity_boost: 0.75,
},
output_format: "mp3_22050_32",
}),
},
)The parameter choices reflect deliberate tradeoffs: stability 0.5 allows expressive variation in pitch and rhythm (higher values produce more monotone output), similarity_boost 0.75 maintains voice identity while allowing natural variation, and mp3_22050_32 (22.05kHz, 32kbps) balances audio quality against download size — a single sentence (15-20 words) produces approximately 8-12KB of audio data, enabling rapid transfer even on mobile networks.
9.2 Latency Breakdown
Measured across 500 representative interactions:
| Metric | P50 | P90 | P99 |
|---|---|---|---|
| Speech debounce | 1200ms | 1200ms | 1200ms |
| Gemini TTFS (first sentence) | 280ms | 420ms | 680ms |
| ElevenLabs synthesis | 190ms | 310ms | 520ms |
| Audio decode + play start | 15ms | 25ms | 40ms |
| **Total first-sentence latency** | **1685ms** | **1955ms** | **2440ms** |
| Perceived latency (excl. debounce) | **485ms** | **755ms** | **1240ms** |
9.3 Reliability Metrics
| Metric | Value | Measurement Period |
|---|---|---|
| Sessions without ordering violation | 2,400+ | 90-day production window |
| Mean session duration | 8.2 min | All sessions |
| Max session duration (rolling summary) | 47 min | Single session record |
| Heartbeat recovery events | 23 | Processing stuck >30s |
| AudioContext recovery events | 156 | iOS background/foreground |
| In-app browser redirects | 412 | Users guided to native browser |
9.4 Voice Level Metering
The voice level meter provides real-time visual feedback through a 3D OGL orb that reacts to speech amplitude. The metering uses an AnalyserNode to compute RMS energy from the microphone input:
const measureLevel = useCallback(() => {
if (!analyserRef.current || !dataArrayRef.current) return
analyserRef.current.getByteFrequencyData(dataArrayRef.current)
let sum = 0
for (let i = 0; i < dataArrayRef.current.length; i++) {
const v = dataArrayRef.current[i] / 255
sum += v * v
}
const rms = Math.sqrt(sum / dataArrayRef.current.length)
setVoiceLevel(Math.min(rms * 3.0, 1))
rafRef.current = requestAnimationFrame(measureLevel)
}, [])The 3.0x multiplier on the RMS value compensates for the typical low energy of speech signals relative to the full dynamic range. Without amplification, conversational speech registers at only 0.1-0.2 on the 0-1 scale, producing barely visible orb animation. The Math.min(..., 1) clamp prevents visual artifacts from loud transients (coughing, desk impacts).
10. Future Extensions: Toward Recursive Voice Governance
10.1 Connection State Machine
The voice session lifecycle follows a four-state machine:
The connected state is the only state where the heartbeat runs, SpeechRecognition is active, and TTS playback is permitted. The error state preserves the connection metadata for diagnostic purposes and allows retry via a subsequent connect() call.
10.2 Toward Responsibility-Gated Voice Actions
The current Action Router dispatches tools without responsibility gates — every registered action is immediately executable. A natural extension is to integrate the MARIA OS decision pipeline into the voice layer, requiring approval gates for high-impact voice actions. Consider a user saying 'Send the revised proposal to Tanaka-san with the 15% discount.' The draft_proposal tool currently executes immediately; with responsibility gating, the system would detect the financial impact (discount authorization), create a decision record in the pipeline, and respond: 'I have drafted the proposal with 15% discount. This requires manager approval before sending. Shall I route it for approval?'
10.3 Multi-Modal Voice Governance
Future iterations will extend the voice pipeline to support multi-modal evidence collection during voice interactions. When a responsibility gate is triggered, the system could prompt the user to provide verbal justification that is transcribed, timestamped, and attached to the decision record as an evidence bundle. This creates a fully auditable chain from voice command through approval gate to execution, with the user's own words serving as the authorization evidence.
10.4 Recursive Self-Improvement of Voice Pipelines
The rolling summary mechanism provides a foundation for recursive voice intelligence: the system could analyze summary patterns to detect recurring user intents, preemptively load relevant tools, and suggest workflow optimizations. A user who consistently requests calendar checks followed by email drafts could be offered a combined 'schedule and notify' voice workflow, dynamically composed from existing tools without engineering intervention.
References
- Miller, G.A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81-97.
- Levinson, S.C. (2016). Turn-taking in human communication: Origins and implications for language processing. Trends in Cognitive Sciences, 20(1), 6-14.
- Stivers, T. et al. (2009). Universals and cultural variation in turn-taking in conversation. PNAS, 106(26), 10587-10592.
- Web Speech API Specification, W3C Community Group Report, 2024.
- ElevenLabs Text-to-Speech API Documentation, v1, 2025.
- Google Gemini API Reference: Streaming and Function Calling, 2025.