EngineeringFebruary 15, 202632 min read

Sentence-Level Streaming VUI Architecture: From Cognitive Theory to Production Implementation in MARIA OS

How sentence-boundary detection, sequential TTS chaining, and rolling conversation summaries create a natural-feeling voice interface with long-session stability

Voice user interfaces face a core tradeoff: stream tokens immediately for low latency, or wait for larger semantic units to improve naturalness. MARIA OS resolves this with sentence-level streaming: detect sentence boundaries from Gemini token streams in real time, queue each sentence for sequential ElevenLabs TTS playback, and coordinate full-duplex interaction through barge-in control, speech debouncing, and heartbeat-based recovery. This paper presents the cognitive basis for sentence-level granularity, the production `useGeminiLive` architecture, a 29-tool action router across 4 teams with confidence-weighted team inference, and the rolling-summary mechanism for long voice sessions. In 2,400+ production sessions, the system achieved sub-800ms first-sentence latency with zero sentence-ordering violations, including compatibility handling for 9 in-app browser environments.

voice-uistreamingTTSspeech-recognitionreal-timeGeminiElevenLabsaction-routerMARIA-OScognitive-science
IntelligenceFebruary 15, 202635 min read

Voice User Interface設計の認知科学的基盤: マルチモーダル対話における注意資源配分モデル

Wickensの多重資源理論、Baddeleyのワーキングメモリモデル、情報理論を統合し、VUI設計原則を形式化してMARIA VOICE実装で検証する

音声ユーザーインターフェース(VUI)の設計は、聴覚認知処理の特性を十分に扱わない経験則に依存しがちである。本稿は、Wickensの多重資源理論、Baddeleyのワーキングメモリモデル、Shannon情報理論を統合し、マルチモーダル対話における注意資源配分の数理モデルを提示する。文レベルストリーミングTTSの認知的最適性、1.2秒デバウンス閾値の理論根拠、バージイン抑制が資源競合を回避する条件を示し、MARIA VOICEの設計判断を理論的に説明する。

voice-uicognitive-scienceinformation-theoryworking-memoryattention-resourcesmultimodal-interactionspeech-processingmaria-voiceformal-methodshuman-computer-interaction