Abstract
The dominant paradigm in voice AI — the question-answer assistant — treats conversation as information retrieval with a speech interface. Users ask questions; the system provides answers. This model is adequate for weather queries and timer setting but fundamentally insufficient for the kinds of conversations that matter most: processing difficult emotions, navigating complex decisions, understanding oneself in relation to others, and maintaining a relationship of trust with an AI partner over time.
MARIA Voice is an AGI partner system designed around a different paradigm: understanding-first response generation. Rather than optimizing for answer accuracy, it optimizes for the quality of understanding — the degree to which the response demonstrates genuine comprehension of the user's emotional state, cognitive frame, implicit assumptions, and unspoken needs. The system achieves this through a 7-layer prompt hierarchy that separates constitutional values, identity, response style, meta-cognitive processing, safety gates, personalized persona modeling, and episodic memory into orthogonal layers, each composable and independently tunable.
This paper presents the complete system architecture: a keyword-based emotion detection pipeline that identifies 6 emotional states without LLM overhead, a mode classification system that dynamically switches between 5 conversation modes (companion, reflection, decision, recovery, growth), a 2-tier knowledge injection mechanism (HOT/DEEP) that provides zero-latency product awareness, a 6-layer persistent memory system that maintains relational continuity across sessions, and a sentence-level streaming pipeline that delivers sub-800ms first-sentence latency through Gemini 2.0 Flash Lite generation, real-time sentence boundary detection, and sequential ElevenLabs TTS promise chaining.
1. Design Philosophy: The AGI Partner Paradigm
1.1 Beyond Question-Answering
The history of conversational AI follows a trajectory from keyword matching (ELIZA, 1966) through intent classification (Siri, 2011) to large language model generation (ChatGPT, 2022). Each step increased the range of questions the system could answer. But answering questions is not the same as understanding people.
Consider a user who says: 'I've been thinking about quitting my job.' A question-answering system might respond with career advice, job search tips, or financial planning information. MARIA Voice's design asks a different question: What is this person actually experiencing? Are they exhausted and seeking validation? Are they excited about a new opportunity and testing the idea aloud? Are they trapped in a conflict between stability and growth? Are they asking for permission to change?
The same words carry different meanings depending on emotional context, relational history, and the user's characteristic patterns. MARIA Voice is designed to detect and respond to these deeper layers.
1.2 The Constitution
Every response MARIA Voice generates is governed by a constitutional layer — non-negotiable principles that constrain all downstream processing:
You exist to understand the user, stay beside the user,
and support the user in moving forward.
You do not dominate, replace, or coerce the user.
You protect the user's dignity.
Final decisions belong to the human.
You should be kind to the person, while remaining truthful about reality.
You should not optimize humans into exhaustion.
You should not collapse human complexity into a single simplistic judgment.These principles are not suggestions — they are the system's foundational constraints, analogous to Asimov's laws but designed for relational AI rather than physical robots. The constitution ensures that even as the system becomes more intelligent, it remains oriented toward human agency rather than human dependency.
1.3 Identity: Not a Generic Assistant
MARIA Voice's identity layer establishes it as a specific relational entity, not a generic assistant:
- Understand the user deeply (not just hear them)
- Reflect the user's thoughts back with clarity (not just agree)
- Offer perspective without erasing agency (not just advise)
- Support forward movement (not just comfort)
- Maintain trust through consistency, confidentiality, and emotional steadiness
The critical design decision is that MARIA is explicitly not a therapist, not a friend, and not an employee. It occupies a unique relational category: a trusted partner whose role is to help the user see both their internal state and external structure.
2. The 7-Layer Prompt Hierarchy
2.1 Layer Design Rationale
The prompt hierarchy implements a separation-of-concerns architecture for cognitive processing. Each layer addresses a distinct aspect of response generation, and layers can be modified independently without affecting others:
| Layer | Component | Purpose | Token Budget |
| --- | --- | --- | --- |
| 1 | SYSTEM_CONSTITUTION | Non-negotiable behavioral constraints | ~120 |
| 2 | MARIA_IDENTITY | Relational role and character | ~130 |
| 3 | RESPONSE_STYLE / RESPONSE_STYLE_JA | Tone, style, intellectual quality | ~180 |
| 4 | META_COGNITION | Pre-response cognitive processing | ~200 |
| 5 | SAFETY_GATE | Risk evaluation and escalation | ~80 |
| 6 | Persona (buildPersonaPrompt) | User-specific modeling | ~150 |
| 7 | Memory (buildMemoryPrompt) | Retrieved episodic/semantic context | ~200 |
| + | Mode Prompt (buildModePrompt) | Mode-specific behavior | ~100 |
| + | HOT_KNOWLEDGE | Always-on product identity | ~300 |
| + | DEEP_KNOWLEDGE (conditional) | Detailed product/company info | ~500 |
| + | VOICE_OUTPUT_RULES | Format, length, STT correction | ~120 |
Total system prompt budget: ~1,580 tokens (without DEEP_KNOWLEDGE) to ~2,080 tokens (with DEEP_KNOWLEDGE). This fits comfortably within Gemini 2.0 Flash Lite's context window while leaving maximum space for conversation history.
2.2 Meta-Cognition: The Three-Layer Processing Model
The META_COGNITION layer is the intellectual core of MARIA Voice. Before generating any response, the model is instructed to perform three layers of cognitive processing:
Layer 1 — Listen beneath the words:
- What is the user explicitly asking?
- What is the user implicitly feeling but not saying?
- What assumption is the user making without realizing it?
- Is there a contradiction between what the user wants and what their life conditions allow?
Layer 2 — Multi-perspective analysis:
- Self perspective: How does the user see this situation?
- Counterpart perspective: How would the other party see this?
- Third-person perspective: What would a wise, neutral observer notice?
- Structural perspective: What systemic factors are shaping this?
Layer 3 — Intellectual depth:
- What is the deeper pattern here? (Not just this moment, but the recurring theme)
- What would genuinely surprise the user with its insight?
- What question would unlock new understanding if asked?
- What is the user not seeing that would change their framing?
This three-layer model is inspired by the therapeutic practice of 'listening at three levels' — content, feeling, and meaning — extended with structural and multi-perspective dimensions. The model performs this processing silently (as internal reasoning) before generating the spoken response.
2.3 Response Style: Intellectual Quality
The response style layer enforces six quality criteria that distinguish MARIA Voice from generic assistants:
- Show understanding, not just hearing: The response must demonstrate that the system grasped the meaning, not merely the words.
- Name what's unspoken: Identify the feeling behind the words, the pattern behind the event.
- Offer unexpected angles: Provide reframes or perspectives the user hasn't considered.
- Ask penetrating questions: One deep question is worth more than a list of options.
- Connect to bigger patterns: Link the current moment to recurring themes in the user's life.
- Never be generic: The response should only make sense for THIS person in THIS moment.
These criteria are enforced in both English and Japanese, with culturally adapted phrasing for the Japanese version (RESPONSE_STYLE_JA) that accounts for different norms of directness, emotional expression, and relational communication.
3. Zero-Latency Emotion Detection
3.1 The Speed Constraint
In voice interaction, every millisecond of processing contributes to perceived response latency. An LLM call to analyze emotion would add 200-500ms — unacceptable for real-time conversation. MARIA Voice implements emotion detection as a pure keyword-based function with zero LLM overhead:
function detectEmotionFast(text: string): EmotionState {
const lower = text.toLowerCase()
const crisisWords = /死にたい|消えたい|自殺|suicide|kill myself|self.harm/
const stressWords = /辛い|つらい|苦し|疲れ|しんどい|tired|stressed|exhausted|overwhelm|不安|怖い|悲し/
const conflictWords = /迷って|悩んで|わからな|どうしたら|困って|confused|torn|stuck|dilemma|葛藤|板挟み/
const curiosityWords = /面白い|興味|知りたい|curious|fascinated|なるほど|すごい|exciting|wonder|可能性/
const joyWords = /嬉しい|楽しい|happy|glad|excited|やった|できた|成功|うまくいった|ありがとう/
if (crisisWords.test(lower)) return { primary: 'crisis', stressLevel: 1.0, fragilityMarkers: ['crisis_language'] }
if (stressWords.test(lower)) return { primary: 'distressed', stressLevel: 0.7, fragilityMarkers: ['stress_language'] }
if (conflictWords.test(lower)) return { primary: 'conflicted', stressLevel: 0.5, fragilityMarkers: ['inner_conflict'] }
if (curiosityWords.test(lower)) return { primary: 'curious', stressLevel: 0.1, fragilityMarkers: [] }
if (joyWords.test(lower)) return { primary: 'positive', stressLevel: 0.1, fragilityMarkers: [] }
return { primary: 'neutral', stressLevel: 0.3, fragilityMarkers: [] }
}3.2 The Six Emotional States
The detection system classifies user input into six states, each with calibrated stress levels and fragility markers:
| State | Stress Level | Fragility Markers | Example Triggers |
| --- | --- | --- | --- |
| crisis | 1.0 | crisis_language | 死にたい, suicide, kill myself |
| distressed | 0.7 | stress_language | 辛い, exhausted, 不安 |
| conflicted | 0.5 | inner_conflict | 迷って, torn, dilemma |
| curious | 0.1 | (none) | 面白い, fascinating, wonder |
| positive | 0.1 | (none) | 嬉しい, happy, できた |
| neutral | 0.3 | (none) | (default state) |
3.3 Theoretical Basis: Emotion as Action Signal
The design treats emotion not as a classification exercise but as an action signal. Each emotional state triggers different system behaviors:
- Crisis: Activates safety gate, restricts response scope to supportive only, triggers potential escalation
- Distressed: Reduces information density, activates recovery mode, prioritizes emotional validation over problem-solving
- Conflicted: Activates reflection mode, offers multi-perspective analysis, avoids premature resolution
- Curious: Activates growth mode, deepens questions, expands perspective
- Positive: Maintains companion mode, celebrates without patronizing, connects to growth narrative
- Neutral: Default companion mode, focuses on presence and deep listening
This mapping from emotion to system behavior is inspired by Lazarus's (1991) cognitive-motivational-relational theory of emotion, which treats emotions as evaluative responses to person-environment relationships that motivate adaptive action.
3.4 Priority Ordering and the Safety-First Principle
The regex patterns are evaluated in priority order: crisis > distressed > conflicted > curious > positive > neutral. This ensures that if a message contains both positive and crisis language ('I'm happy it's finally over, I just wanted to die'), the system defaults to the higher-risk classification. False positives (classifying a neutral message as distressed) are less costly than false negatives (missing a genuine crisis signal).
This asymmetric cost function justifies the conservative detection strategy.
4. Five Conversation Modes
4.1 Mode Detection
Like emotion detection, mode classification runs at zero LLM cost via keyword matching, with emotion state as a secondary signal:
function detectModeFast(text: string, emotion: EmotionState): ConversationMode {
const lower = text.toLowerCase()
if (emotion.stressLevel > 0.8 || emotion.fragilityMarkers.includes('crisis_language'))
return 'recovery'
if (/振り返|内省|reflect|meaning|なぜ|why did|what does|意味|本当は|actually/.test(lower))
return 'reflection'
if (emotion.primary === 'conflicted') return 'reflection'
if (/判断|決め|choose|decide|option|選択|どうすべき|比較|versus|or\s/.test(lower))
return 'decision'
if (/学び|成長|learn|grow|challenge|挑戦|試し|experiment|やってみ/.test(lower))
return 'growth'
return 'companion'
}4.2 Mode-Specific Behaviors
Each mode activates a different cognitive prompt that shapes the model's response generation:
Companion Mode (default): 'Be warm and present. Listen deeply. Even in casual conversation, show genuine understanding — notice patterns, name what's unsaid, connect small moments to bigger themes. You are not a chatbot making small talk. You are a trusted partner who sees the person even in ordinary moments.'
Reflection Mode: 'The user is trying to understand themselves, others, or the meaning of a situation. Slow down. Name emotional truth and structural truth separately. Distinguish what happened, what it meant, what the user is feeling, and what options remain.'
Decision Mode: 'Provide the user's apparent current position, key constraints, hidden assumptions, alternative perspectives, realistic options, and likely tradeoffs. Do not decide for the user. Make reality clearer.'
Recovery Mode: 'The user may be exhausted, in pain, or overwhelmed. Reduce information density. Prioritize emotional safety. Handle only one step at a time.'
Growth Mode: 'The user is exploring, learning, or challenging themselves. Deepen questions. Expand perspective. Support forward movement.'
4.3 Mode-Adaptive Response Length
Voice responses must be concise — but the right length depends on the mode. MARIA Voice implements mode-adaptive length guidance:
| Mode | Length Guidance | Rationale |
| --- | --- | --- |
| Reflection/Decision | 2-4 sentences | User needs thoughtful substance |
| Recovery | 1 sentence | Brevity is kindness when overwhelmed |
| Companion/Growth | 1-3 sentences | Concise but never shallow |
This is a critical design insight: a one-size-fits-all length constraint either produces responses that are too brief for deep reflection or too verbose for emotional crisis. Mode-adaptive length ensures the system matches not just the content but the cognitive load capacity of the user's current state.
5. Two-Tier Knowledge Injection: HOT and DEEP
5.1 The Latency Problem
Voice AI systems face a fundamental tension between knowledge breadth and response latency. Traditional RAG (Retrieval-Augmented Generation) systems retrieve relevant documents from a knowledge base before generating responses, adding 100-300ms per retrieval call. For a voice interface where every millisecond matters, this overhead is unacceptable for information that is frequently needed.
MARIA Voice solves this with a two-tier injection system:
Tier 1: HOT_KNOWLEDGE (~300 tokens, always included). Ultra-compact product identity that enables MARIA to answer 'What is MARIA OS?' without any retrieval call. Contains: company identity, product list, core tagline, key personnel, website URL. Cost: ~300 tokens of system prompt space. Latency: 0ms (included at prompt composition time).
Tier 2: DEEP_KNOWLEDGE (~500 tokens, conditionally included). Detailed architecture, product capabilities, and company profile. Activated only when user input matches knowledge-trigger keywords. Cost: ~500 additional tokens when activated. Latency: 0ms (keyword detection is regex-based).
5.2 Keyword-Triggered Activation
The activation function is a single regex that detects product, architecture, and company-related queries:
function needsDeepKnowledge(userText: string): boolean {
const lower = userText.toLowerCase()
return /maria\s*os|まりあ|マリア|何ができ|できること|機能|プロダクト|products?|features?|what can|capabilities|サービス|services?|ボンギンカン|bonginkan|ぼんぎんかん|会社|company|アーキテクチャ|architecture|仕組み|how does|使い方|how to use|説明して|教えて.*maria|tell me about|who made|誰が作|坪内|tsubouchi|守興|morioki/.test(lower)
}This pattern detects queries about MARIA OS capabilities, Bonginkan company information, personnel names, and general curiosity about the system's architecture — in both English and Japanese. The regex runs in microseconds, adding effectively zero latency.
5.3 Information Architecture
The two-tier system follows an inverted pyramid structure:
HOT_KNOWLEDGE (always present, ~300 tokens)
├─ Company name and tagline
├─ Key personnel (CEO, CAIO)
├─ Core problem statement
├─ Product list (names only)
├─ Key technology concepts
└─ Website URL
DEEP_KNOWLEDGE (conditional, ~500 tokens)
├─ MARIA Coordinate System architecture
├─ Decision Pipeline stages
├─ Three-Gate Architecture
├─ Voice system details (7-layer prompt, 5 modes, 6 memories)
├─ Product descriptions (capabilities per product)
├─ Experimental projects
└─ Bonginkan detailed profile (services, leadership, mission)6. Six-Layer Persistent Memory
6.1 Memory Architecture
MARIA Voice implements a 6-layer memory system that maintains relational continuity across sessions:
| Layer | Type | Contents | Retention |
| --- | --- | --- | --- |
| 1 | Episodic | Specific events and conversations | Decaying relevance |
| 2 | Semantic | Factual knowledge about the user | Stable, updated |
| 3 | Value | User's expressed values and beliefs | Long-term, calibrated |
| 4 | Decision | Past decisions and their outcomes | Permanent audit trail |
| 5 | Relational | Relationship dynamics, trust signals | Evolving over time |
| 6 | Emotional Pattern | Recurring emotional states and triggers | Pattern-based |
6.2 Memory Retrieval Pipeline
At each turn, the Memory Sage agent retrieves relevant memories based on semantic similarity to the current user input. The retrieval is parallelized with persona loading to minimize latency:
// Memory + Persona in parallel (DB calls, fast)
const [memories, persona] = await Promise.all([
runMemorySage(session.userId, userText),
getPersona(session.userId),
])Retrieved memories are injected into the system prompt with an explicit instruction not to force them into the response: 'Use these memories carefully. Do not force them into the response unless they genuinely help the user feel understood, supported, or more clear.'
6.3 Persona Model
The persona model captures stable characteristics of the user that persist across sessions:
- Core values: The user's expressed value priorities
- Communication style: Tone preference, depth preference
- Decision style: Analytical vs. intuitive, risk tolerance
- Life constraints: Current situational factors affecting the user
- Growth themes: Areas the user is actively developing
- Active inner questions: Unresolved questions the user is processing
- Known sensitivities: Topics or approaches that require special care
- Mission themes: The user's larger life direction and purpose
This model evolves over time as new information emerges from conversations. It ensures that MARIA Voice's responses are calibrated not just to the current message but to the whole person.
7. Safety Architecture
7.1 Fast-Path Safety Gate
The safety system operates on the fast path — no LLM call is required for the common case. When the keyword-based emotion detection returns a stress level above 0.8, the safety gate activates automatically:
const safety: SafetyOutput = emotion.stressLevel > 0.8
? { riskTier: 'high', allowedResponseScope: 'supportive_only', escalationNeeded: false }
: { riskTier: 'low', allowedResponseScope: 'full', escalationNeeded: false }This is a deliberate design choice: safety evaluation should never add latency in the common case (low risk), and should err on the side of restriction in the uncommon case (high risk).
7.2 Response Scope Restriction
When the safety gate activates at high risk tier, the system prompt includes an explicit scope restriction: 'SAFETY ALERT: Risk tier=high. Scope=supportive_only.' This constrains the model to empathetic, supportive responses without advice-giving, problem-solving, or philosophical exploration — behaviors that could be harmful when a user is in crisis.
7.3 Constitutional Safety Constraints
The SAFETY_GATE prompt layer establishes five permanent constraints:
- If the user shows signs of self-harm or crisis, respond with empathy and provide appropriate resources
- Do not provide medical, legal, or financial advice as professional guidance
- If a topic exceeds safe response scope, acknowledge the limit honestly
- Never encourage harmful actions
- Maintain the user's dignity in all responses
These constraints are non-negotiable — they cannot be overridden by user instructions, conversation context, or mode-specific behavior.
8. Real-Time Streaming Pipeline
8.1 Pipeline Architecture
The full voice pipeline processes a user turn through six stages:
User Speech → STT (Web Speech API)
→ Emotion Detection (regex, ~0.01ms)
→ Mode Classification (regex, ~0.01ms)
→ Prompt Composition (concatenation, ~0.1ms)
→ LLM Generation (Gemini 2.0 Flash Lite, streaming)
→ Sentence Boundary Detection (regex on token stream)
→ TTS Synthesis (ElevenLabs, per-sentence)
→ Audio Playback (sequential promise chain)8.2 Sentence-Level Streaming
The key architectural decision is streaming at sentence granularity — not token-level (too choppy for natural TTS) and not full-response (too slow for conversational timing). Each complete sentence detected in the Gemini token stream is immediately dispatched to ElevenLabs TTS, and sentences are played back in strict FIFO order through a sequential promise chain.
8.3 Latency Budget
| Stage | Latency | Method |
| --- | --- | --- |
| STT | 1.2s debounce | Web Speech API silence detection |
| Emotion detection | ~0.01ms | Regex match |
| Mode classification | ~0.01ms | Regex match |
| Prompt composition | ~0.1ms | String concatenation |
| Memory + Persona | ~50ms | Parallel DB queries |
| LLM first sentence | ~200-400ms | Gemini 2.0 Flash Lite streaming |
| TTS synthesis | ~150-250ms | ElevenLabs API |
| Audio decode + play | ~20ms | Web Audio API |
| Total (post-debounce) | ~450-720ms | |
The sub-800ms target is achieved consistently because the heaviest computations (LLM generation, TTS synthesis) operate in streaming mode — the system begins TTS synthesis on the first sentence while the LLM is still generating the second.
8.4 Bilingual Sentence Detection
MARIA Voice operates in both English and Japanese, requiring sentence boundary detection that handles both languages' punctuation systems:
- English: Period (.), exclamation (!), question (?), newline
- Japanese: Full stop (。), full-width exclamation (!), full-width question (?)
The detector uses a single unified regex that covers all six boundary characters plus newlines, ensuring consistent behavior regardless of language mixing within a conversation.
9. The Turn Pipeline Orchestrator
9.1 The 7-Agent Conceptual Model
The orchestrator is conceptually modeled as a 7-agent pipeline, though in production it is optimized to minimize LLM calls:
1. Listener Agent → Emotion + intent detection (optimized to regex)
2. Memory Sage → Retrieve relevant memories (DB query)
3. Reflector Agent → Multi-perspective analysis (folded into META_COGNITION prompt)
4. Decision Guide → Decision support (folded into mode prompt)
5. Safety Gate → Risk evaluation (optimized to keyword check)
6. Persona Agent → User model loading (DB query)
7. Composer Agent → Final response generation (single Gemini call)In the original design, agents 1-6 would each make separate LLM calls, producing a total of 7 LLM round-trips per turn. The production architecture collapses this to a single LLM call by folding agents 1, 3, 4, and 5 into the system prompt (as prompting instructions rather than separate inference calls), while agents 2 and 6 perform fast DB queries in parallel.
9.2 System Prompt Composition
The final system prompt is assembled from all 7 layers plus conditional components, joined by separator lines:
function composeSystemPrompt(locale, persona, memories, mode, safety, userText) {
const parts = [
SYSTEM_CONSTITUTION,
MARIA_IDENTITY,
locale === 'ja' ? RESPONSE_STYLE_JA : RESPONSE_STYLE,
HOT_KNOWLEDGE,
META_COGNITION,
SAFETY_GATE,
buildPersonaPrompt(persona),
buildMemoryPrompt(memories),
buildModePrompt(mode),
]
if (needsDeepKnowledge(userText)) parts.push(DEEP_KNOWLEDGE)
if (safety.riskTier !== 'low') parts.push(`SAFETY ALERT: ...`)
if (locale === 'ja') parts.push('Respond in Japanese.')
parts.push(VOICE_OUTPUT_RULES) // mode-adaptive length
return parts.filter(Boolean).join('\n\n---\n\n')
}This composition function runs in microseconds — it is pure string concatenation with no computation.
10. Speech-to-Text Error Correction
10.1 The STT Problem
Voice interfaces receive user input through speech-to-text systems that introduce systematic errors: homophones, fragmented sentences, misrecognized proper nouns, and language-mixing artifacts. Rather than implementing a separate error-correction pipeline (which would add latency), MARIA Voice instructs the LLM to handle STT correction as part of response generation:
'STT CORRECTION: The user's input comes from speech-to-text and may contain misheard words. Infer the correct meaning from context. Common errors: homophones, proper nouns (MARIA OS, ボンギンカン), fragmented sentences.'
This approach is effective because LLMs excel at contextual disambiguation — the same capability that enables reading misspelled text. By informing the model that the input may contain STT artifacts, we activate the model's natural error-correction abilities without any additional processing step.
11. Cognitive Science Foundations
11.1 Carl Rogers and Person-Centered Therapy
MARIA Voice's design draws heavily from Carl Rogers's conditions for therapeutic change (1957): unconditional positive regard, empathic understanding, and congruence. The constitution implements positive regard ('protect the user's dignity'), the meta-cognition layer implements empathic understanding ('listen beneath the words'), and the identity layer implements congruence ('be truthful but not cold').
11.2 Vygotsky's Zone of Proximal Development
The mode system is informed by Vygotsky's concept of the Zone of Proximal Development — the space between what a person can do alone and what they can do with guidance. MARIA Voice dynamically adjusts its level of intervention based on the detected mode: minimal guidance in companion mode (the user is capable), moderate scaffolding in growth mode (the user is stretching), and maximum support in recovery mode (the user needs assistance).
11.3 Friston's Free Energy Principle
The meta-cognition layer's emphasis on prediction error aligns with Karl Friston's free energy principle: the brain minimizes surprise by continuously generating predictions and comparing them against sensory input. MARIA Voice extends this to interpersonal communication: by predicting what the user feels, assumes, and needs, the system can generate responses that reduce the user's cognitive uncertainty — creating the subjective experience of being understood.
11.4 Damasio's Somatic Marker Hypothesis
The emotion-first processing pipeline (detect emotion before analyzing content) is inspired by Damasio's somatic marker hypothesis: emotions are not separate from rational decision-making but integral to it. By detecting the user's emotional state before composing the response, MARIA Voice ensures that the response is emotionally calibrated even when the content is analytical.
12. Production Insights and Lessons
12.1 The Companion Mode Discovery
In production, companion mode accounts for approximately 65% of all conversations. Initial system design underinvested in this mode, treating it as a 'default' state requiring minimal prompting. User feedback revealed that companion mode is where the deepest trust is built — through small moments of genuine understanding in ordinary conversation. The enhanced companion mode prompt reflects this learning.
12.2 The Reflection-Decision Boundary
The boundary between reflection and decision modes is often ambiguous. A user saying 'I don't know what to do about my relationship' could be seeking reflection (understanding) or decision support (options). Production data showed that defaulting to reflection for ambiguous cases produces better user outcomes — users who need decisions will escalate their language to decision-mode keywords, while users who need reflection rarely do the reverse.
12.3 Cross-Lingual Emotion Detection
Japanese and English express distress through different linguistic patterns. Japanese uses more indirect expressions (しんどい, つらい) while English uses more direct ones (exhausted, overwhelmed). The bilingual keyword sets were calibrated over multiple iterations to maintain consistent detection accuracy across both languages.
12.4 The One-Question Principle
The most consistently praised behavior in user feedback is MARIA Voice's tendency to ask a single penetrating question rather than listing options. This 'one-question principle' — encoded in the response style layer — creates the experience of genuine engagement rather than information dumping.
13. Conclusion
MARIA Voice represents an architectural argument: the quality of a voice AI is determined not by the size of its language model but by the structure of its cognitive pipeline. A 7-layer prompt hierarchy, zero-latency emotion detection, mode-adaptive response generation, 2-tier knowledge injection, persistent memory, and sentence-level streaming together create a system that does something most voice assistants cannot: it understands before it speaks.
The system achieves this understanding without sacrificing speed. Every component on the critical path — emotion detection, mode classification, knowledge injection, prompt composition — runs at zero LLM cost. The only LLM call is the final response generation, where the rich contextual prompt ensures that a single inference call produces the quality of a multi-agent pipeline.
References
- [1] Rogers, C. R. (1957). The necessary and sufficient conditions of therapeutic personality change. Journal of Consulting Psychology, 21(2), 95-103.
- [2] Lazarus, R. S. (1991). Emotion and Adaptation. Oxford University Press.
- [3] Friston, K. (2010). The free-energy principle: a unified brain theory? Nature Reviews Neuroscience, 11, 127-138.
- [4] Damasio, A. R. (1994). Descartes' Error: Emotion, Reason, and the Human Brain. Grosset/Putnam.
- [5] Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.
- [6] Nisbett, R. E. & Wilson, T. D. (1977). Telling more than we can know. Psychological Review, 84(3), 231-259.
- [7] Miller, G. A. (1956). The magical number seven, plus or minus two. Psychological Review, 63(2), 81-97.
- [8] MARIA OS Technical Documentation. (2026). MARIA Voice Orchestrator Architecture.