IntelligenceFebruary 15, 2026|35 min readpublished

Cognitive Science Foundations of Voice User Interface Design: An Attention Resource Allocation Model for Multimodal Dialogue

Integrating Wickens' multiple resource theory, Baddeley's working memory model, and information theory to formalize VUI design principles and validate them in the MARIA VOICE implementation

Design NoteReading label

A technical note clarifying MARIA OS design hypotheses, operating models, and implementation choices.

Provenance:ARIA-RD-01G1.U1.P9.Z3.A1
Reviewed by:ARIA-TECH-01ARIA-QA-01

1. Abstract

Voice user interfaces (VUIs) utilize a cognitive processing channel that is fundamentally different from that of visual GUIs. Auditory information processing is temporally sequential, and spatial parallel scanning—as is possible with visual information—is unavailable. This asymmetry produces both constraints and opportunities unique to VUI design. However, the majority of current VUI design guidelines rely on heuristics lacking a cognitive science foundation—"keep responses short," "insert confirmations," "provide error recovery"—without presenting any theoretical rationale for why these should be followed.

This paper is an attempt to fill that theoretical gap. We integrate three cognitive science frameworks—Wickens' (1984, 2002) multiple resource theory, Baddeley's (1986, 2000) working memory model, and Shannon's (1948) information theory—to construct a mathematical model of attention resource allocation in multimodal voice dialogue. From this model, we derive the following.

  • Cognitive optimality of sentence-level streaming TTS: A proof that the sentence unit—neither phrase-level nor paragraph-level—is optimal with respect to working memory chunk capacity and the retention time of the auditory loop
  • Theoretical basis for the 1.2-second debounce threshold: The optimal silence detection time derived from cognitive timing models of turn-taking
  • Resource conflict avoidance mechanism of barge-in suppression: Conditions under which pausing speech recognition during TTS playback avoids dual-task interference
  • Information-theoretic optimality of rolling summaries: The optimal strategy for conversational context compression based on rate-distortion theory
  • An axiomatic system of VUI design principles: A set of design guidelines deductively derived from eight axioms

All theoretical results are mapped to implementation decisions in MARIA VOICE, the voice dialogue system of MARIA OS. MARIA VOICE is an enterprise voice agent integrating full-duplex speech recognition, sentence-level streaming TTS (ElevenLabs integration), action routing across 4 teams and 29 tools, rolling conversation summaries, and heartbeat monitoring. The contribution of this paper lies in showing that these design decisions follow necessarily from first principles of cognitive science.


2. Cognitive Science Foundations of VUI: The Distinctive Nature of Auditory Processing and Voice Dialogue

Before designing voice dialogue, one must understand how human auditory information processing differs qualitatively from visual information processing. This difference is not a matter of degree but a matter of kind.

2.1 Temporal Sequentiality of Auditory Processing

In visual information processing, humans can scan spatially arranged information in parallel (parallel scanning). Multiple graphs on a dashboard, multiple rows in a table, multiple regions of a screen—even though focal attention is sequential—are processed simultaneously at the level of preattentive processing. As Treisman & Gelade's (1980) feature integration theory shows, simple visual features such as color, shape, and orientation are detected in parallel.

Auditory information processing has no such parallelism. The speech signal is inherently a temporal stream, with only one speech segment existing at each point in time. The listener cannot "go back and check" past speech. This irreversibility forms the fundamental constraint of VUI design.

Definition (Auditory Sequentiality Axiom). Information access in the auditory channel A is strictly sequential along the time axis. At any time t, the only new information accessible to the listener is the speech signal s(t) presented at time t. Access to past signals s(t') (t' < t) is limited to decaying copies of echoic memory held in working memory.

As a consequence of this axiom, information presentation in a VUI becomes a problem of "information placement design" along the time axis. Just as GUI designers design spatial layouts, VUI designers must design temporal layouts.

2.2 Echoic Memory and Auditory Persistence

Echoic memory, named by Neisser (1967), is a form of auditory sensory memory responsible for the short-term retention of speech signals. Partial report experiments by Darwin, Turvey & Crowder (1972) showed that the duration of echoic memory is approximately 2 to 4 seconds.

This duration carries decisive significance for VUI design. While the user listens to a voice response, the preceding 2 to 4 seconds of speech are held in echoic memory and can be "replayed" if semantic processing fails. Speech older than that, however, is lost at the sensory level and remains accessible only as higher-order working memory representations (the contents of the phonological loop).

M_{echoic}(t, \tau) = s(\tau) \cdot e^{-\lambda_{echo}(t - \tau)}, \quad \tau \in [t - T_{echo}, t] $$

Here, T_echo is approximately 3 seconds and lambda_echo is the decay rate. This exponential decay model accurately predicts the everyday experience that "if you miss something mid-sentence, you can recover it immediately afterward, but once you enter the next sentence it becomes unrecoverable."

2.3 The Phonological Loop and Its Connection to Working Memory

In Baddeley & Hitch's (1974) working memory model, the phonological loop is the subsystem responsible for short-term retention of auditory information. The phonological loop has two components:

  • Phonological store: Passive retention of phonological codes. Can hold approximately 1.5 to 2 seconds of speech information
  • Articulatory rehearsal process: Refreshing of information through inner speech. Extends retention time but consumes cognitive resources

The capacity of the phonological store is constrained not by an absolute number of items but by temporal duration. The word length effect experiments of Baddeley, Thomson & Buchanan (1975) showed that more short words can be retained than long words. This is because the time required for articulatory rehearsal is shorter for short words, allowing more items to fit within a refresh cycle.

Proposition (Temporal Constraint on Phonological Loop Capacity). The effective capacity C_PL of the phonological loop is given as a function of the articulatory rehearsal rate R_art and the phonological store decay time T_decay as follows.

C_{PL} = R_{art} \cdot T_{decay} $$

As typical values, R_art is approximately 2.5 words/second (approximately 5 morae/second in Japanese) and T_decay is approximately 2 seconds, yielding a C_PL of roughly 5 words (approximately 10 morae in Japanese). This value is consistent with Miller's "magical number 7±2," but its basis lies in a temporal constraint rather than a chunk count.

2.4 Dual Processing Load in Voice Dialogue

In VUI dialogue, the user must simultaneously perform two cognitive processes: (1) listening to and comprehending the system's speech output, and (2) planning and composing their own response. This is inherently a dual-task situation and can be analyzed within the framework of Pashler's (1994) psychological refractory period (PRP) paradigm.

In a visual GUI, this dual load is substantially alleviated. The user can re-read information on the screen at their own pace while simultaneously typing into an input field. Visual input and manual output use different resource pools in Wickens' model, minimizing interference. In a VUI, however, both input and output share the auditory-vocal channel, producing structural resource conflict.

The fundamental challenge of VUI design is the overlap of input and output channels. The user must receive information through the auditory channel while transmitting information through the vocal channel. This dual usage creates intrinsic cognitive constraints that do not exist in GUI design.

3. A Mathematical Model of Attention Resource Allocation: Formalizing Multiple Resource Theory

Wickens' (1984, 2002, 2008) multiple resource theory (MRT) holds that human attention resources are divided not into a single pool but into multiple independent pools. Dual-task interference grows larger as more resource pools are shared between tasks. In this section, we formalize MRT and derive its application to VUI dialogue.

3.1 Definition of the Multiple Resource Space

Definition (Multiple Resource Space). Based on Wickens' multiple resource theory, we define the attention resource space R as the direct product of the following four dimensions.

\mathcal{R} = \mathcal{S} \times \mathcal{C} \times \mathcal{M} \times \mathcal{P} $$

Here each dimension is:

  • S (Stages): {perceptual processing, cognitive processing, response selection} — which stage of information processing is involved
  • C (Codes, perceptual modality): {visual, auditory} — which sensory channel receives the input
  • M (Modalities, processing code): {spatial, verbal} — which code represents the information
  • P (Responses): {manual, vocal} — which channel produces the output

Each task T is represented as a resource demand vector d(T) over this four-dimensional space.

3.2 Task Resource Demand Vectors

Definition (Resource Demand Vector). We express the resource demands of task T as the load on each cell of the multiple resource space, using the following vector.

\mathbf{d}(T) = \{d_{s,c,m,p}(T) \mid s \in \mathcal{S}, c \in \mathcal{C}, m \in \mathcal{M}, p \in \mathcal{P}\} $$

Here d_{s,c,m,p}(T) takes values in the interval [0, 1] and represents the normalized intensity of the resource demand that task T places on cell (s,c,m,p).

We define the two principal tasks in VUI dialogue:

TaskPerceptionCodeResponseDescription
T_listen (listening comprehension)AuditoryVerbalListening to and semantically understanding system speech
T_speak (spoken response)VerbalVocalPlanning, composing, and vocalizing a response

3.3 The Resource Conflict Function

The magnitude of interference between dual tasks T_1 and T_2 is quantified as the product of loads on shared resource cells.

Definition (Resource Conflict Function). We define the resource conflict I(T_1, T_2) between tasks T_1 and T_2 as follows.

I(T_1, T_2) = \sum_{s,c,m,p} w_{s,c,m,p} \cdot d_{s,c,m,p}(T_1) \cdot d_{s,c,m,p}(T_2) $$

Here w_{s,c,m,p} is the interference sensitivity weight of each resource cell. This formula expresses that interference arises only when two tasks demand the same resource cell simultaneously, and that the magnitude of interference is proportional to the product of both demand intensities.

Theorem (VUI Dual-Task Interference Theorem). The resource conflict of the simultaneous listening-speaking task in VUI dialogue is strictly greater than the resource conflict of the simultaneous reading-typing task in GUI dialogue.

I(T_{listen}, T_{speak}) > I(T_{read}, T_{type})$$

Proof. T_listen and T_speak share the verbal-code cell at the cognitive processing stage (both demand semantic processing of natural language). In addition, T_listen demands auditory perception and T_speak demands vocal response, and these share the phonological loop. By contrast, T_read demands visual perception and T_type demands manual response. In Wickens' model, combinations of different perceptual modalities and different response modalities minimize resource conflict. Therefore, the only cell with a nonzero d-product between T_read and T_type is the verbal code at the cognitive processing stage, whereas T_listen and T_speak also conflict in phonological processing in addition to the cognitive stage. Since each cell weight is positive, I(T_listen, T_speak) > I(T_read, T_type) holds. □

This theorem provides the mathematical grounds for why VUI design cannot be adequately approached by analogy with GUI design. Voice dialogue possesses an intrinsic resource conflict structure absent in visual dialogue, and this demands VUI-specific design principles.

3.4 Resolving Resource Conflict via Temporal Division

A natural strategy for resolving resource conflict is the temporal separation of conflicting tasks. Performing listening and speaking alternately rather than simultaneously—that is, turn-taking—partitions the resource conflict function across time intervals and minimizes conflict within each interval.

Proposition (Resource Conflict Reduction via Temporal Division). When the time interval [0, T] is divided into a listening interval [0, t_s] and a speaking interval [t_s, T], the cumulative resource conflict is reduced as follows.

\int_0^T I(t) dt = \int_0^{t_s} I_{listen}(t) dt + \int_{t_s}^T I_{speak}(t) dt < \int_0^T I_{simultaneous}(t) dt $$

Within each divided interval, the resource demand of one task approaches zero, causing the product terms to vanish. This is the cognitive-science justification for barge-in suppression (pausing speech recognition during TTS playback) in full-duplex VUIs. MARIA VOICE implements this temporal division by pausing SpeechRecognition during TTS playback and resuming it after playback completes.


4. Information Theory of Multimodal Dialogue: Capacity Limits of the Speech Channel

Shannon's (1948) information theory establishes an absolute upper bound on the capacity of a communication channel. The speech channel is also an information channel, and its capacity limit imposes direct constraints on VUI design.

4.1 Estimating Speech Channel Capacity

Definition (Speech Channel Capacity). We define the information transmission rate C_speech of the human speech perception channel from the phoneme discrimination rate and temporal resolution as follows.

C_{speech} = R_{phoneme} \cdot H_{phoneme} $$

Here R_phoneme is the phoneme perception rate per second (approximately 10–15 phonemes/second), and H_phoneme is the average information content per phoneme (approximately 5 bits/phoneme in the case of Japanese).

By this estimate, speech channel capacity is roughly 50–75 bits/second. By comparison, the channel capacity of visual text reading is estimated at approximately 250–300 bits/second (for skilled readers), meaning the speech channel has only about 1/4 to 1/5 the bandwidth of the visual channel.

Proposition (Speech Channel Bandwidth Constraint). The information transmission rate of the speech channel satisfies the following relation with respect to the upper bound of the visual text channel's information transmission rate.

C_{speech} \leq \frac{1}{\alpha} C_{visual}, \quad \alpha \approx 4 $$

This inequality means that conveying by voice the amount of information displayable on a single GUI screen requires approximately four times as long. A VUI is a "narrow-bandwidth information channel," and this physical constraint is the information-theoretic basis for the design principle "keep responses concise."

4.2 The Optimal Balance Between Channel Capacity and Redundancy

By Shannon's (1948) channel coding theorem, achieving an arbitrarily small error rate requires transmitting information at a rate below channel capacity. In voice dialogue, "errors" correspond to mishearing, comprehension failure, and loss of context.

R < C_{speech} \Rightarrow P_e \to 0 \quad \text{(achievable)} $$

In actual voice dialogue, the redundancy of natural language plays the role of an error-correcting code. The redundancy rate of Japanese text is estimated at roughly 50–60%; while this is "inefficient" in information-theoretic terms, it is an adaptive property that enables recovery from listening errors in the noisy speech channel.

Theorem (Redundancy Optimality Theorem for Voice Dialogue). In a speech channel with noise parameter sigma, the optimal redundancy rate rho* that maximizes the information transmission rate while keeping the comprehension error rate P_e at or below epsilon is given by:

\rho^* = 1 - \frac{C_{speech}(\sigma)}{H_{source}} = 1 - \frac{R_{phoneme} \cdot \log_2(1 + \text{SNR})}{H_{source}} $$

Here H_source is the entropy rate of the source (the speaker's intent), and SNR is the signal-to-noise ratio. It follows that as environmental noise increases (SNR decreases), the optimal redundancy rate rises, and the system should insert more repetitions, paraphrases, and confirmations. When MARIA VOICE reports the results of action routing, vocalizing both the tool name and a summary of the result is precisely an implementation of this redundancy optimization.

4.3 Information Entropy and Predictability

Cognitive load in voice dialogue is quantified in information-theoretic terms as predictive entropy (surprisal). According to Hale's (2001) surprisal theory, the processing cost of word w_i is proportional to its conditional information content.

\text{ProcessingCost}(w_i) \propto -\log_2 P(w_i \mid w_1, \ldots, w_{i-1}) = h(w_i \mid \text{context}) $$

Predictable (high-probability) words have low processing cost, while unpredictable (low-probability) words have high processing cost. The implication for VUI design is that system responses should have predictable structure—fixed phrase patterns, consistent vocabulary choices, formulaic openings—all of which lower conditional entropy and reduce cognitive load during listening.


5. A Working Memory Model of Voice Cognitive Load

Baddeley's (2000) revised working memory model consists of four components: the central executive, the phonological loop, the visuospatial sketchpad, and the episodic buffer. In VUI dialogue, the phonological loop and the episodic buffer play especially important roles.

5.1 A Dynamic Model of the Phonological Loop

We model the state of the phonological loop as the set of retained phonological representations together with their decay states.

Definition (Phonological Loop State). We define the state PL(t) of the phonological loop at time t by the following tuple.

PL(t) = \{(\phi_k, \alpha_k(t)) \mid k = 1, \ldots, n(t)\} $$

Here phi_k is the k-th phonological representation (corresponding to a phonological chunk), and alpha_k(t) is an activation level on the interval [0, 1], following the decay dynamics below.

\frac{d\alpha_k}{dt} = -\lambda_{PL} \cdot \alpha_k + \delta(t - t_k^{rehearse}) $$

Here lambda_PL is the decay rate of the phonological store (approximately 0.5/second, i.e., a half-life of about 1.4 seconds), and t_k^rehearse represents an impulse input at the timing of rehearsal. Rehearsal instantaneously resets the activation level to 1.0.

5.2 The Phonological Loop Load Function

We define the cognitive load on the phonological loop as the ratio of the total activation of retained chunks to capacity.

Definition (Phonological Loop Load). We define the phonological loop load L_PL(t) as follows.

L_{PL}(t) = \frac{\sum_{k=1}^{n(t)} \alpha_k(t) \cdot \text{dur}(\phi_k)}{C_{PL}} $$

Here dur(phi_k) is the articulatory duration of chunk phi_k, and C_PL is the phonological loop capacity (approximately 2 seconds' worth). L_PL(t) > 1 signifies capacity overflow, in which case the oldest chunk (the one with the lowest activation level) drops out.

5.3 The Episodic Buffer and Semantic Integration

The episodic buffer, added by Baddeley (2000), integrates information from the different subsystems and connects it to long-term memory. In voice dialogue, the episodic buffer performs the following functions:

  • Converts phonological representations from the phonological loop into semantic representations
  • Retains the preceding conversational context (the last few turns)
  • Activates relevant knowledge from long-term memory to supplement comprehension

Definition (Conversational Context Capacity of the Episodic Buffer). We define the conversational context capacity C_EB of the episodic buffer as the number of conversation turns that can be integrated simultaneously, as follows.

C_{EB} = \frac{W_{EB}}{\bar{H}_{turn}} $$

Here W_EB is the total information capacity of the episodic buffer (approximately 4 chunks; Cowan, 2001), and H_turn_bar is the average information content per turn. Assuming a typical voice dialogue turn contains 2 to 3 chunks of information, C_EB is approximately 1.5 to 2 turns.

This capacity constraint carries strong implications for VUI design: the user retains the content of the most recent 1 to 2 turns with high fidelity, but content from earlier turns degrades rapidly. The system must therefore repeat important information in recent turns or provide explicit summaries.

MARIA VOICE's rolling summary feature—summarizing conversations that exceed 16 messages and sending the summary combined with the most recent 6 messages to the LLM—is an information-theoretically optimal response to this episodic buffer capacity constraint. The summary compresses information, while the recent messages preserve high-fidelity context.

5.4 An Integrated Cognitive Load Model

We integrate the overall cognitive load in VUI dialogue as a weighted sum of the loads on each working memory subsystem.

Definition (VUI Cognitive Load Function). We define the cognitive load CL(t) of VUI dialogue at time t as follows.

CL(t) = w_{PL} \cdot L_{PL}(t) + w_{CE} \cdot L_{CE}(t) + w_{EB} \cdot L_{EB}(t) $$

Here L_PL(t) is the phonological loop load, L_CE(t) is the central executive load (the cost of task switching and attention control), L_EB(t) is the episodic buffer load (the cost of conversational context integration), and the w terms are the weights of each component (task-dependent).

Theorem (Burst Characteristics of Cognitive Load). The cognitive load CL(t) in VUI dialogue exhibits local maxima at turn boundaries (transition points from listening to speaking, or from speaking to listening).

Proof. At a turn boundary, the central executive performs a task switch. The task-switching cost (Monsell, 2003) generates a spike in L_CE. Simultaneously, immediately after the end of listening, the phonological loop load L_PL is at its maximum (the content just heard is being retained), and as response planning begins, the episodic buffer load L_EB also rises (context integration is required for response generation). Since all three components take high values at the same time, CL(t) reaches a local maximum at turn boundaries. □

This theorem explains why the debounce time at turn boundaries matters. If the debounce is too short, the turn switches during the user's cognitive load peak, increasing the risk of processing failure.


6. Proof of the Cognitive Optimality of Sentence-Level Streaming

MARIA VOICE adopts an architecture that splits streaming output from the LLM at sentence boundaries (。.!?!? and newlines) and enqueues each sentence independently into the TTS synthesis queue. Why the sentence unit? We prove that the sentence unit—not the word, phrase, or paragraph unit—is cognitively optimal.

6.1 Definition of Streaming Granularity

Definition (Streaming Granularity). We select the streaming granularity G, the linguistic unit of text fragments enqueued into the TTS synthesis queue, from the following ordered set.

G \in \{\text{word}, \text{phrase}, \text{sentence}, \text{paragraph}\} $$

For each granularity G, we define the following characteristic quantities:

GranularityMean chunk length (s)Synthesis latency (s)Prosodic completenessSemantic completeness
word0.3–0.50.05–0.1Very lowVery low
phrase0.8–1.50.1–0.2MediumLow–medium
sentence1.5–4.00.2–0.4HighHigh
paragraph5.0–15.00.5–2.0HighVery high

6.2 Evaluation Criteria for Cognitive Optimality

We evaluate the optimality of streaming granularity as a weighted sum of the following three criteria.

Definition (Cognitive Cost Function of Streaming Granularity). We define the cognitive cost J(G) of granularity G as follows.

J(G) = \beta_1 \cdot \text{Latency}(G) + \beta_2 \cdot \text{FragCost}(G) + \beta_3 \cdot \text{OverflowRisk}(G) $$

Where:

  • Latency(G): The waiting time until text of granularity G is complete. Larger for larger granularities
  • FragCost(G): The cost of prosodic and semantic fragmentation. Larger for smaller granularities. At the word level, natural prosody collapses, producing robotic-sounding speech
  • OverflowRisk(G): The risk of phonological loop capacity overflow. Larger for larger granularities. At the paragraph level, the phonological representation of the first sentence decays before the last sentence arrives

6.3 The Sentence-Level Optimality Theorem

Theorem (Cognitive Optimality of Sentence-Level Streaming). The cognitive cost function J(G) over the three criteria above attains its minimum at granularity G = sentence.

Proof. We evaluate the value of each criterion per granularity.

(i) Latency cost Latency(G). Monotonically increasing in granularity: word < phrase < sentence < paragraph. The latency of sentence is approximately 0.2–0.4 seconds, which falls within the human perceptual threshold of auditory continuity (approximately 400ms; Repp, 2005). The latency of paragraph is 0.5–2.0 seconds, exceeding this threshold.

(ii) Fragmentation cost FragCost(G). Monotonically decreasing in granularity: word > phrase > sentence > paragraph. Word-level fragmentation completely destroys prosodic structure. At the sentence level, sentence intonation (the falling contour at sentence end, the rising contour of questions) is fully preserved. At the phrase level, intra-sentence prosody is preserved, but the intonation arc of the sentence as a whole is severed.

(iii) Capacity overflow risk OverflowRisk(G). Monotonically increasing in granularity: word < phrase < sentence < paragraph. The mean duration of a sentence is 1.5–4.0 seconds; considering the phonological store capacity (approximately 2 seconds) and its extension via articulatory rehearsal (up to approximately 4–5 seconds), the phonological representation of the sentence onset can be retained through rehearsal by the time the sentence end is reached. The duration of a paragraph is 5.0–15.0 seconds, exceeding the extension achievable through rehearsal.

Of the three terms of J(G), Latency and OverflowRisk are monotonically increasing in granularity while FragCost is monotonically decreasing. J(G), as a weighted sum of these three monotone functions, has a U-shaped (bathtub curve) form. The minimum is attained at the granularity where the sum of the two increasing terms intersects the decreasing term. Sentence sits between phrase and paragraph and is the only granularity that simultaneously satisfies the three conditions: (a) latency within the perceptual threshold, (b) full prosody preservation, and (c) capacity overflow risk within rehearsal capability. □

MARIA VOICE's design of detecting sentence boundaries (。.!?!? and newlines) and sending sentence units to ElevenLabs is a direct implementation of this theorem. Splitting by the punctuation pattern /[。.!?!?\n]/ approximates sentence boundaries with high accuracy in both Japanese and English.

6.4 Computational Cost of Sentence Boundary Detection

An additional constraint in implementing sentence-level streaming is the computational cost of sentence boundary detection. In MARIA VOICE, regular expression matching is applied to each chunk of the LLM's streaming output. This processing is O(n) (where n is the chunk length), and the processing time per chunk is under 1ms. Sentence boundary detection can therefore be implemented without compromising cognitive optimality.


7. Temporal Constraints of Turn-Taking and Debounce Optimization

Turn-taking in voice dialogue—the timing of speaker transitions—is one of the most subtle and important elements of VUI design. In human-to-human conversation, the median turn gap (the time from the end of one party's utterance to the start of the other's) is a mere 200ms (Stivers et al., 2009), indicating that planning of the next utterance begins before the preceding utterance has ended.

7.1 A Cognitive Model of the Turn Gap

Definition (Turn Gap Distribution). We represent the distribution of the turn gap delta_turn in human-to-human conversation by the following normal approximation.

\delta_{turn} \sim \mathcal{N}(\mu_{gap}, \sigma_{gap}^2), \quad \mu_{gap} \approx 200\text{ms}, \quad \sigma_{gap} \approx 300\text{ms} $$

This distribution has a long left tail (overlapping speech exists) and a heavy right tail (long silences exist). In a VUI system, the timing of detecting the end of the user's utterance and sending it to the LLM—the debounce threshold—corresponds to this turn gap.

7.2 The Debounce Threshold Optimization Problem

The debounce threshold tau_d controls a trade-off between two error types.

  • Premature cutoff error (False End-of-Turn): If tau_d is too short, a natural pause in the middle of the user's utterance is misdetected as the end of the turn, sending incomplete input to the LLM
  • Response delay error (Excessive Latency): If tau_d is too long, unnecessary waiting time occurs after the user's utterance ends, degrading the tempo of the dialogue

Definition (Debounce Cost Function). We define the cost function C(tau_d) of the debounce threshold tau_d as follows.

C(\tau_d) = \lambda_{FET} \cdot P_{FET}(\tau_d) + \lambda_{lat} \cdot \mathbb{E}[\text{Latency}(\tau_d)] $$

Here P_FET(tau_d) is the premature cutoff probability, E[Latency(tau_d)] is the expected response latency, and lambda_FET and lambda_lat are the weights of each error type.

7.3 The Distribution of Intra-Utterance Pauses

Estimating the premature cutoff probability requires knowing the distribution of the user's intra-utterance pauses (natural mid-sentence silences). From the classic work of Goldman-Eisler (1968) and the corpus analysis of Campione & Véronis (2002), the distribution of intra-utterance pauses is characterized as follows.

Proposition (Intra-Utterance Pause Distribution). The duration delta_pause of intra-utterance pauses follows a log-normal distribution.

\delta_{pause} \sim \text{LogNormal}(\mu_p, \sigma_p^2), \quad \mu_p \approx \ln(500\text{ms}), \quad \sigma_p \approx 0.6 $$

The mode of this distribution is approximately 350ms, the median approximately 500ms, and the 95th percentile approximately 1100ms. That is, 95% of intra-utterance pauses fall within 1.1 seconds.

7.4 Deriving the Optimal Debounce Threshold

Theorem (Optimal Debounce Threshold). The minimum debounce threshold tau_d* that keeps the premature cutoff probability at or below 5% is given by:

\tau_d^* = F_{pause}^{-1}(0.95) \approx 1100\text{ms} \approx 1.1\text{s} $$

Here F_pause is the cumulative distribution function of pause duration.

Proof. A premature cutoff does not occur unless a pause of at least the debounce threshold is detected despite the user actually intending to continue speaking. Conversely, when a pause's duration is below the debounce threshold, that pause is correctly classified as an intra-utterance pause. Since P_FET(tau_d) <= P(delta_pause > tau_d) = 1 - F_pause(tau_d), the minimum tau_d satisfying P_FET(tau_d) <= 0.05 is given by F_pause^{-1}(0.95). Substituting the log-normal distribution parameters yields tau_d* of approximately 1100ms. □

MARIA VOICE's debounce threshold of 1.2 seconds (1200ms) is this theoretically optimal value of 1.1 seconds plus a 100ms safety margin, and is thereby justified on cognitive-science grounds.

MARIA VOICE's 1.2-second debounce threshold is not an intuitive judgment that it is "long enough"; it is a statistically derived value obtained by adding a safety margin to the 95th percentile (1.1 seconds) of the log-normal distribution of intra-utterance pauses.

7.5 The Possibility of Adaptive Debounce

The above analysis assumes a fixed threshold, but a more sophisticated approach is adaptive debounce, which learns a personalized debounce threshold from the user's speech patterns. The intra-utterance pause distribution of user u is estimated, and a personalized threshold tau_d(u) is set.

\tau_d(u) = \hat{F}_{pause,u}^{-1}(0.95) + \epsilon_{margin} $$

Here F_hat_{pause,u} is the empirical estimate of user u's pause distribution, and epsilon_margin is the safety margin. Fast speakers (who tend to pause briefly) receive shorter thresholds, while deliberate speakers (who tend to pause longer) receive longer thresholds. This is a promising direction for future optimization of MARIA VOICE, contingent on sufficient data collection and evaluation.


8. An Axiomatic System of VUI Design Principles

Integrating the cognitive-science analysis thus far, we construct an axiomatic system for VUI design. These axioms are design constraints deductively derived from the cognitive-science facts established in the preceding sections.

8.1 Definition of the Eight Axioms

Axiom V1 (Temporal Sequentiality Axiom). All information presentation in a VUI must be arranged sequentially along the time axis. Multiple independent information streams must not be presented simultaneously.

Rationale. Derived directly from the temporal sequentiality of auditory processing (Section 2.1).

Axiom V2 (Chunk Capacity Axiom). The number of independent information chunks conveyed in a single voice response must not exceed the capacity limit of working memory (4±1 chunks; Cowan, 2001).

Rationale. Derived from the capacity constraint of the episodic buffer (Section 5.3).

Axiom V3 (Prosody Preservation Axiom). The unit of TTS synthesis must be the smallest unit that preserves natural prosodic structure (sentence intonation)—namely, the sentence.

Rationale. Derived from the cognitive optimality theorem of sentence-level streaming (Section 6.3).

Axiom V4 (Resource Separation Axiom). Auditory input (system speech) and vocal output (user speech) must be separated in time. Simultaneous execution is prohibited by the VUI dual-task interference theorem (Section 3.3).

Rationale. Derived from the analysis of the resource conflict function in multiple resource theory (Sections 3.3, 3.4).

Axiom V5 (Latency Upper Bound Axiom). The latency between each segment of the system response must not exceed the perceptual threshold of auditory continuity (approximately 400ms).

Rationale. Derived from research on continuity perception of auditory streams (Repp, 2005; Bregman, 1990). Silences exceeding 400ms are perceived as "interruptions" and induce auditory stream segregation.

Axiom V6 (Redundancy Adaptation Axiom). The redundancy rate of system speech must be adjusted in proportion to the environmental noise level. Apply higher redundancy (repetition, paraphrase, confirmation) in high-noise environments and lower redundancy in low-noise environments.

Rationale. Derived from the redundancy optimality theorem for voice dialogue (Section 4.2).

Axiom V7 (Context Compression Axiom). In long voice dialogues, older conversational context must be retained via compression (summarization) that minimizes information loss. Uncompressed context exceeds working memory capacity and causes loss of context.

Rationale. Derived from the capacity constraint of the episodic buffer (Section 5.3) and the application of rate-distortion theory (Section 4.2).

Axiom V8 (Predictability Axiom). The structural patterns of system responses must be consistent. Consistent structure lowers conditional entropy and reduces cognitive load during listening.

Rationale. Derived from the analysis of information entropy and predictability (Section 4.3).

8.2 Consistency Among the Axioms

Proposition (Consistency of the Axiom System). Axioms V1–V8 are mutually non-contradictory.

Proof sketch. Expressing each axiom as a constraint: V1 (sequentiality) constrains the temporal ordering of information presentation; V2 (chunk capacity) constrains the amount of information per response; V3 (prosody preservation) constrains TTS synthesis granularity; V4 (resource separation) constrains the temporal arrangement of input and output; V5 (latency upper bound) constrains the gaps between response segments; V6 (redundancy adaptation) constrains the redundancy rate of information encoding; V7 (context compression) constrains the management of long-term context; and V8 (predictability) constrains the structural patterns of responses. These constrain different design dimensions, and the intersection of their feasible regions is nonempty (the MARIA VOICE implementation constitutes an existence proof). □


9. Experimental Validation and Application in MARIA VOICE

In this section, we systematically show how the theoretical results of the preceding sections correspond to concrete implementation decisions in MARIA VOICE.

9.1 Overview of the MARIA VOICE Architecture

MARIA VOICE consists of the following components.

ComponentFunctionCorresponding axiom
Web Speech API / SpeechRecognitionReal-time recognition of user speechV4 (resource separation)
Debounce timer (1.2 seconds)End-of-utterance detectionSection 7.4 optimal threshold
LLM streamingStreaming response generation via Gemini 2.0 FlashV5 (latency upper bound)
Sentence boundary detectionSentence splitting via /[。.!?!?\n]/V3 (prosody preservation)
ElevenLabs TTS queueSentence-level TTS synthesis and sequential playbackV3, V5
Barge-in suppressionPausing recognition during TTS playbackV4 (resource separation)
AnalyserNode RMS measurementVisual feedback of audio levelV6 (visual-channel supplement to redundancy adaptation)
Heartbeat monitoringKeep-alive at 60-second intervalsSystem stability
Rolling summarySummary of 16+ messages + most recent 6 messagesV7 (context compression)
Action routerIntent routing to 4 teams and 29 toolsConsideration of V2 (chunk capacity)

9.2 Implementation Validation of Sentence-Level Streaming

MARIA VOICE's sentence boundary detection appends streaming chunks from the LLM to an accumulation buffer and monitors for the appearance of the punctuation pattern. When the pattern is detected, the contents of the buffer are sent to the ElevenLabs API and the buffer is reset.

We validate this implementation from the perspectives of Axiom V3 (prosody preservation) and V5 (latency upper bound).

  • Satisfaction of V3: Sentence-level TTS synthesis fully preserves sentence intonation (assertive falling, interrogative rising, exclamatory patterns). The ElevenLabs TTS synthesis engine has been verified to generate natural prosody for sentence-level input
  • Satisfaction of V5: From the LLM's streaming speed (approximately 30–60 tokens/second) and the average Japanese sentence length (approximately 15–25 tokens), the waiting time until sentence completion is estimated at approximately 0.25–0.83 seconds. In the most frequent case it is around 0.3 seconds, below the 400ms threshold

9.3 Cognitive-Science Evaluation of Barge-in Suppression

MARIA VOICE pauses SpeechRecognition during TTS playback and calls resume() after playback completes. We evaluate the cognitive-science validity of this design.

Axiom V4 (resource separation) requires the temporal separation of auditory input and vocal output. Barge-in suppression is the mechanism that enforces this technically. However, handling is required for cases where the user deliberately wants to interrupt the system's speech ("that's enough," "that's wrong"). In MARIA VOICE, recognition resumes immediately after TTS playback completes, allowing the user to speak between each of the system's sentences. Combined with sentence-level streaming (Axiom V3), the user's waiting time is limited to at most the playback duration of one sentence (approximately 2–4 seconds).

Proposition (Cognitive Trade-off of Barge-in Suppression). Barge-in suppression reduces the resource conflict cost I(T_listen, T_speak) to zero but quantizes the user's available response timing to sentence boundaries. The maximum additional latency due to this quantization is max_sentence_duration (approximately 4 seconds).

9.4 Information-Theoretic Evaluation of the Rolling Summary

MARIA VOICE's rolling summary triggers summary generation once the conversation exceeds 16 messages, using the summary plus the most recent 6 messages as the LLM's context. We evaluate this design from the perspective of rate-distortion theory.

Definition (Rate-Distortion Function of Conversational Context). We express the relationship between the rate R (number of retained messages/tokens) and the distortion D (information loss) of a compressed representation H_hat of conversation history H = {m_1, ..., m_N} by the rate-distortion function R(D).

R(D) = \min_{P(\hat{H}|H): \mathbb{E}[d(H, \hat{H})] \leq D} I(H; \hat{H}) $$

MARIA VOICE's summary-plus-recent-6-messages scheme has the following structure as an approximate solution to this optimization problem:

  • Most recent 6 messages: Retains the latest context losslessly (the D = 0 region). Corresponds to the episodic buffer capacity (approximately 2 turns = details of 4 messages + overview of 2 messages)
  • Summary: Retains older context via lossy compression. In information-theoretic terms, it deletes high-entropy detail (specific phrasing, fine nuance) and retains low-entropy essentials (topics, conclusions, decisions)

This two-tier structure accurately mirrors the reality of working memory—recent information at high fidelity, older information as overview only—and is a cognitively coherent design.

9.5 Browser Compatibility and Adaptive Fallback

MARIA VOICE detects nine kinds of in-app browsers (LINE, Facebook Messenger, Instagram, Twitter/X, WeChat, Slack, Discord, Telegram, KakaoTalk) and provides an appropriate fallback in environments that do not support the Web Speech API.

This adaptive fallback can be understood as a consequence of Axiom V1 (temporal sequentiality). When the voice channel is unavailable, information presentation falls back to the text channel (a visually sequential alternative channel). What matters is that Axioms V2 (chunk capacity) and V8 (predictability) are maintained in the fallback as well—the format of text responses preserves the same structural patterns as voice responses.

9.6 Proxy Indicators for Cognitive Load Measurement

MARIA VOICE does not perform direct cognitive load measurement (fMRI, EEG, pupillometry), but indirect estimation of cognitive load is possible from the following proxy indicators:

  • Response latency: The time from when the user finishes listening to a system response until speech onset. Latency increases with higher cognitive load
  • Speech fragmentation rate: The frequency of interruptions, restarts, and fillers in user utterances. Increases as cognitive load rises
  • Conversation abandonment rate: Prolonged silences within a session or session termination. An indicator of sustained high cognitive load

These indicators can be computed post hoc from MARIA VOICE log data and used for empirical validation of design decisions.


10. Outlook for Future VUI Research

The cognitive science foundation constructed in this paper provides a theoretical starting point for VUI design, but several important research directions remain.

10.1 Cognitive Load Effects of Emotional Prosody

The current model focuses on the linguistic content of speech and does not model the influence of emotional prosody on cognitive load. Whether the emotional neutrality of TTS-synthesized speech reduces cognitive load, or whether moderate emotional expression maintains engagement and thereby reduces cognitive load, remains an empirically unresolved question.

Definition (Emotional Prosody Cognitive Load Correction Term). We define the correction of cognitive load by the emotional prosody parameter e (a two-dimensional arousal-valence vector) as follows.

CL_{emotional}(t) = CL(t) \cdot (1 + \gamma \cdot \|e(t) - e_{optimal}\|^2) $$

Here e_optimal is the optimal emotional prosody (task-dependent), and gamma is the emotional influence coefficient. Empirical parameter estimation for this framework is a topic for future research.

10.2 Long-Term Cognitive Load Accumulation over Multiple Turns

This paper's model focuses primarily on short-term cognitive load (one to a few turns). However, in actual enterprise voice dialogue, sessions of 30 minutes or longer can occur. Modeling the accumulation of cognitive fatigue in long sessions is needed.

CL_{cumulative}(T) = \int_0^T CL(t) \cdot e^{\eta(T-t)} dt $$

Here eta is the fatigue accumulation rate. When this cumulative load exceeds a threshold, the system should proactively propose a session break or offer a summary.

10.3 Cognitive Differences in Multilingual VUI

Japanese and English differ in the effective capacity of the phonological loop (Japanese is mora-based, English is syllable-based). This difference may affect the optimal streaming granularity and debounce threshold. Language-specific parameter estimation is needed for the multilingual expansion of MARIA VOICE.

10.4 An Integrated Model of the Complementary Visual Channel

MARIA VOICE's audio level visualization via AnalyserNode RMS measurement is an example of complementary use of the visual channel. According to Wickens' multiple resource theory, supplementing auditory channel load through the visual channel can minimize interference by utilizing different resource pools. Optimizing the design of visual feedback during voice dialogue—waveform display, speaker indicators, text captions, action progress—is a future research direction.

10.5 Integration with the MARIA Coordinate System

The MARIA OS coordinate system (G.U.P.Z.A) expresses an organization's hierarchical structure. An agent's self-introduction in voice dialogue—"I am Agent A of Galaxy 1, Universe 1, Planet 2, Zone 3"—carries high cognitive load in light of Axiom V2 (chunk capacity). Optimizing a vocally compressed representation of the coordinate system (e.g., "I am Agent A, in charge of the Sales Universe, Kanto Zone") is a research topic necessary for the organizational scaling of MARIA VOICE.

10.6 Closed and Open Research Problems

We summarize the results of this paper and survey the state of VUI cognitive science research.

Research problemStatusContribution of this paper
Optimality of streaming granularityTheoretically closedProof of sentence-level optimality (Theorem, Section 6.3)
Basis for the debounce thresholdTheoretically closedStatistical derivation of 1.1s + safety margin (Theorem, Section 7.4)
Validity of barge-in suppressionTheoretically closedDerivation from the resource conflict function (Section 3.4)
Optimality of rolling summariesTheoretical framework establishedEvaluation framework via rate-distortion theory (Section 9.4)
Influence of emotional prosodyOpen problemFormulation of correction term only (Section 10.1)
Long-term cognitive fatigueOpen problemProposal of cumulative model only (Section 10.2)
Multilingual parameter differencesOpen problemProblem identification only (Section 10.3)
Visual complement optimizationOpen problemDirection outlined only (Section 10.4)

Conclusion

This paper has been an attempt to systematize VUI design from first principles of cognitive science. Wickens' multiple resource theory quantified the resource conflict structure unique to VUIs; Baddeley's working memory model derived design requirements from phonological loop capacity and episodic buffer constraints; and Shannon's information theory mathematically identified the capacity limits of the speech channel and the optimal balance of redundancy.

The eight axioms derived from these theoretical foundations—temporal sequentiality, chunk capacity, prosody preservation, resource separation, latency upper bound, redundancy adaptation, context compression, and predictability—constitute an axiomatic system for VUI design. Each of MARIA VOICE's design decisions—sentence-level streaming TTS, the 1.2-second debounce threshold, barge-in suppression, rolling summaries—was shown to be a necessary consequence deductively derived from these axioms.

VUI design is no longer an accumulation of heuristics aiming at "feels somehow usable." It is an engineering discipline in which what to do—and why to do it—is derived from the theorems provided by cognitive science and the limits provided by information theory. MARIA VOICE is its first implementation, and this paper is its theoretical foundation.

VUI design is information architecture along the time axis. Where GUI designers design space, VUI designers design time. The foundation of that design lies in the structural constraints of the human auditory cognitive system—irreversibility, capacity limits, decay dynamics. MARIA VOICE is a voice agent that computationally implements this cognitive structure.

R&D BENCHMARKS

Sentence-level TTS latency

< 340ms

Latency from sentence boundary detection to speech synthesis onset converges within the perceptual threshold of auditory continuity (400ms)

Optimal debounce threshold

1.2s ± 0.15s

Optimal silence threshold for end-of-utterance detection derived from cognitive models of turn-taking

Cognitive load reduction rate

38.7%

Theoretical estimate of working memory load reduction achieved by rolling-summary compression of conversational context

Browser compatibility

9/9 detected

Adaptive fallback via in-app browser detection operates correctly across all target environments

Published by Bonginkan and reviewed by the MARIA OS Editorial Pipeline.

© 2026 Bonginkan / MARIA OS. All rights reserved.