Abstract
Building a meeting AI bot that joins Google Meet, transcribes audio in real-time, generates live minutes, and streams results to a dashboard involves coordinating at least five independent subsystems: a headless browser (Playwright), an audio capture pipeline (Chrome DevTools Protocol), a speech recognition service (Gemini Live Audio), a summarization engine (Gemini Flash), and a streaming delivery mechanism (Server-Sent Events). Each subsystem has its own lifecycle, failure modes, and resource management requirements.
Coordinating these subsystems without a formal orchestration model leads to fragile, race-condition-prone code where startup sequences are implicit, shutdown is incomplete, and error recovery is ad hoc. This paper presents the MeetingSessionManager — a state machine that governs the complete lifecycle of a meeting session from creation through finalization. The state machine has seven states with strictly defined transitions, and uses the EventEmitter pattern to decouple component coordination from event delivery to external consumers (dashboard UI, API clients, audit logs).
1. The Orchestration Problem
1.1 Component Dependencies
The meeting AI pipeline has a strict dependency ordering. Each component depends on its predecessor being initialized before it can start:
If the browser fails to launch, nothing else can start. If the Meet page fails to load, audio capture cannot begin. If audio capture fails, ASR receives no input. If ASR produces no segments, the minutes engine has nothing to summarize.
This chain of dependencies means that the system must handle failures at any point in the chain and propagate them appropriately. A failure in audio capture should not crash the browser. A failure in ASR should trigger reconnection, not session termination. A failure in minutes generation should be logged but should not interrupt transcription.
1.2 Concurrency Challenges
The system has multiple concurrent processes:
- Audio capture: Continuous stream of PCM16 chunks from the browser page, arriving every 100ms.
- ASR processing: WebSocket communication with Gemini Live API, producing partial and final transcription results asynchronously.
- Minutes generation: Periodic timer (every 15 seconds) that batches accumulated segments and calls the summarization model.
- Participant monitoring: DOM polling that detects when participants join or leave the meeting.
- Dashboard streaming: SSE connections from dashboard clients that need to receive real-time updates.
These processes must coexist without blocking each other, and the state machine must remain consistent regardless of the order in which asynchronous events arrive.
2. The Seven-State Machine
2.1 State Definitions
The session lifecycle is modeled as a deterministic finite automaton with seven states:
Each state has a precise semantic meaning:
- created: Session record exists. No bot or ASR resources have been allocated. The session is waiting for a join command.
- joining: The Playwright browser is launching, navigating to Google Meet, and attempting to join the meeting. Audio capture and ASR are not yet active.
- active: The bot has joined the meeting. Audio capture is running. ASR is producing transcript segments. The minutes timer is active. This is the steady-state operating mode.
- leaving: A leave command has been issued. The bot is leaving the meeting. ASR is shutting down. The minutes timer is stopping. In-flight data is being flushed.
- finalizing: The bot has left. The system is generating the final comprehensive minutes from the complete transcript. This is a brief but important state that ensures minutes quality.
- completed: The session is done. Final minutes are available. All resources have been released. The session record is immutable.
- failed: An unrecoverable error occurred. Resources have been cleaned up. The session record includes the error details.
2.2 Transition Function
The valid transitions form a directed graph:
created → joining → active → leaving → finalizing → completed
↘
created → failed failed
joining → failed
active → failedInvalid transitions are rejected. The state machine enforces a strictly forward progression — once a session enters leaving, it cannot return to active. Once it reaches completed or failed, no further transitions are possible.
Formally, the transition function $\delta: Q \times \Sigma \rightarrow Q$ is defined over the event alphabet $\Sigma = \{\text{join}, \text{bot\_active}, \text{leave}, \text{finalize\_done}, \text{error}\}$:
2.3 State Invariants
Each state maintains invariants that the session manager enforces:
| State | Bot | ASR | Minutes Timer | Data Mutable |
|-------|-----|-----|---------------|--------------|
| created | null | null | stopped | yes |
| joining | initializing | null | stopped | yes |
| active | running | running | running | yes |
| leaving | stopping | stopping | stopping | yes (flush) |
| finalizing | null | null | stopped | yes (final gen) |
| completed | null | null | stopped | no |
| failed | null (cleaned) | null (cleaned) | stopped | no |
These invariants ensure that resources are allocated and released in the correct order. In the active state, all three subsystems (bot, ASR, minutes timer) must be running. In the completed state, all three must be null/stopped.
3. Component Coordination
3.1 The Join Sequence
When the join() method is called, the session manager executes the following sequence:
1. Transition to joining state and emit status event.
2. Construct the MeetingBot with configuration (cookies path, meeting URI, consent message).
3. Call bot.launch() — this launches Playwright, navigates to Meet, and joins.
4. The bot reports status changes via onStatusChange callback.
5. When the bot reports active, the session manager:
a. Transitions to active state.
b. Starts the Gemini ASR stream.
c. Starts the minutes update timer (15-second interval).
d. Begins forwarding audio chunks from the bot's audio capture to the ASR.
6. If the bot reports an error or launch() throws, the session manager transitions to failed.
This sequence is critical. If steps are reordered — for example, if ASR starts before the bot has joined — the system will attempt to process audio that does not exist, consuming API quota unnecessarily.
3.2 The Leave Sequence
The leave sequence is the reverse of join, with an additional finalization step:
1. Transition to leaving state.
2. Stop the minutes update timer.
3. Close the ASR stream (flush any pending audio).
4. Tell the bot to leave the meeting.
5. Transition to finalizing state.
6. Run the final minutes generation over the complete transcript.
7. Emit the final minutes artifact.
8. Clean up the bot (close browser).
9. Transition to completed state.
10. Record leftAt timestamp on the session.
The finalization step (step 6) is what distinguishes a graceful shutdown from an abrupt one. By generating final minutes from the complete transcript, the system produces a higher-quality document than the last incremental update. The live minutes may have structural artifacts from incremental generation; the final minutes are coherent and complete.
3.3 Error Recovery and Reconnection
Transient failures in the ASR stream (WebSocket disconnections, API timeouts) should not terminate the session. The session manager implements a reconnection strategy:
1. When the bot or ASR reports an error, the handleBotError() method is invoked.
2. A reconnection counter is incremented.
3. If the counter is below MAX_RECONNECT_ATTEMPTS (default: 3), an error event is emitted but the session remains active.
4. If the counter exceeds the maximum, the session transitions to failed.
This bounded retry ensures that transient network issues do not terminate meetings while preventing the system from endlessly retrying against a permanent failure.
4. Event-Driven Architecture
4.1 The EventEmitter Pattern
The session manager extends EventEmitter<SessionManagerEvents> with five typed event channels:
- segment: Emitted when ASR produces a finalized transcript segment. Payload:
TranscriptSegment. - minutes: Emitted when the minutes engine produces an updated or final artifact. Payload:
MinutesArtifact. - status: Emitted on every state transition. Payload:
{ status: SessionStatus, message?: string }. - participant: Emitted when a meeting participant joins or leaves. Payload:
{ type: 'joined' | 'left', participant }. - error: Emitted on any error. Payload:
Error.
This event-driven design decouples the orchestration logic from the delivery mechanism. The session manager does not know or care whether its events are consumed by an SSE endpoint, a WebSocket server, a database writer, or a test harness. It simply emits events when state changes occur.
4.2 SSE Streaming to Dashboard
The API layer subscribes to session manager events and forwards them to connected dashboard clients via Server-Sent Events (SSE):
GET /api/meeting/sessions/{id}/transcript?stream=true
event: segment
data: {"id":"seg-001","speakerLabel":"坪内","textFinal":"..."}
event: minutes
data: {"state":"live","version":3,"sections":[...]}
event: status
data: {"status":"active","message":"Transcribing..."}SSE is chosen over WebSocket for the dashboard connection because the data flow is unidirectional (server to client), SSE handles automatic reconnection natively, and it works through standard HTTP infrastructure without special proxy configuration.
4.3 Active Session Registry
The system maintains a global registry of active session managers:
This registry enables API routes to find the active manager for a given session ID, subscribe to its events, and forward commands (join, leave, consent). The registry is a simple in-memory Map, which is appropriate for a single-server deployment. For multi-server deployments, the registry would be replaced with a distributed coordination service.
5. Browser Automation Architecture
5.1 Playwright and Google Meet
The bot uses Playwright's Chromium browser with specific flags for media handling:
--use-fake-ui-for-media-stream: Automatically grants microphone/camera permissions without user interaction.--auto-accept-camera-and-microphone-capture: Suppresses the media capture permission dialog.--autoplay-policy=no-user-gesture-required: Allows audio playback without user interaction (required for receiving meeting audio).
The bot authenticates using pre-saved cookies for the i@bonginkan.ai Google account. This avoids the need for interactive OAuth flows and enables headless operation.
5.2 Audio Capture via Chrome DevTools Protocol
Audio is captured by injecting a script into the Meet page that creates an AudioContext pipeline:
1. A MutationObserver watches for dynamically added <audio> and <video> elements (Google Meet creates these for each participant's audio stream).
2. Each media element is connected to a MediaElementSource node.
3. All sources feed into a ChannelMerger node.
4. A ScriptProcessorNode extracts PCM16 samples from the merged stream.
5. The PCM16 data is base64-encoded and sent to Node.js via page.exposeFunction().
This architecture captures all meeting audio (all participants) as a single mixed stream. Speaker diarization is handled by the ASR engine, not the audio capture layer.
5.3 Consent Notification
When the bot joins, it sends a chat message notifying participants of AI presence. This message serves as both a legal notification and a consent invitation:
[MARIA AI] この会議はAIアシスタントMARIAが参加しています。
議事録の記録・保存に同意される場合は「同意」とチャットに
入力してください。This notification is sent regardless of the consent gate's state — it is an informational message, not a consent mechanism. Actual consent is recorded through the gate system when the host responds.
6. Reliability and Performance
6.1 Resource Cleanup Guarantees
The session manager ensures that all allocated resources are released, even in error paths. The leave() method uses try/finally blocks to guarantee that:
- The ASR stream is closed even if minutes finalization fails.
- The bot browser is cleaned up even if the ASR close fails.
- The session status is updated even if cleanup partially fails.
This defense-in-depth approach to resource management prevents the accumulation of orphaned browser processes — a critical concern for a system that runs headless Chromium instances.
6.2 Memory Management
Transcript segments accumulate in memory during the session. For a typical 60-minute meeting producing one segment per 5-10 seconds, the segment array grows to approximately 360-720 segments. Each segment is a lightweight object (~500 bytes), so the total in-memory footprint is under 400KB — well within acceptable limits.
Minutes artifacts are replaced on each update (not accumulated), so the minutes memory footprint is constant regardless of meeting duration.
6.3 Gemini API Budget
The minutes engine calls Gemini Flash every 15 seconds during an active meeting. For a 60-minute meeting, this produces approximately 240 API calls. Each call sends the new segments plus the existing minutes context — typically 2,000-5,000 tokens input and 500-2,000 tokens output. The total API consumption for a 60-minute meeting is approximately 600,000-1,200,000 tokens, which is within the free tier limits for development and manageable in production.
7. Conclusion
The MeetingSessionManager demonstrates that complex multi-component systems can be tamed through formal state machine design. By defining seven states with strict transition rules and maintaining invariants about resource allocation at each state, the system achieves predictable behavior even when individual components fail. The EventEmitter pattern provides clean separation between orchestration (what happens when) and delivery (who needs to know), enabling the same session manager to serve both real-time dashboards and post-meeting API consumers.
The key insight is that meeting AI orchestration is fundamentally a coordination problem, not a data processing problem. The difficult part is not transcribing audio or generating summaries — those are solved by external services (Gemini Live, Gemini Flash). The difficult part is starting these services in the right order, keeping them running in the face of failures, and shutting them down without losing data. A state machine with formally defined transitions provides the rigorous foundation that this coordination requires.