Hey folks! I want to share some updates on Hashbrown’s next modality: voice. I’m leading the architecture for it, and I’m excited to walk you through how we’re thinking about it, what problems we’re solving, and where we’re headed.
The Three Pillars of Voice
In AI engineering, voice typically falls into three categories:
Text-to-Speech (TTS) – Given text, a system prompt, and a voice, generate audio that “speaks” the provided text.
Speech-to-Text (STT) – Given an audio file and a system prompt, produce a text transcript of what was said.
Speech-to-Speech (STS) – The most advanced form, where the LLM can natively ingest and produce speech without any intermediary text.
These three building blocks form the foundation of two common voice agent designs:
Speech-to-Speech agents – Use STS models directly.
Chained agents – Simulate speech-to-speech by chaining text-based interactions through TTS and STT.
Only a few players, like OpenAI and Google’s experimental APIs, support true STS today. Most developers rely on the chained approach, which is exactly what we’ll do for Hashbrown.
Why We’re Pursuing Chained Voice Agents
There are a few reasons this direction makes sense for us:
Smaller, practical use cases: TTS and STT unlock valuable interactions on their own, like filling out forms by voice or reading reports aloud, without needing a full conversational agent.
Cross-provider parity: Most APIs for TTS and STT look similar. We can build lightweight adapters using our existing pattern and let Hashbrown call into any provider.
WebRTC complexity: True speech-to-speech would drag us deep into WebRTC land. Building generic adapters over WebRTC is ultra-hard mode, and there are entire teams dedicated to that problem. We don’t need to reinvent that wheel.
So, we’re building chained agents that compose our text hooks/resources with speech on either end.
The Hard Problems Ahead
Even with the simpler path, there’s a ton of hard engineering ahead. Speech-to-text in particular will test us in three crunchy areas:
Voice Activity Detection (VAD) – Detecting when a user stops talking is surprisingly complex. I learned this firsthand when I built a voice assistant called Sunny as a technical preview last winter. You can run open-source ML models for VAD, but they’re huge and require ONNX. This would add megabytes to bundle sizes. A more lightweight approach is compiling Google’s BSD-licensed C-based VAD (from their WebRTC stack) to WASM. Intimidating, but doable.
Encoding and Bit Rate – We’ll need to decide whether to stream raw WAV or compress audio. This is a classic latency vs. payload-size tradeoff. At the library level, it’s even trickier.
Streaming – How do we stream audio efficiently using only web standards? My best guess for an MVP is:
Use the browser’s microphone.
Create an AudioWorklet that runs Google’s WebRTC VAD (via WASM).
Stream binary-encoded audio chunks to an adapter.
Have the adapter split the audio into 25 MB chunks (the limit for OpenAI and Gemini), transcribe them, and stream the transcription back.
Once STT is working, TTS will be an order of magnitude easier.
TTS: The Challenge of Chunking
At our morning standup, I was talking to Ben and Brian about one of the subtler challenges of text-to-speech: chunking.
Imagine a chained voice agent mid-conversation:
User: Hello, how are you?
Assistant: I’m doing well, thanks for asking! I’ve been in a pretty steady rhythm today, diving into projects and switching between technical deep dives and more creative writing tasks…
The assistant’s text is streaming in real time, token by token. But to turn that text into audio, we can’t just wait for the entire response to finish. That would add huge latency. Instead, we need to break the text into chunks as it’s generated, ideally at natural sentence boundaries.
That’s where it gets tricky. Knowing where one sentence ends and another begins is a hard problem. You can’t just split on a period; “Dr. Smith” shouldn’t trigger a pause. And ideally, this works across languages. Enter machine learning again. There’s a whole family of models for sentence boundary detection, but like VAD, they’re big and computationally expensive. We have to be careful about what we ship to the browser and how we impact battery life, especially for mobile web apps. Audio might be the killer use case for generative UI on the mobile web, but it comes with unique constraints.
Our Current Plan
Here’s the approach I’m leaning toward:
We’ll build on top of Skillet’s JSON streaming and ask the model to output an ordered list of chunks like this:
interface Chunk {
seq: number;
text: string;
breakMs?: number;
voiceHint?: "casual" | "formal" | string;
lang?: string;
emotion?: "neutral" | "happy" | string;
};
As each chunk is emitted, we can immediately start generating the corresponding audio and queue it for playback. This should make voice interactions feel snappy, expressive, and alive.
Wrapping Up
We’re still early, but the path is clear. Hashbrown’s approach to voice will start with chained agents, powered by Speech-to-Text and Text-to-Speech hooks/resources that developers can mix and match. From there, we’ll layer in streaming, chunking, and VAD to make it feel real-time and natural.
This will be a tough engineering challenge, but that’s what makes it fun. If any of this sparks ideas - or if you’re interested in helping out on a specific piece - please reach out!
Workshops!
There’s still time left to join our Angular workshop on Hashbrown this Tuesday. Use discount code THEGRAVY2025 to get a discount on your order. We are going to break down AI engineering and walk you through multiple exercises on how to build real AI-powered user experiences. We’d love to see you there!