Most transcription services work the same way: you upload your audio, it goes to a server, gets processed by some AI model, and comes back as text. Your audio—meetings, interviews, voice notes—passes through someone else's infrastructure.
We built something different.
Our Private Transcription tool runs entirely in your browser. Your audio never leaves your device. No uploads, no servers, no "trust us with your data."
Here's how we did it, what it costs, and the tradeoffs we made.
The Core Idea: AI in the Browser
Modern browsers can run machine learning models directly. Not toy demos—actual production-quality speech recognition.
The technology stack:
- Transformers.js — Hugging Face's library for running ML models in JavaScript
- WebGPU — Hardware acceleration (GPU compute in the browser)
- WASM — Fallback for browsers without WebGPU
- Web Audio API — Decoding and processing audio files
The key insight: speech-to-text models have gotten small enough to download once and run locally. A 50-75MB model can transcribe audio with surprising accuracy.
Why Not Just Use an API?
We could have integrated Whisper API, AssemblyAI, or Deepgram. Easier to build. Better accuracy. So why didn't we?
Privacy is the feature
Some audio is sensitive:
- Medical appointments
- Legal consultations
- HR conversations
- Personal voice journals
- Confidential interviews
For these, "we promise not to look" isn't good enough. The only real privacy is never sending the data at all.
No usage costs
API transcription costs $0.006-0.01 per minute of audio. That adds up fast:
- 10,000 minutes/month = $60-100/month
- With no way to rate-limit without accounts
- And abuse potential from bots
Browser-based: $0 per transcription, forever.
No server infrastructure
No API keys to manage. No queue systems. No scaling concerns. No compliance headaches.
The user's device is the infrastructure.
The Model Decision: Moonshine vs Whisper
We offer two models:
| Model | Size | Speed | Best For |
|---|---|---|---|
| Moonshine Tiny | ~50MB | 5x faster | Clear audio, quick results |
| Whisper Tiny | ~75MB | Baseline | Accented speech, noisy audio |
Moonshine is a newer model optimized specifically for real-time transcription. It sacrifices some accuracy for dramatically better speed.
Whisper Tiny is OpenAI's smallest Whisper variant. More robust to accents and background noise, but slower.
We deliberately excluded larger models (Whisper Small at 244MB, Base at 142MB). The accuracy gains don't justify 3-5x download sizes for a browser tool.
The Real Cost: Bandwidth
"Free to run" isn't quite true. There's one cost: model downloads.
When a user first visits and clicks "Download Model," they pull 50-75MB from a CDN. That's our cost.
The math
- CDN egress: ~$0.02-0.05 per GB
- Model size: ~0.1 GB
- Cost per new user: $0.002-0.005
At 10,000 new users: $20-50. At 100,000 new users: $200-500.
Why it's manageable
- Browser caching — After first download, the model is cached in your browser. Returning users cost $0.
- CDN edge caching — Popular models get cached at edge nodes. Most requests never hit origin.
- User-initiated downloads — We don't auto-load the model. Users explicitly click "Download Model." No wasted bandwidth on bounced visitors.
- High return rate — Dev tools have repeat users. Someone who transcribes once usually transcribes again.
The effective cost per session drops dramatically over time.
Technical Deep Dive: Audio Processing Pipeline
Raw audio files can't go directly into a speech model. Here's the pipeline:
Audio File (MP3/WAV/M4A/OGG/WebM/MP4)
↓
Web Audio API decode
↓
Convert stereo → mono
↓
Resample to 16kHz
↓
ML Model inference (30s chunks)
↓
Text + timestamps
↓
Plain text or SRT subtitles
Why 16kHz?
Speech recognition models are trained on 16kHz audio. Higher sample rates waste compute without improving accuracy. We resample everything down.
Why mono?
Same reason. Stereo doesn't help speech recognition, and it doubles the data.
The resampling code
function resampleAudio(
audioData: Float32Array,
sourceSampleRate: number,
targetSampleRate: number = 16000
): Float32Array {
const ratio = sourceSampleRate / targetSampleRate;
const newLength = Math.round(audioData.length / ratio);
const result = new Float32Array(newLength);
// Linear interpolation
for (let i = 0; i < newLength; i++) {
const srcIndex = i * ratio;
const floor = Math.floor(srcIndex);
const ceil = Math.min(floor + 1, audioData.length - 1);
const t = srcIndex - floor;
result[i] = audioData[floor] * (1 - t) + audioData[ceil] * t;
}
return result;
}
Simple linear interpolation. Not audiophile quality, but perfectly adequate for speech recognition.
Constraints We Chose
10-minute limit
We cap audio at 10 minutes. Why?
- Memory — Longer audio requires more RAM. Browser tabs have limits.
- UX — Processing 30+ minutes of audio with a progress bar isn't great.
- Scope — For longer content, desktop tools like Whisper.cpp are better suited.
10 minutes handles most real use cases: voice notes, meeting snippets, interview clips.
No real-time streaming
We process complete files, not live microphone input. Real-time adds complexity:
- Chunking strategy
- Partial result handling
- UI for "listening" state
- Permission flows
Complete file upload is simpler and covers 90% of use cases.
English-focused models
Our default models are English-only. Multilingual Whisper exists but:
- 2-3x larger
- Slower
- Requires language detection or selection
We may add multilingual support later, but started focused.
What We Learned
WebGPU is ready (mostly)
Chrome and Edge support WebGPU well. Firefox is catching up. Safari is... Safari.
We fall back to WASM automatically. It's slower (3-5x) but works everywhere.
Model loading UX matters
A 75MB download with no feedback feels broken. We show:
- Download progress percentage
- "First-time only" messaging
- Clear model size upfront
Setting expectations prevents frustration.
Mobile doesn't work (yet)
We had to disable mobile entirely. The models crash mobile browsers due to memory limits:
- Mobile Safari/Chrome cap tabs at ~256-512MB
- Model loading alone needs 50-75MB
- Audio decoding creates large Float32Arrays
- Inference needs additional working memory for tensors
All together, this exceeds what mobile browsers allow before killing the tab. We tried — it crashes.
Rather than a broken experience, we show a clear message explaining why and suggest alternatives (native apps like Google Recorder on Android, or transferring files to a computer).
The honest answer: browser-based ML transcription on mobile isn't viable with current model sizes. Maybe someday with smaller models or WebNN support.
Results
The tool works. Users can:
- Download a 50-75MB model (once)
- Drop in an audio file
- Get text + SRT subtitles
- Copy or download results
Zero data leaves their device. Zero ongoing cost per transcription.
What's Next
We're exploring:
- Translation — Same approach, different model
- Summarization — Process the transcript with a small LLM
- Speaker diarization — Who said what
All browser-based. All private.
The era of "upload your sensitive data to our servers" is ending. Good riddance.
Try Private Transcription