Most transcription services work the same way: you upload your audio, it goes to a server, gets processed by some AI model, and comes back as text. Your audio—meetings, interviews, voice notes—passes through someone else's infrastructure.

We built something different.

Our Private Transcription tool runs entirely in your browser. Your audio never leaves your device. No uploads, no servers, no "trust us with your data."

Here's how we did it, what it costs, and the tradeoffs we made.

The Core Idea: AI in the Browser

Modern browsers can run machine learning models directly. Not toy demos—actual production-quality speech recognition.

The technology stack:

Transformers.js — Hugging Face's library for running ML models in JavaScript
WebGPU — Hardware acceleration (GPU compute in the browser)
WASM — Fallback for browsers without WebGPU
Web Audio API — Decoding and processing audio files

The key insight: speech-to-text models have gotten small enough to download once and run locally. A 50-75MB model can transcribe audio with surprising accuracy.

Why Not Just Use an API?

We could have integrated Whisper API, AssemblyAI, or Deepgram. Easier to build. Better accuracy. So why didn't we?

Privacy is the feature

Some audio is sensitive:

Medical appointments
Legal consultations
HR conversations
Personal voice journals
Confidential interviews

For these, "we promise not to look" isn't good enough. The only real privacy is never sending the data at all.

No usage costs

API transcription costs $0.006-0.01 per minute of audio. That adds up fast:

10,000 minutes/month = $60-100/month
With no way to rate-limit without accounts
And abuse potential from bots

Browser-based: $0 per transcription, forever.

No server infrastructure

No API keys to manage. No queue systems. No scaling concerns. No compliance headaches.

The user's device is the infrastructure.

The Model Decision: Moonshine vs Whisper

We offer two models:

Model	Size	Speed	Best For
Moonshine Tiny	~50MB	5x faster	Clear audio, quick results
Whisper Tiny	~75MB	Baseline	Accented speech, noisy audio

Moonshine is a newer model optimized specifically for real-time transcription. It sacrifices some accuracy for dramatically better speed.

Whisper Tiny is OpenAI's smallest Whisper variant. More robust to accents and background noise, but slower.

We deliberately excluded larger models (Whisper Small at 244MB, Base at 142MB). The accuracy gains don't justify 3-5x download sizes for a browser tool.

The Real Cost: Bandwidth

"Free to run" isn't quite true. There's one cost: model downloads.

When a user first visits and clicks "Download Model," they pull 50-75MB from a CDN. That's our cost.

The math

CDN egress: ~$0.02-0.05 per GB
Model size: ~0.1 GB
Cost per new user: $0.002-0.005

At 10,000 new users: $20-50. At 100,000 new users: $200-500.

Why it's manageable

Browser caching — After first download, the model is cached in your browser. Returning users cost $0.
CDN edge caching — Popular models get cached at edge nodes. Most requests never hit origin.
User-initiated downloads — We don't auto-load the model. Users explicitly click "Download Model." No wasted bandwidth on bounced visitors.
High return rate — Dev tools have repeat users. Someone who transcribes once usually transcribes again.

The effective cost per session drops dramatically over time.

Technical Deep Dive: Audio Processing Pipeline

Raw audio files can't go directly into a speech model. Here's the pipeline:

Audio File (MP3/WAV/M4A/OGG/WebM/MP4)
    ↓
Web Audio API decode
    ↓
Convert stereo → mono
    ↓
Resample to 16kHz
    ↓
ML Model inference (30s chunks)
    ↓
Text + timestamps
    ↓
Plain text or SRT subtitles

Why 16kHz?

Speech recognition models are trained on 16kHz audio. Higher sample rates waste compute without improving accuracy. We resample everything down.

Why mono?

Same reason. Stereo doesn't help speech recognition, and it doubles the data.

The resampling code

typescript

function resampleAudio(
  audioData: Float32Array,
  sourceSampleRate: number,
  targetSampleRate: number = 16000
): Float32Array {
  const ratio = sourceSampleRate / targetSampleRate;
  const newLength = Math.round(audioData.length / ratio);
  const result = new Float32Array(newLength);

  // Linear interpolation
  for (let i = 0; i < newLength; i++) {
    const srcIndex = i * ratio;
    const floor = Math.floor(srcIndex);
    const ceil = Math.min(floor + 1, audioData.length - 1);
    const t = srcIndex - floor;
    result[i] = audioData[floor] * (1 - t) + audioData[ceil] * t;
  }

  return result;
}

Simple linear interpolation. Not audiophile quality, but perfectly adequate for speech recognition.

Constraints We Chose

10-minute limit

We cap audio at 10 minutes. Why?

Memory — Longer audio requires more RAM. Browser tabs have limits.
UX — Processing 30+ minutes of audio with a progress bar isn't great.
Scope — For longer content, desktop tools like Whisper.cpp are better suited.

10 minutes handles most real use cases: voice notes, meeting snippets, interview clips.

No real-time streaming

We process complete files, not live microphone input. Real-time adds complexity:

Chunking strategy
Partial result handling
UI for "listening" state
Permission flows

Complete file upload is simpler and covers 90% of use cases.

English-focused models

Our default models are English-only. Multilingual Whisper exists but:

2-3x larger
Slower
Requires language detection or selection

We may add multilingual support later, but started focused.

What We Learned

WebGPU is ready (mostly)

Chrome and Edge support WebGPU well. Firefox is catching up. Safari is... Safari.

We fall back to WASM automatically. It's slower (3-5x) but works everywhere.

Model loading UX matters

A 75MB download with no feedback feels broken. We show:

Download progress percentage
"First-time only" messaging
Clear model size upfront

Setting expectations prevents frustration.

Mobile doesn't work (yet)

We had to disable mobile entirely. The models crash mobile browsers due to memory limits:

Mobile Safari/Chrome cap tabs at ~256-512MB
Model loading alone needs 50-75MB
Audio decoding creates large Float32Arrays
Inference needs additional working memory for tensors

All together, this exceeds what mobile browsers allow before killing the tab. We tried — it crashes.

Rather than a broken experience, we show a clear message explaining why and suggest alternatives (native apps like Google Recorder on Android, or transferring files to a computer).

The honest answer: browser-based ML transcription on mobile isn't viable with current model sizes. Maybe someday with smaller models or WebNN support.

Results

The tool works. Users can:

Download a 50-75MB model (once)
Drop in an audio file
Get text + SRT subtitles
Copy or download results

Zero data leaves their device. Zero ongoing cost per transcription.

What's Next

We're exploring:

Translation — Same approach, different model
Summarization — Process the transcript with a small LLM
Speaker diarization — Who said what

All browser-based. All private.

The era of "upload your sensitive data to our servers" is ending. Good riddance.

Try Private Transcription