New Feature

Private Audio Transcription: AI That Never Sees Your Audio

We built a browser-based transcription tool using on-device AI. Your audio never leaves your device—here's how it works.

6 min read

Most transcription services work the same way: you upload your audio, it goes to a server, gets processed by some AI model, and comes back as text. Your audio—meetings, interviews, voice notes—passes through someone else's infrastructure.

We built something different.

Our Private Transcription tool runs entirely in your browser. Your audio never leaves your device. No uploads, no servers, no "trust us with your data."

Here's how we did it, what it costs, and the tradeoffs we made.

The Core Idea: AI in the Browser

Modern browsers can run machine learning models directly. Not toy demos—actual production-quality speech recognition.

The technology stack:

  • Transformers.js — Hugging Face's library for running ML models in JavaScript
  • WebGPU — Hardware acceleration (GPU compute in the browser)
  • WASM — Fallback for browsers without WebGPU
  • Web Audio API — Decoding and processing audio files

The key insight: speech-to-text models have gotten small enough to download once and run locally. A 50-75MB model can transcribe audio with surprising accuracy.

Why Not Just Use an API?

We could have integrated Whisper API, AssemblyAI, or Deepgram. Easier to build. Better accuracy. So why didn't we?

Privacy is the feature

Some audio is sensitive:

  • Medical appointments
  • Legal consultations
  • HR conversations
  • Personal voice journals
  • Confidential interviews

For these, "we promise not to look" isn't good enough. The only real privacy is never sending the data at all.

No usage costs

API transcription costs $0.006-0.01 per minute of audio. That adds up fast:

  • 10,000 minutes/month = $60-100/month
  • With no way to rate-limit without accounts
  • And abuse potential from bots

Browser-based: $0 per transcription, forever.

No server infrastructure

No API keys to manage. No queue systems. No scaling concerns. No compliance headaches.

The user's device is the infrastructure.

The Model Decision: Moonshine vs Whisper

We offer two models:

ModelSizeSpeedBest For
Moonshine Tiny~50MB5x fasterClear audio, quick results
Whisper Tiny~75MBBaselineAccented speech, noisy audio

Moonshine is a newer model optimized specifically for real-time transcription. It sacrifices some accuracy for dramatically better speed.

Whisper Tiny is OpenAI's smallest Whisper variant. More robust to accents and background noise, but slower.

We deliberately excluded larger models (Whisper Small at 244MB, Base at 142MB). The accuracy gains don't justify 3-5x download sizes for a browser tool.

The Real Cost: Bandwidth

"Free to run" isn't quite true. There's one cost: model downloads.

When a user first visits and clicks "Download Model," they pull 50-75MB from a CDN. That's our cost.

The math

  • CDN egress: ~$0.02-0.05 per GB
  • Model size: ~0.1 GB
  • Cost per new user: $0.002-0.005

At 10,000 new users: $20-50. At 100,000 new users: $200-500.

Why it's manageable

  1. Browser caching — After first download, the model is cached in your browser. Returning users cost $0.
  2. CDN edge caching — Popular models get cached at edge nodes. Most requests never hit origin.
  3. User-initiated downloads — We don't auto-load the model. Users explicitly click "Download Model." No wasted bandwidth on bounced visitors.
  4. High return rate — Dev tools have repeat users. Someone who transcribes once usually transcribes again.

The effective cost per session drops dramatically over time.

Technical Deep Dive: Audio Processing Pipeline

Raw audio files can't go directly into a speech model. Here's the pipeline:

Audio File (MP3/WAV/M4A/OGG/WebM/MP4)
    ↓
Web Audio API decode
    ↓
Convert stereo → mono
    ↓
Resample to 16kHz
    ↓
ML Model inference (30s chunks)
    ↓
Text + timestamps
    ↓
Plain text or SRT subtitles

Why 16kHz?

Speech recognition models are trained on 16kHz audio. Higher sample rates waste compute without improving accuracy. We resample everything down.

Why mono?

Same reason. Stereo doesn't help speech recognition, and it doubles the data.

The resampling code

typescript
function resampleAudio(
  audioData: Float32Array,
  sourceSampleRate: number,
  targetSampleRate: number = 16000
): Float32Array {
  const ratio = sourceSampleRate / targetSampleRate;
  const newLength = Math.round(audioData.length / ratio);
  const result = new Float32Array(newLength);

  // Linear interpolation
  for (let i = 0; i < newLength; i++) {
    const srcIndex = i * ratio;
    const floor = Math.floor(srcIndex);
    const ceil = Math.min(floor + 1, audioData.length - 1);
    const t = srcIndex - floor;
    result[i] = audioData[floor] * (1 - t) + audioData[ceil] * t;
  }

  return result;
}

Simple linear interpolation. Not audiophile quality, but perfectly adequate for speech recognition.

Constraints We Chose

10-minute limit

We cap audio at 10 minutes. Why?

  1. Memory — Longer audio requires more RAM. Browser tabs have limits.
  2. UX — Processing 30+ minutes of audio with a progress bar isn't great.
  3. Scope — For longer content, desktop tools like Whisper.cpp are better suited.

10 minutes handles most real use cases: voice notes, meeting snippets, interview clips.

No real-time streaming

We process complete files, not live microphone input. Real-time adds complexity:

  • Chunking strategy
  • Partial result handling
  • UI for "listening" state
  • Permission flows

Complete file upload is simpler and covers 90% of use cases.

English-focused models

Our default models are English-only. Multilingual Whisper exists but:

  • 2-3x larger
  • Slower
  • Requires language detection or selection

We may add multilingual support later, but started focused.

What We Learned

WebGPU is ready (mostly)

Chrome and Edge support WebGPU well. Firefox is catching up. Safari is... Safari.

We fall back to WASM automatically. It's slower (3-5x) but works everywhere.

Model loading UX matters

A 75MB download with no feedback feels broken. We show:

  • Download progress percentage
  • "First-time only" messaging
  • Clear model size upfront

Setting expectations prevents frustration.

Mobile doesn't work (yet)

We had to disable mobile entirely. The models crash mobile browsers due to memory limits:

  • Mobile Safari/Chrome cap tabs at ~256-512MB
  • Model loading alone needs 50-75MB
  • Audio decoding creates large Float32Arrays
  • Inference needs additional working memory for tensors

All together, this exceeds what mobile browsers allow before killing the tab. We tried — it crashes.

Rather than a broken experience, we show a clear message explaining why and suggest alternatives (native apps like Google Recorder on Android, or transferring files to a computer).

The honest answer: browser-based ML transcription on mobile isn't viable with current model sizes. Maybe someday with smaller models or WebNN support.

Results

The tool works. Users can:

  1. Download a 50-75MB model (once)
  2. Drop in an audio file
  3. Get text + SRT subtitles
  4. Copy or download results

Zero data leaves their device. Zero ongoing cost per transcription.

What's Next

We're exploring:

  • Translation — Same approach, different model
  • Summarization — Process the transcript with a small LLM
  • Speaker diarization — Who said what

All browser-based. All private.

The era of "upload your sensitive data to our servers" is ending. Good riddance.

Try Private Transcription