Free Voice to Text — Open-Source STT Models (No Credit Card) 2025-2026

⚠️ Data current as of June 2026, gathered from public model cards, papers & leaderboards (see footer). This is the fastest-moving area in AI — params/quotas/ranks drift weekly. Re-verify before committing. All models below are open-source/open-weights and run free with no credit card when self-hosted.

All 🆕 Frontier 2025–26 🌍 Multilingual 📱 Lightweight / Edge 🇨🇳🇯🇵 CN / JP ☁️ Free hosted (no card)

Model	Type	Params / Size	Languages (CN/JP)	Window size	Speed	Response rate / latency	Accuracy	Key limitations	License	Free run path (no card)	Stack fit (A: Py·Flutter / B: Java·RN / Web)
VibeVoice-ASR ★ Microsoft · Jan 2026 frontierlong-form	ASR (long-form, structured)	~1.5B class 7.5 Hz tokenizer	50+ langs CN JP	Up to 60 min single-pass 🏆 no chunking needed	Efficient (low frame-rate)	Batch long-form (not sub-second live)	High; outputs Who+When+What (speaker + timestamps)	New/early tooling; GPU for speed; long-form batch focus	Open (MIT)	Self-host; HF Transformers (since Mar 2026); HF Spaces	🐍 A: native · ☕ B: via API/Py-sidecar · 🌐 Web: via backend
Qwen3-ASR ★ Alibaba Qwen · Jan 2026 frontierCN dialectsstreaming	ASR + ForcedAligner	1.7B & 0.6B 1.7B = Qwen3-1.7B + 300M AuT enc	52 langs + 22 Chinese dialects CN★ JP	Dynamic 1–8 s window streaming + offline long	🚀 0.6B ≈ 2000 s audio / 1 s (concurrency 128)	TTFT ≈ 92 ms (excellent live)	High multilingual; LID + timestamps	Best speed needs GPU; newer ecosystem	Open source (Apache-2.0)	Self-host; HF; ModelScope	🐍 A: native · ☕ B: API/sidecar · 📱 0.6B edge-capable · 🌐 ONNX possible
Voxtral (Mini / Small / Realtime) ★ Mistral AI · Jul 2025 → Transcribe 2 Feb 2026 frontiermost accurate open	ASR + translate + QA + summarize	Mini ~3B (Ministral 3B) Small 24B (Mistral Small 3.1)	Many; Transcribe V2 = 13 langs w/ diarization CN JP (less CN-tuned)	Long-form context	Fast; Mini = edge-friendly	Realtime: sub-200 ms live; word timestamps	🏆 Voxtral Small = most accurate open-weights (AA-WER 2.9%)	Small (24B) heavy; CN/JP not its focus	Open weights (Apache-2.0)	Self-host; HF; `voxtral.c` (pure C, no Python)	🐍 A: native · ☕ B: C/JNI or API · 📱 Mini/C edge · 🌐 via backend
Kimi-Audio-7B-Instruct Moonshot AI · Apr 2025 frontierCN strong	Audio foundation (ASR+AQA+TTS+dialogue)	7B Qwen2.5-7B + Whisper enc	Multilingual; CN★ JP	Standard (chunked)	Moderate (7B → GPU)	Conversational; not sub-second	Leading open-source ASR benchmarks (13M hrs trained)	Heavy (7B GPU); broad model, not edge	Open weights + training code	Self-host; HF	🐍 A: native · ☕ B: API/sidecar · 📱 no (too heavy)
GLM-4-Voice-9B Zhipu AI (z.ai) · Oct 2024 frontierCN+EN	End-to-end speech LLM (voice chat) — ASR-capable	9B 12.5 Hz tokenizer (from Whisper)	CN★ + EN only	Conversational	Moderate (9B GPU); streaming decoder	Low-latency dialogue (starts at ~10 tokens)	Good (chatbot-oriented, not pure transcription)	Speech-dialogue focus, not best for plain transcripts; CN+EN only	Code Apache-2.0; weights = GLM Model License (free)	Self-host; HF	🐍 A: native · ☕ B: API · 📱 no
Qwen3-Omni-30B-A3B Alibaba Qwen · 2026 frontieromni-modal	Omni LLM (text+audio+image+video, real-time speech)	30B MoE (~3B active)	Multilingual; CN★ JP	Long multimodal context	Real-time speech generation	Real-time capable (high-end GPU)	Very high (frontier omni)	Overkill for pure STT; needs strong GPU/VRAM	Open source (Apache-2.0)	Self-host; HF	🐍 A: native · ☕ B: API · 📱 no
Whisper (tiny→large-v3 / turbo) ★ OpenAI · 2022–2024 99+ langsfree on Groq/HF	ASR + translate	39M / 74M / 244M / 769M / 1.55B; turbo 809M	99+; CN JP (very good)	30 s sliding (chunk longer)	tiny ≈10× large; turbo ≈8× large-v3	Batch (live via WhisperX/streaming wrappers)	High (large-v3 best multilingual)	30 s window → chunking; hallucinates on silence (use VAD)	Open (MIT)	Self-host (`faster-whisper`/`whisper.cpp`); Groq free; HF	🐍 A: native · ☕ B: Groq API / whisper.cpp-JNI · 📱 ggml on-device · 🌐 transformers.js/WASM
NVIDIA Canary-Qwen 2.5B NVIDIA · 2025 top EN accuracy	ASR (LLM decoder)	2.5B	English (Canary-1B: EN/DE/ES/FR + translate)	~40 s	Slower (LLM decoder)	Batch	🏆 ~5.63% WER (near top Open ASR Leaderboard)	English-centric; heavy; NeMo toolkit	Open (NVIDIA OpenModel / CC-BY)	Self-host (NeMo); HF	🐍 A: native · ☕ B: API/sidecar · 📱 no
NVIDIA Parakeet TDT 0.6B v2 NVIDIA · 2024–25 fastest	ASR (streaming-capable)	0.6B / 1.1B	English (some multiling variants)	Streaming-capable	🚀 RTFx > 2000 (≈6.5× Canary)	Real-time	High (EN); 10th on Open ASR Leaderboard	Mostly English; needs NeMo/GPU	Open (CC-BY)	Self-host (NeMo); HF	🐍 A: native · ☕ B: sidecar · 📱 limited
IBM Granite Speech 4.1 2B / 3.3 8B IBM · 2025 SOTA accuracy	ASR	2B / 8B	English-focused (+ some)	Standard	Slower (8B)	Batch	🏆 5.33% WER (4.1 2B = best open on Open ASR Leaderboard)	English-centric; heavy	Open (Apache-2.0)	Self-host; HF	🐍 A: native · ☕ B: sidecar · 📱 no
Meta MMS Meta · 2023 1000+ langs	ASR (massively multilingual)	~1B	1,000+ languages; CN JP	Standard	Medium	Batch	Good (esp. rare/low-resource langs)	Setup complexity; per-lang adapters	Open (CC-BY-NC for some / mixed)	Self-host; HF	🐍 A: native · ☕ B: sidecar · 📱 no
Meta SeamlessM4T v2 Meta · 2023–24 ASR+translate	ASR + speech↔text translation	2.3B	~100; CN JP	Standard	Medium	Batch	Good; strong for translation	Large; GPU recommended; license restrictions	Open (CC-BY-NC — non-commercial)	Self-host; HF	🐍 A: native · ☕ B: sidecar · 📱 no
Wav2Vec 2.0 / HuBERT Meta · 2020–21 fine-tunable	ASR (self-supervised)	95M – 1B	EN + any (fine-tune per lang)	Standard	Fast	Batch / streaming variants	Good when fine-tuned	No punctuation/casing by default; needs fine-tuning	Open (MIT / Apache)	Self-host; HF	🐍 A: native · ☕ B: sidecar · 🌐 transformers.js
Vosk ★ Alpha Cephei on-device	ASR (streaming, offline)	~50 MB small → 1.8 GB big	20+; EN, CN, RU, ES, JP…	Streaming (zero-latency)	Real-time on CPU/phone	Live partials, fully offline	Small = medium; big = good	Per-language model; small models less accurate	Open (Apache-2.0)	Self-host / fully on-device (no server)	🐍 A: native · ☕ B: native Java API · 📱 Android+iOS SDK · 🌐 WASM
Moonshine (v2, 2026) ★ Moonshine AI · 2024→2026 edge real-time	ASR (streaming, edge)	Tiny / Base (small)	EN, ES, Mandarin CN, JP, KO, VI, UK, AR	Streaming	🚀 Beats Whisper tiny/base; 5× faster on edge	< 200 ms latency	High for its size	Fewer languages than Whisper	Open (MIT)	Self-host / on-device (runs on MCU→phone→server)	🐍 A: native · ☕ B: C/bindings · 📱 RN+Flutter on-device · 🌐 WASM
Silero STT Silero lightweight	ASR (streaming)	Small	EN, DE, ES, others	Streaming	Fast on CPU	Live	Medium	Fewer languages/updates	Open (CC-BY-NC / mixed)	Self-host / on-device	🐍 A: native · 📱 mobile · 🌐 ONNX
SenseVoice (Small) ★ Alibaba FunAudioLLM CN+JP fast	ASR + emotion + audio-event (non-autoregressive)	Small	CN★ Cantonese EN JP KO (50+)	Utterance-level (≤~30 s best)	🚀 ~15× faster than Whisper-large	Low latency (batch + real-time)	Excellent CN, strong JP	Best on short clips; via FunASR	Open source	Self-host; HF; ModelScope	🐍 A: native · ☕ B: sidecar · 🌐 ONNX
Paraformer Alibaba DAMO (FunASR) Mandarin	ASR (non-autoregressive)	Medium	CN★ Mandarin (EN variants)	Utterance / streaming	🚀 Fast (parallel decode)	Streaming + batch	🏆 Excellent Mandarin (60k hrs)	Mandarin-first; via FunASR toolkit	Open source	Self-host; ModelScope; HF	🐍 A: native · ☕ B: sidecar · 🌐 ONNX
FunASR (toolkit) Alibaba DAMO pipeline	Framework: ASR + VAD + punctuation + diarization	—	CN / EN (hosts SenseVoice, Paraformer…)	Streaming + batch	Fast	Streaming	Excellent (CN)	It's a toolkit — pick a model inside	Open source	Self-host	🐍 A: native · ☕ B: sidecar
Kotoba-Whisper (v2 / bilingual) ★ Kotoba-tech JP best	ASR (distilled Whisper large-v3)	distilled (~small)	JP★ + EN	30 s	~6× faster than large-v3	Batch	🏆 Excellent Japanese	JP/EN focus only	Open (MIT-style)	Self-host; HF	🐍 A: native · ☕ B: sidecar · 📱 ggml possible
ReazonSpeech Reazon Human Interaction Lab JP	ASR (JP; JP→EN translate)	Medium	JP★	Standard	Fast	Batch	Excellent Japanese	Japanese-centric	Open source	Self-host; HF	🐍 A: native · ☕ B: sidecar
Groq — Whisper large-v3 / turbo ★ Groq (hosted OSS model) no card to start	Hosted ASR API (runs OSS Whisper)	n/a (you call API)	99+; CN JP	File-based	🚀 Very fast (LPU)	Free: ~2000 req/day, 7200 audio-sec/hr	High (Whisper large-v3)	Batch only; ~25 MB file cap; per-day quota	Model = MIT; service free tier	✔ Free API, no credit card to start	🐍 A: HTTP · ☕ B: HTTP (ideal — no AI libs) · 🌐 via backend
Hugging Face Inference (Serverless) Hugging Face no card	Hosted endpoints for OSS models (Whisper, etc.)	model-dependent	model-dependent	model-dependent	Cold starts; variable	Small free monthly credits / rate-limited	model-dependent	Tight free limits; cold-start latency	varies by model	✔ Free tier, no card to start	🐍 A: HTTP · ☕ B: HTTP · 🌐 via backend
Web Speech API Browser (Chrome/Edge) zero setup	Browser-native ASR (not OSS, but free)	n/a	many (Chrome→Google)	Streaming	Real-time	Live partials, no key	Medium-good (varies)	Chrome sends audio to Google; inconsistent across browsers; not for private audio	Browser API (not open-source)	✔ Free, no key, no card	🌐 Web only (great Phase-0 demo); 📱 limited

How to read "Stack fit"

🐍 Option A — Python backend (+ Flutter)

Every model here runs natively in Python (Hugging Face transformers or the model's own repo). This is the smoothest path for self-hosting. Flutter calls your Python API over HTTPS.

☕ Option B — Java backend (+ React Native)

Java has weak native AI libs. Three integration paths: (1) call a free hosted API like Groq over HTTP (easiest), (2) run a tiny Python ASR microservice beside Spring Boot, or (3) use native bindings — Vosk has a Java API; whisper.cpp/voxtral.c via JNI.

🌐 React Web

Web never runs heavy models directly — it records audio and calls the backend. Exceptions that run in-browser: Web Speech API, and small models via transformers.js/ONNX/WASM (Whisper-tiny, Moonshine, Vosk-WASM).

📱 On-device mobile (offline)

For true offline mobile, use Vosk (Android+iOS SDK), Moonshine (RN/Flutter), or whisper.cpp ggml. Larger models (7B+) cannot run on phones — call the backend instead.

Recommended picks for Models

MVP (free, no card)

Groq Whisper large-v3 — fast, generous free tier, 99+ langs incl. CN/JP, just an HTTP call (equal for Python or Java).

Self-host fallback / privacy

faster-whisper (large-v3-turbo on GPU, INT8 small on CPU) — no quota, free forever.

Best new multilingual + CN/JP

Qwen3-ASR (52 langs + 22 CN dialects, ~92 ms TTFT) or VibeVoice-ASR (60-min single-pass + diarization).

CN / JP specialists

SenseVoice (CN+JP, 15× faster) · Paraformer (Mandarin) · Kotoba-Whisper (JP).

Offline mobile

Vosk (~50 MB, Java+Android+iOS) or Moonshine (<200 ms edge).

Most accurate open-weights

Voxtral Small (2.9% AA-WER) · IBM Granite 4.1 2B (5.33% WER) · Canary-Qwen 2.5B (5.63%).