Made with ❀ by Muhammad Waqar

πŸŽ™οΈ Voice to Text β€” Free & Open-Source STT Models (No Credit Card) 2025-2026

Curated catalog of free, open-source / open-weights voice-to-text (ASR) models that can be run with no credit card β€” either self-hosted (your CPU/GPU) or via a free no-card host (Hugging Face, Groq). Includes the latest 2025–2026 frontier audio models (VibeVoice-ASR, Qwen3-ASR, Voxtral, Kimi-Audio, GLM-4-Voice). Columns: parameters, window size, speed, response rate (latency), accuracy, limitations, license, free run path, stack compatibility.
β˜… = top pick for this project CN = strong Chinese = strong Japanese βœ” open / no card Window = max audio per pass RTFx = Γ—faster-than-realtime TTFT = time-to-first-token
⚠️ Data current as of June 2026, gathered from public model cards, papers & leaderboards (see footer). This is the fastest-moving area in AI β€” params/quotas/ranks drift weekly. Re-verify before committing. All models below are open-source/open-weights and run free with no credit card when self-hosted.
All πŸ†• Frontier 2025–26 🌍 Multilingual πŸ“± Lightweight / Edge πŸ‡¨πŸ‡³πŸ‡―πŸ‡΅ CN / JP ☁️ Free hosted (no card)
Model Type Params / Size Languages (CN/JP) Window size Speed Response rate / latency Accuracy Key limitations License Free run path (no card) Stack fit (A: PyΒ·Flutter / B: JavaΒ·RN / Web)
VibeVoice-ASR β˜…
Microsoft Β· Jan 2026
frontierlong-form
ASR (long-form, structured) ~1.5B class
7.5 Hz tokenizer
50+ langs
CN
Up to 60 min single-pass πŸ†
no chunking needed
Efficient (low frame-rate) Batch long-form (not sub-second live) High; outputs Who+When+What (speaker + timestamps) New/early tooling; GPU for speed; long-form batch focus Open (MIT) Self-host; HF Transformers (since Mar 2026); HF Spaces 🐍 A: native Β· β˜• B: via API/Py-sidecar Β· 🌐 Web: via backend
Qwen3-ASR β˜…
Alibaba Qwen Β· Jan 2026
frontierCN dialectsstreaming
ASR + ForcedAligner 1.7B & 0.6B
1.7B = Qwen3-1.7B + 300M AuT enc
52 langs + 22 Chinese dialects
CNβ˜…
Dynamic 1–8 s window
streaming + offline long
πŸš€ 0.6B β‰ˆ 2000 s audio / 1 s (concurrency 128) TTFT β‰ˆ 92 ms (excellent live) High multilingual; LID + timestamps Best speed needs GPU; newer ecosystem Open source (Apache-2.0) Self-host; HF; ModelScope 🐍 A: native Β· β˜• B: API/sidecar Β· πŸ“± 0.6B edge-capable Β· 🌐 ONNX possible
Voxtral (Mini / Small / Realtime) β˜…
Mistral AI Β· Jul 2025 β†’ Transcribe 2 Feb 2026
frontiermost accurate open
ASR + translate + QA + summarize Mini ~3B (Ministral 3B)
Small 24B (Mistral Small 3.1)
Many; Transcribe V2 = 13 langs w/ diarization
CN (less CN-tuned)
Long-form context Fast; Mini = edge-friendly Realtime: sub-200 ms live; word timestamps πŸ† Voxtral Small = most accurate open-weights (AA-WER 2.9%) Small (24B) heavy; CN/JP not its focus Open weights (Apache-2.0) Self-host; HF; voxtral.c (pure C, no Python) 🐍 A: native Β· β˜• B: C/JNI or API Β· πŸ“± Mini/C edge Β· 🌐 via backend
Kimi-Audio-7B-Instruct
Moonshot AI Β· Apr 2025
frontierCN strong
Audio foundation (ASR+AQA+TTS+dialogue) 7B
Qwen2.5-7B + Whisper enc
Multilingual; CNβ˜… Standard (chunked) Moderate (7B β†’ GPU) Conversational; not sub-second Leading open-source ASR benchmarks (13M hrs trained) Heavy (7B GPU); broad model, not edge Open weights + training code Self-host; HF 🐍 A: native Β· β˜• B: API/sidecar Β· πŸ“± no (too heavy)
GLM-4-Voice-9B
Zhipu AI (z.ai) Β· Oct 2024
frontierCN+EN
End-to-end speech LLM (voice chat) β€” ASR-capable 9B
12.5 Hz tokenizer (from Whisper)
CNβ˜… + EN only Conversational Moderate (9B GPU); streaming decoder Low-latency dialogue (starts at ~10 tokens) Good (chatbot-oriented, not pure transcription) Speech-dialogue focus, not best for plain transcripts; CN+EN only Code Apache-2.0; weights = GLM Model License (free) Self-host; HF 🐍 A: native Β· β˜• B: API Β· πŸ“± no
Qwen3-Omni-30B-A3B
Alibaba Qwen Β· 2026
frontieromni-modal
Omni LLM (text+audio+image+video, real-time speech) 30B MoE (~3B active) Multilingual; CNβ˜… Long multimodal context Real-time speech generation Real-time capable (high-end GPU) Very high (frontier omni) Overkill for pure STT; needs strong GPU/VRAM Open source (Apache-2.0) Self-host; HF 🐍 A: native Β· β˜• B: API Β· πŸ“± no
Whisper (tinyβ†’large-v3 / turbo) β˜…
OpenAI Β· 2022–2024
99+ langsfree on Groq/HF
ASR + translate 39M / 74M / 244M / 769M / 1.55B; turbo 809M 99+; CN (very good) 30 s sliding (chunk longer) tiny β‰ˆ10Γ— large; turbo β‰ˆ8Γ— large-v3 Batch (live via WhisperX/streaming wrappers) High (large-v3 best multilingual) 30 s window β†’ chunking; hallucinates on silence (use VAD) Open (MIT) Self-host (faster-whisper/whisper.cpp); Groq free; HF 🐍 A: native Β· β˜• B: Groq API / whisper.cpp-JNI Β· πŸ“± ggml on-device Β· 🌐 transformers.js/WASM
NVIDIA Canary-Qwen 2.5B
NVIDIA Β· 2025
top EN accuracy
ASR (LLM decoder) 2.5B English (Canary-1B: EN/DE/ES/FR + translate) ~40 s Slower (LLM decoder) Batch πŸ† ~5.63% WER (near top Open ASR Leaderboard) English-centric; heavy; NeMo toolkit Open (NVIDIA OpenModel / CC-BY) Self-host (NeMo); HF 🐍 A: native Β· β˜• B: API/sidecar Β· πŸ“± no
NVIDIA Parakeet TDT 0.6B v2
NVIDIA Β· 2024–25
fastest
ASR (streaming-capable) 0.6B / 1.1B English (some multiling variants) Streaming-capable πŸš€ RTFx > 2000 (β‰ˆ6.5Γ— Canary) Real-time High (EN); 10th on Open ASR Leaderboard Mostly English; needs NeMo/GPU Open (CC-BY) Self-host (NeMo); HF 🐍 A: native Β· β˜• B: sidecar Β· πŸ“± limited
IBM Granite Speech 4.1 2B / 3.3 8B
IBM Β· 2025
SOTA accuracy
ASR 2B / 8B English-focused (+ some) Standard Slower (8B) Batch πŸ† 5.33% WER (4.1 2B = best open on Open ASR Leaderboard) English-centric; heavy Open (Apache-2.0) Self-host; HF 🐍 A: native Β· β˜• B: sidecar Β· πŸ“± no
Meta MMS
Meta Β· 2023
1000+ langs
ASR (massively multilingual) ~1B 1,000+ languages; CN Standard Medium Batch Good (esp. rare/low-resource langs) Setup complexity; per-lang adapters Open (CC-BY-NC for some / mixed) Self-host; HF 🐍 A: native Β· β˜• B: sidecar Β· πŸ“± no
Meta SeamlessM4T v2
Meta Β· 2023–24
ASR+translate
ASR + speech↔text translation 2.3B ~100; CN Standard Medium Batch Good; strong for translation Large; GPU recommended; license restrictions Open (CC-BY-NC β€” non-commercial) Self-host; HF 🐍 A: native Β· β˜• B: sidecar Β· πŸ“± no
Wav2Vec 2.0 / HuBERT
Meta Β· 2020–21
fine-tunable
ASR (self-supervised) 95M – 1B EN + any (fine-tune per lang) Standard Fast Batch / streaming variants Good when fine-tuned No punctuation/casing by default; needs fine-tuning Open (MIT / Apache) Self-host; HF 🐍 A: native Β· β˜• B: sidecar Β· 🌐 transformers.js
Vosk β˜…
Alpha Cephei
on-device
ASR (streaming, offline) ~50 MB small β†’ 1.8 GB big 20+; EN, CN, RU, ES, … Streaming (zero-latency) Real-time on CPU/phone Live partials, fully offline Small = medium; big = good Per-language model; small models less accurate Open (Apache-2.0) Self-host / fully on-device (no server) 🐍 A: native Β· β˜• B: native Java API Β· πŸ“± Android+iOS SDK Β· 🌐 WASM
Moonshine (v2, 2026) β˜…
Moonshine AI Β· 2024β†’2026
edge real-time
ASR (streaming, edge) Tiny / Base (small) EN, ES, Mandarin CN, , KO, VI, UK, AR Streaming πŸš€ Beats Whisper tiny/base; 5Γ— faster on edge < 200 ms latency High for its size Fewer languages than Whisper Open (MIT) Self-host / on-device (runs on MCUβ†’phoneβ†’server) 🐍 A: native Β· β˜• B: C/bindings Β· πŸ“± RN+Flutter on-device Β· 🌐 WASM
Silero STT
Silero
lightweight
ASR (streaming) Small EN, DE, ES, others Streaming Fast on CPU Live Medium Fewer languages/updates Open (CC-BY-NC / mixed) Self-host / on-device 🐍 A: native Β· πŸ“± mobile Β· 🌐 ONNX
SenseVoice (Small) β˜…
Alibaba FunAudioLLM
CN+JP fast
ASR + emotion + audio-event (non-autoregressive) Small CNβ˜… Cantonese EN KO (50+) Utterance-level (≀~30 s best) πŸš€ ~15Γ— faster than Whisper-large Low latency (batch + real-time) Excellent CN, strong JP Best on short clips; via FunASR Open source Self-host; HF; ModelScope 🐍 A: native Β· β˜• B: sidecar Β· 🌐 ONNX
Paraformer
Alibaba DAMO (FunASR)
Mandarin
ASR (non-autoregressive) Medium CNβ˜… Mandarin (EN variants) Utterance / streaming πŸš€ Fast (parallel decode) Streaming + batch πŸ† Excellent Mandarin (60k hrs) Mandarin-first; via FunASR toolkit Open source Self-host; ModelScope; HF 🐍 A: native Β· β˜• B: sidecar Β· 🌐 ONNX
FunASR (toolkit)
Alibaba DAMO
pipeline
Framework: ASR + VAD + punctuation + diarization β€” CN / EN (hosts SenseVoice, Paraformer…) Streaming + batch Fast Streaming Excellent (CN) It's a toolkit β€” pick a model inside Open source Self-host 🐍 A: native Β· β˜• B: sidecar
Kotoba-Whisper (v2 / bilingual) β˜…
Kotoba-tech
JP best
ASR (distilled Whisper large-v3) distilled (~small) + EN 30 s ~6Γ— faster than large-v3 Batch πŸ† Excellent Japanese JP/EN focus only Open (MIT-style) Self-host; HF 🐍 A: native Β· β˜• B: sidecar Β· πŸ“± ggml possible
ReazonSpeech
Reazon Human Interaction Lab
JP
ASR (JP; JPβ†’EN translate) Medium Standard Fast Batch Excellent Japanese Japanese-centric Open source Self-host; HF 🐍 A: native Β· β˜• B: sidecar
Groq β€” Whisper large-v3 / turbo β˜…
Groq (hosted OSS model)
no card to start
Hosted ASR API (runs OSS Whisper) n/a (you call API) 99+; CN File-based πŸš€ Very fast (LPU) Free: ~2000 req/day, 7200 audio-sec/hr High (Whisper large-v3) Batch only; ~25 MB file cap; per-day quota Model = MIT; service free tier βœ” Free API, no credit card to start 🐍 A: HTTP Β· β˜• B: HTTP (ideal β€” no AI libs) Β· 🌐 via backend
Hugging Face Inference (Serverless)
Hugging Face
no card
Hosted endpoints for OSS models (Whisper, etc.) model-dependent model-dependent model-dependent Cold starts; variable Small free monthly credits / rate-limited model-dependent Tight free limits; cold-start latency varies by model βœ” Free tier, no card to start 🐍 A: HTTP Β· β˜• B: HTTP Β· 🌐 via backend
Web Speech API
Browser (Chrome/Edge)
zero setup
Browser-native ASR (not OSS, but free) n/a many (Chromeβ†’Google) Streaming Real-time Live partials, no key Medium-good (varies) Chrome sends audio to Google; inconsistent across browsers; not for private audio Browser API (not open-source) βœ” Free, no key, no card 🌐 Web only (great Phase-0 demo); πŸ“± limited

How to read "Stack fit"

🐍 Option A β€” Python backend (+ Flutter)

Every model here runs natively in Python (Hugging Face transformers or the model's own repo). This is the smoothest path for self-hosting. Flutter calls your Python API over HTTPS.

β˜• Option B β€” Java backend (+ React Native)

Java has weak native AI libs. Three integration paths: (1) call a free hosted API like Groq over HTTP (easiest), (2) run a tiny Python ASR microservice beside Spring Boot, or (3) use native bindings β€” Vosk has a Java API; whisper.cpp/voxtral.c via JNI.

🌐 React Web

Web never runs heavy models directly β€” it records audio and calls the backend. Exceptions that run in-browser: Web Speech API, and small models via transformers.js/ONNX/WASM (Whisper-tiny, Moonshine, Vosk-WASM).

πŸ“± On-device mobile (offline)

For true offline mobile, use Vosk (Android+iOS SDK), Moonshine (RN/Flutter), or whisper.cpp ggml. Larger models (7B+) cannot run on phones β€” call the backend instead.

Recommended picks for Models

MVP (free, no card)

Groq Whisper large-v3 β€” fast, generous free tier, 99+ langs incl. CN/JP, just an HTTP call (equal for Python or Java).

Self-host fallback / privacy

faster-whisper (large-v3-turbo on GPU, INT8 small on CPU) β€” no quota, free forever.

Best new multilingual + CN/JP

Qwen3-ASR (52 langs + 22 CN dialects, ~92 ms TTFT) or VibeVoice-ASR (60-min single-pass + diarization).

CN / JP specialists

SenseVoice (CN+JP, 15Γ— faster) Β· Paraformer (Mandarin) Β· Kotoba-Whisper (JP).

Offline mobile

Vosk (~50 MB, Java+Android+iOS) or Moonshine (<200 ms edge).

Most accurate open-weights

Voxtral Small (2.9% AA-WER) Β· IBM Granite 4.1 2B (5.33% WER) Β· Canary-Qwen 2.5B (5.63%).