| Model | Type | Params / Size | Languages (CN/JP) | Window size | Speed | Response rate / latency | Accuracy | Key limitations | License | Free run path (no card) | Stack fit (A: PyΒ·Flutter / B: JavaΒ·RN / Web) |
|---|---|---|---|---|---|---|---|---|---|---|---|
VibeVoice-ASR β
frontierlong-form |
ASR (long-form, structured) | ~1.5B class 7.5 Hz tokenizer |
50+ langs CN JP |
Up to 60 min single-pass π no chunking needed |
Efficient (low frame-rate) | Batch long-form (not sub-second live) | High; outputs Who+When+What (speaker + timestamps) | New/early tooling; GPU for speed; long-form batch focus | Open (MIT) | Self-host; HF Transformers (since Mar 2026); HF Spaces | π A: native Β· β B: via API/Py-sidecar Β· π Web: via backend |
Qwen3-ASR β
frontierCN dialectsstreaming |
ASR + ForcedAligner | 1.7B & 0.6B 1.7B = Qwen3-1.7B + 300M AuT enc |
52 langs + 22 Chinese dialects CNβ JP |
Dynamic 1β8 s window streaming + offline long |
π 0.6B β 2000 s audio / 1 s (concurrency 128) | TTFT β 92 ms (excellent live) | High multilingual; LID + timestamps | Best speed needs GPU; newer ecosystem | Open source (Apache-2.0) | Self-host; HF; ModelScope | π A: native Β· β B: API/sidecar Β· π± 0.6B edge-capable Β· π ONNX possible |
Voxtral (Mini / Small / Realtime) β
frontiermost accurate open |
ASR + translate + QA + summarize | Mini ~3B (Ministral 3B) Small 24B (Mistral Small 3.1) |
Many; Transcribe V2 = 13 langs w/ diarization CN JP (less CN-tuned) |
Long-form context | Fast; Mini = edge-friendly | Realtime: sub-200 ms live; word timestamps | π Voxtral Small = most accurate open-weights (AA-WER 2.9%) | Small (24B) heavy; CN/JP not its focus | Open weights (Apache-2.0) | Self-host; HF; voxtral.c (pure C, no Python) |
π A: native Β· β B: C/JNI or API Β· π± Mini/C edge Β· π via backend |
Kimi-Audio-7B-Instruct
frontierCN strong |
Audio foundation (ASR+AQA+TTS+dialogue) | 7B Qwen2.5-7B + Whisper enc |
Multilingual; CNβ JP | Standard (chunked) | Moderate (7B β GPU) | Conversational; not sub-second | Leading open-source ASR benchmarks (13M hrs trained) | Heavy (7B GPU); broad model, not edge | Open weights + training code | Self-host; HF | π A: native Β· β B: API/sidecar Β· π± no (too heavy) |
GLM-4-Voice-9B
frontierCN+EN |
End-to-end speech LLM (voice chat) β ASR-capable | 9B 12.5 Hz tokenizer (from Whisper) |
CNβ + EN only | Conversational | Moderate (9B GPU); streaming decoder | Low-latency dialogue (starts at ~10 tokens) | Good (chatbot-oriented, not pure transcription) | Speech-dialogue focus, not best for plain transcripts; CN+EN only | Code Apache-2.0; weights = GLM Model License (free) | Self-host; HF | π A: native Β· β B: API Β· π± no |
Qwen3-Omni-30B-A3B
frontieromni-modal |
Omni LLM (text+audio+image+video, real-time speech) | 30B MoE (~3B active) | Multilingual; CNβ JP | Long multimodal context | Real-time speech generation | Real-time capable (high-end GPU) | Very high (frontier omni) | Overkill for pure STT; needs strong GPU/VRAM | Open source (Apache-2.0) | Self-host; HF | π A: native Β· β B: API Β· π± no |
Whisper (tinyβlarge-v3 / turbo) β
99+ langsfree on Groq/HF |
ASR + translate | 39M / 74M / 244M / 769M / 1.55B; turbo 809M | 99+; CN JP (very good) | 30 s sliding (chunk longer) | tiny β10Γ large; turbo β8Γ large-v3 | Batch (live via WhisperX/streaming wrappers) | High (large-v3 best multilingual) | 30 s window β chunking; hallucinates on silence (use VAD) | Open (MIT) | Self-host (faster-whisper/whisper.cpp); Groq free; HF |
π A: native Β· β B: Groq API / whisper.cpp-JNI Β· π± ggml on-device Β· π transformers.js/WASM |
NVIDIA Canary-Qwen 2.5B
top EN accuracy |
ASR (LLM decoder) | 2.5B | English (Canary-1B: EN/DE/ES/FR + translate) | ~40 s | Slower (LLM decoder) | Batch | π ~5.63% WER (near top Open ASR Leaderboard) | English-centric; heavy; NeMo toolkit | Open (NVIDIA OpenModel / CC-BY) | Self-host (NeMo); HF | π A: native Β· β B: API/sidecar Β· π± no |
NVIDIA Parakeet TDT 0.6B v2
fastest |
ASR (streaming-capable) | 0.6B / 1.1B | English (some multiling variants) | Streaming-capable | π RTFx > 2000 (β6.5Γ Canary) | Real-time | High (EN); 10th on Open ASR Leaderboard | Mostly English; needs NeMo/GPU | Open (CC-BY) | Self-host (NeMo); HF | π A: native Β· β B: sidecar Β· π± limited |
IBM Granite Speech 4.1 2B / 3.3 8B
SOTA accuracy |
ASR | 2B / 8B | English-focused (+ some) | Standard | Slower (8B) | Batch | π 5.33% WER (4.1 2B = best open on Open ASR Leaderboard) | English-centric; heavy | Open (Apache-2.0) | Self-host; HF | π A: native Β· β B: sidecar Β· π± no |
Meta MMS
1000+ langs |
ASR (massively multilingual) | ~1B | 1,000+ languages; CN JP | Standard | Medium | Batch | Good (esp. rare/low-resource langs) | Setup complexity; per-lang adapters | Open (CC-BY-NC for some / mixed) | Self-host; HF | π A: native Β· β B: sidecar Β· π± no |
Meta SeamlessM4T v2
ASR+translate |
ASR + speechβtext translation | 2.3B | ~100; CN JP | Standard | Medium | Batch | Good; strong for translation | Large; GPU recommended; license restrictions | Open (CC-BY-NC β non-commercial) | Self-host; HF | π A: native Β· β B: sidecar Β· π± no |
Wav2Vec 2.0 / HuBERT
fine-tunable |
ASR (self-supervised) | 95M β 1B | EN + any (fine-tune per lang) | Standard | Fast | Batch / streaming variants | Good when fine-tuned | No punctuation/casing by default; needs fine-tuning | Open (MIT / Apache) | Self-host; HF | π A: native Β· β B: sidecar Β· π transformers.js |
Vosk β
on-device |
ASR (streaming, offline) | ~50 MB small β 1.8 GB big | 20+; EN, CN, RU, ES, JPβ¦ | Streaming (zero-latency) | Real-time on CPU/phone | Live partials, fully offline | Small = medium; big = good | Per-language model; small models less accurate | Open (Apache-2.0) | Self-host / fully on-device (no server) | π A: native Β· β B: native Java API Β· π± Android+iOS SDK Β· π WASM |
Moonshine (v2, 2026) β
edge real-time |
ASR (streaming, edge) | Tiny / Base (small) | EN, ES, Mandarin CN, JP, KO, VI, UK, AR | Streaming | π Beats Whisper tiny/base; 5Γ faster on edge | < 200 ms latency | High for its size | Fewer languages than Whisper | Open (MIT) | Self-host / on-device (runs on MCUβphoneβserver) | π A: native Β· β B: C/bindings Β· π± RN+Flutter on-device Β· π WASM |
Silero STT
lightweight |
ASR (streaming) | Small | EN, DE, ES, others | Streaming | Fast on CPU | Live | Medium | Fewer languages/updates | Open (CC-BY-NC / mixed) | Self-host / on-device | π A: native Β· π± mobile Β· π ONNX |
SenseVoice (Small) β
CN+JP fast |
ASR + emotion + audio-event (non-autoregressive) | Small | CNβ Cantonese EN JP KO (50+) | Utterance-level (β€~30 s best) | π ~15Γ faster than Whisper-large | Low latency (batch + real-time) | Excellent CN, strong JP | Best on short clips; via FunASR | Open source | Self-host; HF; ModelScope | π A: native Β· β B: sidecar Β· π ONNX |
Paraformer
Mandarin |
ASR (non-autoregressive) | Medium | CNβ Mandarin (EN variants) | Utterance / streaming | π Fast (parallel decode) | Streaming + batch | π Excellent Mandarin (60k hrs) | Mandarin-first; via FunASR toolkit | Open source | Self-host; ModelScope; HF | π A: native Β· β B: sidecar Β· π ONNX |
FunASR (toolkit)
pipeline |
Framework: ASR + VAD + punctuation + diarization | β | CN / EN (hosts SenseVoice, Paraformerβ¦) | Streaming + batch | Fast | Streaming | Excellent (CN) | It's a toolkit β pick a model inside | Open source | Self-host | π A: native Β· β B: sidecar |
Kotoba-Whisper (v2 / bilingual) β
JP best |
ASR (distilled Whisper large-v3) | distilled (~small) | JPβ + EN | 30 s | ~6Γ faster than large-v3 | Batch | π Excellent Japanese | JP/EN focus only | Open (MIT-style) | Self-host; HF | π A: native Β· β B: sidecar Β· π± ggml possible |
ReazonSpeech
JP |
ASR (JP; JPβEN translate) | Medium | JPβ | Standard | Fast | Batch | Excellent Japanese | Japanese-centric | Open source | Self-host; HF | π A: native Β· β B: sidecar |
Groq β Whisper large-v3 / turbo β
no card to start |
Hosted ASR API (runs OSS Whisper) | n/a (you call API) | 99+; CN JP | File-based | π Very fast (LPU) | Free: ~2000 req/day, 7200 audio-sec/hr | High (Whisper large-v3) | Batch only; ~25 MB file cap; per-day quota | Model = MIT; service free tier | β Free API, no credit card to start | π A: HTTP Β· β B: HTTP (ideal β no AI libs) Β· π via backend |
Hugging Face Inference (Serverless)
no card |
Hosted endpoints for OSS models (Whisper, etc.) | model-dependent | model-dependent | model-dependent | Cold starts; variable | Small free monthly credits / rate-limited | model-dependent | Tight free limits; cold-start latency | varies by model | β Free tier, no card to start | π A: HTTP Β· β B: HTTP Β· π via backend |
Web Speech API
zero setup |
Browser-native ASR (not OSS, but free) | n/a | many (ChromeβGoogle) | Streaming | Real-time | Live partials, no key | Medium-good (varies) | Chrome sends audio to Google; inconsistent across browsers; not for private audio | Browser API (not open-source) | β Free, no key, no card | π Web only (great Phase-0 demo); π± limited |
Every model here runs natively in Python (Hugging Face transformers or the model's own repo). This is the smoothest path for self-hosting. Flutter calls your Python API over HTTPS.
Java has weak native AI libs. Three integration paths: (1) call a free hosted API like Groq over HTTP (easiest), (2) run a tiny Python ASR microservice beside Spring Boot, or (3) use native bindings β Vosk has a Java API; whisper.cpp/voxtral.c via JNI.
Web never runs heavy models directly β it records audio and calls the backend. Exceptions that run in-browser: Web Speech API, and small models via transformers.js/ONNX/WASM (Whisper-tiny, Moonshine, Vosk-WASM).
For true offline mobile, use Vosk (Android+iOS SDK), Moonshine (RN/Flutter), or whisper.cpp ggml. Larger models (7B+) cannot run on phones β call the backend instead.
Groq Whisper large-v3 β fast, generous free tier, 99+ langs incl. CN/JP, just an HTTP call (equal for Python or Java).
faster-whisper (large-v3-turbo on GPU, INT8 small on CPU) β no quota, free forever.
Qwen3-ASR (52 langs + 22 CN dialects, ~92 ms TTFT) or VibeVoice-ASR (60-min single-pass + diarization).
SenseVoice (CN+JP, 15Γ faster) Β· Paraformer (Mandarin) Β· Kotoba-Whisper (JP).
Vosk (~50 MB, Java+Android+iOS) or Moonshine (<200 ms edge).
Voxtral Small (2.9% AA-WER) Β· IBM Granite 4.1 2B (5.33% WER) Β· Canary-Qwen 2.5B (5.63%).