The quality of AI answers starts before the language model. It starts with speech transcription.
If the system mishears the question, even a strong model will answer distorted text.
In Sobes, you can choose between several speech recognition models: Groq Whisper Large v3 Turbo, OpenAI gpt-4o-mini-transcribe, Deepgram Nova-3, and Soniox STT v4. They differ not only in accuracy, but also in latency, recording behavior, language support, and how they handle noisy meetings.
First: what does Realtime STT change?
In the Sobes interface, the important setting is simple: Realtime STT disabled or Realtime STT enabled.
Realtime STT disabled
Sobes sends a finished audio segment for recognition and waits for the final text.
This is how Groq Whisper Large v3 Turbo, OpenAI gpt-4o-mini-transcribe, and also Deepgram Nova-3 and Soniox STT v4 work when Realtime STT is disabled.
This is a simple, stable flow for short phrases: the text arrives as a finished transcript, without intermediate corrections. The tradeoff is that Sobes receives text only after the audio segment is sent. Long phrases make the delay more noticeable, and there is no live stream of words while the person is speaking.
Realtime STT disabled is a good fit when a phrase is recorded, Sobes transcribes it quickly, and then passes the final text to AI.
Realtime STT enabled
Sobes streams audio to the model and receives partial text before the phrase is over.
This is how Deepgram Nova-3 and Soniox STT v4 work when Realtime STT is enabled.
In streaming transcription, the text can be a draft at first: the model may revise words while the person keeps speaking. Finalization is the moment when the phrase is over, the model stops changing that text segment, and Sobes can safely pass it forward.
The point of Realtime STT is reaction speed. You can see that the system is hearing speech right now, long answers become readable before the phrase is over, and auto-response can start sooner. The tradeoff is that partial text can change, quality depends more on connection stability, and sometimes you need to wait until the model locks in the final text for the phrase.
Realtime STT is useful when the interviewer asks a long question and you want to see the text almost immediately.
Which recording modes support Realtime STT?
Important detail: Realtime STT works in VAD and Start / Stop modes. In VAD, Sobes detects speech automatically, starts recording, and closes the segment after a pause. In Start / Stop, you control the beginning and end manually, but streaming transcription can still run while recording is active.
In One-Shot, Realtime STT is not used: Sobes takes the last seconds from the audio buffer and sends a finished segment for transcription.
So if you choose Deepgram or Soniox but use One-Shot, use the "Realtime STT disabled" latency numbers as the reference.
Transcription latency
This is speech recognition latency: how long Sobes waits for text after a finished audio segment, or after a streaming model locks in the final text for a phrase. It does not include the time while the person is speaking, and it does not include AI answer generation.
| Model | Realtime STT disabled | Realtime STT enabled | What it means |
|---|---|---|---|
| Groq Whisper Large v3 Turbo | 500-600 ms | Not available | The fastest non-streaming option. Good for short questions. |
OpenAI gpt-4o-mini-transcribe |
600-900 ms | Not available | Slightly slower than Groq, but usually more careful with difficult speech. |
| Deepgram Nova-3 | about 900 ms | about 300 ms | With Realtime STT enabled, final text appears much faster. |
| Soniox STT v4 | about 2500 ms | about 200 ms | Slowest without Realtime STT, but fastest at locking in final text when Realtime STT is enabled. |
Realtime STT is not just a checkbox. For Deepgram and Soniox, it changes how fast the product feels: text starts appearing almost immediately, and the final transcript arrives faster than when Sobes sends a completed audio segment after the phrase.
Short version: what should you pick?
If you do not want to overthink it:
- Realtime STT works in VAD and Start / Stop. In One-Shot, use the options with Realtime STT disabled.
- Groq Whisper Large v3 Turbo is the fastest option with Realtime STT disabled for short phrases and typical interviews.
- OpenAI gpt-4o-mini-transcribe is a good Realtime STT disabled option when you want more careful transcription through OpenAI.
- Deepgram Nova-3, Realtime STT disabled gives solid quality, but with average latency around 900 ms.
- Deepgram Nova-3, Realtime STT enabled is a fast streaming mode, around 300 ms until final text after a phrase.
- Soniox STT v4, Realtime STT disabled is strong for Russian and mixed-language speech, but has the highest latency: around 2500 ms.
- Soniox STT v4, Realtime STT enabled is the fastest streaming option for Russian and language switching: around 200 ms.
What is WER?
WER means Word Error Rate: the share of incorrectly recognized words.
The lower the WER, the better.
For example, WER of 6% means that roughly 6 out of 100 words were recognized incorrectly: replaced, skipped, or added by mistake.
Important: WER is hard to compare without context. The same model can show 4% on clean English audio and 15% on a noisy call with accents, interruptions, and a poor microphone.
WER benchmarks for English and Russian
These are public reference points, not a guarantee for every interview. The source matters because providers use different datasets and evaluation methods.
Groq Whisper Large v3 Turbo.
Groq reports a general WER of around 12%, but does not publish a separate English/Russian breakdown. As a family-level Whisper reference, the FLEURS benchmark shows 4.00% for English and 5.13% for Russian.
OpenAI gpt-4o-mini-transcribe.
The FLEURS benchmark in the Voxtral Realtime article lists 3.65% for English and 5.30% for Russian. OpenAI also says gpt-4o-transcribe and gpt-4o-mini-transcribe improve WER compared with Whisper v2/v3, but its docs do not publish a simple breakdown for these two languages.
Deepgram Nova-3.
For English, Deepgram reports 5.26% for finished audio and 6.84% for streaming. For Russian, the public Soniox comparison lists 8.0%. Deepgram's strength is English variants and accents: American, British, Australian, Indian, and New Zealand English.
Soniox STT v4.
Soniox reports about 6.5% for English and about 6.2% for Russian in a 60-language benchmark. There is no separate English/Russian breakdown for Realtime STT, but Soniox says the v4 streaming mode improves accuracy.
The main takeaway: for Russian, Soniox and Deepgram look stronger than older Whisper-style approaches on real-world benchmarks, while gpt-4o-mini-transcribe looks good on FLEURS. But FLEURS, YouTube, calls, and live interviews are different environments. Model choice should consider both WER and latency.
Models one by one
Groq Whisper Large v3 Turbo
Groq Whisper Large v3 Turbo is the fastest Sobes option when Realtime STT is disabled. It fits short replies, fast interviews, and situations where minimal latency matters more than maximum accuracy on difficult speech.
Its main strength is speed. Groq's ASR guide reports a general WER of around 12% for Whisper Large v3 Turbo and acceleration up to 247x relative to audio duration. That makes it a good option for short remarks where delay matters most.
If the speech contains a lot of Russian, names, abbreviations, English technical words inside Russian speech, noise, or a strong accent, Groq can make more mistakes. In that case, OpenAI or Soniox is usually the better fallback, because preserving the exact wording matters more than raw speed.
OpenAI gpt-4o-mini-transcribe
OpenAI gpt-4o-mini-transcribe transcribes a completed audio segment. It is slower than Groq Whisper Large v3 Turbo, but usually handles speech nuance, technical terms, and question details more carefully.
OpenAI says gpt-4o-transcribe and gpt-4o-mini-transcribe improve WER and language recognition compared with Whisper. In the independent FLEURS benchmark, gpt-4o-mini-transcribe is listed at 3.65% WER for English and 5.30% for Russian.
OpenAI gpt-4o-mini-transcribe is a good higher-quality option with Realtime STT disabled when Groq Whisper Large v3 Turbo makes too many mistakes on your speech or microphone.
Deepgram Nova-3
Deepgram is strong in Realtime STT scenarios. Nova-3 works with both completed audio and streaming, and Deepgram's docs report WER of 6.84% for streaming transcription and 5.26% for completed recordings on a set of 2,703 real-world audio files.
In Sobes, Deepgram is useful when you want to see text almost immediately instead of waiting until the phrase is over. With Realtime STT disabled, average transcription latency is around 900 ms; with Realtime STT enabled, it is around 300 ms. That makes Deepgram convenient for long phrases, live subtitles, mixed speech, and English interviews, especially when the interviewer has a regional accent.
A separate Deepgram advantage is support for English variants. You can choose not only generic English, but also American, British, Australian, Indian, or New Zealand English. This helps in international interviews where the interviewer has an unfamiliar accent and technical terms are mixed with fast conversational English.
For Russian, Deepgram also looks decent: the Soniox comparison lists 8.0% WER for Russian.
Soniox STT v4, Realtime STT disabled
Soniox STT v4 works well for Russian, mixed-language speech, and difficult technical terms, but with Realtime STT disabled it is the slowest option: average transcription latency is around 2500 ms. It makes sense when accuracy on Russian and mixed speech matters more than response speed.
Soniox's strength is multilingual recognition and language switching. It also covers a large pool of less common languages where general-purpose models often degrade more than they do on English. This matters for Russian-speaking interviews too, where a single phrase can contain React, PostgreSQL, Kafka, thread pool, event loop, and Russian word endings around English technical terms.
Soniox STT v4, Realtime STT enabled
Soniox STT v4 with Realtime STT enabled is the best option when you need Russian speech and minimal latency at the same time. On average, the final transcript arrives about 200 ms after the phrase. This option fits especially well in VAD and Start / Stop: text streams live, handles Russian-English phrases well, and locks in final text quickly after a pause.
