Sobes.tech

Audio Recording

Questions about recording and voice recognition

Recording Modes

Automatic (VAD)

Recording starts automatically when you speak and stops during pauses. Perfect for continuous interviews.

How it works:

  1. The application analyzes audio using a neural network that accurately distinguishes speech from background noise
  2. As soon as speech is detected — recording begins
  3. When silence occurs — recording ends and is sent for transcription
  4. If speech lasts long, recording is automatically split into parts (chunks)

Manual (Toggle)

You control the start and end of recording with the hotkey Ctrl+R. Suitable when it's important to record only specific moments.

The application constantly records audio in a background buffer. When you press Ctrl+R, these seconds are added to the beginning of the recording — part of the conversation before pressing will still be saved. Recordings are validated by the neural network — if no speech is detected, the recording is automatically discarded.

One-shot (Oneshot)

Works similarly to manual mode, but instead of recording from start to end, it captures a fixed fragment from the buffer. Press the hotkey — and the application saves the last N seconds of audio.

Suitable when you need to quickly capture what was just said without thinking about turning recording on and off.

Selecting Recording Mode

Open settings (gear icon in the side menu), section "Audio Recording". Here you can select the recording mode and configure other parameters.


General Settings

Audio Source

ModeDescription
System audio + microphoneRecords system audio (interlocutor) and microphone (you). Perfect for transcribing dialogues
System audio onlyRecords only system audio. Useful if you only need the interlocutor's speech
Microphone onlyRecords only microphone. Use if system audio is not needed or causes problems

Microphone

In settings, you can select a specific microphone. If not selected — the system default is used.

Audio Output Device (Windows)

On Windows, you can select a device for capturing system audio — for example, headphones or speakers. The application will record audio that is played through the selected device.


Automatic Mode Settings (VAD)

Split into Chunks

Automatically splits long recordings into separate files.

Why this is needed: if the interlocutor speaks for a minute without pauses, and the question was already asked at the beginning — the application will send the first part for transcription and start generating a response while the interlocutor is still finishing.

Recommended to enable if you want to get hints earlier, but this reduces speech recognition accuracy.

Chunk Length

Maximum duration of one audio file. After reaching this time, recording will be saved and a new chunk will begin.

Range: from 5 to 10 seconds. Default: 7 seconds.


Manual Mode Settings (Toggle)

Buffer Length

The application constantly records audio in a background buffer. When you press Ctrl+R, these seconds are added to the beginning of the recording. Useful if you didn't press the hotkey in time.

Range: from 0 to 15 seconds. Default: 4 seconds.


One-shot Mode Settings (Oneshot)

Snapshot Duration

How many seconds of audio to capture from the buffer when the hotkey is pressed.

Range: from 5 to 30 seconds. Default: 20 seconds.

Clear Buffer After Snapshot

If enabled, the buffer is cleared after each snapshot. This prevents sending the same audio fragment again on multiple consecutive presses.


Frequently Asked Questions

Recording triggers from noise

The neural network filters non-speech sounds well, but high background noise levels may cause false triggers. Try reducing microphone sensitivity in system settings or use a headset.

Cuts off the beginning of phrases

In manual mode — increase the buffer length.

Recordings are discarded as empty

Neural network checks each recording for speech presence. If the model doesn't detect voice (for example, only background noise or music was recorded), the recording is automatically discarded. Make sure your microphone is properly configured and your speech is clear enough.