Generate SRT Subtitles from Video — Offline Speech-to-Text

Drop one or more video (or audio) files onto MiniMax Converter and get a synced .srt subtitle file generated by offline Whisper speech-to-text — 99 languages, auto-detected. Pick timestamps or lyrics-style plain text, optionally translate to English. Everything runs on your machine: no upload, no file-size limit, no watermark.

How to use it

Drop your video or audio file (or a whole folder) onto MiniMax Converter and choose Transcribe.
Pick the output format: .srt with timestamps for subtitles, or lyrics/plain-text form.
Choose options — auto-detect or a specific language, multi-language mode, translate-to-English, or word-level timestamps for karaoke alignment.
Run it; the .srt file is saved next to your source video when transcription completes.

Standard .srt, ready to use

The output is a plain .srt file with numbered cues and HH:MM:SS,mmm timestamps — the format VLC, YouTube, Premiere, DaVinci Resolve and every media player accept. Timing is taken straight from Whisper's per-segment timestamps, so cues track the speech as the model heard it. You can also export .vtt or a plain-text transcript, and turn on word-level timestamps (one word per cue) for tight karaoke-style alignment.

99 languages, auto-detect, and a lyrics mode

Transcription runs on offline Whisper (whisper.cpp), which supports 99 languages and detects the spoken language automatically by default. There's a multi-language mode that re-detects per chunk for mixed-language audio, and an optional translate-to-English pass. For music, a lyrics mode turns the audio into line-broken plain text, with an optional vocal-isolation pre-pass that strips backing music so the model hears the vocals more clearly.

Why offline?

Online subtitle generators upload your video to a server, cap file sizes, queue you behind other jobs, and often watermark or paywall the result. MiniMax Converter runs Whisper locally, so there's no upload and no size limit — a feature film or a multi-hour recording is fine. It's also hardware-accelerated where your machine allows (Apple Silicon CoreML, CUDA or Vulkan on supported GPUs, CPU otherwise), and your footage never leaves your computer.

Questions and answers

What languages can it transcribe?

All 99 languages supported by Whisper. By default it auto-detects the spoken language from the audio; you can also force a specific language, or use multi-language mode to re-detect per segment for mixed-language audio.

Can it transcribe audio files, not just video?

Yes. Drop an audio file (MP3, WAV, M4A, FLAC, etc.) and it transcribes the same way — the app extracts a 16 kHz mono track internally before running Whisper. The lyrics mode is built specifically for songs.

Is the timing accurate enough for subtitles?

The .srt cues use Whisper's own per-segment timestamps, which are good for most subtitling. For tighter alignment you can enable word-level timestamps, which emit one word per cue for karaoke-style sync. Quality depends on the model size you pick — larger models are more accurate but slower.

Does the transcription happen on a server?

No. Speech-to-text runs entirely on your machine via offline Whisper — nothing is uploaded and there's no file-size cap. It uses GPU acceleration where available (CoreML on Apple Silicon, CUDA/Vulkan on supported GPUs) and falls back to CPU otherwise.

Related tools

Get MiniMax Converter

Cross-platform desktop app. Linux free for non-commercial use; Windows & macOS one-time €20 license. No subscription, no telemetry, no account.

Download Buy license €20