A Python-based audio transcription tool (STT) using OpenAI's Whisper model with configurable output formats and automated file organization.
- π Web Interface: User-friendly Gradio web UI for easy drag-and-drop transcription
- π» Command Line Interface: Full CLI support for automation and scripting
- Multiple Output Formats: SRT, TXT, JSON, VTT, TSV
- CUDA GPU Acceleration: Automatic NVIDIA GPU detection for faster transcription
- Automatic File Organization: Timestamped output folders
- Environment Configuration:
.envfile support for default settings - Input/Output Folder Structure: Organized file management
- Language Detection: Auto-detect or specify language
- Model Selection: Choose from different Whisper model sizes
- Task Types: Transcription or translation
uvpackage manager (AIO pkg, env, runtime manager)- FFmpeg (for audio processing)
- NVIDIA GPU with CUDA (optional, for acceleration)
βπ½ Only if you don't have it already (click to expand)
-
Install FFmpeg (if not already installed):
macOS:
brew install ffmpeg
Ubuntu/Debian:
sudo apt update sudo apt install ffmpeg
Windows: Download from FFmpeg official website
-
Install UV (if not already installed):
macOS:
brew install uv
OR
curl -LsSf https://astral.sh/uv/install.sh | shUbuntu/Debian:
curl -LsSf https://astral.sh/uv/install.sh | shWindows:
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
-
Clone the repository:
git clone https://github.com/arifulislamat/whisper-voice-transcription.git cd whisper-voice-transcription -
Install dependencies using uv:
For CPU only mode (Default):
uv sync
For Linux/Windows with NVIDIA GPU (CUDA 12.8):
uv sync --extra cu128
Note: To use above commend, first you will need to uncomment line number
29 to 46inpyproject.tomlfile.
This project now includes a user-friendly web interface powered by Gradio, perfect for non-technical users or quick transcriptions.
uv run python main.py --webThis will start a web server (usually at http://127.0.0.1:7860) where you can:
- Upload audio files via drag-and-drop or file picker
- Select Whisper model from dropdown (tiny.en, base.en, small.en, medium.en, large, turbo)
- Choose language (auto-detect or specific language)
- Pick output formats (multiple selection: SRT, TXT, JSON, VTT, TSV)
- Set device preference (auto, CUDA, CPU)
- Start Transcription directly from the browser
Place your audio files in the inputs/ folder:
mkdir inputs
cp english-voice-example.mp3 inputs/Using uv (simplest):
uv run main.py --audio english-voice-example.mp3Override specific settings:
uv run main.py --audio english-voice-example.mp3 --model large --language esMultiple output formats:
uv run main.py --audio english-voice-example.mp3 --formats srt,txt,json,vttTranslation task:
uv run main.py --audio spanish-audio.mp3 --task translate --language esForce GPU/CPU usage:
uv run main.py --audio audio.mp3 --device cuda # Force CUDA
uv run main.py --audio audio.mp3 --device cpu # Force CPUComplete configuration override:
uv run python main.py --audio audio-file.wav \
--model medium \
--language bn \
--task transcribe \
--formats srt,txt \
--device autopython main.py [OPTIONS]
Options:
--web Launch web interface instead of CLI mode
--audio FILEPATH Path to audio file (required for CLI mode)
--model MODEL Whisper model size (default from .env)
--language CODE Language code (default from .env or auto-detect)
--task TASK transcribe or translate (default from .env)
--formats FORMATS Comma-separated formats (default from .env)
--device DEVICE auto, cuda, or cpu (default from .env)
| Feature | CLI | Web Interface |
|---|---|---|
| Automation | β Perfect for scripts | β Manual only |
| Ease of use | β Requires command knowledge | β Point and click |
| Batch processing | β Easy with shell scripts | β One file at a time |
| User-friendly | β Technical users only | β Anyone can use |
| Remote access | β Local only | β Can be hosted |
| Mobile Support | β No | β Yes |
| File Management | β Manual | β Automatic download |
whisper-voice-transcription/ # Repo Directory
βββ main.py # Unified CLI & Web interface
βββ transcription_core.py # Shared transcription logic
βββ .env # Environment configuration
βββ inputs/ # Place audio files here (optional)
βββ outputs/ # Transcription outputs (auto-created)
β βββ YYYYMMDD_HHMMSS/ # Timestamped folders per run
βββ pyproject.toml # Project dependencies
βββ uv.lock # uv package lock file
βββ bangla-voice-example.mp3 # Sample audio (Bangla)
βββ english-voice-example.mp3 # Sample audio (English)
βββ README.md # This file
Create or modify the .env file to set default values:
WHISPER_MODEL=small.en
WHISPER_LANGUAGE=en
WHISPER_TASK=transcribe
WHISPER_FORMATS=srt,txt,json
WHISPER_DEVICE=auto
GRADIO_SHARE=false| Size | Parameters | English-only model | Multilingual model | Required VRAM | Relative speed |
|---|---|---|---|---|---|
| tiny | 39 M | tiny.en |
tiny |
~1 GB | ~10x |
| base | 74 M | base.en |
base |
~1 GB | ~7x |
| small | 244 M | small.en |
small |
~2 GB | ~4x |
| medium | 769 M | medium.en |
medium |
~5 GB | ~2x |
| large | 1550 M | N/A | large |
~10 GB | 1x |
| turbo | 809 M | N/A | turbo |
~6 GB | ~8x |
- English audio: Use
.enmodels for better performance - Limited VRAM: Start with
tinyorbase - Best quality: Use
largeorturbo - Balanced:
small.enfor English,smallfor multilingual
1
00:00:00,000 --> 00:00:04,000
Hello, this is a sample transcription.
2
00:00:04,000 --> 00:00:08,000
The text appears with timestamps.Hello, this is a sample transcription.
The text appears with timestamps.[
{
"start": 0.0,
"end": 4.0,
"text": "Hello, this is a sample transcription."
}
]WEBVTT
00:00:00.000 --> 00:00:04.000
Hello, this is a sample transcription.start end speaker text
0.0 4.0 Hello, this is a sample transcription."Audio file not found" error:
- Check the file path is correct
- Ensure the file exists and is accessible
FFmpeg errors:
- Install FFmpeg using the instructions above
- Ensure FFmpeg is in your system PATH
Web interface won't start:
# Check if port 7860 is available
lsof -i :7860
# Kill process using the port if needed
pkill -f "python main.py --web"Web upload fails:
- Check file format is supported (MP3, WAV, M4A, FLAC, etc.)
- Ensure file size is reasonable (<500MB recommended)
- Verify browser supports file uploads
Out of memory errors:
- Use a smaller model (
tiny,base,small) - Close other applications to free up RAM/VRAM
- Use CPU mode:
--device cpu
Slow transcription:
- Use GPU if available: ensure CUDA is properly installed
- Choose a smaller model for faster processing
- Check if other processes are using GPU resources
CUDA not detected:
# Check CUDA installation
nvidia-smi# Check PyTorch CUDA support
python -c "import torch; print(torch.cuda.is_available())"Ubuntu/Debian (Mac/Linux):
source .venv/bin/activateWindows (CMD):
.venv\Scripts\activate.batStart STT:
python main.py --audio english-voice-example.mp3CUDA out of memory:
- Use smaller models (
tiny,base,small) - Reduce batch size by processing shorter audio clips
- Close other GPU-intensive applications
This project is open source under MIT License. Please check the license file for details.
We welcome contributions. A minimal workflow:
- Fork the repository.
- Create a branch for your change:
git checkout -b feat/your-feature. - Make changes and add tests where applicable.
- Run any project linters/tests and ensure they pass.
- Commit with clear messages and push your branch:
git push origin feat/your-feature. - Open a Pull Request against the
mainbranch, describe the change, and reference any related issues. - Address any feedback and iterate as needed.
If you encounter any issues or have questions, please:
- Search existing issues
- Create a new issue if not exist with detailed information
Happy transcribing! π€β¨
