Whisper Voice Transcription

A Python-based audio transcription tool (STT) using OpenAI's Whisper model with configurable output formats and automated file organization.

✨ Features

🌐 Web Interface: User-friendly Gradio web UI for easy drag-and-drop transcription
💻 Command Line Interface: Full CLI support for automation and scripting
Multiple Output Formats: SRT, TXT, JSON, VTT, TSV
CUDA GPU Acceleration: Automatic NVIDIA GPU detection for faster transcription
Automatic File Organization: Timestamped output folders
Environment Configuration: .env file support for default settings
Input/Output Folder Structure: Organized file management
Language Detection: Auto-detect or specify language
Model Selection: Choose from different Whisper model sizes
Task Types: Transcription or translation

Prerequisites

uv package manager (AIO pkg, env, runtime manager)
FFmpeg (for audio processing)
NVIDIA GPU with CUDA (optional, for acceleration)

✋🏽 Only if you don't have it already (click to expand)

Install FFmpeg (if not already installed):

macOS:
```
brew install ffmpeg
```
Ubuntu/Debian:
```
sudo apt update
sudo apt install ffmpeg
```
Windows: Download from FFmpeg official website

Install UV (if not already installed):

macOS:

brew install uv

OR

curl -LsSf https://astral.sh/uv/install.sh | sh

Ubuntu/Debian:

curl -LsSf https://astral.sh/uv/install.sh | sh

Windows:

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

⚡ Super Quick Installation

Clone the repository:

git clone https://github.com/arifulislamat/whisper-voice-transcription.git
cd whisper-voice-transcription

Install dependencies using uv:

For CPU only mode (Default):
```
uv sync
```
For Linux/Windows with NVIDIA GPU (CUDA 12.8):
```
uv sync --extra cu128
```
Note: To use above commend, first you will need to uncomment line number 29 to 46 in pyproject.toml file.

App Demo

📁 Usage

🌐 Option 1: Web Interface (Recommended for beginners)

This project now includes a user-friendly web interface powered by Gradio, perfect for non-technical users or quick transcriptions.

Launch Web Interface

uv run python main.py --web

This will start a web server (usually at http://127.0.0.1:7860) where you can:

Upload audio files via drag-and-drop or file picker
Select Whisper model from dropdown (tiny.en, base.en, small.en, medium.en, large, turbo)
Choose language (auto-detect or specific language)
Pick output formats (multiple selection: SRT, TXT, JSON, VTT, TSV)
Set device preference (auto, CUDA, CPU)
Start Transcription directly from the browser

⌨️ Option 2: CLI - Command Line Interface

Prepare Audio Files

Place your audio files in the inputs/ folder:

mkdir inputs
cp english-voice-example.mp3 inputs/

Basic CLI Usage

Using uv (simplest):

uv run main.py --audio english-voice-example.mp3

Advanced CLI Usage

Override specific settings:

uv run main.py --audio english-voice-example.mp3 --model large --language es

Multiple output formats:

uv run main.py --audio english-voice-example.mp3 --formats srt,txt,json,vtt

Translation task:

uv run main.py --audio spanish-audio.mp3 --task translate --language es

Force GPU/CPU usage:

uv run main.py --audio audio.mp3 --device cuda  # Force CUDA
uv run main.py --audio audio.mp3 --device cpu   # Force CPU

Complete configuration override:

uv run python main.py --audio audio-file.wav \
--model medium \
--language bn \
--task transcribe \
--formats srt,txt \
--device auto

CLI Command Line Arguments

python main.py [OPTIONS]

Options:
--web                 Launch web interface instead of CLI mode
--audio FILEPATH      Path to audio file (required for CLI mode)
--model MODEL         Whisper model size (default from .env)
--language CODE       Language code (default from .env or auto-detect)
--task TASK           transcribe or translate (default from .env)
--formats FORMATS     Comma-separated formats (default from .env)
--device DEVICE       auto, cuda, or cpu (default from .env)

Dual Interface Benefits

Feature	CLI	Web Interface
Automation	✅ Perfect for scripts	❌ Manual only
Ease of use	❌ Requires command knowledge	✅ Point and click
Batch processing	✅ Easy with shell scripts	❌ One file at a time
User-friendly	❌ Technical users only	✅ Anyone can use
Remote access	❌ Local only	✅ Can be hosted
Mobile Support	❌ No	✅ Yes
File Management	❌ Manual	✅ Automatic download

Project Structure

whisper-voice-transcription/   # Repo Directory
├── main.py                    # Unified CLI & Web interface
├── transcription_core.py      # Shared transcription logic
├── .env                       # Environment configuration
├── inputs/                    # Place audio files here (optional)
├── outputs/                   # Transcription outputs (auto-created)
│   └── YYYYMMDD_HHMMSS/       # Timestamped folders per run
├── pyproject.toml             # Project dependencies
├── uv.lock                    # uv package lock file
├── bangla-voice-example.mp3   # Sample audio (Bangla)
├── english-voice-example.mp3  # Sample audio (English)
└── README.md                  # This file

⚙️ Configuration

Environment Variables (.env)

Create or modify the .env file to set default values:

WHISPER_MODEL=small.en
WHISPER_LANGUAGE=en
WHISPER_TASK=transcribe
WHISPER_FORMATS=srt,txt,json
WHISPER_DEVICE=auto
GRADIO_SHARE=false

🎵 Whisper Models

Model Sizes & Performance

Size	Parameters	English-only model	Multilingual model	Required VRAM	Relative speed
tiny	39 M	`tiny.en`	`tiny`	~1 GB	~10x
base	74 M	`base.en`	`base`	~1 GB	~7x
small	244 M	`small.en`	`small`	~2 GB	~4x
medium	769 M	`medium.en`	`medium`	~5 GB	~2x
large	1550 M	N/A	`large`	~10 GB	1x
turbo	809 M	N/A	`turbo`	~6 GB	~8x

Recommendations

English audio: Use .en models for better performance
Limited VRAM: Start with tiny or base
Best quality: Use large or turbo
Balanced: small.en for English, small for multilingual

🗃️ Output Formats

SRT (SubRip)

1
00:00:00,000 --> 00:00:04,000
Hello, this is a sample transcription.

2
00:00:04,000 --> 00:00:08,000
The text appears with timestamps.

TXT (Plain Text)

Hello, this is a sample transcription.
The text appears with timestamps.

JSON (Detailed)

[
  {
    "start": 0.0,
    "end": 4.0,
    "text": "Hello, this is a sample transcription."
  }
]

VTT (WebVTT)

WEBVTT

00:00:00.000 --> 00:00:04.000
Hello, this is a sample transcription.

TSV (Tab-Separated)

start	end	speaker	text
0.0	4.0		Hello, this is a sample transcription.

🐛 Troubleshooting

Common Issues

"Audio file not found" error:

Check the file path is correct
Ensure the file exists and is accessible

FFmpeg errors:

Install FFmpeg using the instructions above
Ensure FFmpeg is in your system PATH

Web interface won't start:

# Check if port 7860 is available
lsof -i :7860

# Kill process using the port if needed
pkill -f "python main.py --web"

Web upload fails:

Check file format is supported (MP3, WAV, M4A, FLAC, etc.)
Ensure file size is reasonable (<500MB recommended)
Verify browser supports file uploads

Out of memory errors:

Use a smaller model (tiny, base, small)
Close other applications to free up RAM/VRAM
Use CPU mode: --device cpu

Slow transcription:

Use GPU if available: ensure CUDA is properly installed
Choose a smaller model for faster processing
Check if other processes are using GPU resources

GPU Troubleshooting

CUDA not detected:

# Check CUDA installation
nvidia-smi

# Check PyTorch CUDA support
python -c "import torch; print(torch.cuda.is_available())"

Using Python directly

Ubuntu/Debian (Mac/Linux):

source .venv/bin/activate

Windows (CMD):

.venv\Scripts\activate.bat

Start STT:

python main.py --audio english-voice-example.mp3

CUDA out of memory:

Use smaller models (tiny, base, small)
Reduce batch size by processing shorter audio clips
Close other GPU-intensive applications

📝 License

This project is open source under MIT License. Please check the license file for details.

🤝 How to contribute

We welcome contributions. A minimal workflow:

Fork the repository.
Create a branch for your change: git checkout -b feat/your-feature.
Make changes and add tests where applicable.
Run any project linters/tests and ensure they pass.
Commit with clear messages and push your branch: git push origin feat/your-feature.
Open a Pull Request against the main branch, describe the change, and reference any related issues.
Address any feedback and iterate as needed.

📞 Support

If you encounter any issues or have questions, please:

Search existing issues
Create a new issue if not exist with detailed information

Happy transcribing! 🎤✨

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.github		.github
.env		.env
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
bangla-voice-example.mp3		bangla-voice-example.mp3
english-voice-example.mp3		english-voice-example.mp3
main.py		main.py
pyproject.toml		pyproject.toml
transcription_core.py		transcription_core.py
uv.lock		uv.lock

License

arifulislamat/whisper-voice-transcription

Folders and files

Latest commit

History

Repository files navigation