A CLI based audio description generation tool built on Python leveraging the power of Ollama, Whisper, CLIP, Coqui TTS and FFMPEG.
Installation is written for MacOS
- Install pyenv for python 3.11
-
Run
brew install pyenv -
Add these commands to your
~/.zshrcor~/bashrcexport PYENV_ROOT="$HOME/.pyenv" [[ -d $PYENV_ROOT/bin ]] && export PATH="$PYENV_ROOT/bin:$PATH" eval "$(pyenv init - zsh)"
-
Restart your terminal and run
pyenv install 3.11
-
- Install FFMPEG and Ollama and espeak
- Run
brew install ffmpegfor video editing - Run
brew install espeakfor some AI Text-To-Speech - Install Ollama from their site https://ollama.com/download
- Run
- Run
ollama pull gemma3:12bandollama pull nomic-embed-text - Run
python3 -m venv ./venv&&source ./venv/bin/activate- Every time you start a new terminal for the project run
source ./venv/bin/activate - Python Environments will run that for you
- Every time you start a new terminal for the project run
- Run
pip3 install -r requirements.txt - Run
python3 ./describe_video.py --input ./my_video.mp4 --output ./my_video_script.txtto generate a video script file - Then run
python3 ./process_video.py --input_video ./my_video.mp4 --input_text ./my_video_script.txt --output ./my_video_audio_description.mp4
To run pylint, runpylint ./describe_video.py. Pylint configuration is located in ./pyproject.toml