Building an Easier to Use FFmpeg With LLMs

Chatbots are great for navigating complex domain-specific languages and APIs, especially if you rarely interact with these interfaces. This use case surfaced quickly with ChatGPT, which proved to be excellent at writing and explaining regular expressions.

But my favorite example of this use case is writing FFmpeg incantations.

For the unfamiliar, FFmpeg is a command-line tool for converting video and audio files. It is ridiculously powerful and ridiculously complex. Most people I know use Google as the primary interface for FFmpeg: search for the job to be done, copy-and-paste the command. This pattern quickly migrated to ChatGPT and Claude, which proved excellent at FFmpeg incantations.

But it can get even easier…

Recently, Simon Willison added an “extraction” feature to his llm utility, enabling us to return only the code from our LLM response, with none of the preamble. The tool looks for the first instance of Markdown-fenced code and returns only that block.

Armed with this feature, we can write a simple shell script to streamline this workflow:

#!/usr/bin/env zsh

# Check for if the user wants to execute the command
flag_x=false
for arg in "$@"; do
    if [[ "$arg" == "-x" ]]; then
        flag_x=true
        break
    fi
done

# Hit the model and get the ffmpeg command
output=$(llm "$1" \
    --system "You are an expert at writing commands for ffmpeg. You will be given prompts describing what the user wants to do with ffmpeg. to the best of your abilities, translate these plain language descriptions into a single one-liner that calls ffmpeg, with all the appropriate flags and input/output specifications. Do not use the variable 'total_frames' in any select statement. Ensure the command is wrapped as a code block." \
    --extract)

# Print or execute the command
if [[ "$flag_x" == true ]]; then
    echo "$output"
    eval "$output"
else
    echo "$output"
fi

Stashing this script in my path as ffmsay, I can now run ffmsay -x 'Do things to video example.mp4'. The -x flag tells the script to not only print, but execute the command.

And it works great! Testing it on a random YouTube video I downloaded, here’s some tasks I ran:

“Extract the audio from the sample.mp4 video and save it as a stereo mp3 with lossless compression.”

This yielded: ffmpeg -i sample.mp4 -vn -acodec libmp3lame -b:a 320k -ac 2 sample.mp3

“Convert the video sample.mp4 into a video the size of a postage stamp in black and white color.”

Yielding: ffmpeg -i sample.mp4 -vf "scale=50:50,format=gray" -sws_flags lanczos output.mp4

“Extract 9 random keyframes, as a grid in a single image named grid.png, from the video sample.mp4.”

Yielding: ffmpeg -i sample.mp4 -vf "select='eq(pict_type,I)',scale=320:180,tile=3x3" -frames:v 9 grid.png

This threw a complaint about the output filename not having a numeric variable, but did the job perfectly:

“Render a random snippet 2 seconds long from sample.mp4 as an animated GIF. Adjust the settings for a lightweight website GIF.”

This yielded the most complex call yet: ffmpeg -i sample.mp4 -t 2 -ss $(ffprobe -v error -show_entries format=duration -of default=noprint_wrappers=1:nokey=1 sample.mp4 | awk -v seed=$RANDOM 'BEGIN{srand(seed)} {print rand()*($1-2)}') -vf "fps=10,scale=320:-1:flags=lanczos,split[s0][s1];[s0]palettegen[p];[s1][p]paletteuse" -f gif output.gif

And worked perfectly!

Interfacing with rarely used, complex interfaces is a perfect LLM use case. Previously, API and UX designers have had to simplify interfaces for powerful tools to produce reasonable surface areas for casual users. But with LLMs we can expose all the complexity with few of the downsides.

This FFmpeg example, is one we can handle out-of-the-box with little precision prompting. For newer, or less commonly used interfaces, simple system prompting and/or a bit of fine-tuning can yield similar results.