
Oct 26, 2025
Building a Local AI Video Captioning Tool
Over a few late nights, I built a local AI tool that automatically transcribes videos and overlays captions. It started as a fast, scrappy proof of concept, but now I’m taking the time to rebuild it cleanly, with a local first desktop focus.
Why I Built It
The idea isn’t totally new. Around a year ago, I got fascinated by those goofy brainrot TikToks, the ones with Minecraft parkour or Subway Surfers gameplay looping in the background while an AI voice reads Reddit stories.
So, I built a little Python pipeline that recreated the format step by step:
- Generate a script using the OpenAI API
- Convert it to audio with ElevenLabs (using the classic Adam voice)
- Transcribe it locally with a Whisper model to get word-level timestamps
- Grab a random gameplay clip (Minecraft, Subway Surfers, etc.)
- Combine everything: overlay the audio and add captions on top of the gameplay
Surprisingly, the results turned out really well, you can check out a sample here:
🎬 YouTube Short
💻 GitHub Project
Over time, I noticed I was using that tool less for brainrot videos and more just to caption existing videos with audio. That’s when I realized, maybe this could evolve into something more practical, creative, and genuinely useful.
The old pipeline worked, but it was slow and hard to reuse. My new proof-of-concept focuses on:
- Speed and local performance
- Word-level editing (adjusting caption start/end times manually)
- Easy customization of fonts, colors, and styling
Essentially, I wanted a tool that feels like a local AI video caption studio, fast, accurate, and customizable.
Some Things I’m Still Figuring Out
The original project ran in Docker containers to avoid dependency issues. But for a real desktop app, I’ll need to bundle everything directly into the app itself.
That brings up new questions: how to handle model downloads efficiently, how to manage GPU acceleration locally, and how to make setup feel invisible for the user.
These are the problems I want to solve next, slowly, intentionally, and with a cleaner design this time around.