I stream on Twitch and I’ve liked video editing since I was a kid. There’s one glaring issue with making stream highlights though — it takes FOREVER. Think about it. If you have a 5 hour VOD, who’s going to scrub through 5 hours of VOD footage? You? The guy who just spent 5 hours playing video games on stream? Add the editing on top of that. You might spend two hours in editing per one hour of streaming! That adds up way too quickly.
Solving the Problem
Naturally, I want a quick and easy answer because I am a smart and lazy man. The goal is simple: use AI to cut down a stream VOD in an entertaining way.
First we had to decide on our AI strategy. Unfortunately, this is the weakest part of my knowledge. I usually see AI as a tool for code completion and funny jokes. How was I going to use it to cut down a video?
Processing the Data
There are many approaches, and I ended up choosing the simplest and cheapest one. I transcribed the video using Davinci Resolve’s transcribe feature. Of course, you can do this with other popular video editors like Adobe Premiere. The output is a transcript of the stream VOD with all the points in time where the words were said.
I then passed the transcribed audio into ChatGPT with a set of directions from the streamer. I used the following prompts for my results:
You are a clip editor for a streamer. You cut clips out of their streams to post them as youtube highlight reels. You will receive directions from the streamer about which clips to select. The way you select clips is from the transcript provided to you. Pick out the lines of the transcript that match the streamer’s description. Select at most 1 line per 100 lines of transcript. You should select lines that are coherent together. You should select lines that are not technical problems. Do not provide commentary in your response. Do not include the triple quotes in your response. Preserve linebreaks/newlines in your selections. Preserve the word order of the original text. Do not modify the line of text with quotes or bullet points.
A streamer has asked you to select these kinds of clips given the stream transcript in triple quotes. (streamer directions) “””(The transcript)”””
With this prompt, ChatGPT will select lines from the original transcript that it thinks matches our direction. We call this the “condensed transcript.”
This step was not too difficult. Writing a prompt that a baby computer can understand is not difficult for me. I think very literally. ๐ค
Creating the Highlight Reel
Our next problem was converting our condensed transcript into the final product — the highlight reel. Initially, our tool would select the segments of the VOD when the line was said (plus or minus 5 seconds for context) and combine them into one video using ffmpeg.
However, the limitations were quickly apparent. The editor had no idea where the clip came from in the VOD, and he can’t adjust the length of the clips very easily. He would basically need to scrub through the VOD to adjust the clips, which we were trying to avoid in the first place.
There is a fantastic solution to his called an Edit Decision List (.EDL). The film industry uses this universal format to indicate what cuts were made from what source footage. So we know where it came from in the VOD, and where it goes in the highlight reel. The format is supported by every major video editor, so this is exactly what we needed! We can import this file inside Resolve or Premiere, and it’s like someone else cut down the video for you.
We sent the script up in chunks since chatGPT can only process so much text at one time. ChatGPT charges you per word that you send to it. Since these are pretty big transcripts, it ended up being about 30 cents to process a 6 hour VOD. Once our tool created the .EDL file from this, we were in business.
Overall the tool takes about 10 minutes to run, with the only manual step being the transcription of the video using Resolve.
Results
The results below are unedited, raw outputs of the program. Of course, with the edit decision list we could have cleaned them up easily. But we want to see what the AI is doing directly.
Analysis
Benefits
The benefits of getting this right are huge. Right now, video editing is a net negative on your time. That is, for every hour you stream, you need to spend at least another hour editing down the footage. So it’s like your work is 2x your stream length if you want to make quality VODs yourself.
This solution makes the VOD editing trivial, or at least gets you most of the way there. Tapping into Youtube is usually the hardest part of being a streamer. It’s a high competition platform and creating quality content is very time consuming.
Shortcomings
Unfortunately, this solution is not durable enough for a competitive business model.
The AI doesn’t select EVERY cut that meets the direction. Of course, this is out of the box and with no fine-tuning on the model. There’s only so much prompt engineering can do. Clients find it frusturating, of course, because they have an intuition of what moments were the best in their stream. The problem we’re trying to solve is giving it to them based on description.
The AI also cuts people off while they’re speaking. This isn’t a problem if a human goes back to clean it up, but it’s definitely not usable to upload directly to Youtube.
The AI can’t create a cohesive narrative. Stories with beginnings, conflicts, middles, and ends are how we absorb our entertainment. The most popular videos on Youtube always take advantage of this fact. It’s a shortcoming of LLMs in general that they can’t make stories the way that humans can.
Additionally, the AI does not understand quantity very well. You could tell it to select 20% of a script, but that result could easily be 10% or 60%. That’s very irksome when the whole point of the project is to press a button and get a specified length of video.
Lastly, the transcriptions from Resolve are low quality. They get the job done, but it doesn’t recognize different speakers and its audio event transcription (like music or clapping) is very limited. Ideally we want to feed as much info as possible into our program.
How to Fix the Shortcomings
The approach of making ChatGPT do all of the work is flawed. Most of our problems with this project can be solved by reinterpreting ChatGPTs role in the process. We could assign a number from 1-100 ( a “HypeScore”) at different segments of the video. Then we could select the top percentiles of HypeScore and turn that into an Edit Decision List. This solves our quantification problem and is much cheaper to implement and maintain.
The AI cutting people off mid sentence could be fixed by analyzing the waveform of the VOD when generating the .EDL. In general, we’d tell the tool not to make a cut in the middle of a loud noise.
Specifying a final highlight reel length is also trivial with a HypeScore. Our tool determines the video length instead of ChatGPT. We also get more latitude to solve the problem of selecting the right clips. All we have to do is adjust how the HypeScore is calculated.
The low quality transcriptions are also straightforward to fix. We could spend more money on high quality transcription services. But that would be suicide because 1 hour of audio costs several dollars to transcribe.
Ideally we could transcribe high quality audio locally using this library, but it would take 6+ hours to transcribe a 5 hour VOD on my GTX 2080. It would be simpler to buy some Tesla K80s and make a GPU farm. Of course, it would be wise to shore up the other shortcomings before dumping $600 on a spicy bitcoin rig.
There’s little we can do about adding narrative — that’s up to the human editor or any emergent properties of our design.
Alternative Solutions
There are more economical and effective solutions for creating high quality VOD reels. For starters, you could pay a friend to watch your stream and have him write down the timestamps of the best moments with some notes. Having that information saves you from the most time consuming part — scrubbing the footage. You also get the benefit of a fresh pair of eyes on your content and what’s good about it.
Additionally, this video taught me a clever trick. By setting up a footpedal in OBS to activate a special microphone track, you can create bookmarks in your OBS recording. All you have to do is open it in Premiere and you’ll see the waveform of when you pressed the footpedal down.
Conclusion
The tool is in a primitive, but useful state. It can create stream highlight reels in a matter of minutes, even if they are mediocre. If the tool was fully automatic and ran after you ended a stream, I’d edit my VODs pretty much all the time. After all, 70% of the work is done for you. If your goal is to grow your Youtube presence with mediocre content and little effort, this tool is great. However, it is unsuitable as a direct to Youtube highlight reel tool.
The AI is especially good at selecting snippets that sound “youtube worthy”. LLMs are more or less trained on the eye-grabbing headline data of the internet. The AI is most effective as a TikTok making tool since they’re both suited to short contexts.
If you liked this article, please consider supporting me on my ko-fi. If I get 15 subscribers, I will release the source code for this project.