Detecting the Sound of Typing Using fast.ai

There’s a project I would like to work on that requires a large dataset of videos of typing. Instead of spending hours upon hours manually categorizing small video clips, I decided to automate the task using fast.ai.

Before I talk about what I’ve done, I want to emphasize that I am not an AI expert. I have quite a bit of experience with Linux and making various personal apps and scripts, but I’ve only learned all the AI stuff from watching only two lessons of the fast.ai course and playing around with it. The fact that I’ve been able to accomplish so much with so little experience is a true testament to how rapidly this field is improving.

Broadly, the app splits a video into short clips, makes audio spectrograms of those clips, analyzes those images, and determines if the clip contains audio of typing. Below is the process I used to make the machine learning model.

  • Downloaded video
  • Made 15 second clips from video
  • Made spectrograms of those clips
  • Categorized about 150 clips manually
  • Used alexnet and a limited dataset to make an inaccurate model
  • Used inaccurate model to improve signal-to-noise ratio and speed up further categorization
  • Made final model that appears over 90% accurate at finding clips that contain typing

Downloaded and Split the Video

I decided to use Twitch streamer moistcr1tikal‘s videos as the base of this project for several reasons. His audio setup is pretty consistent, meaning any model made would probably be usable on his other videos. He has a large following, meaning if my final project works and gains any traction, I may get a lot of readers. Finally, I just like his content. Since I had to spend a few hours categorizing clips, it helped to do it with with content I like.

I used youtube-dl to download a recently posted Among Us stream. It was a little over 5 hours long, and was picked because he was playing in a public lobby, and his typing on-screen in the lobby chat will be useful in the future project. In the end, I had a large mp4.

youtube-dl <url>

Splitting the file proved a little difficult. After a bit of research, I discovered and used mkvmerge to split the stream into 1017 separete mp4 files.

mkvmerge --split 15s amongus.mp4

Made Spectrograms and Manually Categorized

Current AI is pretty good at image recognition and categorization. Because of this, I believed that the best way to train a model was to convert the audio to images, then train the model based off of the images. I had a folder full of mp4 files and wanted to make spectrograms of all of them, so I created the shell script below to do that. It uses ffmpeg to convert the mp4 to wav(in /tmp/ so it’s all in memory, to wav because the conversion is fast and lossless), then it uses a piece of software called sox to create the spectrogram.

FILES=./*.mp4
  for f in $FILES
  do
      echo $f
      ffmpeg -y -i $f /tmp/output.wav
      sox /tmp/output.wav -n remix 2 spectrogram -h -r -o "${f}.png"
  done

The sox parameters took some experimentation. In the first iteration, I tried parameters with the default color spectrum and limiting the frequency to that around human hearing.

Spectrograms below are made from this clip.
sox /tmp/output.wav -n rate 6k spectrogram -r

Notice the two audio channels(left and right) and the limited color spectrum.

I had issues getting any reasonable results with these settings. On the second iteration, I removed the limitation of the human auditory range and picked a color range that was easier for humans to see at least. I’m not sure if the color choice helped train the model though.

sox /tmp/output.wav -n remix 2 spectrogram -h -r

Single audio channel, no spectrum limiting, high contrast.

Judging by this spectrogram, he probably uses a noise gate, then compressor, then low-pass filter.

With good spectrograms made, I watched random video clips and categorized them based on if Charlie(moistcr1tikal) was typing or not by moving the clips into a folder heirarchy shown below. The videos in the keysVisible folder adopted the hasKeySounds categorization with a function when forming the DataLoader.

PapaMoist/trainingData/noKeySounds
PapaMoist/trainingData/hasKeySounds
PapaMoist/trainingData/hasKeySounds/keysVisible

Initially I tried categorizing based on if there were any key sounds at all, but I found that there were too many clips of Charlie pressing keys while he was playing games. Training seemed to work better when I limited categorization to typing vs no-typing.

I eventually ended up with about 120 noKeySound clips and about 30 hasKeySound clips.

Initial Model and Continued Training

150 images with only 30 examples of typing sounds is a tiny dataset for training this kind of model, but I was impatient and curious to see if any improvement could be made with this only. I fiddled with it for a long time, and found that if I did one fine-tune epoch with alexnet, I could get about 50% error rate based off of the metrics. I used this model to copy all clips with a predicted hasKeySound categorization to a separate folder.

After doing this, I found that the 50% error rate was definitely inaccurate. Maybe 2 in 10 videos actually contained typing sounds. Luckily, this was still a much-improved signal-to-noise ratio from before, and the confusion matrix indicated that the major issue was the false positive rate(clips being categorized as hasKeySounds when there are none), not the false negative. I used these clips to categorize them much faster than before.

When I categorized about 270 clips, I made a new model using the pretrained model densenet121 and made a spectacular(to me) result of 90% accuracy. Further fiddling got it up to 96% accuracy. Not only this, but based off of the confusion matrix and a spot check of all the videos, false positives(video clips categorized as hasKeySound but with no typing sounds) seem quite rare, which I much prefer to false negatives because I’m lazy and will need to categorize these clips further, probably by 0.3 second chunks. Any silence that I can avoid will be good.

I tried using this model to predict clips from another video, but it was completely inaccurate. The issue is that I’m completely unsure what the audio codec and quality was on the original video. I’ll probably end up downloading a few videos and making a more diverse training dataset. Hopefully after determining that the quality is the same, I’ll be able to use it more generally.

Published
Categorized as Tech

1 comment

  1. Thats actually really cool. And even me barely having programming knowledge, you made this incredibly easy to understand (at least to the extent of my knowledge) I’m definitely curious to see the final product.

Leave a comment

Your email address will not be published. Required fields are marked *