Environmental Sound Classification

This was a really interesting week. I enjoyed playing around with the notebook. For my own little project, I decided to dispense with notebooks and the p5 editor and run everything locally – a python script server and a p5.js-based frontend. The model is AST-ESC50, a transformer trained on the ESC-50 environmental sound dataset.

The frontend records audio in chunks (3 seconds each) using the MediaRecorder API, then ships each chunk to the server as a POST request. The server converts the webm blob to a wav file using pydub/ffmpeg, feeds it to the classifier, and returns the top 5 predictions with confidence scores. The sketch displays these as labeled bars, updating every cycle.

When the top prediction crosses a confidence threshold, the frontend asks the server for a relevant image. The server proxies a keyword search to Pixabay’s API and pipes back a photo. This gives the interface a visual element that reflects what the model thinks it’s hearing — a picture of rain, a dog, a keyboard, whatever the top label happens to be.

The p5 canvas handles all the rendering: a status indicator, the prediction bars, the background image with a fade-in effect, and a footer showing the chunk interval. There’s no DOM manipulation beyond the canvas itself. Clicking anywhere toggles recording on and off.

Leave a ReplyCancel Reply