Live Camera Efficient YoloV5 Object Recognition (ML, Computer Vision, FFMPEG, Python)

I have a special use case where I have security cameras that have terrible “movement” recognition so we’d get a lot of useless recordings when the bushes shook, and then things like the mailman wouldn’t register.

I knew I wanted to implement Object Recognition and YoloV5 seemed to fit the problem. I’d encourage everyone to take literally 30 seconds to look at the YoloV5 link as to analyze an image takes like 4 lines of code. The barrier to entry is so low.

As well, I am also not a Python developer so I’m sorry about the shit code.

The Idea

I don’t need a per-frame analysis, and through my experience with FFMPEG, I know it can output frame grabs. Let’s analyze for cars and humans twice a second (every 0.5 seconds). So we just need to stream images, without persisting to disk (because that wouldn’t be efficient) and pipe it into YoloV5.

Streaming Images via FFMPEG

import subprocess
from subprocess import Popen

# Pipes png bytes to STDOUT
ffmpeg_command = ["ffmpeg", "-rtsp_transport", "tcp", "-i", "url",
                "-f", "image2pipe", "-c:v", "png", "-vf", "fps=2", "pipe:1"]

pipe = Popen(ffmpeg_command,
                       stdout=subprocess.PIPE,
                       stderr=subprocess.PIPE,
                       bufsize=10**8)

The main thing here is that we’re taking a RTSP input (via tcp) and outputting PNG bytes to STDOUT at 2 fps.

Then we need to read the buffer, and find the decimal values 137 80 78 71 13 10 26 10 to signify the start of a PNG, load all of those bytes until the next magic number that signifies the next one.

start = True
x = 0
currentPngBytes = bytes([])

while(pipe.poll() == None):
    readBytes = pipe.stdout.read(10**4)
    foundStartMarker = False
    x+=1

    for index, item in enumerate(readBytes):
        if (index + 8 > len(readBytes)):
            break
        if item == 137 and readBytes[index+1] == 80 and readBytes[index+2] == 78 and readBytes[index+3] == 71 and readBytes[index+4] == 13 and readBytes[index+5] == 10 and readBytes[index+6] == 26 and readBytes[index+7] == 10 :
            foundStartMarker = True
            if start == True:
                start = False
                currentPngBytes = bytes([])
            else:
                currentPngBytes += readBytes[:index-1]
                analyzePng(currentPngBytes)
                currentPngBytes = bytes([])


            currentPngBytes += readBytes[index:]
            break;
    
    if foundStartMarker == False:
        currentPngBytes += readBytes

print("Killing Pipeline")
pipe.kill()

Analyzing the raw image bytes

I load the bytes into a PIL image, crop out the part I don’t want to analyze, set the classifications I care about, then analyze.

To get the classifications of the default COCO data set, use this (but subtract 1 from the id): https://gist.github.com/AruniRC/7b3dadd004da04c80198557db5da4bda

def analyzePng(pngBytes):
    img = Image.open(io.BytesIO(pngBytes))
    # crop the top of the image because I don't want to analyze that section
    croppedImg = img.crop((0,50,img.width, img.height))
    # COCO classes I care about 
    model.classes = [0,2,3,7]
    results = model(croppedImg)
    results.print()

Summary

In the end, I was extremely surprised how easy it was to do object detection via YoloV5. It really lowers the barrier to entry for object detection. I’m still working on result manipulation, but this will link into the security code very easily.