In this blog post we explain how you can build your own face detection application without much machine learning knowledge. Why? At codecentric everyone has one day per week for professional development and training. Among other things we use this time to get in touch with new technologies and build cool stuff. This time we decided to have a closer look at the Coral USB accelerator. You can see the outcome in the following video.
The application detects faces based on a pre-trained neural network and overlays them with face filters. In order to keep the face filters assigned to individual faces, even if multiple people appear in the video, it tracks the detected faces over time. In this blog post we explain how it works and how you can build your own face detection application with low cost consumer hardware and without much machine learning knowledge.
Our hardware setup
We used the following hardware components:
The Coral USB accelerator connected to a Raspberry Pi 4 is the heart of the setup. The accelerator contains an edge TPU (Tensor Processing Unit) coprocessor which is optimized to process matrix operations. It currently only supports pre-compiled TensorFlow Lite models. It can perform 4 trillion operations per second. Therefore, it allows high inference speed for image classification and object detection using neural networks. When we ran our experiments on the CPU of the Raspberry Pi 4 without the Coral USB accelerator, the application could process between 0.5 and 1.5 frames per second. Using the accelerator, we achieved between 10 and 25 frames per second depending on how much image manipulation features we added and which image resolution we used.
The USB accelerator is connected with the Raspberry Pi 4 via the USB 3.0 Type C interface. While the accelerator also supports USB 2.0, it is recommended to use USB 3.0 to ensure sufficient data transfer rates. The Pi 4 is the first Pi which has USB 3.0 on board. You can also use older Raspberry Pi versions but expect USB 2.0 to be a bottleneck which will substantially lower the achievable framerate. Have a look at the framerate from our experiment we did two years ago with a Pi 3 and the Movidius stick, which was connected via USB 2.0.
We used a Logitech C920 HD Pro webcam for the setup but as we mentioned earlier, many webcams should work and should lead to similar results. After connecting the devices as shown below we can start to install the needed drivers and libraries.
Setup and installation
First you should install a clean Raspbian distribution on your Raspberry Pi using the Noobs installer. A detailed guide can be found here. With your Raspberry Pi up and running you can install git and clone the repository we prepared to help you get started fast. Installing all needed dependencies for the USB accelerator is quite some work. To save you the time and effort we wrote a script which automates the installation. You can find it in the root folder of the repository. Simply run
install.sh, which is located in the root folder and it will install all dependencies. The installation will take a while. After the installation unplug and replug the Coral USB Accelerator once. Now you should be able to run the face replace demo with
python3.7 -m face_replace.
First we have to initialize the detection engine with the pre-trained model contained in the repository.
Now we can read the video stream from the webcam. Therefore, we are using the image-utils library. To allow the camera sensor to warm up we wait 1 second before we start processing the stream.
The next step is to read the single frames from the video stream and preprocess them. Since the video is recorded mirrored, we flip each frame horizontally using the computer vision library opencv. This code, together with the rest of the frame processing, happens inside a loop which runs infinitely until the user stops the application.
As the first step of the image preprocessing, we resize the frame. The frame we captured from the video is a numpy ndarray. The color model of the pixel stored in this array is BGR. Since we need an RGB image as input for the face detection engine, we have to convert the colors from BGR to RGB and create an image out of the array using the imaging library Pillow (PIL).
Using the preprocessed image and the previously initialized model we can now start to run the face detection. Therefore, we call the method
detect_with_image on the previously initialized model. This will run the inference, which means it will produce the predicted faces. The method takes multiple inputs: the image, a
threshold which defines the minimum confidence for the detected faces and the
top_k parameter, which defines the maximum number of faces the model should detect.
The detected faces are a list of
DetectionCandidates where each entry provides the bounding box of the detected face.
In the next step we iterate over the detected faces, extract each bounding box and use it to overlay the faces with a face filter. How we determine the face filter will be explained in the next chapter. Notice, we apply the face filter on the resized frame (ndarray) instead of the Pillow image.
We achieve this by extracting the coordinates from the bounding box and resizing the face filter to the size of the bounding box.
Afterwards, we can overwrite the original face with the resized face filter. We rewrite all pixels inside the bounding box. Since our face filters are PNG images and want to keep the transparent regions of the images we have to take the alpha value into account and draw the original image with the inverted alpha value of the face filter image.
The last step is to display the manipulated frame. We can easily do this by calling the method
imshow and providing the window name and the frame inside the constructor.
Until now, we didn’t explain how we keep track of faces and cover a face with the same face filter over time. Let’s say you would choose a filter randomly. Since the application does not yet know any time dependency, different face filters would be randomly chosen for the same person for every frame, which would result in a very chaotic filter flickering. Instead, our goal should be to track faces. That’s why we implemented a simple tracking algorithm which assigns a specific face filter to each face even when the position of the face changes from frame to frame.
Caching – keeping faces in memory
If you already worked with face detection the first idea which might come to your mind is to apply feature recognition to each detected face and compare the features from frame to frame, which would allow you to track the face. On the one hand this approach would be very expensive computationally speaking, on the other hand it would also be complex to implement.
Instead, we came up with the idea to store the bounding boxes of the detected faces, together with the related face filter, inside a cache. Using this cache, we calculate the nearest bounding box from frame to frame in order to rediscover the face related to the bounding box. This approach is much easier to implement and requires significantly fewer calculations. Though, it leads to an unwanted feature. When person B walks in front of person A, they might be able to “steal” the face filter from person A, since person A’s face won’t be visible while covered by person B and the closest face to A’s filter will then be the face of person B. We did not mind this feature for our experiments ;-).
Each entry of the cache contains the bounding box of the detected face, a face filter randomly chosen from the available face filter collection and the age of the entry.
For each detected face we update the cache with its bounding box and get the face filter for the face.
If the cache is empty or the distance of the new bounding box to the cached bounding box exceeds a defined threshold, we add a new entry to the cache and return a new randomly chosen face filter.
Otherwise, we update the cache entry with the nearest bounding box. This means we overwrite the bounding box of the cached entry with the new one, reduce the age of the entry and return the previously applied face filter.
To find the nearest bounding box we perform a nearest neighbor lookup using the k dimensional search tree provided by the scipy library.
Our caching approach allows us to track faces but when the application is running, the cache will grow and use up an increasing amount of memory. Furthermore, it will contain bounding boxes of faces which left the captured area of the camera. For these reasons we decided to invalidate the cache each 10 iterations of the video loop.
The invalidate method first increases the age of each cache entry and then drops all entries whose age is equal or bigger than the maximum age.
In this blog post we showed how you can build your own AI based face detection application using low cost consumer hardware and little machine learning knowledge.
Now that you know how it works, you could try to build your own applications. For example, have a look at the pong game.
We hope that we inspired you to start your own experiments. Share your ideas or results in the comments below and let us know which experiments we should conduct next!
We thank our colleague Marcel Mikl for his support during the implementation of the demo.