How to count people from CCTV cameras

We will see how to identify and count people from CCTV cameras with the highest accuracy possible. I will explain the main problems you might find in making this project and what the best strategy is.

Count people from CCTV camera of bakery store

Count people from a CCTV camera in a bakery store

We will understand what is the best way to count people from a CCTV camera in a bakery store. Before going on with this tutorial I recommend that you see How Artificial Intelligence counts people and vehicles from CCTV cameras to get a general idea of how object counting works.

We list some methods and try solutions that can adapt to our problem.

Why you should not Identify full-length people (with pre-trained models)

As a first try, we will use the most immediate and intuitive way for the type of project. We have to count people from a CCTV camera in a bakery store so we focus on identifying people and then we apply a tracking algorithm.

For this step, I used a version of YOLO with a pre-trained model and set the selection of people as the only category. This is the result

As you can see the result is quite good because people are identified but if we apply the tracking the problems begin.

Two types of problems:

  1. Recognition of people is decent but if the accuracy is not accurate or if the algorithm loses the object for a few seconds the tracking will not be accurate.
  2. Tracking doesn’t work well when there are occlusions or too many people overlap.

2st Method: Identify peoples head

With the first method we tried to identify people with a pre-trained model but the result was not good. In this case, we will try a different approach, we will try to identify the head instead of the whole body. The explanation is very simple with this type of CCTV camera and due to its position, it is easier to identify the head than the whole body. Of course, there will also be fewer problems of overlapping people.

How do we go on? In this case, we will need a custom model, if you want to know more about how to train your model you can relate to this lesson: Train YOLO to detect a custom object (online with free GPU) .

As you know, to train a model you first need a dataset of images which will then be labeled and passed to the algorithm. Where to find these images? There are several solutions and let’s see which is the best:

  1. Google image Dataset [bad result]
  2. Images from video footage [good result]

1 Google image Dataset – [bad result]

We can give it a try, the simplest, by searching for “Human Head” on the Google Open image dataset and downloading everything you are ready for training.

Google Open Image Dataset

After 50 hours of training, we see from the graph below on the left that the accuracy is quite good but applied in our project in the photo below on the right, the result is bad because it only recognizes one person.

2 Images from video footage – [good result]

To get a good result in the Count people from a CCTV camera we have to try another method because using generic images didn’t work. If we compare the images downloaded from the Google Open image dataset we can see the difference with the heads present in the CCTV camera images.

We repeat the training process but first, it is necessary to extract thousands of images from the video footage. The process used in this step is similar to this project: Agriculture plant analysis with the drone and Artificial Intelligence and you should get something like this

count people from CCTV video footage

Also in this case, by making the annotations and subsequently the training, we see the graph and the result. At first, it already seems an excellent result.

Count people from CCTV camera from a Region of Interest (ROI)

When we have achieved the best possible result for the recognition and tracking of people we have to count them. This is possible by defining a Region Of Interest, I used the same technique for this Speed detection from CCTV with OpenCV and Deep Learning project as well.

When the person enters the ROI it updates the count and it does not matter if it moves continuously because the count is associated with the ID of each individual user. For this procedure, I recommend defining an area of interest as small as possible to increase accuracy.

Count people from a cctv camera final result

Accuracy in Count people from CCTV

This is already an excellent result but there is certainly still a lot of room for improvement. For example, if a person moves to an area not covered by the camera, they may lose track.

We have achieved an excellent result for this project but I always remember that an analysis of the cameras, position, additional tools such as Deep cameras or Deep Learning models created specifically for that type of scenario is necessary.


Learn to build Computer Vision Software easily and efficiently.

This is a FREE Workshop where I'm going to break down the 4 steps that are necessary to build software to detect and track any object.

Sign UP for FREE