Multi-camera real-time object detection with WebRTC and YOLO

Cover Image for Multi-camera real-time object detection with WebRTC and YOLO
·21 min read

Ever seen Jason Bourne trilogy?

In the Jason Bourne trilogy, cameras play a crucial role in the hunt for the titular character, a former CIA assassin suffering from dissociative amnesia who is on the run and trying to uncover the truth about his past.

Scene from Jason Bourne movie thrilogy

In "The Bourne Identity", cameras are used by the CIA to track Bourne's movements and try to locate him.

In "The Bourne Supremacy", the use of cameras intensifies as Bourne is framed for a crime he didn't commit and the CIA is determined to bring him in.

In "The Bourne Ultimatum", cameras continue to be a key tool in the pursuit of Bourne, as the agency uses them to monitor his every move and try to anticipate his next steps.

Despite the extensive use of cameras and other surveillance technology, Bourne is able to stay one step ahead of his pursuers and evade capture thanks to his intelligence, resourcefulness, and combat skills.

Nevertheless, it is possible to build similar similar system by using cameras for object detection through the use of computer vision and machine learning techniques. These techniques allow for the automatic identification and tracking of objects in real-time video streams, enabling a wide range of applications such as security and surveillance.

There are also several challenges to using cameras for object detection such as issues related to data privacy and security, as well as technical limitations such as low image quality, lighting conditions and available compute power.

Nevertheless, you will learn to build a surveillance system that uses WebRTC for low-latency, and YOLO for object detection.

What is WebRTC?

WebRTC (Web Real-Time-Communication) is a collection of specifications enabling a web application running on any device to exchange real-time video, voice, or generic data with remote party in a peer-to-peer (P2P) fashion.

WebRTC is maintainted in the IETF (Internet Engineering Task Force) by the RTCWEB (Real-Time Communication in WEB-browsers) working group.

On the high-level, the WebRTC contains of the following 3 components:

  • Signaling: Middleman between peers
  • Connecting: How peer-to-peer happens
  • Secure Communication: Sending and receiving video, and audio

Signaling: Middleman between peers

In order to communicate with other peers, you first need to share who you are and what kind of data you provide or accept. For that you are going to use signaling.

Signaling is a communication coordinating process. You can think of it as a middleman between peers. Signaling uses Session Description Protocol (SDP), as defined in RFC8866, for communication format.

SDP is plain-text protocol that defines a format for session description as a group of key values in text format, one field per line. Below is example from the RFC8866:

o=jdoe 3724394400 3724394405 IN IP4
s=Call to John Smith
i=SDP Offer #1
e=Jane Doe <>
p=+1 617 555-6011
c=IN IP4
t=0 0
m=audio 49170 RTP/AVP 0
m=audio 49180 RTP/AVP 0
m=video 51372 RTP/AVP 99
c=IN IP6 2001:db8::2
a=rtpmap:99 h263-1998/90000

Beware, not all key values are used by WebRTC. Only keys defined in JavaScript Session Establishment Protocol (JSEP) (RFC 8829) are.

While the protocol is simple to read and understand, the meaning of values can make your head spin. I suggest you to check out anatomoy of a WebRTC SDP for interactive explanation of what these values mean.

WebRTC does not handle signaling for you nor does the SDP tell you what kind of transport you must use. It is up to you to use either HTTP REST API, RPC, WS (WebSocket) server, or even email for sending necessary information before the peer connection is initiated. With most public WebRTC examples, human acts as a transport layer where you have to copy SDPs to initiate peer-to-peer connection.

NB! You are RESPONSIBLE for making signaling secure. If you do not secure signaling, it is possible to hijack signaling which results control over sessions, connections, and content. In order to secure signaling, use HTTPS and WSS/TLS to ensure messages cannot be intercepted unencrypted.

Connecting: How peer-to-peer happens

Once session description has been communicated, WebRTC establishes peer-to-peer (P2P) connections in order to share media and data. However, establishing peer-to-peer connections can be quite difficult as peers are usually in different networks. On top of that, there are network constraints such as Network Address Translations (NATs), firewalls, antiviruses, corporate policies, etc. In some cases, the peers don't speak even the same network protocols (different IP versions for example).

Network Address Translations (NAT) is technique to delay the exhaustion of the available address pool of IPV4.

Interactive Connectivity Establishment (ICE) solves this by making use of the Session Traversal Utilities for NAT (STUN) protocol, and its extension Traversal Using Relay NAT (TURN).

You can read about ICE (Interactive Connectivity Establishment) in rather lengthy RFC8445.

In short, ICE finds the best way to communicate between computers (agents). Using either STUN or TURN protocols, ICE finds out the network architecture and provides some transport addresses (ICE canditaes) on which the ICE agent can be contacted.

Network Address Translation (NAT) is the magic what makes WebRTC possible. To make this communication happen you establish a NAT mapping. Agent 1 uses port 7000 to establish a WebRTC connection with Agent 2. This creates a binding of to This then allows Agent 2 to reach Agent 1 by sending packets to Creating a NAT mapping like in this example is like an automated version of doing port forwarding in your router. RFC 4787 describes in detail this behaviour.


STUN identifies the unique IP address of the agent (peer) for the peer to peer connection. However, if network conditions are such that the true IP address for an agent is masked, the STUN server fails to supply specific information which results failure in media and data exchange between the agents (peers). TURN, as defined in RFC 8656, is backup for such cases when direct connectivity is not possible. TURN uses dedicated server which acts as a proxy for agents. It does introduce some things that you must be aware:

  • TURN is extra dependency in your system which takes resources
  • Latency increase between peers
  • TURN server might receive a lot of traffic so make sure it scales

Now you know how peers reach other, and how data exchange takes place, it is time to talk about the last piece that completes the picture.

Secure Communication: Sending and receiving video, and audio

Theoretically you can send and receive unlimited audio and video streams with WebRTC meaning you could send desktop screen streaming while using headphones microphone for audio stream. And the best part of it is that you really do not have to think about details such as codecs, fragile network conditions as bandwith fluxations, packets losses, etc. because it is all handled by WebRTC for you. Awesome, right?

Small word of caution: while you can send media in any codec, the client receiving it must support the used codec as well!

The media transmission is done by RTP (Real-time Transport Protocol), and RTCP (RTP Control Protocol) protocols, both defined in RFC 1889. In short, RTP carries the media streams (audio and video) while RTCP monitors transmission statistics and quality of service.

Every WebRTC connection is authenticated and encrypted with Datagram Transport Layer Security (DTLS) and the Secure Real-time Transport Protocol (SRTP).

DTLS allows you to negotiate a session and then exchange data securely between two peers. It is a sibling of TLS, the same technology that powers HTTPS, but DTLS uses UDP instead of TCP as the transport layer. SRTP is specifically designed for exchanging media securely.

Object Detection

In computer vision (CV) object detection is an aspect of image processing that detects and classifies different objects within an image. It is a fundamental problem in computer vision that has a wide range of applications, including robotics, surveillance, and image and video analysis.

Some examples of object detection include:

  • Identifying pedestrians in an image or video taken from a self-driving car.
  • Detecting cars, pedestrians, and other objects in an image or video taken from a surveillance camera.
  • Detecting objects in images taken by a drone for mapping or inspection purposes.
  • Identifying objects in images or videos taken by a smartphone camera for augmented reality applications.
  • Detecting objects in images or videos for image or video annotation or classification purposes.

Pattern Classification

Traditional computer vision techniques, such as detecting colors, shapes, and contours, can be effective for simple object detection tasks where the objects of interest have distinctive features that can be easily identified. These techniques are generally fast and efficient.

However, traditional techniques can be limited in their ability to handle more complex object detection tasks, such as detecting objects that vary in appearance or are occluded by other objects. In these cases, machine learning-based approaches, such as deep learning, is more approriate fit. This leads us to YOLO.

Deep learning neural networks are able to learn complex patterns in data and can be trained to perform object detection tasks with high accuracy. These approaches are generally more flexible and can handle a wider range of object detection tasks.


YOLO (You Only Look Once) is a deep learning model developed by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi in 2015. It is a single convolutional neural network (CNN) that is trained to perform object detection by identifying objects in images and predicting the bounding boxes around them.

The key idea behind YOLO is that it only processes an image once, rather than performing object detection in multiple stages. This makes it much faster than other approaches, and allows it to be used in real-time applications.

YOLO works by dividing the input image into a grid of cells, and each cell is responsible for predicting a certain number of bounding boxes. The CNN processes the entire image and produces a set of predictions for each cell in the grid. Each prediction consists of several components: the class probabilities, the bounding box coordinates, and the confidence score.

Convolutional neural network (CNN) is a type of artificial neural network specifically designed to process data that has a grid-like topology, such as an image. CNNs are particularly effective at image recognition tasks because they are able to learn features directly from the input data, rather than requiring the features to be hand-engineered by a human.

The class probabilities indicate the likelihood that the bounding box contains an object of each class. The bounding box coordinates define the location and size of the bounding box, and the confidence score reflects the confidence of the prediction.

To make the predictions more accurate, YOLO uses anchor boxes, which are predefined bounding boxes with different aspect ratios. The anchor boxes help the model to better detect objects that have unusual shapes or sizes, as well as handle multiple objects in the same bounding box.

Once the predictions have been made, YOLO uses non-maximum suppression to remove overlapping bounding boxes and keep only the most confident ones. The final set of bounding boxes is then used to label the objects in the image.


As you now know about WebRTC and YOLO, let's proceed with solving the problem: find a person via object-detection by using n amount of camera feeds in preferrably real-time.

If the n was equal to 1 or 2, capturing data and sending it to central server for object detection would be reasonable thing to do.

What if n was equal to 1000? You would need a lot of resources to run this kind of central processing system. Even if you had access to such resources, scaling the system will become very expensive as the higher the n, the higher the compute power need.

What if you used edge computing instead?

Edge computing

Edge computing is a distributed computing paradigm in which computing and data processing tasks are performed at the edge of the network, closer to the devices generating or collecting the data. This is in contrast to traditional computing, in which tasks are performed in a centralized location, often far from the devices generating or collecting the data.

The main motivation for edge computing is to reduce the amount of data that needs to be transmitted over the network, reduce latency, and improve the performance of real-time applications that require low-latency processing of data.

This can is useful for object detection, as it allows for real-time processing of data. It also combines well with WebRTC.

As the diagram below shows, a camera equipped with processing service uses YOLO for object detection, and sends the data via WebRTC to the end-user. This scales well assuming each camera is attached to a resource-constrained node.

Real time object detection solution
Central Node
Signaling Service
Node #3
Processing Service
Video Input
Node #2
Processing Service
Video Input
Node #1
Processing Service
Video Input

You can optimize the solution as a node doesn't have to process just 1 camera. You could also use mesh network instead where multiple cameras feeds are processed in parellel by a hub.

As you can see, only processing service needs to be built, and bundle it with desired hardware for processing the camera input.

Processing service PoC

Let's build the processing service with Python.

Want to learn Python? Check out O'Reilly platform by clicking here.

Install some Python dependencies.

pip3 install opencv-python aiortc numpy aiohttp

The solution code can be found at Github -

Copy & paste the videostream-cli example from aiortc.

This will serve as boilerplate for the YOLO detection code as you only need to work with a class that extends aiortc's VideoStreamTrack.

You are going to use YOLOV7 which is based on the YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors paper.

Let's proceed with writing the class.

model = "yolov7-tiny_480x640.onnx"
class YOLOVideoStreamTrack(VideoStreamTrack):
    A video track thats returns camera track with annotated detected objects.
    def __init__(self, conf_thres=0.7, iou_thres=0.5):
        super().__init__()  # don't forget this!
        self.conf_threshold = conf_thres
        self.iou_threshold = iou_thres
        video = cv2.VideoCapture(0) = video
        # Initialize model = cv2.dnn.readNet(model)
        input_shape = os.path.splitext(os.path.basename(model))[0].split('_')[-1].split('x')
        self.input_height = int(input_shape[0])
        self.input_width = int(input_shape[1])
        self.class_names = list(map(lambda x: x.strip(), open('coco.names', 'r').readlines()))
        self.colors = np.random.default_rng(3).uniform(0, 255, size=(len(self.class_names), 3))
        self.output_names =

The YOLOVideoStreamTrack class has a constructor which takes two optional arguments: conf_thres and iou_thres. Both of these values will be used for selecting the most appropriate bounding box from the result returned from the object detection model.

A cv2.VideoCapture object is created and initialized with a value of 0, which indicates that it should capture video from the default camera.

The YOLOVideoStreamTrack class also has instance variables for storing a deep learning model, the input shape for the model, a list of class names for the objects that the model can detect, and a list of random colors to use when visualizing the detected objects. These are all initialized in the constructor.

The output_names instance variable is set to a list of the names of the unconnected output layers of the deep learning model. This information is used when running object detection with the YOLO algorithm.

Now, let's start building the object detection logic. First, you need to prepare the captured frame that is sent to the deep learning model.

def prepare_input(self, image):
    self.img_height, self.img_width = image.shape[:2]
    input_img = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    input_img = cv2.resize(input_img, (self.input_width, self.input_height))
    return input_img

The method first stores the height and width of the input image in the img_height and img_width instance variables of the class.

The input image is then converted from the BGR color space to the RGB color space using the cv2.cvtColor function. This is done because many image processing and computer vision libraries use the RGB color space by default.

The input image is then resized to the dimensions specified by the input_width and input_height instance variables using the cv2.resize function. These dimensions are the expected input size for the deep learning model used for object detection.

Now, let's proceed with detection method.

def detect(self, frame):
    input_img = self.prepare_input(frame)
    blob = cv2.dnn.blobFromImage(input_img, 1 / 255.0)
    # Perform inference on the image
    # Runs the forward pass to get output of the output layers
    outputs =
    boxes, scores, class_ids = self.process_output(outputs)
    return boxes, scores, class_ids

The method first calls the prepare_input method to preprocess the input image. The preprocessed image is then passed to the cv2.dnn.blobFromImage function to create a blob (a multi-dimensional array) that can be passed as input to the deep learning model.

The model is then run on the input data using the net.forward method, which returns the output of the model.

The method then calls the process_output method, passing in the model outputs as an argument. This method processes the output of the model and returns three lists: boxes, scores, and class_ids. The boxes list contains the bounding boxes for the detected objects, the scores list contains the confidence scores for the detected objects, and the class_ids list contains the class IDs for the detected objects.

def process_output(self, output):
    predictions = np.squeeze(output[0])
    # Filter out object confidence scores below threshold
    obj_conf = predictions[:, 4]
    predictions = predictions[obj_conf > self.conf_threshold]
    obj_conf = obj_conf[obj_conf > self.conf_threshold]
    # Multiply class confidence with bounding box confidence
    predictions[:, 5:] *= obj_conf[:, np.newaxis]
    # Get the scores
    scores = np.max(predictions[:, 5:], axis=1)
    # Filter out the objects with a low score
    valid_scores = scores > self.conf_threshold
    predictions = predictions[valid_scores]
    scores = scores[valid_scores]
    # Get the class with the highest confidence
    class_ids = np.argmax(predictions[:, 5:], axis=1)
    # Get bounding boxes for each object
    boxes = self.extract_boxes(predictions)
    # Apply non-maxima suppression to suppress weak, overlapping bounding boxes
    indices = cv2.dnn.NMSBoxes(boxes.tolist(), scores.tolist(), self.conf_threshold, self.iou_threshold)
    if len(indices) > 0:
        indices = indices.flatten()
    return boxes[indices], scores[indices], class_ids[indices]
def rescale_boxes(self, boxes):
    input_shape = np.array([self.input_width, self.input_height, self.input_width, self.input_height])
    boxes = np.divide(boxes, input_shape, dtype=np.float32)
    boxes *= np.array([self.img_width, self.img_height, self.img_width, self.img_height])
    return boxes
def extract_boxes(self, predictions):
    # Extract boxes from predictions
    boxes = predictions[:, :4]
    # Scale boxes to original image dimensions
    boxes = self.rescale_boxes(boxes)
    # Convert boxes to xywh format
    boxes_ = np.copy(boxes)
    boxes_[..., 0] = boxes[..., 0] - boxes[..., 2] * 0.5
    boxes_[..., 1] = boxes[..., 1] - boxes[..., 3] * 0.5
    return boxes_

The process_output method first selects the first element of the output list and squeezes it to remove any dimensions with size 1. This results in a 2D NumPy array called predictions.

The method then filters out predictions with object confidence scores below the conf_threshold instance variable.

The method then extracts the class IDs for each prediction by taking the index of the maximum value in the last 80 columns of the predictions array and storing the result in the class_ids variable.

The method then applies non-maxima suppression (NMS) to the bounding boxes using the cv2.dnn.NMSBoxes function. NMS is a technique used to suppress weak, overlapping bounding boxes and select only the most confident bounding boxes for each object. The indices variable is created to store the indices of the bounding boxes that pass the NMS criteria.

If there are any indices in the indices list, the method flattens the list and uses it to select the bounding boxes, scores, and class IDs for the detected objects. These are then returned as the result of the process_output method.

The rescale_boxes method takes a single argument boxes, which is a list of bounding boxes in the format [x, y, width, height]. The method scales the bounding boxes to the original dimensions of the input image.

The extract_boxes method takes a single argument predictions, which is a 2D NumPy array containing the predictions for each object. The method extracts the bounding boxes from the predictions array and scales them to the original dimensions of the input image using the rescale_boxes method. The method then converts the bounding boxes to the format [x, y, width, height] and returns the resulting list of bounding boxes.

Lastly, you just need a piece of code that draws these on the camera frame.

def draw_detections(self, frame, boxes, scores, class_ids):
    for box, score, class_id in zip(boxes, scores, class_ids):
        x, y, w, h = box.astype(int)
        color = self.colors[class_id]
        # Draw rectangle
        cv2.rectangle(frame, (x, y), (x+w, y+h), color, thickness=2)
        label = self.class_names[class_id]
        label = f'{label} {int(score * 100)}%'
        cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1)
        cv2.putText(frame, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 1, color, thickness=2)

The draw_detections uses the cv2.rectangle function to draw a rectangle on the frame image using the bounding box coordinates and the color for the current object. The method also draws a label for the object on the image using the cv2.putText function. The label includes the class name and the confidence score for the object.

If you run the following code with public webcamera, then this is the result:

POC result

The solution code can be found at Github -

Please keep in mind that this is PoC, and you need additional code to handle different scenarios such as errors, capture not working, no device found, etc.

And that's it! Just with some Python libraries, and adding not that complicated code, you have built a service that consumes camera feed which gets proccessed and sent to the end-user.

Going futher

At large scale, you most likely need to optimize the processing service to run on the specific hardware.

First of all, you would consider using C++ or Rust. If you are going to use C++, I recommend using ncnn, a high-performance neural network inference framework optimized for the mobile platform. Also, beware that working with WebRTC C++ library can cause some major headache therefore as a word of caution, you are better off writing your own implementation for it.

Secondly, you could use other object detection models although YOLO is currently the most popular one. You can also include other algorithms such as detecting unique entities, counting entities, and so on.

At last, you might consider using data channel for sending data with the media channel instead of annotating media frames. With data channels you can build a system that only sends data when motion is detected which results preserved bandwidth.

Nevertheless, I hope you learned something new!

Thank you for reading, and subscribe to keep up with new articles.


Best world monthlyselection in your email.