How to use YOLO to accurately track specific people?

OpenCV comes with built-in object tracking capabilities that can be used upon installation. However, it only provides basic tracking functions. When there are many objects in the image, or when the objects move across each other or are obscured by other objects, the ongoing tracking can be interrupted.

A better approach is to use deep learning models, such as YOLO object recognition, combined with the Intersection-Over-Union (IOU) method. This yields impressive tracking results and is quite fast even on a standard PC or laptop. Let’s take a look at how to do this step by step.

The human detection model we use in this example is「Accurate human detection model」from yolo.dog , which can accurately detect parts of the human figure such as heads and bodies in images.

You could also opt for the general model provided on the official YOLOV8 website, but the results won’t be as good, especially for crowded groups, smaller figures, people in swimming pools, etc. For detecting most of the human body and head, using 「Accurate human detection model」is recommended.

A. Install the necessary Python packages

pip install ultralytics
pip install streamlit
pip install dill

B. Create a file named tracking.py, and then edit it with your preferred tool.

C. Import necessary packages

import cv2
import math
from ultralytics import YOLO

D. Define an IOU function.

Intersection over Union (IoU) is a metric used in object detection to measure the overlap between two bounding boxes. It is defined as the area of overlap between the two bounding boxes divided by the area of their union.

The IoU score ranges from 0 to 1, where 0 means there is no overlap and 1 means the bounding boxes are identical. This metric is often used in object detection and segmentation to evaluate how close the predicted bounding box is to the ground truth box.

def iou_boxes(boxA, boxB):
    # boxA and boxB are bounding boxes, each represented by a list of four elements: [x1, y1, x2, y2], where (x1, y1) are the coordinates of the top-left corner and (x2, y2) are the coordinates of the bottom-right corner.
    # Compute the coordinates (xA, yA) of the top-left corner of the intersection of boxA and boxB
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])

    # Compute the coordinates (xB, yB) of the bottom-right corner of the intersection of boxA and boxB
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])

    # Compute the area of the intersection from its width (xB - xA + 1) and height (yB - yA + 1). If there's no intersection, this will be 0.
    interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)

    # Compute the area of both boxA and boxB
    boxAArea = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1)
    boxBArea = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1)

    # Compute the IoU by dividing the area of the intersection by the area of the union (which is the sum of the areas of boxA and boxB, minus the area of the intersection)
    iou = interArea / float(boxAArea + boxBArea - interArea)

    # Return the IoU
    return iou

E. Load the YOLOv8 model, which can detect head and body.

net = YOLO('yolov8s.pt')
classNames =["head", "body"]

F. You can use a wen camera or a video file

# Use Webcam
cap = cv2.VideoCapture(0)
# Use a viodeo file
# cap = cv2.VideoCapture("source.mp4")

G. create an parameter areaROI, and read the first frame. Then, strart the loop.

areaROI = None
frameOK, img = cap.read()
while frameOK:

H. If areaROI is equal to None, it means we need to ask the user to provide a rectangular box containing the object to be tracked. (use cv2.selectROI). Then, we will put the box to roi_bbox.

    if areaROI is None:
        areaROI = cv2.selectROI("Select ROI", img)
        cv2.destroyWindow("Select ROI")
        roi_bbox = [areaROI[0], areaROI[1], areaROI[0]+areaROI[2], areaROI[1]+areaROI[3]]

I. Detect the image

    results = net(img, stream=True)

J. We only need the body class, so take out all the body classes and put them in the bodies list.

    bodys = []
    color = (0, 0, 255)
    for r in results:
        boxes = r.boxes

        for box in boxes:
            # bounding box
            x1, y1, x2, y2 = box.xyxy[0]
            x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2) # convert to int values

            # confidence
            confidence = math.ceil((box.conf[0]*100))/100

            # class id
            cls = int(box.cls[0])
            # we use the body class only (class_id = 1)
            if cls == 1:                
                bodys.append([x1, y1, x2, y2])

K. Among all the bodies, find the one with the largest IOU value to roi_bbox, which means that this body is closest to roi_bbox, which means that the object to be tracked should be it. Finally, put the closest box in the new_roi_bbox parameter.

    max_distance = 0
    new_roi_bbox = None
    for box in bodys:
        iou = iou_boxes(box, roi_bbox)
        if iou > max_distance:
            max_distance = iou
            new_roi_bbox = box

L. If the IOU of the nearest BOX is less than 0.3, it means that the two boxes are not that close, so we determine that tracking has been lost, set areaROI to None, and let the user reselect the object to be tracked.

    if max_distance<0.3:
        areaROI = None

M. If areaROI is not None, it means that the object is still being tracked. Therefore, we can select it and print the font so that the user can see where the object is currently being tracked.

    if areaROI is not None:
        [x1,y1,x2,y2] = new_roi_bbox
        font = cv2.FONT_HERSHEY_SIMPLEX
        fontScale = 1
        thickness = 2
        cv2.rectangle(img, (x1, y1), (x2, y2), color, thickness)
        cv2.putText(img, "Here", (x1,y1), font, fontScale, color, thickness)

N. Put the new_roi_bbox obtained by the current frame into the roi_bbox parameter.

        roi_bbox = new_roi_bbox

O. Use cv2.imshow to display the screen. Finally, get the next frame and execute the next loop.

    cv2.imshow("Tracking", img)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

    frameOK, img = cap.read()

The complete program code is as follows

import cv2
import math
from ultralytics import YOLO

def iou_boxes(boxA, boxB):
    # boxA and boxB are bounding boxes, each represented by a list of four elements: [x1, y1, x2, y2], where (x1, y1) are the coordinates of the top-left corner and (x2, y2) are the coordinates of the bottom-right corner.
    # Compute the coordinates (xA, yA) of the top-left corner of the intersection of boxA and boxB
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])

    # Compute the coordinates (xB, yB) of the bottom-right corner of the intersection of boxA and boxB
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])

    # Compute the area of the intersection from its width (xB - xA + 1) and height (yB - yA + 1). If there's no intersection, this will be 0.
    interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)

    # Compute the area of both boxA and boxB
    boxAArea = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1)
    boxBArea = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1)

    # Compute the IoU by dividing the area of the intersection by the area of the union (which is the sum of the areas of boxA and boxB, minus the area of the intersection)
    iou = interArea / float(boxAArea + boxBArea - interArea)

    # Return the IoU
    return iou

# Load the YOLOv8 model
net = YOLO('yolov8s.pt')
classNames =["head", "body"]

# Use Webcam
#cap = cv2.VideoCapture(0)
# Use a viodeo file
cap = cv2.VideoCapture("source.mp4")

areaROI = None
frameOK, img = cap.read()
while frameOK:
    if areaROI is None:
        areaROI = cv2.selectROI("Select ROI", img)
        cv2.destroyWindow("Select ROI")
        roi_bbox = [areaROI[0], areaROI[1], areaROI[0]+areaROI[2], areaROI[1]+areaROI[3]]

    # Start to detect, and store in results parameter
    results = net(img, stream=True)

    bodys = []
    color = (0, 0, 255)
    for r in results:
        boxes = r.boxes

        for box in boxes:
            # bounding box
            x1, y1, x2, y2 = box.xyxy[0]
            x1, y1, x2, y2 = int(x1), int(y1), int(x2), int(y2) # convert to int values

            # confidence
            confidence = math.ceil((box.conf[0]*100))/100

            # class id
            cls = int(box.cls[0])
            # we use the body class only (class_id = 1)
            if cls == 1:                
                bodys.append([x1, y1, x2, y2])

    #find the closest body to the center of the ROI
    max_distance = 0
    new_roi_bbox = None
    for box in bodys:
        iou = iou_boxes(box, roi_bbox)
        if iou > max_distance:
            max_distance = iou
            new_roi_bbox = box

    if max_distance<0.3:
        areaROI = None

    if areaROI is not None:
        [x1,y1,x2,y2] = new_roi_bbox
        # object details
        font = cv2.FONT_HERSHEY_SIMPLEX
        fontScale = 1
        thickness = 2
        cv2.rectangle(img, (x1, y1), (x2, y2), color, thickness)
        cv2.putText(img, "Here", (x1,y1), font, fontScale, color, thickness)

        roi_bbox = new_roi_bbox

    cv2.imshow("Tracking", img)
    if cv2.waitKey(1) & 0xFF == ord('q'):
        break

    frameOK, img = cap.read()

Finally, save the tracking.py code file and execute the command below.

python tracking.py
+0