YOLO Model Comparison: v8, v11, v12 & v26

Electro‑optical sensors such as digital cameras generate rich streams of data. Object detectors like the Ultralytics YOLO family distil those streams into actionable information for tasks ranging from pollution monitoring and self‑driving cars to intrusion detection and 3D reconstruction. In this tutorial you will learn how to compare YOLOv8, YOLOv11, YOLOv12 and YOLOv26 on a common video. The exercise walks you through downloading (or generating) a video, installing the required packages, running all four models on each frame, overlaying statistics and assembling the results into a 2×2 grid. Because each detector runs on the same frame at the same time, you can visually gauge differences in accuracy and processing speed.

Background on the models

Before diving into the hands‑on portion, it’s useful to summarize what makes each generation of YOLO unique. These short notes are distilled from the official Ultralytics documentation.

YOLOv8 introduced a new backbone and neck architecture that improves feature extraction, an anchor‑free split head for more efficient detection, and a balance between accuracy and speed that makes it suitable for real‑time applications. Pretrained weights exist to suit different tasks.
YOLOv11 builds on v8 with improved backbone and neck designs, refined training pipelines for greater efficiency and the ability to deliver higher accuracy with fewer parameters than YOLOv8m. It supports object detection, instance segmentation, classification, pose estimation and oriented bounding boxes (OBB).
YOLOv12 is a community‑driven release from early 2025. YOLO12 departs from conventional CNNs by introducing an attention‑centric architecture. It achieves high accuracy through novel attention mechanisms but at the cost of increased memory use and slower CPU throughput. Ultralytics therefore recommends v11 (servers) or v26 (edge devices) for most production workloads. Key innovations include an area attention mechanism to reduce computational overhead, a residual efficient layer aggregation network (R‑ELAN) for better feature aggregation and several optimizations of the attention pipeline.
YOLOv26 is engineered for edge and low‑power devices. It features a native end‑to‑end architecture that eliminates the non‑maximum suppression step, resulting in faster, lighter inference. Additional innovations include a MuSGD optimizer that adapts techniques from large‑language‑model training and task‑specific improvements such as multi‑scale segmentation and residual log‑likelihood estimation for pose detection. These changes deliver higher accuracy on small objects and up to 43 % faster CPU inference.

These differences motivate why you might favor a particular version in your own projects. The exercise below will help you see those differences in action.

Install dependencies

You will need Python 3, OpenCV and the Ultralytics package. In a fresh environment the following commands install what you need:

pip install opencv-python ultralytics

[$[Get Code]]

Download or create a video

You can supply any video file you like. For example, download a sample video of a runner:

import urllib.request
urllib.request.urlretrieve('http://apmonitor.com/dde/uploads/Main/runner.mp4','runner.mp4')

[$[Get Code]]

Python script: process and compare the models

The script below loads four YOLO models, processes each frame of the input video with every model, overlays the model name, frame number and inference time, combines the results into a 2×2 grid and writes the output at half speed. Save this script as compare.py and run it from the command line:

import time, cv2, numpy as np
from ultralytics import YOLO

MODEL_WEIGHTS = {
'YOLOv8': 'yolov8x-pose.pt',
'YOLOv11': 'yolo11x-pose.pt',
'YOLOv12': 'yolo12x.pt',
'YOLOv26': 'yolo26n.pt',
}

def overlay_stats(img, name, frame_no, ms):
for idx, text in enumerate([name,f'Frame: {frame_no}',f'Time: {ms:.1f} ms']):
y = 30 + idx * 30
cv2.putText(img, text, (10,y), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0,0,0), 4, cv2.LINE_AA)
cv2.putText(img, text, (10,y), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (255,255,255), 1, cv2.LINE_AA)
return img

input_video = 'runner.mp4'
output_video = 'compare_'+input_video
cap = cv2.VideoCapture(input_video)
fps = cap.get(cv2.CAP_PROP_FPS)
w = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
h = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
out_fps = max(fps/2.0,1)
out = cv2.VideoWriter(output_video, cv2.VideoWriter_fourcc(*'mp4v'), out_fps,(w*2,h*2))
models = {name: YOLO(weight) for name, weight in MODEL_WEIGHTS.items()}
frame_no = 0
while True:
ret, frame = cap.read()
if not ret: break
frame_no += 1
sub = []
for name, model in models.items():
t0 = time.perf_counter()
res = model(frame)
ms = (time.perf_counter() - t0)*1000
annotated = res[0].plot()
sub.append(overlay_stats(annotated, name, frame_no, ms))
top = np.hstack((sub[0], sub[1]))
bottom = np.hstack((sub[2], sub[3]))
combined = np.vstack((top, bottom))
out.write(combined)
cap.release(); out.release()

[$[Get Code]]

Run the script (substitute your video path accordingly):

python compare.py

[$[Get Code]]

View the comparison video

The output will have the same number of frames as the original video but played at half speed. Each quadrant corresponds to a model (top‑left = v8, top‑right = v11, bottom‑left = v12, bottom‑right = v26). Hover your mouse over the video player to pause or adjust the playback speed.

Each quadrant of the video shows the same moving object along with the model name, frame number and processing time.

Activity: compare inference times

1. Use a video of your choice (e.g. a short clip of you in the scene) and run the comparison script. 2. Examine the resulting 2×2 video and note which models detect objects correctly and which miss detections. 3. Measure the average inference time for each model by pausing the video and reading the overlay. How do the speeds compare? 4. Try swapping the weight files in MODEL_WEIGHTS for different model scales (e.g. yolov8s.pt, yolov11m.pt etc.) and observe how accuracy and speed change.

✅ Knowledge check

1. What architectural innovation enables YOLO26 to perform inference without the non‑maximum suppression (NMS) post‑processing step?

A. End‑to‑end model design

Correct. YOLO26 produces predictions directly from the network without the need for NMS.

B. The anchor‑free split head

Incorrect. The anchor‑free split head is a feature of YOLOv8, not YOLO26.

C. Area attention mechanism

Incorrect. Area attention is one of the innovations in YOLO12, not YOLO26.

D. MuSGD optimiser

Incorrect. The MuSGD optimiser improves training stability but does not remove the need for NMS.

2. Which statement best describes YOLO12?

A. It introduces an attention‑centric architecture but may be slower and use more memory than earlier models

Correct. YOLO12 departs from CNNs with an attention‑centric architecture and can have higher memory use and slower CPU throughput.

B. It removes Distribution Focal Loss and eliminates NMS

Incorrect. Removing DFL and NMS are features of YOLO26.

C. It introduces a new backbone and neck and achieves higher mAP with fewer parameters

Incorrect. Those improvements are characteristic of YOLO11.

D. It is optimized for real‑time edge deployment and 43 % faster on CPUs

Incorrect. This description fits YOLO26.

Data-Driven Engineering

YOLO Model Comparison: v8, v11, v12 & v26

Background on the models

Install dependencies

Download or create a video

Python script: process and compare the models

View the comparison video

Activity: compare inference times

✅ Knowledge check