From camera to alert: real-time detection with YOLO + Supervision

June 10, 2026

The naive version of software that "watches" a security camera is three lines: grab a frame, run a model, draw boxes. That's fine for a demo; in the field it's useless. In SafeEye, the real engineering is turning a noisy, never-ending video stream into a reliable decision: seeing the right object, in the right zone, for long enough, and raising an alert without false alarms. This post walks the whole chain.

Architecture: why not a single process?

The first instinct is to cram everything into one Python script. After a few cameras that collapses: model inference saturates the CPU/GPU, the API can't respond, and one frozen camera stalls everything. So SafeEye is split into services:

Worker: reads streams, runs the model, does tracking and rule evaluation. The heavy lifting lives here.
API (FastAPI + SQLAlchemy): manages cameras, rules, violation logs and alerts. Lightweight and always up.
Redis: the messaging/queue layer between worker and API — loose coupling so one slowing down doesn't drag the other.
Panel (Next.js): live status and violation history.

This split isn't just tidiness; it's resilience. If a camera drops its RTSP connection you restart that worker, and the rest keeps running.

1. Capture: reading the stream

Cameras usually expose RTSP. OpenCV reads it, but production has two traps: (1) the connection dropping, and (2) a buffer piling up so you process stale frames. The second is sneaky — the model analyzes 5 seconds ago.

import cv2
cap = cv2.VideoCapture(rtsp_url, cv2.CAP_FFMPEG)
cap.set(cv2.CAP_PROP_BUFFERSIZE, 1)  # always the freshest frame

while True:
    ok, frame = cap.read()
    if not ok:                       # dropped → reconnect
        cap.release(); cap = cv2.VideoCapture(rtsp_url); continue
    process(frame)

2. Detection: picking the right YOLO

Detection runs on Ultralytics YOLO. Model choice is a speed/accuracy trade-off: yolov8n (nano) is very fast but weak on small/distant objects; yolov8m/l are more accurate but need a GPU. In the field, nano + a good threshold is often more practical than a giant model.

from ultralytics import YOLO
model = YOLO("yolov8s.pt")

# conf: confidence threshold — low = many boxes/false positives, high = misses
# classes: only what we care about (e.g. person=0) → cut the noise
results = model(frame, conf=0.4, iou=0.5, classes=[0], verbose=False)[0]

A pretrained model is enough for common objects; for a custom one (e.g. a hard hat) you fine-tune on your own data. The lesson: most of your accuracy comes not from the model but from threshold + class filter + a good camera angle.

3. Tracking: turning jitter into identity

Raw detection is per-frame: the same person is a new box every frame, with no identity. To say "in this zone for 5 seconds" you need persistent identity. That's where Supervision + ByteTrack come in:

import supervision as sv
tracker = sv.ByteTrack()

dets = sv.Detections.from_ultralytics(results)
dets = tracker.update_with_detections(dets)   # a persistent tracker_id per object

ByteTrack tries to keep identity even through brief occlusions. That's critical both for time-based rules and for "don't count the same event twice".

4. Zones & time: adding context

A detection alone is meaningless; where and how long matter. With Supervision's PolygonZone you define a region of interest and filter who's inside:

zone = sv.PolygonZone(polygon=AREA_POLYGON)
in_zone = zone.trigger(dets)          # mask of objects in the zone
for tracker_id in dets.tracker_id[in_zone]:
    dwell[tracker_id] += dt           # accumulate dwell time in the zone

5. Rule engine: from detection to violation

All the value comes together here. A rule binds the trio detection + zone + time into a decision: "if object X stays in zone Y for more than N seconds, it's a violation". The critical part is debounce: a single-frame flash must never trigger an alert.

if dwell[tid] >= rule.threshold_s and not active[tid]:
    active[tid] = True
    publish(redis, "violation", {"rule": rule.id, "track": tid, "ts": now})
elif tid not in current_ids:
    active.pop(tid, None)             # object gone → clear its state

The violation event is published to Redis; the API writes it to the violation_log and raises the alert. The worker "sees", the API "remembers and manages" — responsibilities are separated.

6. Performance and scaling

opencv-python-headless: no GUI dependency on the server, a lighter image.
Frame skipping: you rarely need all 25 FPS; processing every 3rd frame cuts GPU load to a third and is unnoticeable for most rules.
Resolution: downscale to the model's input size; it lowers latency and memory.
Multiple cameras: scale each camera as a separate worker/process; Redis is the shared backbone.

Lessons from the field

Stability matters as much as model accuracy. What kills false alarms in the field isn't a bigger model; it's the time window, the tracking ID, and a well-placed camera. An operator turns the system off after a few false alarms — so reliability isn't a feature, it's a precondition.

Lighting changes, camera angle and weather affect accuracy more than the model does, which is why you tune thresholds and zones on site. For the whole architecture: SafeEye project page.