
From camera to alert: real-time detection with YOLO + Supervision
The naive version of software that "watches" a security camera is three lines: grab a frame, run a model, draw boxes. That's fine for a demo; in the field it's useless. In SafeEye, the real engineering is turning a noisy, never-ending video stream into a reliable decision: seeing the right object, in the right zone, for long enough, and raising an alert without false alarms. This post walks the whole chain.
Architecture: why not a single process?
The first instinct is to cram everything into one Python script. After a few cameras that collapses: model inference saturates the CPU/GPU, the API can't respond, and one frozen camera stalls everything. So SafeEye is split into services:
- Worker: reads streams, runs the model, does tracking and rule evaluation. The heavy lifting lives here.
- API (FastAPI + SQLAlchemy): manages cameras, rules, violation logs and alerts. Lightweight and always up.
- Redis: the messaging/queue layer between worker and API — loose coupling so one slowing down doesn't drag the other.
- Panel (Next.js): live status and violation history.
This split isn't just tidiness; it's resilience. If a camera drops its RTSP connection you restart that worker, and the rest keeps running.
1. Capture: reading the stream
Cameras usually expose RTSP. OpenCV reads it, but production has two traps: (1) the connection dropping, and (2) a buffer piling up so you process stale frames. The second is sneaky — the model analyzes 5 seconds ago.
import cv2
cap = cv2.VideoCapture(rtsp_url, cv2.CAP_FFMPEG)
cap.set(cv2.CAP_PROP_BUFFERSIZE, 1) # always the freshest frame
while True:
ok, frame = cap.read()
if not ok: # dropped → reconnect
cap.release(); cap = cv2.VideoCapture(rtsp_url); continue
process(frame)
2. Detection: picking the right YOLO
Detection runs on Ultralytics YOLO. Model choice is a speed/accuracy trade-off: yolov8n (nano) is very fast but weak on small/distant objects; yolov8m/l are more accurate but need a GPU. In the field, nano + a good threshold is often more practical than a giant model.
from ultralytics import YOLO
model = YOLO("yolov8s.pt")
# conf: confidence threshold — low = many boxes/false positives, high = misses
# classes: only what we care about (e.g. person=0) → cut the noise
results = model(frame, conf=0.4, iou=0.5, classes=[0], verbose=False)[0]
A pretrained model is enough for common objects; for a custom one (e.g. a hard hat) you fine-tune on your own data. The lesson: most of your accuracy comes not from the model but from threshold + class filter + a good camera angle.
3. Tracking: turning jitter into identity
Raw detection is per-frame: the same person is a new box every frame, with no identity. To say "in this zone for 5 seconds" you need persistent identity. That's where Supervision + ByteTrack come in:
import supervision as sv
tracker = sv.ByteTrack()
dets = sv.Detections.from_ultralytics(results)
dets = tracker.update_with_detections(dets) # a persistent tracker_id per object
ByteTrack tries to keep identity even through brief occlusions. That's critical both for time-based rules and for "don't count the same event twice".
4. Zones & time: adding context
A detection alone is meaningless; where and how long matter. With Supervision's PolygonZone you define a region of interest and filter who's inside:
zone = sv.PolygonZone(polygon=AREA_POLYGON)
in_zone = zone.trigger(dets) # mask of objects in the zone
for tracker_id in dets.tracker_id[in_zone]:
dwell[tracker_id] += dt # accumulate dwell time in the zone
5. Rule engine: from detection to violation
All the value comes together here. A rule binds the trio detection + zone + time into a decision: "if object X stays in zone Y for more than N seconds, it's a violation". The critical part is debounce: a single-frame flash must never trigger an alert.
if dwell[tid] >= rule.threshold_s and not active[tid]:
active[tid] = True
publish(redis, "violation", {"rule": rule.id, "track": tid, "ts": now})
elif tid not in current_ids:
active.pop(tid, None) # object gone → clear its state
The violation event is published to Redis; the API writes it to the violation_log and raises the alert. The worker "sees", the API "remembers and manages" — responsibilities are separated.
6. Performance and scaling
opencv-python-headless: no GUI dependency on the server, a lighter image.- Frame skipping: you rarely need all 25 FPS; processing every 3rd frame cuts GPU load to a third and is unnoticeable for most rules.
- Resolution: downscale to the model's input size; it lowers latency and memory.
- Multiple cameras: scale each camera as a separate worker/process; Redis is the shared backbone.
Lessons from the field
Stability matters as much as model accuracy. What kills false alarms in the field isn't a bigger model; it's the time window, the tracking ID, and a well-placed camera. An operator turns the system off after a few false alarms — so reliability isn't a feature, it's a precondition.
Lighting changes, camera angle and weather affect accuracy more than the model does, which is why you tune thresholds and zones on site. For the whole architecture: SafeEye project page.