Computer Vision — Image Representations, CNNs, Detection, Segmentation, and Metrics
A concise technical overview of modern computer vision architectures and evaluation methodologies, plus deployment considerations for edge and cloud inference.
Feature extraction and learned representations power modern vision systems.
Image Representations & Preprocessing
Images are tensors (H×W×C). Preprocessing: normalization, resizing, augmentation (flip, crop, color jitter). Feature extractors learn hierarchical representations from edges to semantic concepts.
Convolutional Neural Networks
Convolutions, pooling/strided convs, residual connections (ResNet), and depthwise separable convolutions (MobileNet) form the backbone of vision models. Transfer learning with pretrained backbones is standard.
Detection & Segmentation
Detectors: two-stage (Faster R-CNN) vs single-shot (YOLO/SSD). Segmentation: semantic (FCN, DeepLab), instance (Mask R-CNN), and panoptic segmentation (unified). Key trade-offs: speed vs accuracy, anchor-based vs anchor-free paradigms.
Backbone (CNN)
Head (Boxes, Masks)
Evaluation Metrics
Classification: accuracy, F1; detection: mAP (mean Average Precision) at IoU thresholds; segmentation: IoU / Dice. Consider calibration and per-class analysis for imbalanced datasets.
Deployment
Edge inference uses quantization, pruning, and hardware accelerators (VPU, NPU). Cloud inference supports large models and batching. Real-time video analytics requires pipeline optimizations and batching strategies to meet FPS and latency targets.
References
- He et al., “Deep Residual Learning for Image Recognition” (ResNet)
- Ren et al., “Faster R-CNN”
- Lin et al., “Focal Loss for Dense Object Detection” (RetinaNet)
Leave a comment