ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • Architecture
  • Choosing default boundary boxes
  • Try to verify the number of default boxes in SSD300 (the one implemented)
  • Loss Function
  • Hard Negative Mining
  • Data Augmentation

Was this helpful?

  1. Computer Vision
  2. One Stage Object Detection

Single Shot MultiBox Detector(SSD)

PreviousYOLONextSegmentation

Last updated 3 years ago

Was this helpful?

  • Single Shot: this means that the tasks of object localization and classification are done in a single forward pass of the network

  • MultiBox: this is the name of a technique for bounding box regression developed by Szegedy et al. (we will briefly cover it shortly)

  • Detector: The network is an object detector that also classifies those detected objects

Architecture

Choosing default boundary boxes

SSD defines a scale value for each feature map layer. Starting from the left, Conv4_3 detects objects at the smallest scale 0.2 (or 0.1 sometimes) and then increases linearly to the rightmost layer at a scale of 0.9.

the scale value with the target aspect ratios, we compute the width and the height of the default boxes. For layers making 6 predictions, SSD starts with 5 target aspect ratios: 1, 2, 3, 1/2 and 1/3. Then the width and the height of the default boxes are calculated as:

Default boundary boxes are chosen manually. The criterion for matching a prior and a ground-truth box is IoU (Intersection Over Union), which is also called Jaccard index. The more overlap, the better match. The process of matching looks like follows:

for every ground-truth box:
    match the ground-truth box with prior having the biggest the IoU
for every prior:
    ious = IoU(prior, ground_truth_boxes)
    max_iou = max(ious)
    if max_iou > threshold:
        i = argmax(ious)
        match the prior with ground_truth_boxes[i]

YOLO uses k-means clustering on the training dataset to determine those default boundary boxes.

Try to verify the number of default boxes in SSD300 (the one implemented)

Loss Function

MultiBox's loss function also combined two critical components that made their way into SSD:

  • Confidence Loss: this measures how confident the network is of the objectness of the computed bounding box. Categorical cross-entropy is used to compute this loss.

  • Location Loss: this measures how far away the network's predicted bounding boxes are from the ground truth ones from the training set. L1-smooth Norm is used here.

Where the alpha term helps us in balancing the contribution of the location loss.

Hard Negative Mining

During training, as most of the bounding boxes will have low IoU and therefore be interpreted as negative training examples, we may end up with a disproportionate amount of negative examples in our training set. Therefore, instead of using all negative predictions, it is advised to keep a ratio of negative to positive examples of around 3:1. The reason why you need to keep negative samples is because the network also needs to learn and be explicitly told what constitutes an incorrect detection.

Data Augmentation

  • Generated additional training examples with patches of the original image at different IoU ratios (e.g. 0.1, 0.3, 0.5, etc.) and random patches as well.

  • Each image is also randomly horizontally flipped with a probability of 0.5.

The center of each default box to (i+0.5∣fk∣,j+0.5∣fk∣)(\frac{i+0.5}{|f_{k}|}, \frac{j+0.5}{|f_{k}|})(∣fk​∣i+0.5​,∣fk​∣j+0.5​), where ∣fk∣|f_{k}|∣fk​∣ is the size of the k-th square feature map, i,j∈[0,∣fk∣]i, j \in [0, |f_{k}|]i,j∈[0,∣fk​∣].

For conv4_3, conv10_2 and conv11_2, we only associate 4 default boxes at each feature map location - omitting aspect ratios of 13\frac{1}{3}31​and 3.

Conv4_3: 38â‹…38â‹…4=577638 \cdot 38 \cdot 4 = 577638â‹…38â‹…4=5776

Conv7: 19â‹…19â‹…6=216619 \cdot 19 \cdot 6 = 216619â‹…19â‹…6=2166

Conv8_2: 10â‹…10â‹…6=60010 \cdot 10 \cdot 6 = 60010â‹…10â‹…6=600

Conv9_2: 5â‹…5â‹…6=1505 \cdot 5 \cdot 6 = 1505â‹…5â‹…6=150

Conv10_2: 3â‹…3â‹…4=363 \cdot 3 \cdot 4 = 363â‹…3â‹…4=36

Conv11_2: 444

Total: 5776+2166+600+150+36+4=87325776+ 2166 + 600 + 150 + 36 + 4 = 87325776+2166+600+150+36+4=8732

img

multiboxLoss=confidenceLoss+α∗locationLossmultiboxLoss = confidenceLoss + \alpha * locationLossmultiboxLoss=confidenceLoss+α∗locationLoss

Architecture
wh
Example of hard negative mining