ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • Region Proposal Network
  • Network Architecture
  • Bounding Box Regression
  • Loss Function

Was this helpful?

  1. Computer Vision
  2. Two Stage Object Detection

Faster RCNN

PreviousFast RCNNNextMask RCNN

Last updated 3 years ago

Was this helpful?

Region Proposal Network

  1. First, the picture goes through conv layers and feature maps are extracted.

  2. Then a sliding window is used in RPN for each location over the feature map.

  3. For each location, k (k=9) anchor boxes are used (3 scales of 128, 256 and 512, and 3 aspect ratios of 1:1, 1:2, 2:1) for generating region proposals.

  4. A cls layer outputs 2k2k2k scores whether there is object or not for kkk boxes.

  5. A reg layer outputs 4k4k4k for the coordinates (box center coordinates, width and height) of k boxes.

  6. With a size of W×HW \times HW×H feature map, there are W×H×kW \times H \times kW×H×k anchors in total.

Network Architecture

  • Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map.

  • Instead of using selective search algorithm on the feature map to identify the region proposals, a separate network is used to predict the region proposals(Region Proposal Network).

  • The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.

Bounding Box Regression

A standard regression model can solve the problem by minimizing the SSE loss with regularization:

The regularization term is critical here and RCNN paper picked the best λ by cross validation. It is also noteworthy that not all the predicted bounding boxes have corresponding ground truth boxes. For example, if there is no overlap, it does not make sense to run bbox regression. Here, only a predicted box with a nearby ground truth box with at least 0.6 IoU is kept for training the bbox regression model.

Loss Function

Faster R-CNN is optimized for a multi-task loss function, similar to fast R-CNN.

Symbol

Explanation

Predicted probability of anchor i being an object.

Ground truth label (binary) of whether anchor i is an object.

Predicted four parameterized coordinates.

Ground truth coordinates.

Normalization term, set to be mini-batch size (~256) in the paper.

Normalization term, set to the number of anchor locations (~2400) in the paper.

The multi-task loss function combines the losses of classification and bounding box regression:

Given a predicted bounding box coordinate p=(px,py,pw,ph)\mathbf{p} = (p_x, p_y, p_w, p_h)p=(px​,py​,pw​,ph​) (center coordinate, width, height) and its corresponding ground truth box coordinates g=(gx,gy,gw,gh)\mathbf{g} = (g_x, g_y, g_w, g_h)g=(gx​,gy​,gw​,gh​) , the regressor is configured to learn scale-invariant transformation between two centers and log-scale transformation between widths and heights. All the transformation functions take p\mathbf{p}p as input.

g^x=pwdx(p)+pxg^y=phdy(p)+pyg^w=pwexp⁡(dw(p))g^h=phexp⁡(dh(p))\begin{aligned} \hat{g}_x &= p_w d_x(\mathbf{p}) + p_x \\ \hat{g}_y &= p_h d_y(\mathbf{p}) + p_y \\ \hat{g}_w &= p_w \exp({d_w(\mathbf{p})}) \\ \hat{g}_h &= p_h \exp({d_h(\mathbf{p})}) \end{aligned}g^​x​g^​y​g^​w​g^​h​​=pw​dx​(p)+px​=ph​dy​(p)+py​=pw​exp(dw​(p))=ph​exp(dh​(p))​

An obvious benefit of applying such transformation is that all the bounding box correction functions, di(p)d_i(\mathbf{p})di​(p) where i∈{x,y,w,h}i \in \{ x, y, w, h \}i∈{x,y,w,h}, can take any value between [-∞, +∞]. The targets for them to learn are:

tx=(gx−px)/pwty=(gy−py)/phtw=log⁡(gw/pw)th=log⁡(gh/ph)\begin{aligned} t_x &= (g_x - p_x) / p_w \\ t_y &= (g_y - p_y) / p_h \\ t_w &= \log(g_w/p_w) \\ t_h &= \log(g_h/p_h) \end{aligned}tx​ty​tw​th​​=(gx​−px​)/pw​=(gy​−py​)/ph​=log(gw​/pw​)=log(gh​/ph​)​
Lreg=∑i∈{x,y,w,h}(ti−di(p))2+λ∥w∥2\mathcal{L}_\text{reg} = \sum_{i \in \{x, y, w, h\}} (t_i - d_i(\mathbf{p}))^2 + \lambda \|\mathbf{w}\|^2Lreg​=i∈{x,y,w,h}∑​(ti​−di​(p))2+λ∥w∥2

A balancing parameter, set to be ~10 in the paper (so that both and terms are roughly equally weighted).

L=Lcls+Lbox\mathcal{L} = \mathcal{L}_\text{cls} + \mathcal{L}_\text{box}L=Lcls​+Lbox​
L({pi},{ti})=1Ncls∑iLcls(pi,pi∗)+λNbox∑ipi∗⋅L1smooth(ti−ti∗)\mathcal{L}(\{p_i\}, \{t_i\}) = \frac{1}{N_\text{cls}} \sum_i \mathcal{L}_\text{cls} (p_i, p^*_i) + \frac{\lambda}{N_\text{box}} \sum_i p^*_i \cdot L_1^\text{smooth}(t_i - t^*_i)L({pi​},{ti​})=Ncls​1​i∑​Lcls​(pi​,pi∗​)+Nbox​λ​i∑​pi∗​⋅L1smooth​(ti​−ti∗​)

where Lcls\mathcal{L}_\text{cls}Lcls​ is the log loss function over two classes, as we can easily translate a multi-class classification into a binary classification by predicting a sample being a target object versus not. L1smoothL_1^\text{smooth}L1smooth​ is the smooth L1 loss.

Lcls(pi,pi∗)=−pi∗log⁡pi−(1−pi∗)log⁡(1−pi)\mathcal{L}_\text{cls} (p_i, p^*_i) = - p^*_i \log p_i - (1 - p^*_i) \log (1 - p_i)Lcls​(pi​,pi∗​)=−pi∗​logpi​−(1−pi∗​)log(1−pi​)
pip_ipi​
pi∗p^*_ipi∗​
tit_iti​
ti∗t^*_iti∗​
NclsN_\text{cls}Ncls​
NboxN_\text{box}Nbox​
λ\lambdaλ
Lcls\mathcal{L}_\text{cls}Lcls​
Lbox\mathcal{L}_\text{box}Lbox​
rpn
fast rcnn
bbox regression