Faster RCNN
Last updated
Last updated
First, the picture goes through conv layers and feature maps are extracted.
Then a sliding window is used in RPN for each location over the feature map.
For each location, k (k=9) anchor boxes are used (3 scales of 128, 256 and 512, and 3 aspect ratios of 1:1, 1:2, 2:1) for generating region proposals.
A cls layer outputs scores whether there is object or not for boxes.
A reg layer outputs for the coordinates (box center coordinates, width and height) of k boxes.
With a size of feature map, there are anchors in total.
Similar to Fast R-CNN, the image is provided as an input to a convolutional network which provides a convolutional feature map.
Instead of using selective search algorithm on the feature map to identify the region proposals, a separate network is used to predict the region proposals(Region Proposal Network).
The predicted region proposals are then reshaped using a RoI pooling layer which is then used to classify the image within the proposed region and predict the offset values for the bounding boxes.
Given a predicted bounding box coordinate (center coordinate, width, height) and its corresponding ground truth box coordinates , the regressor is configured to learn scale-invariant transformation between two centers and log-scale transformation between widths and heights. All the transformation functions take as input.
An obvious benefit of applying such transformation is that all the bounding box correction functions, where , can take any value between [-∞, +∞]. The targets for them to learn are:
A standard regression model can solve the problem by minimizing the SSE loss with regularization:
The regularization term is critical here and RCNN paper picked the best λ by cross validation. It is also noteworthy that not all the predicted bounding boxes have corresponding ground truth boxes. For example, if there is no overlap, it does not make sense to run bbox regression. Here, only a predicted box with a nearby ground truth box with at least 0.6 IoU is kept for training the bbox regression model.
Faster R-CNN is optimized for a multi-task loss function, similar to fast R-CNN.
Symbol
Explanation
Predicted probability of anchor i being an object.
Ground truth label (binary) of whether anchor i is an object.
Predicted four parameterized coordinates.
Ground truth coordinates.
Normalization term, set to be mini-batch size (~256) in the paper.
Normalization term, set to the number of anchor locations (~2400) in the paper.
A balancing parameter, set to be ~10 in the paper (so that both and terms are roughly equally weighted).
The multi-task loss function combines the losses of classification and bounding box regression:
where is the log loss function over two classes, as we can easily translate a multi-class classification into a binary classification by predicting a sample being a target object versus not. is the smooth L1 loss.