Fast RCNN
Last updated
Last updated
Instead of feeding the region proposals to the CNN, the author feeded the input image to the CNN to generate a convolutional feature map.
From the convolutional feature map, the author identified the region of proposals and warp them into squares and by using a RoI pooling layer the author reshaped them into a fixed size so that it can be fed into a fully connected layer.
From the RoI feature vector, the author used a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.
The model is optimized for a loss combining two tasks (classification + localization):
Symbol | Explanation |
True class label, ; by convention, the catch-all background class has . | |
Discrete probability distribution (per RoI) over K + 1 classes: , computed by a softmax over the K + 1 outputs of a fully connected layer. | |
True bounding box . | |
Predicted bounding box correction, . See here. |
The loss function sums up the cost of classification and bounding box prediction: . For "background" RoI, is ignored by the indicator function , defined as:
The overall loss function is:
The bounding box loss should measure the difference between and using a robust loss function. The smooth L1 loss is adopted here and it is claimed to be less sensitive to outliers.
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.