Fast RCNN
Last updated
Was this helpful?
Last updated
Was this helpful?
Instead of feeding the region proposals to the CNN, the author feeded the input image to the CNN to generate a convolutional feature map.
From the convolutional feature map, the author identified the region of proposals and warp them into squares and by using a RoI pooling layer the author reshaped them into a fixed size so that it can be fed into a fully connected layer.
From the RoI feature vector, the author used a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.
The model is optimized for a loss combining two tasks (classification + localization):
Symbol
Explanation
The overall loss function is:
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.
True class label, ; by convention, the catch-all background class has .
Discrete probability distribution (per RoI) over K + 1 classes: , computed by a softmax over the K + 1 outputs of a fully connected layer.
True bounding box .
Predicted bounding box correction, . See .
The loss function sums up the cost of classification and bounding box prediction: . For "background" RoI, is ignored by the indicator function , defined as:
The bounding box loss should measure the difference between and using a robust loss function. The smooth L1 loss is adopted here and it is claimed to be less sensitive to outliers.