Fast RCNN
Network Architecture

Instead of feeding the region proposals to the CNN, the author feeded the input image to the CNN to generate a convolutional feature map.
From the convolutional feature map, the author identified the region of proposals and warp them into squares and by using a RoI pooling layer the author reshaped them into a fixed size so that it can be fed into a fully connected layer.
From the RoI feature vector, the author used a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.
Loss Function
The model is optimized for a loss combining two tasks (classification + localization):
Symbol
Explanation
u
True class label, u∈0,1,...,K; by convention, the catch-all background class has u=0.
p
Discrete probability distribution (per RoI) over K + 1 classes: p=(p0,...,pK), computed by a softmax over the K + 1 outputs of a fully connected layer.
v
True bounding box v=(vx,vy,vw,vh).
tu
Predicted bounding box correction, tu=(txu,tyu,twu,thu). See here.
The loss function sums up the cost of classification and bounding box prediction: L=Lcls+Lbox. For "background" RoI, Lbox is ignored by the indicator function 1[u≥1], defined as:
The overall loss function is:
The bounding box loss Lbox should measure the difference between tiu and vi using a robust loss function. The smooth L1 loss is adopted here and it is claimed to be less sensitive to outliers.

Implenmentation
Why faster than R-CNN?
The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.
Last updated