Fast RCNN

Network Architecture

fast rcnn
  • Instead of feeding the region proposals to the CNN, the author feeded the input image to the CNN to generate a convolutional feature map.

  • From the convolutional feature map, the author identified the region of proposals and warp them into squares and by using a RoI pooling layer the author reshaped them into a fixed size so that it can be fed into a fully connected layer.

  • From the RoI feature vector, the author used a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.

Loss Function

The model is optimized for a loss combining two tasks (classification + localization):

Symbol

Explanation

uu

True class label, u0,1,...,Ku \in 0, 1, ..., K; by convention, the catch-all background class has u=0u = 0.

pp

Discrete probability distribution (per RoI) over K + 1 classes: p=(p0,...,pK)p = (p_0, ..., p_K), computed by a softmax over the K + 1 outputs of a fully connected layer.

vv

True bounding box v=(vx,vy,vw,vh)v = (v_x, v_y, v_w, v_h).

tut^u

Predicted bounding box correction, tu=(txu,tyu,twu,thu)t^u = (t^u_x, t^u_y, t^u_w, t^u_h). See here.

The loss function sums up the cost of classification and bounding box prediction: L=Lcls+Lbox\mathcal{L} = \mathcal{L}_\text{cls} + \mathcal{L}_\text{box}. For "background" RoI, Lbox\mathcal{L}_\text{box} is ignored by the indicator function 1[u1]\mathbb{1} [u \geq 1], defined as:

1[u>=1]={1if u10otherwise\mathbb{1} [u >= 1] = \begin{cases} 1 & \text{if } u \geq 1\\ 0 & \text{otherwise} \end{cases}

The overall loss function is:

L(p,u,tu,v)=Lcls(p,u)+1[u1]Lbox(tu,v)\mathcal{L}(p, u, t^u, v) = \mathcal{L}_\text{cls} (p, u) + \mathbb{1} [u \geq 1] \mathcal{L}_\text{box}(t^u, v)
Lcls(p,u)=logpu\mathcal{L}_\text{cls}(p, u) = -\log p_u
Lbox(tu,v)=i{x,y,w,h}L1smooth(tiuvi)\mathcal{L}_\text{box}(t^u, v) = \sum_{i \in \{x, y, w, h\}} L_1^\text{smooth} (t^u_i - v_i)

The bounding box loss Lbox\mathcal{L}_{box} should measure the difference between tiut^u_i and viv_i using a robust loss function. The smooth L1 loss is adopted here and it is claimed to be less sensitive to outliers.

L1smooth(x)={0.5x2if x<1x0.5otherwiseL_1^\text{smooth}(x) = \begin{cases} 0.5 x^2 & \text{if } \vert x \vert < 1\\ \vert x \vert - 0.5 & \text{otherwise} \end{cases}

Implenmentation

Why faster than R-CNN?

The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.

Last updated

Was this helpful?