ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • ROI Pooling
  • Network Architecture
  • Loss Function
  • Implenmentation
  • Why faster than R-CNN?

Was this helpful?

  1. Computer Vision
  2. Two Stage Object Detection

Fast RCNN

PreviousR-CNNNextFaster RCNN

Last updated 3 years ago

Was this helpful?

Network Architecture

  • Instead of feeding the region proposals to the CNN, the author feeded the input image to the CNN to generate a convolutional feature map.

  • From the convolutional feature map, the author identified the region of proposals and warp them into squares and by using a RoI pooling layer the author reshaped them into a fixed size so that it can be fed into a fully connected layer.

  • From the RoI feature vector, the author used a softmax layer to predict the class of the proposed region and also the offset values for the bounding box.

Loss Function

The model is optimized for a loss combining two tasks (classification + localization):

Symbol

Explanation

The overall loss function is:

Implenmentation

class SlowROIPool(nn.Module):
    def __init__(self, output_size):
        super().__init__()
        self.maxpool = nn.AdaptiveMaxPool2d(output_size)
        self.size = output_size

    def forward(self, images, rois, roi_idx):
        n = rois.shape[0]
        h = images.size(2)
        w = images.size(3)
        x1 = rois[:,0]
        y1 = rois[:,1]
        x2 = rois[:,2]
        y2 = rois[:,3]

        x1 = np.floor(x1 * w).astype(int)
        x2 = np.ceil(x2 * w).astype(int)
        y1 = np.floor(y1 * h).astype(int)
        y2 = np.ceil(y2 * h).astype(int)

        res = []
        for i in range(n):
            img = images[roi_idx[i]].unsqueeze(0)
            img = img[:, :, y1[i]:y2[i], x1[i]:x2[i]]
            img = self.maxpool(img)
            res.append(img)
        res = torch.cat(res, dim=0)
        return res

class RCNN(nn.Module):
    def __init__(self):
        super().__init__()

        rawnet = torchvision.models.vgg16_bn(pretrained=True)
        self.seq = nn.Sequential(*list(rawnet.features.children())[:-1])
        # self.roipool = nn.MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
        self.roipool = SlowROIPool(output_size=(7, 7))
        self.feature = nn.Sequential(*list(rawnet.classifier.children())[:-1])

        _x = Variable(torch.Tensor(1, 3, 224, 224))
        _r = np.array([[0., 0., 1., 1.]])
        _ri = np.array([0])
        _x = self.feature(self.roipool(self.seq(_x), _r, _ri).view(1, -1))
        feature_dim = _x.size(1)
        self.cls_score = nn.Linear(feature_dim, N_CLASS+1)
        self.bbox = nn.Linear(feature_dim, 4*(N_CLASS+1))

        self.cel = nn.CrossEntropyLoss()
        self.sl1 = nn.SmoothL1Loss()

    def forward(self, inp, rois, ridx):
        res = inp
        res = self.seq(res)
        res = self.roipool(res, rois, ridx)
        res = res.detach()
        res = res.view(res.size(0), -1)
        feat = self.feature(res)

        cls_score = self.cls_score(feat)
        bbox = self.bbox(feat).view(-1, N_CLASS+1, 4)
        return cls_score, bbox

    def calc_loss(self, probs, bbox, labels, gt_bbox):
        loss_sc = self.cel(probs, labels)
        lbl = labels.view(-1, 1, 1).expand(labels.size(0), 1, 4)
        mask = (labels != 0).float().view(-1, 1).expand(labels.size(0), 4)
        loss_loc = self.sl1(bbox.gather(1, lbl).squeeze(1) * mask, gt_bbox * mask)
        lmb = 1.0
        loss = loss_sc + lmb * loss_loc
        return loss, loss_sc, loss_loc

Why faster than R-CNN?

The reason “Fast R-CNN” is faster than R-CNN is because you don’t have to feed 2000 region proposals to the convolutional neural network every time. Instead, the convolution operation is done only once per image and a feature map is generated from it.

True class label, ; by convention, the catch-all background class has .

Discrete probability distribution (per RoI) over K + 1 classes: , computed by a softmax over the K + 1 outputs of a fully connected layer.

True bounding box .

Predicted bounding box correction, . See .

The loss function sums up the cost of classification and bounding box prediction: L=Lcls+Lbox\mathcal{L} = \mathcal{L}_\text{cls} + \mathcal{L}_\text{box}L=Lcls​+Lbox​. For "background" RoI, Lbox\mathcal{L}_\text{box}Lbox​ is ignored by the indicator function 1[u≥1]\mathbb{1} [u \geq 1]1[u≥1], defined as:

1[u>=1]={1if u≥10otherwise\mathbb{1} [u >= 1] = \begin{cases} 1 & \text{if } u \geq 1\\ 0 & \text{otherwise} \end{cases}1[u>=1]={10​if u≥1otherwise​
L(p,u,tu,v)=Lcls(p,u)+1[u≥1]Lbox(tu,v)\mathcal{L}(p, u, t^u, v) = \mathcal{L}_\text{cls} (p, u) + \mathbb{1} [u \geq 1] \mathcal{L}_\text{box}(t^u, v)L(p,u,tu,v)=Lcls​(p,u)+1[u≥1]Lbox​(tu,v)
Lcls(p,u)=−log⁡pu\mathcal{L}_\text{cls}(p, u) = -\log p_uLcls​(p,u)=−logpu​
Lbox(tu,v)=∑i∈{x,y,w,h}L1smooth(tiu−vi)\mathcal{L}_\text{box}(t^u, v) = \sum_{i \in \{x, y, w, h\}} L_1^\text{smooth} (t^u_i - v_i)Lbox​(tu,v)=i∈{x,y,w,h}∑​L1smooth​(tiu​−vi​)

The bounding box loss Lbox\mathcal{L}_{box}Lbox​ should measure the difference between tiut^u_itiu​ and viv_ivi​ using a robust loss function. The is adopted here and it is claimed to be less sensitive to outliers.

L1smooth(x)={0.5x2if ∣x∣<1∣x∣−0.5otherwiseL_1^\text{smooth}(x) = \begin{cases} 0.5 x^2 & \text{if } \vert x \vert < 1\\ \vert x \vert - 0.5 & \text{otherwise} \end{cases}L1smooth​(x)={0.5x2∣x∣−0.5​if ∣x∣<1otherwise​
uuu
u∈0,1,...,Ku \in 0, 1, ..., Ku∈0,1,...,K
u=0u = 0u=0
ppp
p=(p0,...,pK)p = (p_0, ..., p_K)p=(p0​,...,pK​)
vvv
v=(vx,vy,vw,vh)v = (v_x, v_y, v_w, v_h)v=(vx​,vy​,vw​,vh​)
tut^utu
smooth L1 loss
tu=(txu,tyu,twu,thu)t^u = (t^u_x, t^u_y, t^u_w, t^u_h)tu=(txu​,tyu​,twu​,thu​)
here
ROI Pooling
fast rcnn