ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • RoIAlign
  • Loss Function

Was this helpful?

  1. Computer Vision
  2. Two Stage Object Detection

Mask RCNN

PreviousFaster RCNNNextOne Stage Object Detection

Last updated 3 years ago

Was this helpful?

Mask R-CNN () extends Faster R-CNN to pixel-level image segmentation. The key point is to decouple the classification and the pixel-level mask prediction tasks. Based on the framework of , it added a third branch for predicting an object mask in parallel with the existing branches for classification and localization. The mask branch is a small fully-connected network applied to each RoI, predicting a segmentation mask in a pixel-to-pixel manner.

Because pixel-level segmentation requires much more fine-grained alignment than bounding boxes, mask R-CNN improves the RoI pooling layer (named "RoIAlign layer") so that RoI can be better and more precisely mapped to the regions of the original image.

RoIAlign

Loss Function

The multi-task loss function of Mask R-CNN combines the loss of classification, localization and segmentation mask: L=Lcls+Lbox+Lmask\mathcal{L} = \mathcal{L}_\text{cls} + \mathcal{L}_\text{box} + \mathcal{L}_\text{mask}L=Lcls​+Lbox​+Lmask​, where Lcls\mathcal{L}_\text{cls}Lcls​ and Lbox\mathcal{L}_\text{box}Lbox​ are same as in Faster R-CNN.

The mask branch generates a mask of dimension m x m for each RoI and each class; K classes in total. Thus, the total output is of size K⋅m2K \cdot m^2K⋅m2. Because the model is trying to learn a mask for each class, there is no competition among classes for generating masks.

Lmask\mathcal{L}_\text{mask}Lmask​ is defined as the average binary cross-entropy loss, only including k-th mask if the region is associated with the ground truth class k.

Lmask=−1m2∑1≤i,j≤m[yijlog⁡y^ijk+(1−yij)log⁡(1−y^ijk)]\mathcal{L}_\text{mask} = - \frac{1}{m^2} \sum_{1 \leq i, j \leq m} \big[ y_{ij} \log \hat{y}^k_{ij} + (1-y_{ij}) \log (1- \hat{y}^k_{ij}) \big]Lmask​=−m21​1≤i,j≤m∑​[yij​logy^​ijk​+(1−yij​)log(1−y^​ijk​)]

where yijy_{ij}yij​ is the label of a cell (i, j) in the true mask for the region of size m x m; y^ijk\hat{y}_{ij}^ky^​ijk​ is the predicted value of the same cell in the mask learned for the ground-truth class k.

The RoIAlign layer is designed to fix the location misalignment caused by quantization in the RoI pooling. RoIAlign removes the hash quantization, for example, by using x/16 instead of [x/16], so that the extracted features can be properly aligned with the input pixels. is used for computing the floating-point location values in the input.

Bilinear interpolation
He et al., 2017
Faster R-CNN