ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • Forward pass: Loss computation
  • Backward pass: Gradients computation

Was this helpful?

  1. Loss

Categorical Cross-Entropy Loss

PreviousBinary Cross-Entropy LossNext(Optional) Focal Loss

Last updated 3 years ago

Was this helpful?

Also called Softmax Loss. It is a Softmax activation plus a Cross-Entropy loss. If we use this loss, we will train a CNN to output a probability over the CCC classes for each image. It is used for multi-class classification.

In the specific (and usual) case of Multi-Class classification the labels are one-hot, so only the positive class CpC_pCp​ keeps its term in the loss. There is only one element of the Target vector ttt which is not zero ti=tpt_i=t_pti​=tp​. So discarding the elements of the summation which are zero due to target labels, we can write:

CE=−log(esp∑jCesj)CE = -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right )CE=−log(∑jC​esj​esp​​)

Where Sp is the CNN score for the positive class.

Defined the loss, now we’ll have to compute its gradient respect to the output neurons of the CNN in order to backpropagate it through the net and optimize the defined loss function tuning the net parameters. So we need to compute the gradient of CE Loss respect each CNN class score in ss. The loss terms coming from the negative classes are zero. However, the loss gradient respect those negative classes is not cancelled, since the Softmax of the positive class also depends on the negative classes scores.

After some calculus, the derivative respect to the positive class is:

And the derivative respect to the other (negative) classes is:

→ Skip this part if you are not interested in Facebook or me using Softmax Loss for multi-label classification, which is not standard.

The gradient has different expressions for positive and negative classes. For positive classes:

For negative classes:

This expressions are easily inferable from the single-label gradient expressions.

As Caffe Softmax with Loss layer nor Multinomial Logistic Loss Layer accept multi-label targets, I implemented my own PyCaffe Softmax loss layer, following the specifications of the Facebook paper. Caffe python layers let’s us easily customize the operations done in the forward and backward passes of the layer:

Forward pass: Loss computation

    def forward(ctx, x, target):
        """
        forward propagation
        """
        assert x.dim() == 2, "dimension of input should be 2"
        exp_x = torch.exp(x)
        y = exp_x / exp_x.sum(1).unsqueeze(1).expand_as(exp_x)

        # parameter "target" is a LongTensor and denotes the labels of classes, here we need to convert it into one hot vectors
        t = torch.zeros(y.size()).type(y.type())
        for n in range(t.size(0)):
            t[n][target[n]] = 1

        output = (-t * torch.log(y)).sum() / y.size(0)

        # output should be a tensor, but the output of sum() is float
        output = torch.Tensor([output]).type(y.type())
        ctx.save_for_backward(y, t)

        return output

Backward pass: Gradients computation

    @staticmethod
    def backward(ctx, grad_output):
        """
        backward propagation
        # softmax with ce loss backprop see https://www.youtube.com/watch?v=5-rVLSc2XdE
        """
        y, t = ctx.saved_tensors

        # grads = []
        # for i in range(y.size(0)):
        #     grads.append(softmax_grad(y[i]))

        grads = softmax_grad_vectorized(y)
        grad_input = grad_output * (y - t) / y.size(0)
        return grad_input, None

The gradient expression will be the same for all CCC except for the ground truth class CpC_pCp​, because the score of Cp(sp)C_p (s_p)Cp​(sp​) is in the nominator.

∂∂sp(−log(esp∑jCesj))=(esp∑jCesj−1)\frac{\partial}{\partial s_{p}} \left ( -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right ) \right ) = \left ( \frac{e^{s_{p}}}{\sum_{j}^{C}e^{s_{j}}} - 1 \right )∂sp​∂​(−log(∑jC​esj​esp​​))=(∑jC​esj​esp​​−1)
∂∂sn(−log(esp∑jCesj))=(esn∑jCesj)\frac{\partial}{\partial s_{n}} \left (-log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right ) \right ) = \left ( \frac{e^{s_{n}}}{\sum_{j}^{C}e^{s_{j}}}\right )∂sn​∂​(−log(∑jC​esj​esp​​))=(∑jC​esj​esn​​)

Where snsn is the score of any negative class in CCC different from CpCp.

Caffe: . Is limited to multi-class classification.

Pytorch: . Is limited to multi-class classification.

TensorFlow: . Is limited to multi-class classification.

In they claim that, despite being counter-intuitive, Categorical Cross-Entropy loss, or Softmax loss worked better than Binary Cross-Entropy loss in their multi-label classification problem.

When Softmax loss is used is a multi-label scenario, the gradients get a bit more complex, since the loss contains an element for each positive class. Consider MMM are the positive classes of a sample. The CE Loss with Softmax activations would be:

CE=1M∑pM−log(esp∑jCesj)CE = \frac{1}{M} \sum_{p}^{M} -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right )CE=M1​p∑M​−log(∑jC​esj​esp​​)

Where each sps_psp​ in MMM is the CNN score for each positive class. As in Facebook paper, I introduce a scaling factor 1/M1/M1/M to make the loss invariant to the number of positive classes, which may be different per sample.

∂∂spi(1M∑pM−log(esp∑jCesj))=1M((espi∑jCesj−1)+(M−1)espi∑jCesj)\frac{\partial}{\partial s_{pi}} \left ( \frac{1}{M} \sum_{p}^{M} -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right ) \right ) = \frac{1}{M} \left ( \left ( \frac{e^{s_{pi}}}{\sum_{j}^{C}e^{s_{j}}} - 1 \right ) + (M - 1) \frac{e^{s_{pi}}}{\sum_{j}^{C}e^{s_{j}}} \right )∂spi​∂​(M1​p∑M​−log(∑jC​esj​esp​​))=M1​((∑jC​esj​espi​​−1)+(M−1)∑jC​esj​espi​​)

Where spis_pisp​i is the score of any positive class.

∂∂sn(1M∑pM−log(esp∑jCesj))=esn∑jCesj\frac{\partial}{\partial s_{n}} \left ( \frac{1}{M} \sum_{p}^{M} -log\left ( \frac{e^{s_{p}}}{\sum_{j}^{C} e^{s_{j}}} \right ) \right ) = \frac{e^{s_{n}}}{\sum_{j}^{C}e^{s_{j}}}∂sn​∂​(M1​p∑M​−log(∑jC​esj​esp​​))=∑jC​esj​esn​​

For full code, take a look at .

The Caffe Python layer of this Softmax loss supporting a multi-label setup with real numbers labels is available

SoftmaxWithLoss Layer
CrossEntropyLoss
softmax_cross_entropy
this Facebook work
here
here