# Binary Cross-Entropy Loss

Also called **Sigmoid Cross-Entropy loss**. It is a **Sigmoid activation** plus a **Cross-Entropy loss**. Unlike **Softmax loss** it is independent for each vector component (class), meaning that the loss computed for every CNN output vector component is not affected by other component values. That’s why it is used for **multi-label classification**, were the insight of an element belonging to a certain class should not influence the decision for another class. It’s called **Binary Cross-Entropy Loss** because it sets up a binary classification problem between $$C'=2$$ classes for every class in $$C$$, as explained above. So when using this Loss, the formulation of **Cross Entroypy Loss** for binary problems is often used:

$$
CE = -\sum\_{i=1}^{C'=2}t\_{i} log (f(s\_{i})) = -t\_{1} log(f(s\_{1})) - (1 - t\_{1}) log(1 - f(s\_{1}))
$$

![](https://637078585-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MYsi-h_n0zY_8MKKgyu%2Fuploads%2Fgit-blob-4fc1ff1dd988d8ea392692f6f1fcde02ea4b91e7%2Fsigmoid_CE_pipeline.png?alt=media)

This would be the pipeline for each one of the $$C$$ clases. We set $$C$$ independent binary classification problems ($$C'=2$$). Then we sum up the loss over the different binary problems: We sum up the gradients of every binary problem to backpropagate, and the losses to monitor the global loss. $$s\_1$$ and $$t\_1$$ are the score and the gorundtruth label for the class $$C1$$, which is also the class $$C\_i$$ in $$C$$. $$s\_2=1-s\_1$$ and $$t\_2=1-t\_1$$ are the score and the ground truth label of the class $$C\_2$$, which is not a "class" in our original problem with $$C$$ classes, but a class we create to set up the binary problem with $$C\_1=C\_i$$. We can understand it as a background class.

The loss can be expressed as:

$$
CE = \left{\begin{matrix} & - log(f(s\_{1})) & & if & t\_{1} = 1 \ & - log(1 - f(s\_{1})) & & if & t\_{1} = 0 \end{matrix}\right.
$$

Where $$t\_1=1$$ means that the class $$C\_1=C\_i$$ is positive for this sample.

In this case, the activation function does not depend in scores of other classes in $$C$$ more than $$C\_1=C\_i$$. So the gradient respect to the each score $$s\_i$$ in $$s$$ will only depend on the loss given by its binary problem.

The gradient respect to the score $$s\_i=s\_1$$ can be written as:

$$
\frac{\partial}{\partial s\_{i}} \left ( CE(f(s\_{i})\right) = t\_{1} (f(s\_{1}) - 1) + (1 - t\_{1}) f(s\_{1})
$$

Where $$f()$$ is the **sigmoid** function. It can also be written as:

$$
\frac{\partial}{\partial s\_{i}} \left ( CE(f(s\_{i})\right) = \begin{Bmatrix} f(s\_{i}) - 1 && if & t\_{i} = 1\ f(s\_{i}) && if & t\_{i} = 0 \end{Bmatrix}
$$

```python
import numpy as np
from sklearn.metrics import log_loss

import tensorflow as tf


def binary_cross_entropy(X, y):
    m = y.shape[0]
    y = y.reshape((m))
    # apply sigmod 1/(1+e^-x)
    fX = 1 / (1 + np.exp(-X))

    # -Y * log(fX) - (1-Y) * (1-log(fX))
    a = -Y * np.log(fX) - (1 - Y) * np.log(1 - fX)
    ce = np.sum(-Y * np.log(fX) - (1 - Y) * np.log(1 - fX)) / m
    return ce


X = np.array([[9.7], [0]])  # (N, 1)
Y = np.array([[0],[1]])  # (N, 1)

print(binary_cross_entropy(X, Y))
```

> Refer [here](https://www.ics.uci.edu/~pjsadows/notes.pdf) for a detailed loss derivation.

* Caffe: [Sigmoid Cross-Entropy Loss Layer](http://caffe.berkeleyvision.org/tutorial/layers/sigmoidcrossentropyloss.html)
* Pytorch: [BCEWithLogitsLoss](https://pytorch.org/docs/master/nn.html#bcewithlogitsloss)
* TensorFlow: [sigmoid\_cross\_entropy](https://www.tensorflow.org/api_docs/python/tf/losses/sigmoid_cross_entropy).
