FaceNet
Last updated
Last updated
The network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training.
squared distance between all faces, independent of imaging conditions, of the same identity is small.
the squared distance between a pair of face images from different identities is large.
Two obvious choices:
Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.
Generate triplets online. This can be done by select- ing the hard positive/negative exemplars from within a mini-batch.
To have a meaningful representation of the anchor- positive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each minibatch.
around 40 faces are selected per identity per minibatch.
randomly sampled negative faces are added to each mini-batch.
Instead of picking the hardest positive, we use all anchor-positive pairs in a mini-batch while still selecting the hard negatives. For all anchor-positive method was more stable and converged slightly faster at the beginning of training.
these negative exemplars semi-hard, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance.
In most experiments the author used a batch size of around 1,800 exemplars.
Use rectified linear units as the non-linear activation function
Now use Inception+ResNet
the set of all true accepts as:
the set of all pairs that was incorrectly classified as same(false accept) is:
Namely, the authors strive for an embedding , from an image into a feature space , such that:
The embedding is represented by . It embeds an image x into a d-dimensional Euclidean space. Additionally, we constrain this embedding to live on the d-dimensional hypersphere, i.e. .
Here the objective is that we want to make sure that an image (anchor) of a specific person is closer to all other images (positive) of the same person than it is to any image (negative) of any other person.
for
is a margin that is enforced between positive and negative pairs.
is the set of all possible triplets in the training set and has cardinality
The the objective is to minimize the Loss:
Given , we want to select an (hard positive) such that and similarly (hard negative) such that .
Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. ). In order to mitigate this, it helps to select such that
Those negatives lie inside the margin
The author evaluate our method on the face verification task. I.e. given a pair of two face images a squared L2 distance threshold is used to determine the classification of same and different. All faces pairs of the same identity are denoted with , whereas all pairs of different identities are denoted with
, with .
These are the face pairs that were correctly classified as same at threshold .
, with .
The validation rate and the false accept rate for a given face distance are then defined as:
,