ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • Model structure
  • Triplet Loss
  • Triplet Selection
  • Online Triplets Generation
  • Deep Convolutional Networks
  • Model Evaluation

Was this helpful?

  1. Computer Vision

FaceNet

PreviousPSPNetNextGAN

Last updated 3 years ago

Was this helpful?

Model structure

The network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training.

  • squared distance between all faces, independent of imaging conditions, of the same identity is small.

  • the squared distance between a pair of face images from different identities is large.

Triplet Loss

Triplet Selection

Two obvious choices:

  • Generate triplets offline every n steps, using the most recent network checkpoint and computing the argmin and argmax on a subset of the data.

  • Generate triplets online. This can be done by select- ing the hard positive/negative exemplars from within a mini-batch.

Online Triplets Generation

  • To have a meaningful representation of the anchor- positive distances, it needs to be ensured that a minimal number of exemplars of any one identity is present in each minibatch.

    • around 40 faces are selected per identity per minibatch.

    • randomly sampled negative faces are added to each mini-batch.

  • Instead of picking the hardest positive, we use all anchor-positive pairs in a mini-batch while still selecting the hard negatives. For all anchor-positive method was more stable and converged slightly faster at the beginning of training.

    • these negative exemplars semi-hard​, as they are further away from the anchor than the positive exemplar, but still hard because the squared distance is close to the anchor-positive distance.

  • In most experiments the author used a batch size of around 1,800 exemplars.

Deep Convolutional Networks

  • Use rectified linear units as the non-linear activation function

  • Now use Inception+ResNet

Model Evaluation

  • the set of all true accepts as:

  • the set of all pairs that was incorrectly classified as same(false accept) is:

Namely, the authors strive for an embedding f(x)f(x)f(x), from an image xxx into a feature space Rd\mathbf{R}^dRd, such that:

The embedding is represented by f(x)∈Rdf (x) \in \mathbf{R}^df(x)∈Rd . It embeds an image x into a d-dimensional Euclidean space. Additionally, we constrain this embedding to live on the d-dimensional hypersphere, i.e. ∣∣f(x)∣∣2=1||f(x)||^2 = 1∣∣f(x)∣∣2=1.

Here the objective is that we want to make sure that an image xiax^a_ixia​(anchor) of a specific person is closer to all other images xipx^p_ixip​ (positive) of the same person than it is to any image xinx^n_ixin​ (negative) of any other person.

∥f(xia)−f(xip)∥2+α<∥f(xia)−f(xin)∥2\|f(x^a_i)-f(x^p_i)\|^2 +\alpha<\|f(x^a_i)-f(x^n_i)\|^2∥f(xia​)−f(xip​)∥2+α<∥f(xia​)−f(xin​)∥2 for ∀(f(xia),f(xip),f(xin))∈τ\forall (f(x^a_i),f(x^p_i),f(x^n_i))\in \tau∀(f(xia​),f(xip​),f(xin​))∈τ

α\alphaα is a margin that is enforced between positive and negative pairs.

τ\tauτ is the set of all possible triplets in the training set and has cardinality NNN

The the objective is to minimize the Loss: L=∑[∥f(xia)−f(xip)∥2−∥f(xia)−f(xin)∥2+α]+L = \sum [\|f(x^a_i)-f(x^p_i) \|^2-\|f(x^a_i)-f(x^n_i)\|^2+\alpha]_+L=∑[∥f(xia​)−f(xip​)∥2−∥f(xia​)−f(xin​)∥2+α]+​

Given xiax^a_ixia​ , we want to select an xipx^p_ixip​ (hard positive) such that argmaxxip∥f(xia)−f(xip)∥2argmax_{x^p_i} \|f (x^a_i )-f (x^p_i )\|^2argmaxxip​​∥f(xia​)−f(xip​)∥2 and similarly xinx^n_ixin​ (hard negative) such that argminxin∥f(xia)−f(xin)∥2argmin_{x^n_i}\|f(x^a_i )-f (x^n_i )\|^2argminxin​​∥f(xia​)−f(xin​)∥2 .

Selecting the hardest negatives can in practice lead to bad local minima early on in training, specifically it can result in a collapsed model (i.e. f(x)=0f(x) = 0f(x)=0). In order to mitigate this, it helps to select xinx^n_ixin​ such that

∥f(xia)−f(xip)∥22<∥f(xia)−f(xin)∥22\|f(x^a_i)-f(x^p_i)\|^2_2<\|f(x^a_i)-f(x^n_i)\|^2_2∥f(xia​)−f(xip​)∥22​<∥f(xia​)−f(xin​)∥22​

Those negatives lie inside the margin α\alphaα

The author evaluate our method on the face verification task. I.e. given a pair of two face images a squared L2 distance threshold D(xi,xj)D(x_i,x_j)D(xi​,xj​) is used to determine the classification of same and different. All faces pairs (i,j)(i, j)(i,j) of the same identity are denoted with Psame\mathcal{P}_{same}Psame​, whereas all pairs of different identities are denoted with Pdiff\mathcal{P}_{diff}Pdiff​

TA(d)={(i,j)∈PsameTA(d) = \{ (i,j) \in \mathcal{P}_{same}TA(d)={(i,j)∈Psame​, with  D(xi,xj)<=d}\space D(xi,xj) <= d \} D(xi,xj)<=d}.

These are the face pairs (i,j)(i, j)(i,j) that were correctly classified as same at threshold ddd.

TP(d)={(i,j)∈PdiffTP(d) = \{ (i,j) \in \mathcal{P}_{diff}TP(d)={(i,j)∈Pdiff​, with  D(xi,xj)<=d}\space D(xi,xj) <= d \} D(xi,xj)<=d}.

The validation rate VAL(d)VAL(d)VAL(d) and the false accept rate FAR(d)FAR(d)FAR(d) for a given face distance ddd are then defined as:

VAL(d)=∣TA(d)∣∣Psame∣VAL(d)=\frac{|TA(d)|}{|P_{same}|}VAL(d)=∣Psame​∣∣TA(d)∣​ ,

FAR(d)=∣TP(d)∣∣Pdiff∣FAR(d)=\frac{|TP(d)|}{|P_{diff}|}FAR(d)=∣Pdiff​∣∣TP(d)∣​

image-20190801163738683
image-20190801164742807
image-20190801185039847