ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • What is BERT?
  • Task 1: Predict Masked Words
  • Randomly mask a word
  • Predict the masked word
  • Task 2: Predict the Next Sentence
  • Input Representation
  • Combining the two methods
  • Training
  • Data
  • Cost of computation
  • Reference

Was this helpful?

  1. NLP

BERT

PreviousTransformerNextCommunication

Last updated 3 years ago

Was this helpful?

What is BERT?

  • BERT [1] is for pre-training Transformer’s [2] encoder.

  • How?

  • Predict masked word.

  • Predict next sentence.

Task 1: Predict Masked Words

Randomly mask a word

  • "The _ sat on the mat"

  • What is the masked word?

Predict the masked word

  • Performing one gradient descent to update the model parameters.

Task 2: Predict the Next Sentence

  • Given the sentence: "calculus is a branch of math"

  • Is this the next sentence?

    • "it was developed by newton and leibniz"

  • Is this the next sentence?

    • "panda is native to south central china"

Input Representation

  • Input:

    • [CLS] “calculus is a branch of math”

    • [SEP] “it was developed by newton and leibniz”

  • Target: true

  • Input:

    • [CLS] “calculus is a branch of math”

    • [SEP] “panda is native to south central china”

  • Target: false

Note: [CLS] is a token for classification. [SEP] is for separating sentences.

Combining the two methods

  • Input:

    • "[CLS] calculus is a [MASK] of math [SEP] it [MASK] developed by newton and leibniz".

    • Targets: true, “branch”, “was”.

Training

  • Loss 1 is for binary classification (i.e., predicting the next sentence.)

  • Loss 2 and Loss 3 are for multi-class classification (i.e., predicting the masked words.)

  • Objective function is the sum of the three loss functions.

  • Update model parameters by performing one gradient descent.

Data

  • BERT does not need manually labeled data. (Nice! Manual labeling is expensive.)

  • Use large-scale data, e.g., English Wikipedia (2.5 billion words.)

  • Randomly mask words (with some tricks.)

  • 50% of the next sentences are real. (The other 50% are fake.)

Cost of computation

  • BERT Base

    • 110M parameters.

    • 16 TPUs, 4 days of training (without hyper-parameter tuning.)

  • BERT Large

    • 235M parameters.

    • 64 TPUs, 4 days of training (without hyper-parameter tuning.)

Reference

  1. Devlin, Chang, Lee, and Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In ACL, 2019.

  2. Vaswani and others. Attention is all you need. In NIPS, 2017.

eee: one-hot vector of the masked word "cat".

ppp : output probability distribution at the masked position.

Loss=CrossEntropy(e,p)Loss = CrossEntropy( e,p)Loss=CrossEntropy(e,p).

youtube