BERT
Last updated
Last updated
BERT [1] is for pre-training Transformer’s [2] encoder.
How?
Predict masked word.
Predict next sentence.
"The _ sat on the mat"
What is the masked word?
: one-hot vector of the masked word "cat".
: output probability distribution at the masked position.
.
Performing one gradient descent to update the model parameters.
Given the sentence: "calculus is a branch of math"
Is this the next sentence?
"it was developed by newton and leibniz"
Is this the next sentence?
"panda is native to south central china"
Input:
[CLS] “calculus is a branch of math”
[SEP] “it was developed by newton and leibniz”
Target: true
Input:
[CLS] “calculus is a branch of math”
[SEP] “panda is native to south central china”
Target: false
Note: [CLS] is a token for classification. [SEP] is for separating sentences.
Input:
"[CLS] calculus is a [MASK] of math [SEP] it [MASK] developed by newton and leibniz".
Targets: true, “branch”, “was”.
Loss 1 is for binary classification (i.e., predicting the next sentence.)
Loss 2 and Loss 3 are for multi-class classification (i.e., predicting the masked words.)
Objective function is the sum of the three loss functions.
Update model parameters by performing one gradient descent.
BERT does not need manually labeled data. (Nice! Manual labeling is expensive.)
Use large-scale data, e.g., English Wikipedia (2.5 billion words.)
Randomly mask words (with some tricks.)
50% of the next sentences are real. (The other 50% are fake.)
BERT Base
110M parameters.
16 TPUs, 4 days of training (without hyper-parameter tuning.)
BERT Large
235M parameters.
64 TPUs, 4 days of training (without hyper-parameter tuning.)
Devlin, Chang, Lee, and Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In ACL, 2019.
Vaswani and others. Attention is all you need. In NIPS, 2017.