ML_101
  • Introduction
  • ML Fundamentals
    • Basics
    • Optimization
    • How to prevent overfitting
    • Linear Algebra
    • Clustering
    • Calculate Parameters in CNN
    • Normalization
    • Confidence Interval
    • Quantization
  • Classical Machine Learning
    • Basics
    • Unsupervised Learning
  • Neural Networks
    • Basics
    • Activation function
    • Different Types of Convolution
    • Resnet
    • Mobilenet
  • Loss
    • L1 and L2 Loss
    • Hinge Loss
    • Cross-Entropy Loss
    • Binary Cross-Entropy Loss
    • Categorical Cross-Entropy Loss
    • (Optional) Focal Loss
    • (Optional) CORAL Loss
  • Computer Vision
    • Two Stage Object Detection
      • Metrics
      • ROI
      • R-CNN
      • Fast RCNN
      • Faster RCNN
      • Mask RCNN
    • One Stage Object Detection
      • FPN
      • YOLO
      • Single Shot MultiBox Detector(SSD)
    • Segmentation
      • Panoptic Segmentation
      • PSPNet
    • FaceNet
    • GAN
    • Imbalance problem in object detection
  • NLP
    • Embedding
    • RNN
    • LSTM
    • LSTM Ext.
    • RNN for text prediction
    • BLEU
    • Seq2Seq
    • Attention
    • Self Attention
    • Attention without RNN
    • Transformer
    • BERT
  • Parallel Computing
    • Communication
    • MapReduce
    • Parameter Server
    • Decentralized And Ring All Reduce
    • Federated Learning
    • Model Parallelism: GPipe
  • Anomaly Detection
    • DBSCAN
    • Autoencoder
  • Visualization
    • Saliency Maps
    • Fooling images
    • Class Visualization
Powered by GitBook
On this page
  • 1. Tokenization & Build Dictionary
  • Question: Why 2 different tokenizers and dictionaries?
  • Question: Why 2 different tokenizers and dictionaries?
  • 2. One-Hot Encoding
  • 3. Training Seq2Seq Model
  • 4. Inference
  • Summary

Was this helpful?

  1. NLP

Seq2Seq

1. Tokenization & Build Dictionary

  • input_texts => [Eng_Tokenizer] => input_tokens

  • target_texts => [Deu_Tokenizer] => target_tokens

    • Use 2 different tokenizers for the 2 languages.

    • Then build 2 different dictionaries.

  • Tokenization can be char-level or word-level

Eng_Tokenizer: "I_am_okay." => ['i', '_', 'a', 'm', ..., 'a', 'y']
Deu_Tokenizer: "Es geht mir gut" => ['e', 's', '_', ..., 'u', 't']

Question: Why 2 different tokenizers and dictionaries?

Answer: In the char-level, languages have different alphabets/chars.

  • English: A a, B b, C c …, Z z. (26 letters × 2).

  • German: 26 letters, 3 umlauts (Ä,Ö,Ü), and one ligature (ß).

  • Greek: Α α, Β β, Γ γ, Δ δ, …, Ω ω. (24 letters × 2).

  • Chinese: 金 木 水 火 土 … 赵 钱 孙 李 (a few thousands characters).

Question: Why 2 different tokenizers and dictionaries?

Answer: In the word-level, languages have different vocabulary.

2. One-Hot Encoding

3. Training Seq2Seq Model

4. Inference

Summary

  • Encoder’s final states (𝐡 < and 𝐜 < ) have all the information of the English sentence.

  • If the sentence is long, the final states have forgotten early inputs.

  • Bi-LSTM (left-to-right and right-to-left) has longer memory.

  • Use Bi-LSTM in the encoder; use unidirectional LSTM in the decoder.

  • Word-level tokenization instead of char-level.

    • The average length of English words is 4.5 letters.

    • The sequences will be 4.5x shorter.

    • Shorter sequence -> less likely to forget.

  • But you will need a large dataset!

    • Embedding Layer has many parameters -> overfitting!

PreviousBLEUNextAttention

Last updated 3 years ago

Was this helpful?

of (frequently used) chars is ~10210^2102 ) -> one-hot suffices.

of (frequently used) words is ~10410^4104 -> must use embedding.