Seq2Seq

1. Tokenization & Build Dictionary

input_texts => [Eng_Tokenizer] => input_tokens
target_texts => [Deu_Tokenizer] => target_tokens
- Use 2 different tokenizers for the 2 languages.
- Then build 2 different dictionaries.
Tokenization can be char-level or word-level

Eng_Tokenizer: "I_am_okay." => ['i', '_', 'a', 'm', ..., 'a', 'y']
Deu_Tokenizer: "Es geht mir gut" => ['e', 's', '_', ..., 'u', 't']

Answer: In the char-level, languages have different alphabets/chars.

Answer: In the word-level, languages have different vocabulary.

Encoder’s final states (𝐡 < and 𝐜 < ) have all the information of the English sentence.
If the sentence is long, the final states have forgotten early inputs.
Bi-LSTM (left-to-right and right-to-left) has longer memory.
Use Bi-LSTM in the encoder; use unidirectional LSTM in the decoder.
Word-level tokenization instead of char-level.
- The average length of English words is 4.5 letters.
- The sequences will be 4.5x shorter.
- Shorter sequence -> less likely to forget.
But you will need a large dataset!
- of (frequently used) chars is ~ $10^2$ ) -> one-hot suffices.
- of (frequently used) words is ~ $10^4$ -> must use embedding.
- Embedding Layer has many parameters -> overfitting!

Last updated 3 years ago

Was this helpful?