> For the complete documentation index, see [llms.txt](https://ztlevi.gitbook.io/ml-101/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://ztlevi.gitbook.io/ml-101/nlp/seq2seq.md).

# Seq2Seq

## 1. Tokenization & Build Dictionary

* input\_texts => \[Eng\_Tokenizer] => input\_tokens
* target\_texts => \[Deu\_Tokenizer] => target\_tokens
  * Use 2 different tokenizers for the 2 languages.
  * Then build 2 different dictionaries.
* Tokenization can be char-level or word-level

```
Eng_Tokenizer: "I_am_okay." => ['i', '_', 'a', 'm', ..., 'a', 'y']
Deu_Tokenizer: "Es geht mir gut" => ['e', 's', '_', ..., 'u', 't']
```

### Question: Why 2 different tokenizers and dictionaries?

Answer: In the char-level, languages have different alphabets/chars.

* English: A a, B b, C c …, Z z. (26 letters × 2).
* German: 26 letters, 3 umlauts (Ä,Ö,Ü), and one ligature (ß).
* Greek: Α α, Β β, Γ γ, Δ δ, …, Ω ω. (24 letters × 2).
* Chinese: 金 木 水 火 土 … 赵 钱 孙 李 (a few thousands characters).

### Question: Why 2 different tokenizers and dictionaries?

Answer: In the word-level, languages have different vocabulary.

## 2. One-Hot Encoding

![](/files/VQllFmAuwxggXfcJDi7S)

## 3. Training Seq2Seq Model

![](/files/stv916OaQeGNQbUa1rZa)

![](/files/hO6I7obYNaPx8D1JEf3w)

## 4. Inference

![](/files/A4yP6eVcPmYvdb93g5Gz)

![](/files/PafgyiNMliXRRjjeGUKc)

![](/files/8A4XeZQbnO5SvnSkHGiK)

![](/files/P96GCZEc6pfNixHb3vfz)

## Summary

![](/files/gM7vZCDgpuz7YRngodwK)

* Encoder’s final states (𝐡 < and 𝐜 < ) have all the information of the English sentence.
* If the sentence is long, the final states have forgotten early inputs.
* Bi-LSTM (left-to-right and right-to-left) has longer memory.
* Use Bi-LSTM in the encoder; use unidirectional LSTM in the decoder.
* Word-level tokenization instead of char-level.
  * The average length of English words is 4.5 letters.
  * The sequences will be 4.5x shorter.
  * Shorter sequence -> less likely to forget.
* But you will need a large dataset!
  * **of (frequently used) chars is \~**$$10^2$$ **) -> one-hot suffices.**
  * **of (frequently used) words is \~**$$10^4$$ **-> must use embedding.**
  * Embedding Layer has many parameters -> overfitting!


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://ztlevi.gitbook.io/ml-101/nlp/seq2seq.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
