BLEU
Last updated
Last updated
Example of poor machine translation output with high precision
Reference 1
the
cat
is
on
the
mat
Reference 2
there
is
a
cat
on
the
mat
Of the seven words in the candidate translation, all of them appear in the reference translations. Thus the candidate text is given a unigram precision of,
where is number of words from the candidate that are found in the reference, and is the total number of words in the candidate. This is a perfect score, despite the fact that the candidate translation above retains little of the content of either of the references.
The modification that BLEU makes is fairly straightforward. For each word in the candidate translation, the algorithm takes its maximum total count, , in any of the reference translations. In the example above, the word "the" appears twice in reference 1, and once in reference 2. Thus.
For the candidate translation, the of each word is clipped to a maximum of for that word. In this case, "the" has and , thus is clipped to 2. These clipped counts are then summed over all distinct words in the candidate. This sum is then divided by the total number of unigrams in the candidate translation. In the above example, the modified unigram precision score would be:
In practice, however, using individual words as the unit of comparison is not optimal. Instead, BLEU computes the same modified precision metric using n-grams.
Here is an example of 2-gram (bigram) modified prevision calculation:
Candidate ()
the
cat
the
cat
on
the
mat
Reference 1
the
cat
is
on
the
mat
Reference 2
there
is
a
cat
on
the
mat
bigram
the cat
2
1
cat the
1
0
cat on
1
1
on the
1
1
the mat
1
1
Where means the max count of the bigrams appears in the references.
=Bleu score on n-grams only
The length which has the "highest correlation with monolingual human judgements"[1] was found to be four. The unigram scores are found to account for the adequacy of the translation, how much information is retained. The longer n-gram scores account for the fluency of the translation, or to what extent it reads like "good English".
Where BP is brevity penlty. Machine tends to generate shorter translation because the score will be higher. Add the penality so that it would generate longer sentences.
The Python Natural Language Toolkit Library (NLTK) provides an implementation of BLEU scoring. You can use it to evaluate the generated text by comparing it with the reference text.
NLTK provides the sentence_bleu() function to evaluate candidate sentences based on one or more reference sentences .
The reference sentence must be provided as a list of sentences, where each sentence is a list of tokens. Candidate sentences are provided as a list of tokens. E.g:
Running this example will output a full score, because the candidate sentence exactly matches one of the reference sentences.
NLTK also provides a function called corpus_bleu() to calculate multiple sentences (such as paragraphs or Document) BLEU score.
The reference text must be specified as a list of documents, where each document is a list of reference sentences, and each replaceable reference sentence is also a list of tokens, which means that the document list is a list of lists of token lists. Candidate documents must be specified as a list, where each file is a list of tokens, which means that the candidate document is a list of token lists.
This sounds a bit confusing; the following are examples of two reference documents for one document.
Run this example and output full marks as before.
The BLEU scoring method provided in NLTK allows you to assign weights to different n-tuples when calculating BLEU scores.
This gives you the flexibility to calculate different types of BLEU scores, such as individual and cumulative n-gram scores.
Let's take a look.
An individual N-gram score is a score for matching n-tuples in a specific order, such as a single word (called 1-gram) or word pair (called 2-gram or bigram).
The weights are specified as an array, where each index corresponds to an n-tuple in the corresponding order. To calculate the BLEU score for 1-gram matching, you can specify the 1-gram weight as 1, and for 2, 3 and 4 gram, the weight is 0, that is, the weight is (1,0,0,0). E.g:
Running this example will output a score of 0.5.
We can repeat this example, and run the sentence for each n-gram from 1 yuan to 4 yuan as follows:
Running the example, the results are as follows:
Although we can calculate a separate BLEU score, this is not the original intention of using this method, and the resulting score does not have much meaning or seems to be illustrative.
The cumulative score refers to the calculation of all individual n-gram scores from 1 to n, and they are weighted by calculating the weighted geometric mean.
By default, sentence_bleu()
and corpus_bleu()
scores calculate the accumulated 4-tuple BLEU scores, also known as BLEU-4 scores.
BLEU-4 weights the numbers of 1-tuples, 2-tuples, 3-tuples and 4-tuples as 1/4 (25%) or 0.25. E.g:
Run this example and output the following scores:
The cumulative and single 1-tuple BLEU use the same weight, which is (1,0,0,0). The BLEU score of the accumulated 2-tuple is calculated as a 1-tuple and a 2-tuple with a weight of 50% respectively, and the accumulated 3-tuple BLEU is calculated as a 1-tuple, and a 2-tuple and a 3-tuple are respectively assigned a weight of 33%.
Let us specify by calculating the cumulative scores of BLEU-1, BLEU-2, BLEU-3 and BLEU-4:
Running this example outputs the following scores. The results vary greatly, and are more expressive than individual n-gram scores.
When describing the performance of the text generation system, the cumulative score from BLEU-1 to BLEU-4 is usually reported.
In this section, we try to use some examples to further gain intuition for BLEU scoring.
At the sentence level, we use the following reference sentence to illustrate:
the quick brown fox jumped over the lazy dog
First, let's look at a perfect score.
Running the example outputs a perfect match score.
Next, let's change a word and change "quick" to "fast ".
The result is a slight drop in scores.
Try to change two words, change "quick" to "fast" and "lazy" to "sleepy ".
Running this example, we can see that the score drops linearly.
What if all the words in the candidate sentence are different from those in the reference sentence?
We got a worse score.
Now, let's try a candidate sentence that has fewer words than the reference sentence (for example, discarding the last two words), but these words are all correct.
The result is very similar to the previous case where there were two word errors.
What if we adjust the candidate sentence to two more words than the reference sentence?
Once again, we can see that our intuition is valid