BLEU (Bilingual Evaluation Understudy) 是评估机器翻译和文本生成质量的标准算法,通过计算候选翻译与参考翻译的N-gram重叠度来衡量翻译质量。
graph TD
A[输入: 候选翻译 + 参考翻译] --> B[提取N-gram]
B --> C[计算N-gram重叠]
C --> D[计算精确度]
D --> E[几何平均]
E --> F[计算简洁惩罚BP]
F --> G[最终BLEU分数]
B --> B1[1-gram: 单词级别]
B --> B2[2-gram: 词对级别]
B --> B3[3-gram: 三词组合]
B --> B4[4-gram: 四词组合]
style A fill:#e1f5fe
style G fill:#e8f5e8
style E fill:#fff3e0
style F fill:#ffebee
def _get_ngrams(segment, max_order):
"""提取所有n-gram直到最大阶数"""
ngram_counts = collections.Counter()
for order in range(1, max_order + 1):
for i in range(0, len(segment) - order + 1):
ngram = tuple(segment[i : i + order])
ngram_counts[ngram] += 1
return ngram_counts
示例:
segment = ["the", "cat", "is", "on", "the", "mat"]
# 1-gram: ("the",), ("cat",), ("is",), ("on",), ("the",), ("mat",)
# 2-gram: ("the","cat"), ("cat","is"), ("is","on"), ("on","the"), ("the","mat")
# 3-gram: ("the","cat","is"), ("cat","is","on"), ("is","on","the"), ("on","the","mat")
# 4-gram: ("the","cat","is","on"), ("cat","is","on","the"), ("is","on","the","mat")
# 合并多个参考翻译的n-gram
merged_ref_ngram_counts = collections.Counter()
for reference in references:
merged_ref_ngram_counts |= _get_ngrams(reference, max_order)
# 计算候选翻译的n-gram
translation_ngram_counts = _get_ngrams(translation, max_order)
# 计算重叠
overlap = translation_ngram_counts & merged_ref_ngram_counts
for ngram in overlap:
matches_by_order[len(ngram) - 1] += overlap[ngram]
重叠计算原理:
precisions = [0] * max_order
for i in range(0, max_order):
if smooth:
# Lin et al. 2004 平滑
precisions[i] = (matches_by_order[i] + 1.0) / (
possible_matches_by_order[i] + 1.0
)
else:
if possible_matches_by_order[i] > 0:
precisions[i] = (
float(matches_by_order[i]) / possible_matches_by_order[i]
)
else:
precisions[i] = 0.0
精确度公式:
P_n = 匹配的n-gram数量 / 候选翻译中的n-gram总数
if min(precisions) > 0:
p_log_sum = sum((1.0 / max_order) * math.log(p) for p in precisions)
geo_mean = math.exp(p_log_sum)
else:
geo_mean = 0
几何平均公式:
geo_mean = (P_1 × P_2 × P_3 × P_4)^(1/4)
ratio = float(translation_length) / reference_length
if ratio > 1.0:
bp = 1.0 # 候选翻译较长,无惩罚
else:
bp = math.exp(1 - 1.0 / ratio) # 候选翻译较短,施加惩罚
BP作用:
bleu = geo_mean * bp
完整公式:
BLEU = BP × exp(∑(w_n × log(P_n)))
其中:
BP: 简洁惩罚因子w_n: N-gram权重(通常为1/N)P_n: N-gram精确度候选翻译: "the cat is on the mat" 参考翻译: "the cat sits on the mat"
步骤1: N-gram提取
# 候选翻译 n-grams:
1-gram: the(2), cat(1), is(1), on(1), mat(1)
2-gram: (the,cat)(1), (cat,is)(1), (is,on)(1), (on,the)(1), (the,mat)(1)
# 参考翻译 n-grams:
1-gram: the(2), cat(1), sits(1), on(1), mat(1)
2-gram: (the,cat)(1), (cat,sits)(1), (sits,on)(1), (on,the)(1), (the,mat)(1)
步骤2: 重叠计算
# 1-gram匹配: the(2), cat(1), on(1), mat(1) = 5个
# 2-gram匹配: (the,cat)(1), (on,the)(1), (the,mat)(1) = 3个
# 3-gram匹配: 0个
# 4-gram匹配: 0个
步骤3: 精确度计算
P1 = 5/6 = 0.833 # 5个匹配 / 6个总计
P2 = 3/5 = 0.600 # 3个匹配 / 5个总计
P3 = 0/4 = 0.000 # 0个匹配 / 4个总计
P4 = 0/3 = 0.000 # 0个匹配 / 3个总计
步骤4: 几何平均
# 由于P3和P4为0,几何平均为0
geo_mean = 0
步骤5: 简洁惩罚
ratio = 6/6 = 1.0
BP = 1.0 # 无惩罚
最终BLEU: 0 × 1.0 = 0
当某个n-gram精确度为0时,几何平均也为0,导致BLEU分数为0。平滑策略解决这个问题:
if smooth:
precisions[i] = (matches_by_order[i] + 1.0) / (
possible_matches_by_order[i] + 1.0
)
Lin et al. 2004平滑: 给分子分母都加1,避免零值。
# 合并多个参考翻译
merged_ref_ngram_counts = collections.Counter()
for reference in references:
merged_ref_ngram_counts |= _get_ngrams(reference, max_order)
策略: 对每个n-gram,在所有参考翻译中取最大出现次数。
根据文档显示的典型BLEU分数:
| 模型 | 任务 | BLEU分数 | 说明 |
|---|---|---|---|
| UniMERNet | 数学公式识别 | 85.91% | 高质量公式识别 |
| LaTeX-OCR | LaTeX公式识别 | ~80% | 复杂公式处理 |
| PP-FormulaNet | 公式识别 | ~75% | 平衡精度与速度 |
BLEU算法为PaddleOCR的序列生成任务(特别是公式识别)提供了国际标准的质量评估机制,能够全面衡量生成文本的准确性和流畅性。