BLUE算法说明.md 6.1 KB

compute_bleu算法原理详解

1. BLEU算法概述

BLEU (Bilingual Evaluation Understudy) 是评估机器翻译文本生成质量的标准算法,通过计算候选翻译与参考翻译的N-gram重叠度来衡量翻译质量。

2. 核心算法流程

graph TD
    A[输入: 候选翻译 + 参考翻译] --> B[提取N-gram]
    B --> C[计算N-gram重叠]
    C --> D[计算精确度]
    D --> E[几何平均]
    E --> F[计算简洁惩罚BP]
    F --> G[最终BLEU分数]
    
    B --> B1[1-gram: 单词级别]
    B --> B2[2-gram: 词对级别]
    B --> B3[3-gram: 三词组合]
    B --> B4[4-gram: 四词组合]
    
    style A fill:#e1f5fe
    style G fill:#e8f5e8
    style E fill:#fff3e0
    style F fill:#ffebee

3. 详细算法步骤

3.1 N-gram提取

def _get_ngrams(segment, max_order):
    """提取所有n-gram直到最大阶数"""
    ngram_counts = collections.Counter()
    for order in range(1, max_order + 1):
        for i in range(0, len(segment) - order + 1):
            ngram = tuple(segment[i : i + order])
            ngram_counts[ngram] += 1
    return ngram_counts

示例:

segment = ["the", "cat", "is", "on", "the", "mat"]

# 1-gram: ("the",), ("cat",), ("is",), ("on",), ("the",), ("mat",)
# 2-gram: ("the","cat"), ("cat","is"), ("is","on"), ("on","the"), ("the","mat")
# 3-gram: ("the","cat","is"), ("cat","is","on"), ("is","on","the"), ("on","the","mat")
# 4-gram: ("the","cat","is","on"), ("cat","is","on","the"), ("is","on","the","mat")

3.2 N-gram重叠计算

# 合并多个参考翻译的n-gram
merged_ref_ngram_counts = collections.Counter()
for reference in references:
    merged_ref_ngram_counts |= _get_ngrams(reference, max_order)

# 计算候选翻译的n-gram
translation_ngram_counts = _get_ngrams(translation, max_order)

# 计算重叠
overlap = translation_ngram_counts & merged_ref_ngram_counts
for ngram in overlap:
    matches_by_order[len(ngram) - 1] += overlap[ngram]

重叠计算原理:

  • 对每个n-gram,取候选翻译和参考翻译中的最小出现次数
  • 避免重复计算同一个n-gram

3.3 精确度计算

precisions = [0] * max_order
for i in range(0, max_order):
    if smooth:
        # Lin et al. 2004 平滑
        precisions[i] = (matches_by_order[i] + 1.0) / (
            possible_matches_by_order[i] + 1.0
        )
    else:
        if possible_matches_by_order[i] > 0:
            precisions[i] = (
                float(matches_by_order[i]) / possible_matches_by_order[i]
            )
        else:
            precisions[i] = 0.0

精确度公式:

P_n = 匹配的n-gram数量 / 候选翻译中的n-gram总数

3.4 几何平均计算

if min(precisions) > 0:
    p_log_sum = sum((1.0 / max_order) * math.log(p) for p in precisions)
    geo_mean = math.exp(p_log_sum)
else:
    geo_mean = 0

几何平均公式:

geo_mean = (P_1 × P_2 × P_3 × P_4)^(1/4)

3.5 简洁惩罚 (Brevity Penalty)

ratio = float(translation_length) / reference_length

if ratio > 1.0:
    bp = 1.0  # 候选翻译较长,无惩罚
else:
    bp = math.exp(1 - 1.0 / ratio)  # 候选翻译较短,施加惩罚

BP作用:

  • 防止过短的翻译获得虚高分数
  • 鼓励生成适当长度的翻译

4. 最终BLEU分数

bleu = geo_mean * bp

完整公式:

BLEU = BP × exp(∑(w_n × log(P_n)))

其中:

  • BP: 简洁惩罚因子
  • w_n: N-gram权重(通常为1/N)
  • P_n: N-gram精确度

5. 算法示例演示

5.1 具体计算过程

候选翻译: "the cat is on the mat" 参考翻译: "the cat sits on the mat"

步骤1: N-gram提取

# 候选翻译 n-grams:
1-gram: the(2), cat(1), is(1), on(1), mat(1)
2-gram: (the,cat)(1), (cat,is)(1), (is,on)(1), (on,the)(1), (the,mat)(1)

# 参考翻译 n-grams:
1-gram: the(2), cat(1), sits(1), on(1), mat(1)  
2-gram: (the,cat)(1), (cat,sits)(1), (sits,on)(1), (on,the)(1), (the,mat)(1)

步骤2: 重叠计算

# 1-gram匹配: the(2), cat(1), on(1), mat(1) = 5个
# 2-gram匹配: (the,cat)(1), (on,the)(1), (the,mat)(1) = 3个
# 3-gram匹配: 0个
# 4-gram匹配: 0个

步骤3: 精确度计算

P1 = 5/6 = 0.833  # 5个匹配 / 6个总计
P2 = 3/5 = 0.600  # 3个匹配 / 5个总计
P3 = 0/4 = 0.000  # 0个匹配 / 4个总计
P4 = 0/3 = 0.000  # 0个匹配 / 3个总计

步骤4: 几何平均

# 由于P3和P4为0,几何平均为0
geo_mean = 0

步骤5: 简洁惩罚

ratio = 6/6 = 1.0
BP = 1.0  # 无惩罚

最终BLEU: 0 × 1.0 = 0

6. 平滑策略

当某个n-gram精确度为0时,几何平均也为0,导致BLEU分数为0。平滑策略解决这个问题:

if smooth:
    precisions[i] = (matches_by_order[i] + 1.0) / (
        possible_matches_by_order[i] + 1.0
    )

Lin et al. 2004平滑: 给分子分母都加1,避免零值。

7. 多参考翻译处理

# 合并多个参考翻译
merged_ref_ngram_counts = collections.Counter()
for reference in references:
    merged_ref_ngram_counts |= _get_ngrams(reference, max_order)

策略: 对每个n-gram,在所有参考翻译中取最大出现次数

8. 算法特点

8.1 优点

  • 标准化: 国际通用的评估标准
  • 多粒度: 同时考虑1-4gram的匹配
  • 鲁棒性: 处理多参考翻译
  • 惩罚机制: 防止过短翻译

8.2 局限性

  • 词序敏感: 相同词汇不同顺序得分不同
  • 同义词盲: 不识别语义相似的词汇
  • 参考依赖: 依赖高质量参考翻译

9. 在PaddleOCR中的应用效果

根据文档显示的典型BLEU分数:

模型 任务 BLEU分数 说明
UniMERNet 数学公式识别 85.91% 高质量公式识别
LaTeX-OCR LaTeX公式识别 ~80% 复杂公式处理
PP-FormulaNet 公式识别 ~75% 平衡精度与速度

BLEU算法为PaddleOCR的序列生成任务(特别是公式识别)提供了国际标准的质量评估机制,能够全面衡量生成文本的准确性和流畅性。