_match_html_rows_to_paddle_groups这是一个动态规划(DP)算法,用于将HTML表格行与OCR检测到的文本分组进行智能匹配。
输入1: HTML表格行(7行)
输入2: OCR文本分组(48组)
# HTML行文本(7行)
html_row_texts = [
'部门department:01372403999柜员searchteller:ebf0000...', # 行0
'账号/卡号account/cardno:6222621010026732125...', # 行1
'查询起日querystartingdate:2023-08-12...', # 行2
'查询时间querytime:2024年08月12日...', # 行3
'证件种类idtype:第二代居民身份证...', # 行4
'serialnum序号transdate交易日期...', # 行5(表头)
'12023-08-1213:04:52网上银行卡转入...' # 行6(数据行)
]
# OCR组文本(48组)
group_texts = [
'部门department:01372403999...', # 组0
'账号/卡号account/cardno:...', # 组1
'查询起日querystartingdate:...', # 组2
...
'964642s0110305' # 组47
]
if len(html_rows) == len(grouped_boxes): # 7 == 48? False
# 不执行此分支
实际场景:7行 ≠ 48组,需要DP算法
merged_cache为了支持一个HTML行匹配多个OCR组,预计算所有可能的组合:
MAX_MERGE = 4 # 最多合并4个OCR组
merged_cache = {
(0, 1): '部门department:01372403999...', # 组0单独
(0, 2): '部门department:...账号/卡号account...', # 组0+组1
(0, 3): '部门department:...查询起日...', # 组0+组1+组2
(0, 4): '部门department:...查询时间...', # 组0+组1+组2+组3
(1, 1): '账号/卡号account/cardno:...', # 组1单独
(1, 2): '账号/卡号...查询起日...', # 组1+组2
...
(47, 1): '964642s0110305' # 组47单独
}
作用:避免在DP过程中重复拼接文本,提升性能
# dp[i][j] = 将HTML前i行匹配到OCR前j组的最大得分
dp = np.full((7, 48), -np.inf) # 初始化为负无穷
# path[(i, j)] = (prev_j, start_j)
# prev_j: 上一行结束的OCR组索引
# start_j: 当前行开始的OCR组索引
path = {}
目标:找到HTML第0行与OCR组的最佳匹配
SEARCH_WINDOW = 15 # 只在前15个组中搜索
MAX_MERGE = 4 # 最多合并4个组
# 遍历所有可能的匹配
for end_j in range(min(48, 15 + 4)): # end_j: 0..18
for count in range(1, 5): # count: 1..4
start_j = end_j - count + 1 # 计算起始组索引
# 例如:end_j=3, count=2 → start_j=2
# 表示:HTML行0 匹配 OCR组2+组3
current_text = merged_cache.get((start_j, count), "")
similarity = calculate_similarity(html_row_texts[0], current_text)
penalty = start_j * 0.1 # 跳过组的惩罚
score = similarity - penalty
if score > 0.1: # 只有得分足够高才记录
dp[0][end_j] = max(dp[0][end_j], score)
path[(0, end_j)] = (-1, start_j) # -1表示这是第一行
示例计算过程:
| end_j | count | start_j | 匹配的OCR组 | 相似度 | 惩罚 | 得分 | 是否记录 |
|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 组0 | 0.95 | 0.0 | 0.95 | ✅ |
| 1 | 1 | 1 | 组1 | 0.15 | 0.1 | 0.05 | ❌ |
| 1 | 2 | 0 | 组0+组1 | 0.98 | 0.0 | 0.98 | ✅ |
| 2 | 3 | 0 | 组0+组1+组2 | 0.92 | 0.0 | 0.92 | ✅ |
| ... | ... | ... | ... | ... | ... | ... | ... |
结果:假设 dp[0][2] = 0.98 是最优,表示HTML行0匹配OCR组0+组1+组2
状态转移方程:
dp[i][end_j] = max {
dp[i-1][prev_j] + match_score(i, start_j..end_j) - gap_penalty
}
具体步骤:
# 找到所有 dp[i-1][j] > -inf 的位置
valid_prev_indices = [j for j in range(48) if dp[i-1][j] > -np.inf]
# 例如处理第1行时:
# valid_prev_indices = [2] # 因为dp[0][2] = 0.98
if len(valid_prev_indices) > 30:
# 只保留得分最高的前30个
valid_prev_indices.sort(key=lambda j: dp[i-1][j], reverse=True)
valid_prev_indices = valid_prev_indices[:30]
# 如果当前HTML行没有匹配,可以跳过(继承上一行状态)
for prev_j in valid_prev_indices:
score_skip = dp[i-1][prev_j] - 0.3 # SKIP_HTML_PENALTY
if score_skip > dp[i][prev_j]:
dp[i][prev_j] = score_skip
path[(i, prev_j)] = (prev_j, prev_j + 1) # 空范围,表示跳过
for prev_j in valid_prev_indices: # 例如:prev_j = 2
prev_score = dp[i-1][prev_j] # 例如:0.98
max_gap = min(15, 48 - prev_j - 1) # 最多向前搜索15个组
for gap in range(max_gap): # gap: 0..14
start_j = prev_j + 1 + gap # 例如:gap=0 → start_j=3
for count in range(1, 5): # count: 1..4
end_j = start_j + count - 1 # 例如:count=2 → end_j=4
# 获取合并文本
current_text = merged_cache.get((start_j, count), "")
# 长度预筛选(避免明显不匹配的情况)
if len(html_text) > 10 and len(current_text) < len(html_text) * 0.2:
continue # 跳过
# 计算相似度
similarity = calculate_similarity(html_text, current_text)
# 计算惩罚
gap_penalty = gap * 0.1 # 跳过组的惩罚
len_penalty = ... # 长度不匹配的惩罚
current_score = similarity - gap_penalty - len_penalty
if current_score > 0.1: # 只有正收益才转移
total_score = prev_score + current_score
if total_score > dp[i][end_j]:
dp[i][end_j] = total_score
path[(i, end_j)] = (prev_j, start_j)
示例:处理HTML行1(账号信息行)
假设 dp[0][2] = 0.98(上一行匹配到组2)
| gap | start_j | count | end_j | 匹配的OCR组 | 相似度 | 总得分 | 是否更新 |
|---|---|---|---|---|---|---|---|
| 0 | 3 | 1 | 3 | 组3 | 0.85 | 0.98+0.85=1.83 | ✅ |
| 0 | 3 | 2 | 4 | 组3+组4 | 0.92 | 0.98+0.92=1.90 | ✅ |
| 1 | 4 | 1 | 4 | 组4 | 0.20 | 0.98+0.20=1.18 | ✅ |
| ... | ... | ... | ... | ... | ... | ... | ... |
最终:dp[1][4] = 1.90,path[(1, 4)] = (2, 3)
best_end_j = -1
max_score = -np.inf
# 从最后一行往前找
for i in range(6, -1, -1): # i: 6, 5, 4, ..., 0
for j in range(48):
if dp[i][j] > max_score:
max_score = dp[i][j]
best_end_j = j
best_last_row = i
# 例如:dp[6][47] = 5.2 是最大值
# → best_end_j = 47, best_last_row = 6
mapping = {}
curr_i = 6 # best_last_row
curr_j = 47 # best_end_j
while curr_i >= 0:
if (curr_i, curr_j) in path:
prev_j, start_j = path[(6, 47)] # 例如:(45, 46)
# 如果 start_j <= curr_j,说明消耗了OCR组
if start_j <= curr_j: # 46 <= 47? True
indices = list(range(start_j, curr_j + 1)) # [46, 47]
mapping[6] = [46, 47] # HTML行6匹配OCR组46+47
curr_j = prev_j # 45
curr_i -= 1 # 5
else:
break
# 继续回溯...
# mapping[5] = [40, 41, 42, 43, 44, 45]
# mapping[4] = [38, 39]
# ...
# mapping[0] = [0, 1, 2]
最终映射结果:
mapping = {
0: [0, 1, 2], # HTML行0 → OCR组0+1+2
1: [3, 4], # HTML行1 → OCR组3+4
2: [5, 6], # HTML行2 → OCR组5+6
3: [7, 8, 9], # HTML行3 → OCR组7+8+9
4: [10, 11], # HTML行4 → OCR组10+11
5: [12, 13, ..., 39], # HTML行5 → OCR组12..39(表头)
6: [40, 41, ..., 47] # HTML行6 → OCR组40..47(数据行)
}
used_groups = {0, 1, 2, ..., 47} # 已使用的组
unused_groups = [] # 如果没有未使用的组,跳过
# 如果有未使用的组,找到最近的已匹配组,合并过去
for row_idx in mapping:
if mapping[row_idx]:
# 按OCR组的Y坐标排序,确保顺序正确
mapping[row_idx].sort(key=lambda idx: grouped_boxes[idx]['y_center'])
OCR组索引 →
HTML 0 1 2 3 4 5 6 7 ... 47
行↓
0 [0.98] - - - - - - ... -
1 - - - [1.90] - - - ... -
2 - - - - - [2.85] - ... -
3 - - - - - - [3.80] ... -
4 - - - - - - - [4.75] ... -
5 - - - - - - - - ... [5.20]
6 - - - - - - - - ... [5.50] ← 最优终点
HTML行0 → OCR组[0,1,2]
↓
HTML行1 → OCR组[3,4]
↓
HTML行2 → OCR组[5,6]
↓
...
↓
HTML行6 → OCR组[40..47]
这个算法通过动态规划实现了:
最终得到每个HTML行对应的OCR组索引列表,用于后续的单元格匹配。