6 months ago · 2683f0226d
--- a/ocr_tools/ocr_merger/DP算法多路径探索说明.md
+++ b/ocr_tools/ocr_merger/DP算法多路径探索说明.md
@@ -0,0 +1,264 @@
 
				+# DP算法多路径探索说明：为什么不是只从最优解往下？
			
 
				+
			
 
				+## ❓ 问题
			
 
				+
			
 
				+**上一行只会从最优解往下进行吗？上一行的其他解会往下计算吗？**
			
 
				+
			
 
				+## ✅ 答案
			
 
				+
			
 
				+**不是！算法会从上一行的所有有效解（经过剪枝后最多30个）往下计算，而不是只从最优解往下。**
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 📝 代码证据
			
 
				+
			
 
				+### 关键代码片段
			
 
				+
			
 
				+```python
			
 
				+# 第604行：获取上一行所有有效位置
			
 
				+valid_prev_indices = [j for j in range(n_paddle) if dp[i-1][j] > -np.inf]
			
 
				+
			
 
				+# 第619-621行：剪枝优化（但仍然是多个状态）
			
 
				+if len(valid_prev_indices) > 30:
			
 
				+    valid_prev_indices.sort(key=lambda j: dp[i-1][j], reverse=True)
			
 
				+    valid_prev_indices = valid_prev_indices[:30]  # 保留前30个，不是只保留1个！
			
 
				+
			
 
				+# 第637行：遍历所有有效的上一行状态
			
 
				+for prev_j in valid_prev_indices:  # 遍历多个状态，不是只有一个
			
 
				+    prev_score = dp[i-1][prev_j]
			
 
				+    # ... 从每个prev_j往下计算
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🎯 原因分析
			
 
				+
			
 
				+### 原因1：动态规划的核心思想
			
 
				+
			
 
				+**DP算法的本质是：考虑所有可能的状态转移，找到全局最优解。**
			
 
				+
			
 
				+```
			
 
				+全局最优 ≠ 局部最优的累积
			
 
				+
			
 
				+示例：
			
 
				+HTML行0 → OCR组[0,1,2]  得分: 0.98  ← 局部最优
			
 
				+HTML行0 → OCR组[0,1]    得分: 0.95  ← 次优解
			
 
				+
			
 
				+如果只从最优解(0.98)往下：
			
 
				+HTML行1 → OCR组[3,4]    得分: 0.98 + 0.90 = 1.88
			
 
				+
			
 
				+如果也从次优解(0.95)往下：
			
 
				+HTML行1 → OCR组[2,3,4]  得分: 0.95 + 0.95 = 1.90  ← 更优！
			
 
				+
			
 
				+结论：次优解可能通向全局最优！
			
 
				+```
			
 
				+
			
 
				+### 原因2：局部最优 ≠ 全局最优
			
 
				+
			
 
				+**当前行的最优解可能不是全局最优路径的起点。**
			
 
				+
			
 
				+#### 示例场景
			
 
				+
			
 
				+```
			
 
				+DP状态矩阵（简化）:
			
 
				+     OCR组索引 j →
			
 
				+HTML  0    1    2    3    4    5
			
 
				+行↓
			
 
				+0    [0.98] [0.95] [0.92] -    -    -
			
 
				+      ↑     ↑     ↑
			
 
				+    最优   次优  第三
			
 
				+
			
 
				+1    -     -     [1.88] [1.90] [1.85] -
			
 
				+                      ↑
			
 
				+                   全局最优！
			
 
				+                   来自次优解(0.95)
			
 
				+```
			
 
				+
			
 
				+**说明**：
			
 
				+- HTML行0的最优解是 `dp[0][0] = 0.98`
			
 
				+- 但如果只从 `dp[0][0]` 往下，得到 `dp[1][3] = 1.88`
			
 
				+- 而从次优解 `dp[0][1] = 0.95` 往下，得到 `dp[1][4] = 1.90`（更优！）
			
 
				+
			
 
				+### 原因3：多路径探索的必要性
			
 
				+
			
 
				+**保留多个候选路径，避免过早剪枝导致错过最优解。**
			
 
				+
			
 
				+#### 场景：OCR检测不准确
			
 
				+
			
 
				+```
			
 
				+HTML行0的匹配情况:
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ 匹配1: OCR组[0,1,2]  相似度: 0.98  ← 看起来最优              │
			
 
				+│        但OCR组2可能包含了一些噪声                           │
			
 
				+│                                                             │
			
 
				+│ 匹配2: OCR组[0,1]    相似度: 0.95  ← 次优，但更干净         │
			
 
				+│        没有噪声，可能后续匹配更准确                          │
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+
			
 
				+如果只保留匹配1：
			
 
				+→ HTML行1可能因为OCR组2的噪声影响，匹配不佳
			
 
				+→ 全局得分: 0.98 + 0.85 = 1.83
			
 
				+
			
 
				+如果也保留匹配2：
			
 
				+→ HTML行1从OCR组2开始，匹配更准确
			
 
				+→ 全局得分: 0.95 + 0.92 = 1.87  ← 更优！
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 📊 可视化说明
			
 
				+
			
 
				+### 场景1：只从最优解往下（错误做法）
			
 
				+
			
 
				+```
			
 
				+DP状态转移（错误）:
			
 
				+     OCR组索引 j →
			
 
				+HTML  0    1    2    3    4    5
			
 
				+行↓
			
 
				+0    [0.98] [0.95] [0.92] -    -    -
			
 
				+      │
			
 
				+      └─→ 只从最优解往下
			
 
				+          │
			
 
				+1    -    -    -    [1.88] -    -
			
 
				+                      ↑
			
 
				+                    唯一路径
			
 
				+                    可能错过全局最优
			
 
				+```
			
 
				+
			
 
				+### 场景2：从多个解往下（正确做法）
			
 
				+
			
 
				+```
			
 
				+DP状态转移（正确）:
			
 
				+     OCR组索引 j →
			
 
				+HTML  0    1    2    3    4    5
			
 
				+行↓
			
 
				+0    [0.98] [0.95] [0.92] -    -    -
			
 
				+      │     │     │
			
 
				+      └─────┴─────┴─→ 从多个解往下探索
			
 
				+          │     │     │
			
 
				+1    -    -    [1.88] [1.90] [1.85] -
			
 
				+                      ↑     ↑
			
 
				+                    路径1  路径2（更优！）
			
 
				+                    
			
 
				+最终选择: dp[1][4] = 1.90（来自dp[0][1]）
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🔍 剪枝策略的平衡
			
 
				+
			
 
				+### 为什么保留30个而不是全部？
			
 
				+
			
 
				+**原因：平衡准确性和性能**
			
 
				+
			
 
				+```python
			
 
				+# 如果保留所有状态
			
 
				+valid_prev_indices = [0, 1, 2, ..., 47]  # 48个状态
			
 
				+计算量: 48 × 15 × 4 = 2,880 次匹配/行
			
 
				+总计算量: 2,880 × 7 = 20,160 次
			
 
				+
			
 
				+# 如果只保留最优解
			
 
				+valid_prev_indices = [0]  # 1个状态
			
 
				+计算量: 1 × 15 × 4 = 60 次匹配/行
			
 
				+总计算量: 60 × 7 = 420 次
			
 
				+问题: 可能错过全局最优解 ❌
			
 
				+
			
 
				+# 如果保留前30个（当前策略）
			
 
				+valid_prev_indices = [0, 1, 2, ..., 29]  # 30个状态
			
 
				+计算量: 30 × 15 × 4 = 1,800 次匹配/行
			
 
				+总计算量: 1,800 × 7 = 12,600 次
			
 
				+平衡: 既保证准确性，又控制计算量 ✅
			
 
				+```
			
 
				+
			
 
				+### 剪枝的合理性
			
 
				+
			
 
				+**保留前30个得分最高的状态，通常已经包含了所有有希望通向全局最优的路径。**
			
 
				+
			
 
				+```
			
 
				+得分分布示例:
			
 
				+dp[0][0] = 0.98  ← 第1名
			
 
				+dp[0][1] = 0.95  ← 第2名
			
 
				+dp[0][2] = 0.92  ← 第3名
			
 
				+...
			
 
				+dp[0][29] = 0.65 ← 第30名
			
 
				+dp[0][30] = 0.20 ← 第31名（被剪枝）
			
 
				+
			
 
				+分析：
			
 
				+- 前30个状态得分都在0.65以上，有希望通向全局最优
			
 
				+- 第31个状态得分只有0.20，即使后续匹配完美，也很难超过前30个
			
 
				+- 剪枝是安全的 ✅
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 💡 实际案例
			
 
				+
			
 
				+### 案例：银行流水表匹配
			
 
				+
			
 
				+```
			
 
				+HTML行0（表头信息行）的匹配情况:
			
 
				+
			
 
				+匹配1: OCR组[0,1,2]  得分: 0.98
			
 
				+       文本: "部门department:01372403999柜员searchteller:ebf0000打印日期..."
			
 
				+       但OCR组2可能包含了一些后续行的内容
			
 
				+
			
 
				+匹配2: OCR组[0,1]    得分: 0.95
			
 
				+       文本: "部门department:01372403999柜员searchteller:ebf0000"
			
 
				+       更精确，没有混入后续内容
			
 
				+
			
 
				+匹配3: OCR组[0,1,2,3] 得分: 0.92
			
 
				+       文本: "部门department:...查询起日..."
			
 
				+       包含了太多内容，可能影响后续匹配
			
 
				+```
			
 
				+
			
 
				+**如果只从匹配1往下**：
			
 
				+```
			
 
				+HTML行1 → OCR组[3,4]  得分: 0.98 + 0.85 = 1.83
			
 
				+（因为OCR组2已经消耗了一些内容，导致HTML行1匹配不佳）
			
 
				+```
			
 
				+
			
 
				+**如果也从匹配2往下**：
			
 
				+```
			
 
				+HTML行1 → OCR组[2,3,4] 得分: 0.95 + 0.92 = 1.87  ← 更优！
			
 
				+（从OCR组2开始，HTML行1能匹配到更完整的内容）
			
 
				+```
			
 
				+
			
 
				+**结论**：保留多个候选路径，最终找到全局最优解！
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 📋 总结
			
 
				+
			
 
				+### 核心要点
			
 
				+
			
 
				+1. **不是只从最优解往下**
			
 
				+   - 算法会从上一行的所有有效解（最多30个）往下计算
			
 
				+   - 这是动态规划的标准做法
			
 
				+
			
 
				+2. **原因**
			
 
				+   - 局部最优 ≠ 全局最优
			
 
				+   - 次优解可能通向全局最优
			
 
				+   - 多路径探索保证找到全局最优解
			
 
				+
			
 
				+3. **剪枝策略**
			
 
				+   - 保留前30个得分最高的状态
			
 
				+   - 平衡准确性和性能
			
 
				+   - 通常已经包含了所有有希望的路径
			
 
				+
			
 
				+4. **实际效果**
			
 
				+   - 在保证准确性的同时，控制计算量
			
 
				+   - 避免错过全局最优解
			
 
				+   - 提高算法的鲁棒性
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🎓 算法设计启示
			
 
				+
			
 
				+这个设计体现了**动态规划算法的精髓**：
			
 
				+
			
 
				+1. **状态空间探索**：考虑所有可能的状态转移
			
 
				+2. **剪枝优化**：在保证准确性的前提下，减少计算量
			
 
				+3. **全局最优**：通过多路径探索，找到全局最优解
			
 
				+
			
 
				+**这就是为什么这个算法能够准确匹配HTML行与OCR组的原因！**
			
 
				+
			
--- a/ocr_tools/ocr_merger/README.md
+++ b/ocr_tools/ocr_merger/README.md
@@ -0,0 +1,842 @@
 
				+# OCR 结果合并工具说明
			
 
				+
			
 
				+## 概述
			
 
				+
			
 
				+本工具包用于合并不同 OCR 工具的识别结果，主要功能是将结构化识别结果（如 MinerU、PaddleOCR_VL、DotsOCR）与精确文字框坐标（PaddleOCR）进行合并，生成包含完整 bbox 信息的增强版 Markdown 和 JSON 文件。
			
 
				+
			
 
				+**位置**：`ocr_platform/ocr_tools/ocr_merger/`
			
 
				+
			
 
				+**注意**：
			
 
				+- `BBoxExtractor` 已迁移到 `ocr_utils/bbox_utils.py`，供多个模块共享使用
			
 
				+- 导入时使用：`from ocr_utils import BBoxExtractor`
			
 
				+
			
 
				+## 支持的合并方式
			
 
				+
			
 
				+### 1. MinerU + PaddleOCR
			
 
				+- **输入**：
			
 
				+  - MinerU 的 JSON 文件（包含表格结构、段落等）
			
 
				+  - PaddleOCR 的 JSON 文件（包含精确的文字框坐标）
			
 
				+- **输出**：
			
 
				+  - 增强的 Markdown（包含 bbox 注释）
			
 
				+  - 合并的 JSON（MinerU 格式 + bbox 信息）
			
 
				+
			
 
				+### 2. PaddleOCR_VL + PaddleOCR
			
 
				+- **输入**：
			
 
				+  - PaddleOCR_VL 的 JSON 文件（包含版面分析、表格识别等）
			
 
				+  - PaddleOCR 的 JSON 文件（包含精确的文字框坐标）
			
 
				+- **输出**：
			
 
				+  - 增强的 Markdown（包含 bbox 注释）
			
 
				+  - 合并的 JSON（转换为 MinerU 格式 + bbox 信息）
			
 
				+
			
 
				+### 3. DotsOCR + PaddleOCR
			
 
				+- **输入**：
			
 
				+  - DotsOCR 的 JSON 文件（包含表格结构、段落等）
			
 
				+  - PaddleOCR 的 JSON 文件（包含精确的文字框坐标）
			
 
				+- **输出**：
			
 
				+  - 增强的 Markdown（包含 bbox 注释）
			
 
				+  - 合并的 JSON（转换为 MinerU 格式 + bbox 信息）
			
 
				+
			
 
				+## 目录结构
			
 
				+
			
 
				+```
			
 
				+ocr_tools/ocr_merger/
			
 
				+├── __init__.py                      # 包初始化文件
			
 
				+├── merger_core.py                   # MinerU 合并器核心类
			
 
				+├── paddleocr_vl_merger.py          # PaddleOCR_VL 合并器核心类
			
 
				+├── dotsocr_merger.py                # DotsOCR 合并器核心类
			
 
				+├── merge_mineru_paddle_ocr.py      # MinerU + PaddleOCR 主程序
			
 
				+├── merge_paddleocr_vl_paddleocr.py # PaddleOCR_VL + PaddleOCR 主程序
			
 
				+├── merge_dotsocr_paddleocr.py      # DotsOCR + PaddleOCR 主程序
			
 
				+├── text_matcher.py                  # 文本匹配模块（共用）
			
 
				+├── data_processor.py               # 数据处理模块（共用）
			
 
				+├── markdown_generator.py           # Markdown 生成模块（共用）
			
 
				+├── unified_output_converter.py     # 统一输出转换器
			
 
				+├── table_cell_matcher.py           # 表格单元格匹配器
			
 
				+└── README.md                        # 本说明文档
			
 
				+
			
 
				+注意：
			
 
				+- `BBoxExtractor` 已迁移到 `ocr_utils/bbox_utils.py`，供多个模块共享使用
			
 
				+- 导入时使用：`from ocr_utils import BBoxExtractor`
			
 
				+```
			
 
				+
			
 
				+## 核心模块说明
			
 
				+
			
 
				+### 1. 合并器类
			
 
				+
			
 
				+#### `MinerUPaddleOCRMerger` (merger_core.py)
			
 
				+```python
			
 
				+class MinerUPaddleOCRMerger:
			
 
				+    """MinerU 和 PaddleOCR 结果合并器"""
			
 
				+    
			
 
				+    def __init__(self, look_ahead_window: int = 10, 
			
 
				+                 similarity_threshold: int = 90):
			
 
				+        """
			
 
				+        Args:
			
 
				+            look_ahead_window: 向前查找的窗口大小
			
 
				+            similarity_threshold: 文本相似度阈值（0-100）
			
 
				+        """
			
 
				+    
			
 
				+    def merge_table_with_bbox(self, mineru_json_path: str, 
			
 
				+                             paddle_json_path: str) -> List[Dict]:
			
 
				+        """合并 MinerU 和 PaddleOCR 的结果"""
			
 
				+    
			
 
				+    def generate_enhanced_markdown(self, merged_data: List[Dict], 
			
 
				+                                   output_path: str = None) -> str:
			
 
				+        """生成增强的 Markdown"""
			
 
				+```
			
 
				+
			
 
				+#### `PaddleOCRVLMerger` (paddleocr_vl_merger.py)
			
 
				+```python
			
 
				+class PaddleOCRVLMerger:
			
 
				+    """PaddleOCR_VL 和 PaddleOCR 结果合并器"""
			
 
				+    
			
 
				+    def __init__(self, look_ahead_window: int = 10, 
			
 
				+                 similarity_threshold: int = 90):
			
 
				+        """
			
 
				+        Args:
			
 
				+            look_ahead_window: 向前查找的窗口大小
			
 
				+            similarity_threshold: 文本相似度阈值（0-100）
			
 
				+        """
			
 
				+    
			
 
				+    def merge_table_with_bbox(self, paddleocr_vl_json_path: str, 
			
 
				+                             paddle_json_path: str) -> List[Dict]:
			
 
				+        """合并 PaddleOCR_VL 和 PaddleOCR 的结果"""
			
 
				+    
			
 
				+    def generate_enhanced_markdown(self, merged_data: List[Dict], 
			
 
				+                                   output_path: str = None,
			
 
				+                                   data_format: str = None) -> str:
			
 
				+        """生成增强的 Markdown"""
			
 
				+```
			
 
				+
			
 
				+#### `DotsOCRMerger` (dotsocr_merger.py)
			
 
				+```python
			
 
				+class DotsOCRMerger:
			
 
				+    """DotsOCR 和 PaddleOCR 结果合并器"""
			
 
				+    
			
 
				+    def __init__(self, look_ahead_window: int = 10, 
			
 
				+                 similarity_threshold: int = 90):
			
 
				+        """
			
 
				+        Args:
			
 
				+            look_ahead_window: 向前查找的窗口大小
			
 
				+            similarity_threshold: 文本相似度阈值（0-100）
			
 
				+        """
			
 
				+    
			
 
				+    def merge_table_with_bbox(self, dotsocr_json_path: str, 
			
 
				+                             paddle_json_path: str,
			
 
				+                             data_format: str = 'mineru') -> List[Dict]:
			
 
				+        """合并 DotsOCR 和 PaddleOCR 的结果"""
			
 
				+    
			
 
				+    def generate_enhanced_markdown(self, merged_data: List[Dict], 
			
 
				+                                   output_path: str = None,
			
 
				+                                   source_file: str = None,
			
 
				+                                   data_format: str = None) -> str:
			
 
				+        """生成增强的 Markdown"""
			
 
				+```
			
 
				+
			
 
				+### 2. 共用模块
			
 
				+
			
 
				+#### `TextMatcher` (text_matcher.py)
			
 
				+负责文本匹配，找到 MinerU/PaddleOCR_VL/DotsOCR 文本在 PaddleOCR 结果中的对应位置。
			
 
				+
			
 
				+**核心方法**：
			
 
				+```python
			
 
				+def find_matching_bbox(self, target_text: str, 
			
 
				+                      text_boxes: List[Dict],
			
 
				+                      start_index: int = 0,
			
 
				+                      last_match_index: int = 0,
			
 
				+                      look_ahead_window: int = 10) -> Tuple[Optional[Dict], int, int]:
			
 
				+    """
			
 
				+    查找匹配的 bbox
			
 
				+    
			
 
				+    Args:
			
 
				+        target_text: 目标文本
			
 
				+        text_boxes: PaddleOCR 文字框列表
			
 
				+        start_index: 开始搜索的位置
			
 
				+        last_match_index: 上一次匹配的索引
			
 
				+        look_ahead_window: 向前查找窗口大小
			
 
				+    
			
 
				+    Returns:
			
 
				+        (匹配的文字框, 新的开始位置, 新的last_match_index)
			
 
				+    """
			
 
				+```
			
 
				+
			
 
				+**匹配策略**：
			
 
				+1. **顺序匹配**：优先从 `start_index` 开始顺序查找
			
 
				+2. **窗口回溯**：在 `[last_match_index - look_ahead_window, last_match_index + look_ahead_window]` 范围内查找
			
 
				+3. **相似度计算**：使用 `fuzzywuzzy` 计算文本相似度
			
 
				+4. **指针更新**：匹配成功后更新指针，避免重复匹配
			
 
				+
			
 
				+#### `BBoxExtractor` (ocr_utils/bbox_utils.py)
			
 
				+负责从 PaddleOCR 结果中提取文字框信息，并提供坐标转换功能。
			
 
				+
			
 
				+**注意**：`BBoxExtractor` 已迁移到 `ocr_utils/bbox_utils.py`，供多个模块共享使用。
			
 
				+
			
 
				+**导入方式**：
			
 
				+```python
			
 
				+from ocr_utils import BBoxExtractor
			
 
				+```
			
 
				+
			
 
				+**核心方法**：
			
 
				+```python
			
 
				+def extract_paddle_text_boxes(self, paddle_data: Dict) -> Tuple[List[Dict], float, Tuple[int, int]]:
			
 
				+    """
			
 
				+    提取 PaddleOCR 的文字框信息
			
 
				+    
			
 
				+    Returns:
			
 
				+        (文字框列表, 旋转角度, 原始图像尺寸)
			
 
				+        文字框格式：
			
 
				+        [
			
 
				+            {
			
 
				+                'text': '文本内容',
			
 
				+                'bbox': [x1, y1, x2, y2],
			
 
				+                'poly': [[x1,y1], [x2,y2], [x3,y3], [x4,y4]],
			
 
				+                'score': 0.99,
			
 
				+                'used': False
			
 
				+            },
			
 
				+            ...
			
 
				+        ]
			
 
				+    """
			
 
				+
			
 
				+def extract_table_cells_with_bbox(self, merged_data: List[Dict]) -> List[Dict]:
			
 
				+    """
			
 
				+    提取所有表格单元格及其 bbox 信息
			
 
				+    
			
 
				+    Returns:
			
 
				+        [
			
 
				+            {
			
 
				+                'text': '单元格文本',
			
 
				+                'bbox': [x1, y1, x2, y2],
			
 
				+                'row': 1,
			
 
				+                'col': 2,
			
 
				+                'score': 0.99,
			
 
				+                'paddle_index': 10
			
 
				+            },
			
 
				+            ...
			
 
				+        ]
			
 
				+    """
			
 
				+
			
 
				+def inverse_rotate_box_coordinates(self, bbox: List[float], angle: float, orig_image_size: tuple) -> List[float]:
			
 
				+    """
			
 
				+    反向旋转 bbox 坐标（将旋转后的坐标转换回原始图像坐标）
			
 
				+    """
			
 
				+
			
 
				+def rotate_box_coordinates(self, bbox: List[float], angle: float, orig_image_size: tuple) -> List[float]:
			
 
				+    """
			
 
				+    旋转 bbox 坐标（与图像旋转保持一致）
			
 
				+    """
			
 
				+
			
 
				+def calculate_skew_angle(self, paddle_boxes: List[Dict], sample_ratio: float = 0.5, outlier_threshold: float = 0.3) -> float:
			
 
				+    """
			
 
				+    计算文档倾斜角度（基于文本行分析）
			
 
				+    """
			
 
				+
			
 
				+def correct_boxes_skew(self, paddle_boxes: List[Dict], correction_angle: float, image_size: Tuple[int, int]) -> List[Dict]:
			
 
				+    """
			
 
				+    校正文本框的倾斜
			
 
				+    """
			
 
				+```
			
 
				+
			
 
				+#### `DataProcessor` (data_processor.py)
			
 
				+负责处理 MinerU/PaddleOCR_VL/DotsOCR 数据，添加 bbox 信息。
			
 
				+
			
 
				+**核心方法**：
			
 
				+```python
			
 
				+def process_mineru_data(self, mineru_data: List[Dict], 
			
 
				+                       paddle_text_boxes: List[Dict]) -> List[Dict]:
			
 
				+    """
			
 
				+    处理 MinerU 数据，添加 bbox 信息
			
 
				+    
			
 
				+    处理类型：
			
 
				+    - text: 普通文本
			
 
				+    - title: 标题
			
 
				+    - table: 表格
			
 
				+    - list: 列表
			
 
				+    - image: 图片
			
 
				+    - equation: 公式
			
 
				+    """
			
 
				+
			
 
				+def process_paddleocr_vl_data(self, paddleocr_vl_data: Dict,
			
 
				+                              paddle_text_boxes: List[Dict],
			
 
				+                              rotation_angle: float = 0.0,
			
 
				+                              orig_image_size: tuple = (0,0)) -> List[Dict]:
			
 
				+    """
			
 
				+    处理 PaddleOCR_VL 数据，添加 bbox 信息
			
 
				+    
			
 
				+    处理类型：
			
 
				+    - paragraph_title: 段落标题
			
 
				+    - figure_title: 图片标题
			
 
				+    - text: 文本
			
 
				+    - table: 表格
			
 
				+    - figure: 图片
			
 
				+    - equation: 公式
			
 
				+    - reference: 参考文献
			
 
				+    """
			
 
				+
			
 
				+def process_dotsocr_data(self, dotsocr_data: Dict,
			
 
				+                        paddle_text_boxes: List[Dict],
			
 
				+                        rotation_angle: float = 0.0,
			
 
				+                        orig_image_size: tuple = (0,0)) -> List[Dict]:
			
 
				+    """
			
 
				+    处理 DotsOCR 数据，添加 bbox 信息
			
 
				+    
			
 
				+    处理类型：
			
 
				+    - text: 文本
			
 
				+    - table: 表格
			
 
				+    - image: 图片
			
 
				+    - equation: 公式
			
 
				+    """
			
 
				+```
			
 
				+
			
 
				+**表格处理逻辑**：
			
 
				+1. 解析 HTML 表格结构
			
 
				+2. 逐个单元格匹配 PaddleOCR 文字框
			
 
				+3. 为每个 `<td>` 添加 `data-bbox`、`data-paddle-index`、`data-score` 属性
			
 
				+4. 处理合并单元格（colspan、rowspan）
			
 
				+
			
 
				+#### `MarkdownGenerator` (markdown_generator.py)
			
 
				+负责将合并后的数据生成 Markdown 文件。
			
 
				+
			
 
				+**核心特性**：
			
 
				+- **自动格式检测**：自动识别 MinerU 或 PaddleOCR_VL 格式
			
 
				+- **bbox 注释**：为每个元素添加 `<!-- bbox: [x1, y1, x2, y2] -->` 注释
			
 
				+- **表格增强**：保留表格中的 bbox 属性
			
 
				+- **图片处理**：自动复制图片文件
			
 
				+
			
 
				+**核心方法**：
			
 
				+```python
			
 
				+def detect_data_format(merged_data: List[Dict]) -> str:
			
 
				+    """
			
 
				+    检测数据格式
			
 
				+    
			
 
				+    Returns:
			
 
				+        'mineru' 或 'paddleocr_vl'
			
 
				+    """
			
 
				+
			
 
				+def generate_enhanced_markdown(merged_data: List[Dict], 
			
 
				+                               output_path: Optional[str] = None,
			
 
				+                               source_file: Optional[str] = None,
			
 
				+                               data_format: Optional[str] = None) -> str:
			
 
				+    """
			
 
				+    生成增强的 Markdown
			
 
				+    
			
 
				+    Args:
			
 
				+        merged_data: 合并后的数据
			
 
				+        output_path: 输出路径
			
 
				+        source_file: 源文件路径（用于复制图片）
			
 
				+        data_format: 数据格式，None 则自动检测
			
 
				+    """
			
 
				+```
			
 
				+
			
 
				+**格式化方法**：
			
 
				+- **MinerU 格式**：`_format_mineru_*()` 系列方法
			
 
				+- **PaddleOCR_VL 格式**：`_format_paddleocr_vl_*()` 系列方法
			
 
				+- **通用方法**：`_format_equation()`, `_format_metadata()` 等
			
 
				+
			
 
				+## 使用方法
			
 
				+
			
 
				+### 1. MinerU + PaddleOCR 合并
			
 
				+
			
 
				+#### 命令行使用
			
 
				+
			
 
				+```bash
			
 
				+# 单文件处理
			
 
				+python ocr_tools/ocr_merger/merge_mineru_paddle_ocr.py \
			
 
				+  --mineru-file /path/to/mineru_page_001.json \
			
 
				+  --paddle-file /path/to/paddle_page_001.json \
			
 
				+  --output-dir /path/to/output \
			
 
				+  --output-type both
			
 
				+
			
 
				+# 批量处理
			
 
				+python ocr_tools/ocr_merger/merge_mineru_paddle_ocr.py \
			
 
				+  --mineru-dir /path/to/mineru_results \
			
 
				+  --paddle-dir /path/to/paddle_results \
			
 
				+  --output-dir /path/to/output \
			
 
				+  --output-type both \
			
 
				+  --window 15 \
			
 
				+  --threshold 85
			
 
				+```
			
 
				+
			
 
				+#### 参数说明
			
 
				+
			
 
				+| 参数 | 说明 | 默认值 |
			
 
				+|------|------|--------|
			
 
				+| `--mineru-file` | MinerU JSON 文件路径（单文件模式） | - |
			
 
				+| `--paddle-file` | PaddleOCR JSON 文件路径（单文件模式） | - |
			
 
				+| `--mineru-dir` | MinerU 结果目录（批量模式） | - |
			
 
				+| `--paddle-dir` | PaddleOCR 结果目录（批量模式） | - |
			
 
				+| `-o, --output-dir` | 输出目录（必需） | - |
			
 
				+| `-f, --output-type` | 输出格式：json/markdown/both | both |
			
 
				+| `-w, --window` | 向前查找窗口大小 | 15 |
			
 
				+| `-t, --threshold` | 文本相似度阈值（0-100） | 80 |
			
 
				+
			
 
				+### 2. PaddleOCR_VL + PaddleOCR 合并
			
 
				+
			
 
				+#### 命令行使用
			
 
				+
			
 
				+```bash
			
 
				+# 单文件处理
			
 
				+python ocr_tools/ocr_merger/merge_paddleocr_vl_paddleocr.py \
			
 
				+  --paddleocr-vl-file /path/to/paddleocr_vl_page_001.json \
			
 
				+  --paddle-file /path/to/paddle_page_001.json \
			
 
				+  --output-dir /path/to/output \
			
 
				+  --output-type both
			
 
				+
			
 
				+# 批量处理
			
 
				+python ocr_tools/ocr_merger/merge_paddleocr_vl_paddleocr.py \
			
 
				+  --paddleocr-vl-dir /path/to/paddleocr_vl_results \
			
 
				+  --paddle-dir /path/to/paddle_results \
			
 
				+  --output-dir /path/to/output \
			
 
				+  --output-type both \
			
 
				+  --window 15 \
			
 
				+  --threshold 85
			
 
				+```
			
 
				+
			
 
				+#### 参数说明
			
 
				+
			
 
				+| 参数 | 说明 | 默认值 |
			
 
				+|------|------|--------|
			
 
				+| `--paddleocr-vl-file` | PaddleOCR_VL JSON 文件路径（单文件模式） | - |
			
 
				+| `--paddle-file` | PaddleOCR JSON 文件路径（单文件模式） | - |
			
 
				+| `--paddleocr-vl-dir` | PaddleOCR_VL 结果目录（批量模式） | - |
			
 
				+| `--paddle-dir` | PaddleOCR 结果目录（批量模式） | - |
			
 
				+| `-o, --output-dir` | 输出目录（必需） | - |
			
 
				+| `-f, --output-type` | 输出格式：json/markdown/both | both |
			
 
				+| `-w, --window` | 向前查找窗口大小 | 15 |
			
 
				+| `-t, --threshold` | 文本相似度阈值（0-100） | 80 |
			
 
				+
			
 
				+### 3. DotsOCR + PaddleOCR 合并
			
 
				+
			
 
				+#### 命令行使用
			
 
				+
			
 
				+```bash
			
 
				+# 单文件处理
			
 
				+python ocr_tools/ocr_merger/merge_dotsocr_paddleocr.py \
			
 
				+  --dotsocr-file /path/to/dotsocr_page_001.json \
			
 
				+  --paddle-file /path/to/paddle_page_001.json \
			
 
				+  --output-dir /path/to/output \
			
 
				+  --output-type both
			
 
				+
			
 
				+# 批量处理
			
 
				+python ocr_tools/ocr_merger/merge_dotsocr_paddleocr.py \
			
 
				+  --dotsocr-dir /path/to/dotsocr_results \
			
 
				+  --paddle-dir /path/to/paddle_results \
			
 
				+  --output-dir /path/to/output \
			
 
				+  --output-type both \
			
 
				+  --window 15 \
			
 
				+  --threshold 85
			
 
				+```
			
 
				+
			
 
				+#### 参数说明
			
 
				+
			
 
				+| 参数 | 说明 | 默认值 |
			
 
				+|------|------|--------|
			
 
				+| `--dotsocr-file` | DotsOCR JSON 文件路径（单文件模式） | - |
			
 
				+| `--paddle-file` | PaddleOCR JSON 文件路径（单文件模式） | - |
			
 
				+| `--dotsocr-dir` | DotsOCR 结果目录（批量模式） | - |
			
 
				+| `--paddle-dir` | PaddleOCR 结果目录（批量模式） | - |
			
 
				+| `-o, --output-dir` | 输出目录（必需） | - |
			
 
				+| `-f, --output-type` | 输出格式：json/markdown/both | both |
			
 
				+| `-w, --window` | 向前查找窗口大小 | 15 |
			
 
				+| `-t, --threshold` | 文本相似度阈值（0-100） | 80 |
			
 
				+
			
 
				+## 输出格式
			
 
				+
			
 
				+### 1. 增强的 Markdown
			
 
				+
			
 
				+```markdown
			
 
				+<!-- bbox: [717, 191, 917, 229] -->
			
 
				+# 账务明细清单
			
 
				+
			
 
				+<!-- bbox: [721, 233, 921, 254] -->
			
 
				+# Statement Of Account
			
 
				+
			
 
				+<!-- bbox: [181, 257, 517, 283] -->
			
 
				+开户银行：呼和浩特成吉思汗大街
			
 
				+
			
 
				+<!-- bbox: [176, 406, 1468, 1920] -->
			
 
				+<table>
			
 
				+  <tr>
			
 
				+    <td data-bbox="[183,413,293,438]" data-paddle-index="10" data-score="0.9995">日期Date</td>
			
 
				+    <td data-bbox="[296,413,486,438]" data-paddle-index="11" data-score="0.9988">业务类型Business Type</td>
			
 
				+    <td data-bbox="[489,413,642,438]" data-paddle-index="12" data-score="0.9994">票据号Bill No.</td>
			
 
				+  </tr>
			
 
				+  ...
			
 
				+</table>
			
 
				+```
			
 
				+
			
 
				+### 2. 合并的 JSON
			
 
				+
			
 
				+#### MinerU 格式
			
 
				+```json
			
 
				+[
			
 
				+  {
			
 
				+    "type": "text",
			
 
				+    "text": "账务明细清单",
			
 
				+    "text_level": 1,
			
 
				+    "bbox": [717, 191, 917, 229],
			
 
				+    "page_idx": 0
			
 
				+  },
			
 
				+  {
			
 
				+    "type": "table",
			
 
				+    "table_body": "<table>...</table>",
			
 
				+    "table_body_with_bbox": "<table>...</table>",
			
 
				+    "bbox": [176, 406, 1468, 1920],
			
 
				+    "bbox_mapping": "merged_from_paddle_ocr",
			
 
				+    "table_cells": [
			
 
				+      {
			
 
				+        "text": "日期Date",
			
 
				+        "bbox": [183, 413, 293, 438],
			
 
				+        "paddle_index": 10,
			
 
				+        "score": 0.9995,
			
 
				+        "row": 0,
			
 
				+        "col": 0
			
 
				+      }
			
 
				+    ],
			
 
				+    "page_idx": 0
			
 
				+  }
			
 
				+]
			
 
				+```
			
 
				+
			
 
				+#### PaddleOCR_VL 格式（转换为 MinerU 格式）
			
 
				+```json
			
 
				+[
			
 
				+  {
			
 
				+    "type": "text",
			
 
				+    "text": "账务明细清单",
			
 
				+    "text_level": 1,
			
 
				+    "bbox": [719, 194, 924, 264],
			
 
				+    "page_idx": 0
			
 
				+  },
			
 
				+  {
			
 
				+    "type": "table",
			
 
				+    "table_body": "<table>...</table>",
			
 
				+    "table_body_with_bbox": "<table>...</table>",
			
 
				+    "bbox": [177, 256, 1464, 393],
			
 
				+    "bbox_mapping": "merged_from_paddle_ocr",
			
 
				+    "table_cells": [...],
			
 
				+    "page_idx": 0
			
 
				+  }
			
 
				+]
			
 
				+```
			
 
				+
			
 
				+## 核心算法
			
 
				+
			
 
				+### 1. 文本匹配算法
			
 
				+
			
 
				+**目标**：在 PaddleOCR 的文字框列表中找到与 MinerU/PaddleOCR_VL 文本最匹配的 bbox。
			
 
				+
			
 
				+**策略**：
			
 
				+1. **顺序匹配优先**：从 `start_index` 开始顺序查找
			
 
				+2. **窗口回溯**：如果顺序匹配失败，在上一次匹配位置附近的窗口内查找
			
 
				+3. **相似度筛选**：使用 `fuzzywuzzy.partial_ratio` 计算相似度，阈值默认 80%
			
 
				+4. **指针管理**：维护两个指针
			
 
				+   - `start_index`：下次搜索的起始位置
			
 
				+   - `last_match_index`：上一次匹配的位置（用于窗口回溯）
			
 
				+
			
 
				+**伪代码**：
			
 
				+```python
			
 
				+def find_matching_bbox(target_text, text_boxes, start_index, last_match_index, window):
			
 
				+    # 第一阶段：顺序查找
			
 
				+    for i in range(start_index, len(text_boxes)):
			
 
				+        if similarity(target_text, text_boxes[i].text) >= threshold:
			
 
				+            return text_boxes[i], i + 1, i
			
 
				+    
			
 
				+    # 第二阶段：窗口回溯
			
 
				+    window_start = max(0, last_match_index - window)
			
 
				+    window_end = min(len(text_boxes), last_match_index + window)
			
 
				+    
			
 
				+    best_match = None
			
 
				+    best_score = 0
			
 
				+    best_index = -1
			
 
				+    
			
 
				+    for i in range(window_start, window_end):
			
 
				+        score = similarity(target_text, text_boxes[i].text)
			
 
				+        if score > best_score and score >= threshold:
			
 
				+            best_match = text_boxes[i]
			
 
				+            best_score = score
			
 
				+            best_index = i
			
 
				+    
			
 
				+    if best_match:
			
 
				+        return best_match, start_index, best_index
			
 
				+    
			
 
				+    return None, start_index, last_match_index
			
 
				+```
			
 
				+
			
 
				+### 2. 表格单元格匹配算法
			
 
				+
			
 
				+**目标**：为表格中的每个单元格找到对应的 PaddleOCR 文字框。
			
 
				+
			
 
				+**步骤**：
			
 
				+1. **解析表格 HTML**：使用 BeautifulSoup 解析 `<table>` 结构
			
 
				+2. **逐行逐列处理**：
			
 
				+   ```python
			
 
				+   for row in table.find_all('tr'):
			
 
				+       for cell in row.find_all(['td', 'th']):
			
 
				+           cell_text = cell.get_text()
			
 
				+           matched_bbox = find_matching_bbox(cell_text, ...)
			
 
				+           
			
 
				+           # 添加属性
			
 
				+           cell['data-bbox'] = str(matched_bbox['bbox'])
			
 
				+           cell['data-paddle-index'] = matched_bbox['index']
			
 
				+           cell['data-score'] = matched_bbox['score']
			
 
				+   ```
			
 
				+3. **处理合并单元格**：
			
 
				+   - 检测 `colspan` 和 `rowspan` 属性
			
 
				+   - 为每个展开的单元格使用相同的 bbox
			
 
				+4. **指针更新**：每次匹配成功后更新 `paddle_pointer`，避免重复匹配
			
 
				+
			
 
				+## 参数调优建议
			
 
				+
			
 
				+### 1. `look_ahead_window` (向前查找窗口)
			
 
				+
			
 
				+**作用**：当顺序匹配失败时，在上一次匹配位置附近的窗口内查找。
			
 
				+
			
 
				+**推荐值**：
			
 
				+- **顺序文档**（如流水记录）：`10-15`
			
 
				+- **复杂版面**（如多列排版）：`15-25`
			
 
				+- **表格密集型**：`5-10`
			
 
				+
			
 
				+**调优方法**：
			
 
				+```python
			
 
				+# 如果发现匹配错位，可以适当增大窗口
			
 
				+merger = MinerUPaddleOCRMerger(look_ahead_window=20)
			
 
				+
			
 
				+# 如果发现误匹配，可以适当减小窗口
			
 
				+merger = MinerUPaddleOCRMerger(look_ahead_window=8)
			
 
				+```
			
 
				+
			
 
				+### 2. `similarity_threshold` (相似度阈值)
			
 
				+
			
 
				+**作用**：控制文本匹配的严格程度。
			
 
				+
			
 
				+**推荐值**：
			
 
				+- **高质量 OCR**（如扫描件）：`85-90`
			
 
				+- **一般质量**（如拍照）：`75-85`
			
 
				+- **低质量**（如模糊图片）：`70-80`
			
 
				+
			
 
				+**调优方法**：
			
 
				+```python
			
 
				+# 如果发现漏匹配，可以降低阈值
			
 
				+merger = MinerUPaddleOCRMerger(similarity_threshold=75)
			
 
				+
			
 
				+# 如果发现误匹配，可以提高阈值
			
 
				+merger = MinerUPaddleOCRMerger(similarity_threshold=90)
			
 
				+```
			
 
				+
			
 
				+## 开发指南
			
 
				+
			
 
				+### 扩展新的合并器
			
 
				+
			
 
				+1. **创建新的合并器类**：
			
 
				+```python
			
 
				+# ocr_tools/ocr_merger/my_custom_merger.py
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径
			
 
				+ocr_platform_root = Path(__file__).parents[3]
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+from ocr_tools.ocr_merger.text_matcher import TextMatcher
			
 
				+from ocr_tools.ocr_merger.data_processor import DataProcessor
			
 
				+from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+
			
 
				+class MyCustomMerger:
			
 
				+    def __init__(self):
			
 
				+        self.text_matcher = TextMatcher()
			
 
				+        self.bbox_extractor = BBoxExtractor()  # 从 ocr_utils 导入
			
 
				+        self.data_processor = DataProcessor(self.text_matcher, look_ahead_window=10)
			
 
				+    
			
 
				+    def merge_data(self, custom_json, paddle_json):
			
 
				+        # 实现自定义合并逻辑
			
 
				+        pass
			
 
				+```
			
 
				+
			
 
				+2. **在 DataProcessor 中添加处理方法**：
			
 
				+```python
			
 
				+# ocr_tools/ocr_merger/data_processor.py
			
 
				+def process_my_custom_data(self, custom_data, paddle_text_boxes, rotation_angle=0.0, orig_image_size=(0,0)):
			
 
				+    # 处理自定义格式数据
			
 
				+    pass
			
 
				+```
			
 
				+
			
 
				+3. **在 MarkdownGenerator 中添加格式化方法**：
			
 
				+```python
			
 
				+# ocr_tools/ocr_merger/markdown_generator.py
			
 
				+@staticmethod
			
 
				+def _format_my_custom_element(item: Dict) -> List[str]:
			
 
				+    # 格式化自定义元素
			
 
				+    pass
			
 
				+```
			
 
				+
			
 
				+### 单元测试
			
 
				+
			
 
				+```python
			
 
				+# tests/test_merger.py
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径
			
 
				+ocr_platform_root = Path(__file__).parents[2]  # 根据实际层级调整
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+import pytest
			
 
				+from ocr_tools.ocr_merger import MinerUPaddleOCRMerger
			
 
				+
			
 
				+def test_text_matching():
			
 
				+    merger = MinerUPaddleOCRMerger()
			
 
				+    # 测试文本匹配逻辑
			
 
				+    
			
 
				+def test_table_processing():
			
 
				+    merger = MinerUPaddleOCRMerger()
			
 
				+    # 测试表格处理逻辑
			
 
				+```
			
 
				+
			
 
				+## 导入方式
			
 
				+
			
 
				+### 在 ocr_platform 内部使用
			
 
				+
			
 
				+```python
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # 根据实际层级调整
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+# 导入 merger 组件
			
 
				+from ocr_tools.ocr_merger import (
			
 
				+    MinerUPaddleOCRMerger,
			
 
				+    PaddleOCRVLMerger,
			
 
				+    DotsOCRMerger,
			
 
				+    TableCellMatcher,
			
 
				+    TextMatcher
			
 
				+)
			
 
				+
			
 
				+# 导入 BBoxExtractor（已迁移到 ocr_utils）
			
 
				+from ocr_utils import BBoxExtractor
			
 
				+```
			
 
				+
			
 
				+### 在 ocr_merger 内部使用
			
 
				+
			
 
				+```python
			
 
				+# 相对导入（在同一包内）
			
 
				+from .text_matcher import TextMatcher
			
 
				+from .data_processor import DataProcessor
			
 
				+
			
 
				+# 从 ocr_utils 导入 BBoxExtractor
			
 
				+from ocr_utils import BBoxExtractor
			
 
				+```
			
 
				+
			
 
				+## 版本历史
			
 
				+
			
 
				+### v1.3.0 (当前版本)
			
 
				+- ✅ 迁移到 `ocr_platform/ocr_tools/ocr_merger`
			
 
				+- ✅ `BBoxExtractor` 提取到 `ocr_utils/bbox_utils.py`，供多个模块共享
			
 
				+- ✅ 更新所有导入路径，支持新的目录结构
			
 
				+- ✅ 保持向后兼容性
			
 
				+
			
 
				+### v1.2.0
			
 
				+- ✅ 新增 PaddleOCR_VL 合并支持
			
 
				+- ✅ 重构为模块化架构，提高代码复用率
			
 
				+- ✅ 新增自动格式检测功能
			
 
				+- ✅ 优化文本匹配算法
			
 
				+- ✅ 改进表格单元格 bbox 匹配
			
 
				+
			
 
				+### v1.1.0
			
 
				+- ✅ 优化表格合并单元格处理
			
 
				+- ✅ 新增批量处理模式
			
 
				+- ✅ 改进窗口回溯算法
			
 
				+
			
 
				+### v1.0.0
			
 
				+- ✅ 初始版本
			
 
				+- ✅ 支持 MinerU + PaddleOCR 合并
			
 
				+- ✅ 基本的文本匹配功能
			
 
				+
			
 
				+---
			
 
				+
			
 
				+### 运行试验数据
			
 
				+#### 1. mineru-vlm-2.5.3
			
 
				+```bash
			
 
				+cd /Users/zhch158/workspace/repository.git/ocr_platform
			
 
				+
			
 
				+echo "A用户_单元格扫描流水"
			
 
				+python ocr_tools/ocr_merger/merge_mineru_paddle_ocr.py \
			
 
				+  --mineru-dir "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/mineru-vlm-2.5.3_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/merged_results" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+echo "B用户_扫描流水"
			
 
				+python ocr_tools/ocr_merger/merge_mineru_paddle_ocr.py \
			
 
				+  --mineru-dir "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/mineru-vlm-2.5.3_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/merged_results" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+echo "德_内蒙古银行照"
			
 
				+python ocr_tools/ocr_merger/merge_mineru_paddle_ocr.py \
			
 
				+  --mineru-dir "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/mineru-vlm-2.5.3_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/merged_results" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+echo "对公_招商银行图"
			
 
				+python ocr_tools/ocr_merger/merge_mineru_paddle_ocr.py \
			
 
				+  --mineru-dir "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/mineru-vlm-2.5.3_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/merged_results" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+echo "至远彩色印刷工业有限公司"
			
 
				+python ocr_tools/ocr_merger/merge_mineru_paddle_ocr.py \
			
 
				+  --mineru-dir "/Users/zhch158/workspace/data/至远彩色印刷工业有限公司/mineru-vlm-2.5.3_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/至远彩色印刷工业有限公司/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/至远彩色印刷工业有限公司/merged_results" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+```
			
 
				+
			
 
				+#### 2. PaddleOCR_VL_Results
			
 
				+```bash
			
 
				+cd /Users/zhch158/workspace/repository.git/ocr_platform
			
 
				+
			
 
				+echo "A用户_单元格扫描流水"
			
 
				+python ocr_tools/ocr_merger/merge_paddleocr_vl_paddleocr.py \
			
 
				+  --paddleocr-vl-dir "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/PaddleOCR_VL_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/PaddleOCR_VL_Results_cell_bbox" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+echo "B用户_扫描流水"
			
 
				+python ocr_tools/ocr_merger/merge_paddleocr_vl_paddleocr.py \
			
 
				+  --paddleocr-vl-dir "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/PaddleOCR_VL_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/PaddleOCR_VL_Results_cell_bbox" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+echo "德_内蒙古银行照"
			
 
				+python ocr_tools/ocr_merger/merge_paddleocr_vl_paddleocr.py \
			
 
				+  --paddleocr-vl-dir "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/PaddleOCR_VL_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/PaddleOCR_VL_Results_cell_bbox" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+echo "对公_招商银行图"
			
 
				+python ocr_tools/ocr_merger/merge_paddleocr_vl_paddleocr.py \
			
 
				+  --paddleocr-vl-dir "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/PaddleOCR_VL_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/流水分析/对公_招商银行图/PaddleOCR_VL_Results_cell_bbox" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+echo "至远彩色印刷工业有限公司"
			
 
				+python ocr_tools/ocr_merger/merge_paddleocr_vl_paddleocr.py \
			
 
				+  --paddleocr-vl-dir "/Users/zhch158/workspace/data/至远彩色印刷工业有限公司/PaddleOCR_VL_Results" \
			
 
				+  --paddle-dir "/Users/zhch158/workspace/data/至远彩色印刷工业有限公司/data_PPStructureV3_Results" \
			
 
				+  --output-dir "/Users/zhch158/workspace/data/至远彩色印刷工业有限公司/PaddleOCR_VL_Results_cell_bbox" \
			
 
				+  --output-type "both"
			
 
				+
			
 
				+```
			
--- a/ocr_tools/ocr_merger/Tablecells匹配-动态规划.md
+++ b/ocr_tools/ocr_merger/Tablecells匹配-动态规划.md
@@ -0,0 +1,430 @@
 
				+基于您提供的 table_cell_matcher.py 代码，`_match_html_rows_to_paddle_groups` 方法实现了一个**全局动态规划 (Global Dynamic Programming)** 算法。
			
 
				+
			
 
				+这个算法的核心目标是解决两个序列（HTML 表格行序列 vs PaddleOCR 文本行序列）的**非线性对齐**问题。
			
 
				+
			
 
				+以下是该算法的详细原理解析：
			
 
				+
			
 
				+### 1. 核心思想：为什么用动态规划？
			
 
				+
			
 
				+传统的“贪婪算法”是逐行匹配，如果第 1 行匹配错了，第 2 行就会基于错误的位置继续找，导致“一步错，步步错”。
			
 
				+
			
 
				+**动态规划 (DP)** 的思想是：**不急着做决定**。它会计算所有可能的匹配路径的得分，最后回溯找出一条**总分最高**的路径。即使中间某一行匹配分值较低，只要它能让整体结构最合理，DP 就会选择它。
			
 
				+
			
 
				+### 2. 算法状态定义
			
 
				+
			
 
				+在代码中，`dp` 矩阵定义如下：
			
 
				+
			
 
				+*   **维度**：`dp[n_html][n_paddle]`
			
 
				+*   **含义**：`dp[i][j]` 表示 **“HTML 的前 `i` 行”** 成功匹配到了 **“Paddle 的前 `j` 组”** 时，所能获得的**最大累计得分**。
			
 
				+*   **值**：
			
 
				+    *   `-inf` (负无穷)：表示此状态不可达（例如 HTML 第 5 行不可能匹配到 Paddle 第 1 组，因为顺序不对）。
			
 
				+    *   `float`：表示当前的累计相似度分数。
			
 
				+
			
 
				+### 3. 算法执行流程
			
 
				+
			
 
				+#### 第一步：预计算 (Pre-computation)
			
 
				+为了性能，代码预先计算了 Paddle 组的合并文本。
			
 
				+*   `merged_cache[(j, k)]`：存储从 Paddle 第 `j` 组开始，合并 `k` 个组后的文本。
			
 
				+*   这避免了在多重循环中反复进行字符串拼接。
			
 
				+
			
 
				+#### 第二步：初始化 (Initialization)
			
 
				+处理 HTML 的第 0 行 (`i=0`)。
			
 
				+*   尝试将 HTML 第 0 行匹配 Paddle 的第 `0` 到 `SEARCH_WINDOW` 组。
			
 
				+*   允许合并 `1` 到 `MAX_MERGE` (4) 个 Paddle 组。
			
 
				+*   **得分计算**：`相似度 - 跳过惩罚`。
			
 
				+*   如果得分有效，填入 `dp[0][end_j]`。
			
 
				+
			
 
				+#### 第三步：状态转移 (State Transition) - 核心循环
			
 
				+这是算法最复杂的部分，遍历每一行 HTML (`i` 从 1 到 N)。对于当前行，有两种选择：
			
 
				+
			
 
				+**选择 A：跳过当前 HTML 行 (The "Skip" Strategy)**
			
 
				+*   **场景**：HTML 有这一行，但 OCR 漏识别了，或者 OCR 顺序错乱导致当前位置找不到对应的 OCR 组。
			
 
				+*   **逻辑**：直接继承上一行的最佳状态，但扣除 `SKIP_HTML_PENALTY`。
			
 
				+*   **代码**：
			
 
				+    ```python
			
 
				+    score_skip = dp[i-1][prev_j] - SKIP_HTML_PENALTY
			
 
				+    if score_skip > dp[i][prev_j]:
			
 
				+        dp[i][prev_j] = score_skip
			
 
				+        path[(i, prev_j)] = (prev_j, prev_j + 1) # 标记未消耗新组
			
 
				+    ```
			
 
				+*   **作用**：防止“链条断裂”。即使这一行匹配失败，状态也能传递给下一行。
			
 
				+
			
 
				+**选择 B：匹配 Paddle 组 (The "Match" Strategy)**
			
 
				+*   **场景**：正常匹配。
			
 
				+*   **逻辑**：
			
 
				+    1.  找到上一行所有有效的结束位置 `prev_j`。
			
 
				+    2.  **Gap (跳过)**：允许跳过中间的一些 Paddle 组（可能是噪音或页眉），即 `start_j = prev_j + 1 + gap`。
			
 
				+    3.  **Merge (合并)**：尝试将 `start_j` 开始的 `1` 到 `4` 个 Paddle 组视为一行。
			
 
				+    4.  **计算得分**：
			
 
				+        $$Score = Similarity(HTML\_Text, Paddle\_Text) - Gap\_Penalty - Length\_Penalty$$
			
 
				+    5.  **转移方程**：
			
 
				+        $$dp[i][end\_j] = \max(dp[i][end\_j], \ dp[i-1][prev\_j] + Score)$$
			
 
				+*   **代码**：
			
 
				+    ```python
			
 
				+    total_score = prev_score + current_score
			
 
				+    if total_score > dp[i][end_j]:
			
 
				+        dp[i][end_j] = total_score
			
 
				+        path[(i, end_j)] = (prev_j, start_j) # 记录路径
			
 
				+    ```
			
 
				+
			
 
				+#### 第四步：回溯 (Backtracking)
			
 
				+当填满 `dp` 表后，我们需要找出最优路径：
			
 
				+1.  **找终点**：在最后一行 HTML (`n_html-1`) 中找到得分最高的 `j`。如果最后一行没匹配上，就往前找倒数第二行，以此类推。
			
 
				+2.  **倒推**：利用 `path` 字典，从终点一步步回到起点。
			
 
				+    *   `path[(i, j)]` 存储了 `(prev_j, start_j)`。
			
 
				+    *   如果 `start_j <= j`：说明 HTML 行 `i` 对应了 Paddle 组 `[start_j, ..., j]`。
			
 
				+    *   如果 `start_j > j`：说明 HTML 行 `i` 被跳过了（对应空列表）。
			
 
				+
			
 
				+#### 第五步：孤儿挽救 (Orphan Rescue)
			
 
				+DP 找的是“主干路径”，可能会漏掉一些没能进入最优路径的 Paddle 组（Orphans）。
			
 
				+*   **逻辑**：遍历所有未使用的 Paddle 组，看它在物理位置上（Y坐标）离哪个已匹配的组最近（上方或下方）。
			
 
				+*   **归属**：将其强行合并到最近的 HTML 行中。这解决了 OCR 过度切分导致的小碎片丢失问题。
			
 
				+
			
 
				+### 4. 关键参数解析
			
 
				+
			
 
				+*   **`MAX_MERGE = 4`**：
			
 
				+    *   允许一行 HTML 对应最多 4 行 OCR 文本。解决 OCR 将长表格行切分成多行的问题。
			
 
				+*   **`SEARCH_WINDOW = 15`**：
			
 
				+    *   允许 OCR 结果相对于 HTML 出现较大的位移或中间插入了大量噪音（如印章干扰）。
			
 
				+*   **`SKIP_HTML_PENALTY = 0.3`**：
			
 
				+    *   这是您之前遇到问题的关键修复。允许算法以 0.3 分的代价放弃一行 HTML，从而保全后续的匹配。
			
 
				+*   **`SKIP_PADDLE_PENALTY = 0.1`**：
			
 
				+    *   允许跳过 OCR 中的噪音行，但跳过越多扣分越多，倾向于连续匹配。
			
 
				+
			
 
				+### 5. 总结
			
 
				+
			
 
				+这个算法将表格行匹配问题转化为了一个**寻找最大权值路径**的图论问题。
			
 
				+
			
 
				+*   **节点**：`(HTML行索引, Paddle组索引)`
			
 
				+*   **边**：匹配操作（得分）或跳过操作（罚分）。
			
 
				+*   **目标**：找到从第 0 行到第 N 行的总分最高路线。
			
 
				+
			
 
				+这种方法极其鲁棒，能够同时处理：OCR 分行、OCR 漏行、OCR 噪音、HTML 空行以及轻微的顺序错乱问题。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+# 行级动态规划 (Row-Level DP) 
			
 
				+
			
 
				+1.  **贪婪算法 (Greedy)**：它是一个接一个匹配的。如果第一个单元格错误地“吞掉”了本属于第二个单元格的 Box（或者像之前的案例，短日期匹配到了长 ID），后面的单元格就无路可走了。
			
 
				+2.  **逻辑耦合 (Coupling)**：匹配流程控制（循环、指针移动）与 相似度评分逻辑（各种防御、加分、扣分）深度纠缠在一起，导致代码臃肿且难以维护。
			
 
				+
			
 
				+### 🚀 更好的思路：行级动态规划 (Row-Level DP)
			
 
				+
			
 
				+既然你已经在行与行的匹配中使用了 DP，**在行内的单元格匹配中，DP 同样是终极解决方案。**
			
 
				+
			
 
				+**核心思想**：
			
 
				+不要问“当前单元格最匹配哪个 Box”，而要问“**如何将这一行的 M 个 Box 分配给 N 个单元格，使得整体匹配得分最高**”。
			
 
				+
			
 
				+这样，如果“短日期”匹配“长ID”会导致“长ID”对应的单元格没东西匹配（总分变低），DP 就会自动放弃这个错误匹配，选择全局最优解。
			
 
				+
			
 
				+### 🛠️ 重构方案
			
 
				+
			
 
				+我们将代码拆分为两个清晰的部分：
			
 
				+1.  **纯粹的评分函数 (`_compute_match_score`)**：只负责计算“文本 A”和“文本 B”的匹配度，包含所有的防御逻辑（长度、类型、子序列等）。
			
 
				+2.  **DP 匹配器 (`_match_cells_in_row_dp`)**：只负责路径规划，决定哪个 Box 归哪个 Cell。
			
 
				+
			
 
				+#### 1. 提取评分逻辑 (解耦与瘦身)
			
 
				+
			
 
				+把所有复杂的 `if/else` 防御逻辑移到这里。
			
 
				+
			
 
				+```python
			
 
				+    def _compute_match_score(self, cell_text: str, box_text: str) -> float:
			
 
				+        """
			
 
				+        纯粹的评分函数：计算单元格文本与候选 Box 文本的匹配得分
			
 
				+        包含所有防御逻辑、加权逻辑
			
 
				+        """
			
 
				+        # 1. 预处理
			
 
				+        cell_norm = self.text_matcher.normalize_text(cell_text)
			
 
				+        box_norm = self.text_matcher.normalize_text(box_text)
			
 
				+        
			
 
				+        if not cell_norm or not box_norm:
			
 
				+            return 0.0
			
 
				+            
			
 
				+        # --- ⚡️ 快速防御 (Fast Fail) ---
			
 
				+        len_cell = len(cell_norm)
			
 
				+        len_box = len(box_norm)
			
 
				+        
			
 
				+        # 长度差异过大直接 0 分 (除非是包含关系且特征明显)
			
 
				+        if len_box > len_cell * 3 + 5:
			
 
				+            # 除非 cell 很长 (长ID)，否则不许短 cell 匹配超长 box
			
 
				+            if len_cell < 5: 
			
 
				+                return 0.0
			
 
				+
			
 
				+        # --- 🔍 核心相似度计算 ---
			
 
				+        cell_proc = self._preprocess_text_for_matching(cell_text)
			
 
				+        box_proc = self._preprocess_text_for_matching(box_text)
			
 
				+        
			
 
				+        # A. Token Sort (解决乱序)
			
 
				+        score_sort = fuzz.token_sort_ratio(cell_proc, box_proc)
			
 
				+        
			
 
				+        # B. Partial (解决截断/包含)
			
 
				+        score_partial = fuzz.partial_ratio(cell_norm, box_norm)
			
 
				+        
			
 
				+        # C. Subsequence (解决噪音插入)
			
 
				+        score_subseq = 0.0
			
 
				+        if len_cell > 5:
			
 
				+            score_subseq = self._calculate_subsequence_score(cell_norm, box_norm)
			
 
				+
			
 
				+        # --- 🛡️ 深度防御逻辑 (Deep Defense) ---
			
 
				+        
			
 
				+        # 1. 短文本防御 (防止 "1" 匹配 "1000")
			
 
				+        if score_partial > 80:
			
 
				+            import re
			
 
				+            has_content = lambda t: bool(re.search(r'[a-zA-Z0-9\u4e00-\u9fa5]', t))
			
 
				+            
			
 
				+            # 纯符号防御
			
 
				+            if not has_content(cell_norm) and has_content(box_norm):
			
 
				+                if len_box > len_cell + 2: score_partial = 0.0
			
 
				+            
			
 
				+            # 微小碎片防御
			
 
				+            elif len_cell <= 2 and len_box > 8:
			
 
				+                score_partial = 0.0
			
 
				+                
			
 
				+            # 覆盖率防御 (防止 "2024" 匹配 "ID2024...")
			
 
				+            else:
			
 
				+                coverage = len_cell / len_box
			
 
				+                if coverage < 0.3 and score_sort < 45:
			
 
				+                    score_partial = 0.0
			
 
				+
			
 
				+        # 2. 子序列防御 (防止短数字匹配长ID)
			
 
				+        if score_subseq > 80:
			
 
				+            if len_box > len_cell * 1.5:
			
 
				+                import re
			
 
				+                # 如果是数字类型且较短
			
 
				+                if re.match(r'^[\d\-\:\.\s]+$', cell_norm) and len_cell < 12:
			
 
				+                    score_subseq = 0.0
			
 
				+
			
 
				+        # --- 📊 综合评分 ---
			
 
				+        final_score = max(score_sort, score_partial, score_subseq)
			
 
				+        
			
 
				+        # 精确匹配奖励
			
 
				+        if cell_norm == box_norm:
			
 
				+            final_score = 100.0
			
 
				+        # 包含关系奖励
			
 
				+        elif cell_norm in box_norm:
			
 
				+            final_score = min(100, final_score + 5)
			
 
				+            
			
 
				+        return final_score
			
 
				+```
			
 
				+
			
 
				+#### 2. 实现行内 DP (解决贪婪问题)
			
 
				+
			
 
				+用 DP 替代 `_match_cell_sequential`。
			
 
				+
			
 
				+```python
			
 
				+    def _match_cells_in_row_dp(self, html_cells: List, row_boxes: List[Dict]) -> List[Dict]:
			
 
				+        """
			
 
				+        使用动态规划进行行内单元格匹配
			
 
				+        目标：找到一种分配方案，使得整行的匹配总分最高
			
 
				+        """
			
 
				+        n_cells = len(html_cells)
			
 
				+        n_boxes = len(row_boxes)
			
 
				+        
			
 
				+        # dp[i][j] 表示：前 i 个单元格 消耗了 前 j 个 boxes 的最大得分
			
 
				+        dp = np.full((n_cells + 1, n_boxes + 1), -np.inf)
			
 
				+        dp[0][0] = 0
			
 
				+        
			
 
				+        # path[i][j] = (prev_j, matched_info) 用于回溯
			
 
				+        path = {}
			
 
				+        
			
 
				+        # 允许合并的最大 box 数量
			
 
				+        MAX_MERGE = 5 
			
 
				+        
			
 
				+        for i in range(1, n_cells + 1):
			
 
				+            cell = html_cells[i-1]
			
 
				+            cell_text = cell.get_text(strip=True)
			
 
				+            
			
 
				+            # 优化：如果单元格为空，直接继承状态，不消耗 box
			
 
				+            if not cell_text:
			
 
				+                for j in range(n_boxes + 1):
			
 
				+                    if dp[i-1][j] > -np.inf:
			
 
				+                        dp[i][j] = dp[i-1][j]
			
 
				+                        path[(i, j)] = (j, None) # None 表示空匹配
			
 
				+                continue
			
 
				+
			
 
				+            # 遍历当前 box 指针 j
			
 
				+            for j in range(n_boxes + 1):
			
 
				+                # 1. 策略：跳过当前单元格 (Cell Missing)
			
 
				+                # 继承 dp[i-1][j]
			
 
				+                if dp[i-1][j] > dp[i][j]:
			
 
				+                    dp[i][j] = dp[i-1][j]
			
 
				+                    path[(i, j)] = (j, None)
			
 
				+
			
 
				+                # 2. 策略：匹配 k 个 boxes (从 j-k 到 j)
			
 
				+                # 我们尝试用 boxes[prev_j : j] 来匹配 cell[i]
			
 
				+                # prev_j 是上一个单元格结束的位置
			
 
				+                
			
 
				+                # 限制搜索范围，提高性能
			
 
				+                # 假设中间跳过的噪音 box 不超过 3 个
			
 
				+                search_start = max(0, j - MAX_MERGE - 3)
			
 
				+                
			
 
				+                for prev_j in range(j - 1, search_start - 1, -1):
			
 
				+                    if dp[i-1][prev_j] == -np.inf:
			
 
				+                        continue
			
 
				+                        
			
 
				+                    # 取出中间的 boxes (prev_j 到 j)
			
 
				+                    # 注意：这里我们允许中间有一些 box 被跳过（视为噪音），
			
 
				+                    # 但为了简化 DP，我们通常假设连续取用，或者只取末尾的 k 个
			
 
				+                    
			
 
				+                    # 简化版：尝试合并 boxes[prev_j:j] 中的所有内容
			
 
				+                    candidate_boxes = row_boxes[prev_j:j]
			
 
				+                    
			
 
				+                    # 组合文本
			
 
				+                    merged_text = "".join([b['text'] for b in candidate_boxes])
			
 
				+                    # 或者用空格连接，视情况而定
			
 
				+                    merged_text_spaced = " ".join([b['text'] for b in candidate_boxes])
			
 
				+                    
			
 
				+                    # 计算得分
			
 
				+                    score = self._compute_match_score(cell_text, merged_text_spaced)
			
 
				+                    
			
 
				+                    if score > 60: # 只有及格的匹配才考虑
			
 
				+                        new_score = dp[i-1][prev_j] + score
			
 
				+                        if new_score > dp[i][j]:
			
 
				+                            dp[i][j] = new_score
			
 
				+                            path[(i, j)] = (prev_j, {
			
 
				+                                'text': merged_text_spaced,
			
 
				+                                'boxes': candidate_boxes,
			
 
				+                                'score': score
			
 
				+                            })
			
 
				+
			
 
				+        # --- 回溯找最优解 ---
			
 
				+        # 找到最后一行得分最高的 j
			
 
				+        best_j = np.argmax(dp[n_cells])
			
 
				+        if dp[n_cells][best_j] == -np.inf:
			
 
				+            return [] # 匹配完全失败
			
 
				+            
			
 
				+        results = []
			
 
				+        curr_i, curr_j = n_cells, best_j
			
 
				+        
			
 
				+        while curr_i > 0:
			
 
				+            prev_j, match_info = path.get((curr_i, curr_j), (curr_j, None))
			
 
				+            
			
 
				+            if match_info:
			
 
				+                # 构造结果
			
 
				+                cell_idx = curr_i - 1
			
 
				+                # 这里需要把结果存下来，最后反转
			
 
				+                results.append({
			
 
				+                    'cell_idx': cell_idx,
			
 
				+                    'match_info': match_info
			
 
				+                })
			
 
				+            
			
 
				+            curr_i -= 1
			
 
				+            curr_j = prev_j
			
 
				+            
			
 
				+        return results[::-1]
			
 
				+```
			
 
				+
			
 
				+### 💡 为什么这个方案更好？
			
 
				+
			
 
				+1.  **全局最优**：
			
 
				+    *   **场景**：Cell 1=`2024`, Cell 2=`ID2024...`。
			
 
				+    *   **贪婪**：Cell 1 看到 `ID2024...`，部分匹配 100分，抢走。Cell 2 没得吃，0分。总分 100。
			
 
				+    *   **DP**：
			
 
				+        *   方案 A (Cell 1 抢走): Cell 1 得 100，Cell 2 得 0。总分 100。
			
 
				+        *   方案 B (Cell 1 跳过): Cell 1 匹配后面的短日期 (100分)，Cell 2 匹配长ID (100分)。总分 200。
			
 
				+    *   **结果**：DP 自动选择方案 B。
			
 
				+
			
 
				+2.  **代码清晰**：
			
 
				+    *   `_compute_match_score` 只有评分逻辑，没有循环和指针。
			
 
				+    *   `_match_cells_in_row_dp` 只有路径规划，没有复杂的字符串处理。
			
 
				+
			
 
				+3.  **天然处理合并**：
			
 
				+    *   DP 的内层循环 `candidate_boxes = row_boxes[prev_j:j]` 天然支持将多个 Box 合并给一个 Cell，不需要单独写 `merged_bboxes` 逻辑。
			
 
				+
			
 
				+# 全局动态规划与行级动态规划详解
			
 
				+
			
 
				+结合 `ocr_verify/merger/table_cell_matcher.py` 代码与提供的流水分析数据，为您讲解这两种动态规划在表格匹配中的应用。
			
 
				+
			
 
				+## 核心背景
			
 
				+
			
 
				+表格匹配的任务是将 **HTML 表格结构**（来自 PP-Structure）与 **OCR 文字框**（来自 PaddleOCR）进行对齐。由于 OCR 结果可能存在漏检、误检、多行被识别为单行或单行被切分为多行的情况，简单的顺序匹配容易出错。因此，代码采用了**两级动态规划（DP）**策略。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 1. 全局动态规划 (Global DP)：行级匹配
			
 
				+
			
 
				+**目标**：将 HTML 的 `<tr>` 行与 OCR 的“行组”（Grouped Boxes）进行对齐。
			
 
				+
			
 
				+### 代码位置
			
 
				+`TableCellMatcher._match_html_rows_to_paddle_groups` (Line 548)
			
 
				+
			
 
				+### 算法原理
			
 
				+*   **状态定义**：`dp[i][j]` 表示 HTML 前 `i` 行与 OCR 前 `j` 个行组匹配的最大得分。
			
 
				+*   **核心挑战**：HTML 行数与 OCR 行组数往往不一致（例如 OCR 将一行文字拆成了两行，或者漏掉了某一行）。
			
 
				+*   **转移方程**：
			
 
				+    1.  **匹配 (Match)**：HTML 第 `i` 行与 OCR 第 `j` 组（或合并 `j` 到 `j+k` 组）匹配。
			
 
				+        *   `score = similarity(html_text, ocr_text) - penalty`
			
 
				+    2.  **跳过 HTML (Skip HTML)**：HTML 第 `i` 行在 OCR 中没找到对应（可能是 OCR 漏检）。
			
 
				+        *   `dp[i][j] = dp[i-1][j] - SKIP_HTML_PENALTY`
			
 
				+    3.  **跳过 OCR (Skip Paddle)**：OCR 第 `j` 组是噪音，不匹配任何 HTML 行（通过 `gap` 参数实现）。
			
 
				+
			
 
				+### 实例演示
			
 
				+**数据来源**：`A用户_单元格扫描流水_page_007.json`
			
 
				+
			
 
				+假设 HTML 结构如下（简化）：
			
 
				+*   **HTML Row 0**: `1000107... 2024-07-21...` (交易记录1)
			
 
				+*   **HTML Row 1**: `1000107... 2024-07-21...` (交易记录2)
			
 
				+
			
 
				+OCR 分组结果（按 Y 坐标聚类）：
			
 
				+*   **OCR Group 0**: `1000107... 2024-07-21...` (对应 HTML Row 0)
			
 
				+*   **OCR Group 1**: `1000107... 2024-07-21...` (对应 HTML Row 1)
			
 
				+*   **OCR Group 2**: `(噪音/水印)`
			
 
				+
			
 
				+**DP 过程**：
			
 
				+1.  `dp[0][0]` 计算 HTML Row 0 与 OCR Group 0 的相似度，得分高。
			
 
				+2.  `dp[1][1]` 基于 `dp[0][0]`，计算 HTML Row 1 与 OCR Group 1 的相似度，得分高。
			
 
				+3.  如果 OCR Group 2 是噪音，算法会发现 HTML Row 2 与 OCR Group 2 匹配度极低，可能会选择跳过 OCR Group 2，或者如果 HTML 结束了，OCR Group 2 就成为“未匹配组”。
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 2. 行级动态规划 (Row-level DP)：单元格匹配
			
 
				+
			
 
				+**目标**：在已对齐的某一行内，将 HTML 的 `<td>` 单元格与该行的 OCR Boxes 进行对齐。
			
 
				+
			
 
				+### 代码位置
			
 
				+`TableCellMatcher._match_cells_in_row_dp` (Line 171)
			
 
				+
			
 
				+### 算法原理
			
 
				+*   **状态定义**：`dp[i][j]` 表示该行前 `i` 个 HTML 单元格消耗了前 `j` 个 OCR Boxes 的最大得分。
			
 
				+*   **核心挑战**：一个单元格可能对应多个 OCR Box（例如长文本被切断），或者某个单元格内容 OCR 漏检。
			
 
				+*   **转移方程**：
			
 
				+    1.  **匹配 (Match)**：HTML 单元格 `i` 匹配 OCR Boxes `prev_j` 到 `j` 的合并内容。
			
 
				+        *   `score = similarity(cell_text, merged_box_text)`
			
 
				+        *   这里允许合并最多 `MAX_MERGE` (5) 个 Box。
			
 
				+    2.  **单元格缺失 (Cell Missing)**：当前 HTML 单元格为空或 OCR 没识别到。
			
 
				+        *   `dp[i][j] = dp[i-1][j]` (继承上一个状态，相当于当前单元格不消耗 Box)
			
 
				+
			
 
				+### 实例演示
			
 
				+**数据来源**：`A用户_单元格扫描流水_page_007.json` (HTML Row 0)
			
 
				+
			
 
				+**HTML 单元格**：
			
 
				+1.  Cell 0: `100010710124072100059119291975743003` (长数字)
			
 
				+2.  Cell 1: `2024-07-2107:50:38` (时间)
			
 
				+3.  Cell 2: `扫二维码付款`
			
 
				+
			
 
				+**OCR Boxes (按 X 排序)**：
			
 
				+1.  Box 0: `100010710124072100059119291975743003`
			
 
				+2.  Box 1: `2024-07-21`
			
 
				+3.  Box 2: `07:50:38` (假设时间被切分为两个 Box)
			
 
				+4.  Box 3: `扫二维码付款`
			
 
				+
			
 
				+**DP 过程**：
			
 
				+1.  **i=1 (Cell 0)**:
			
 
				+    *   尝试匹配 Box 0: 相似度 100%，得分高。
			
 
				+    *   `dp[1][1]` 更新为高分。
			
 
				+2.  **i=2 (Cell 1)**:
			
 
				+    *   尝试匹配 Box 1: `2024-07-21` vs `2024-07-2107:50:38` -> 相似度一般。
			
 
				+    *   尝试匹配 Box 1+2: `2024-07-21 07:50:38` vs `2024-07-2107:50:38` -> 相似度极高。
			
 
				+    *   算法选择合并 Box 1 和 Box 2 赋给 Cell 1。
			
 
				+    *   `dp[2][3]` 更新为 `dp[1][1] + score`。
			
 
				+3.  **i=3 (Cell 2)**:
			
 
				+    *   尝试匹配 Box 3: 相似度 100%。
			
 
				+    *   `dp[3][4]` 更新。
			
 
				+
			
 
				+最终回溯路径：Cell 0 <- [Box 0], Cell 1 <- [Box 1, Box 2], Cell 2 <- [Box 3]。
			
 
				+
			
 
				+## 总结
			
 
				+
			
 
				+*   **全局 DP** 解决了“行对齐”问题，处理行层面的漏检和错位。
			
 
				+*   **行级 DP** 解决了“单元格对齐”问题，处理单元格内的碎片化识别和内容匹配。
			
 
				+*   两者结合，实现了从整页到细节的高精度表格还原。
			
--- a/ocr_tools/ocr_merger/__init__.py
+++ b/ocr_tools/ocr_merger/__init__.py
@@ -0,0 +1,31 @@
 
				+"""
			
 
				+MinerU/PaddleOCR_VL + PaddleOCR 合并工具包
			
 
				+"""
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径（用于导入 ocr_utils）
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # ocr_merger -> ocr_tools -> ocr_platform -> repository.git
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+from .merger_core import MinerUPaddleOCRMerger
			
 
				+from .paddleocr_vl_merger import PaddleOCRVLMerger
			
 
				+from .text_matcher import TextMatcher
			
 
				+from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+from .data_processor import DataProcessor
			
 
				+from .markdown_generator import MarkdownGenerator
			
 
				+from .unified_output_converter import UnifiedOutputConverter
			
 
				+from .table_cell_matcher import TableCellMatcher
			
 
				+
			
 
				+__all__ = [
			
 
				+    'MinerUPaddleOCRMerger',
			
 
				+    'PaddleOCRVLMerger',
			
 
				+    'TextMatcher',
			
 
				+    'BBoxExtractor',  # 重新导出，保持向后兼容
			
 
				+    'DataProcessor',
			
 
				+    'MarkdownGenerator',
			
 
				+    'UnifiedOutputConverter',
			
 
				+    'TableCellMatcher',
			
 
				+]
			
 
				+
			
--- a/ocr_tools/ocr_merger/data_processor.py
+++ b/ocr_tools/ocr_merger/data_processor.py
@@ -0,0 +1,463 @@
 
				+"""
			
 
				+数据处理模块
			
 
				+负责处理 MinerU/PaddleOCR_VL/DotsOCR 数据，添加 bbox 信息
			
 
				+"""
			
 
				+from typing import List, Dict, Tuple, Optional
			
 
				+from bs4 import BeautifulSoup
			
 
				+
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径（用于导入 ocr_utils）
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # ocr_merger -> ocr_tools -> ocr_platform -> repository.git
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+try:
			
 
				+    from .text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+    from .table_cell_matcher import TableCellMatcher
			
 
				+except ImportError:
			
 
				+    from text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+    from table_cell_matcher import TableCellMatcher
			
 
				+
			
 
				+
			
 
				+class DataProcessor:
			
 
				+    """数据处理器"""
			
 
				+    """_summary_
			
 
				+    1.负责处理 MinerU/PaddleOCR_VL/DotsOCR 数据，添加 table_cells bbox 信息, 其他类型的bbox信息依然使用vl自带的bbox
			
 
				+    2.由于不同OCR工具的输出格式不同，DataProcessor 需要包含多个处理方法，分别处理 MinerU、DotsOCR 和 PaddleOCR_VL 数据, 都先转换成mineru格式再添加table cells bbox信息
			
 
				+    3.使用 TextMatcher 进行文本匹配，TableCellMatcher 进行表单元格匹配
			
 
				+    4.最终输出统一的 MinerU 格式数据
			
 
				+    
			
 
				+    由于VL模型minerU，dotsocr坐标都是使用的原图坐标，不是旋转后的坐标，PaddleVL使用的时旋转转换后的坐标，而ppstructure使用的ocr文本块是旋转后的坐标，
			
 
				+    因此在处理VL数据时，
			
 
				+    1.首先需要根据ppstructure的旋转角度和原图尺寸，将VL的table坐标转换为旋转后的坐标
			
 
				+    2.通过TableCellMatcher 进行表单元格匹配
			
 
				+    3.再将匹配到的单元格bbox逆向转换为原图坐标，存储在最终输出的MinerU格式数据中
			
 
				+    4.其他类型的bbox信息依然使用vl自带的bbox
			
 
				+    """
			
 
				+    
			
 
				+    def __init__(self, text_matcher: TextMatcher, look_ahead_window: int = 10, x_tolerance: int = 3, y_tolerance: int = 10):
			
 
				+        """
			
 
				+        Args:
			
 
				+            text_matcher: 文本匹配器
			
 
				+            look_ahead_window: 向前查找窗口
			
 
				+            x_tolerance: x轴容差
			
 
				+        """
			
 
				+        self.text_matcher = text_matcher
			
 
				+        self.look_ahead_window = look_ahead_window
			
 
				+        # X轴容差, 用于判断文本框是否在同一列
			
 
				+        self.x_tolerance = x_tolerance
			
 
				+        self.y_tolerance = y_tolerance  # Y轴容差, 用于行分组
			
 
				+        
			
 
				+        # 🎯 创建表格单元格匹配器
			
 
				+        self.table_cell_matcher = TableCellMatcher(
			
 
				+            text_matcher=text_matcher,
			
 
				+            x_tolerance=x_tolerance,
			
 
				+            y_tolerance=y_tolerance
			
 
				+        )
			
 
				+    
			
 
				+    def process_mineru_data(self, mineru_data: List[Dict], 
			
 
				+                           paddle_text_boxes: List[Dict], rotation_angle: float, orig_image_size: Tuple[int, int]) -> List[Dict]:
			
 
				+        """
			
 
				+        处理 MinerU 数据，添加 bbox 信息
			
 
				+        
			
 
				+        Args:
			
 
				+            mineru_data: MinerU 数据
			
 
				+            paddle_text_boxes: PaddleOCR 文字框列表
			
 
				+        
			
 
				+        Returns:
			
 
				+            合并后的数据, table cell使用paddle的bbox，其他类型只是移动指针，bbox还是沿用minerU的bbox
			
 
				+        """
			
 
				+        merged_data = []
			
 
				+        paddle_pointer = 0
			
 
				+        last_matched_index = 0
			
 
				+
			
 
				+        # 按 bbox 排序
			
 
				+        mineru_data.sort(
			
 
				+            key=lambda x: (x['bbox'][1], x['bbox'][0]) 
			
 
				+            if 'bbox' in x else (float('inf'), float('inf'))
			
 
				+        )
			
 
				+
			
 
				+        for item in mineru_data:
			
 
				+            item_type = item.get('type', '')
			
 
				+            
			
 
				+            if item_type == 'table':
			
 
				+                if rotation_angle != 0:
			
 
				+                    inverse_table_bbox = BBoxExtractor.rotate_box_coordinates(item['bbox'], rotation_angle, orig_image_size)  
			
 
				+                    inverse_item = item.copy()
			
 
				+                    inverse_item['bbox'] = inverse_table_bbox
			
 
				+                else:
			
 
				+                    inverse_item = item
			
 
				+                merged_item, paddle_pointer, skew_angle = self._process_table(
			
 
				+                    inverse_item, paddle_text_boxes, paddle_pointer
			
 
				+                )
			
 
				+                # 🆕 保存角度信息到表格 item
			
 
				+                merged_item['image_rotation_angle'] = rotation_angle  # 图片旋转角度
			
 
				+                merged_item['skew_angle'] = skew_angle  # 倾斜角度
			
 
				+
			
 
				+                # 如果有旋转，需要将匹配到的单元格bbox逆向转换为原图坐标
			
 
				+                if rotation_angle != 0:
			
 
				+                    for cell in merged_item.get('table_cells', []):
			
 
				+                        cell_bbox = cell.get('bbox', [])
			
 
				+                        if cell_bbox:
			
 
				+                            original_bbox = BBoxExtractor.inverse_rotate_box_coordinates(cell_bbox, rotation_angle, orig_image_size)
			
 
				+                            cell['bbox'] = original_bbox
			
 
				+                    merged_item['bbox'] = item['bbox']  # 保持表格的原始bbox不变        
			
 
				+                            
			
 
				+                merged_data.append(merged_item)
			
 
				+            
			
 
				+            elif item_type in ['text', 'title', 'header', 'footer']:
			
 
				+                merged_item, paddle_pointer, last_matched_index = self._process_text(
			
 
				+                    item, paddle_text_boxes, paddle_pointer, last_matched_index
			
 
				+                )
			
 
				+                merged_data.append(merged_item)
			
 
				+            
			
 
				+            elif item_type == 'list':
			
 
				+                merged_item, paddle_pointer, last_matched_index = self._process_list(
			
 
				+                    item, paddle_text_boxes, paddle_pointer, last_matched_index
			
 
				+                )
			
 
				+                merged_data.append(merged_item)
			
 
				+            
			
 
				+            else:
			
 
				+                merged_data.append(item.copy())
			
 
				+        
			
 
				+        return merged_data
			
 
				+    
			
 
				+    def process_dotsocr_data(self, dotsocr_data: List[Dict],
			
 
				+                            paddle_text_boxes: List[Dict], 
			
 
				+                            rotation_angle: float, 
			
 
				+                            orig_image_size: Tuple[int, int]) -> List[Dict]:
			
 
				+        """
			
 
				+        处理 DotsOCR 数据（简化版：转换后复用 MinerU 处理逻辑）
			
 
				+        
			
 
				+        Args:
			
 
				+            dotsocr_data: DotsOCR 输出数据
			
 
				+            paddle_text_boxes: PaddleOCR 文本框
			
 
				+            rotation_angle: 旋转角度
			
 
				+            orig_image_size: 原始图片尺寸
			
 
				+        
			
 
				+        Returns:
			
 
				+            统一的 MinerU 格式数据（带 table_cells bbox）
			
 
				+        """
			
 
				+        print(f"📊 处理 DotsOCR 数据: {len(dotsocr_data)} 个块")
			
 
				+        
			
 
				+        # 🎯 第一步：转换为 MinerU 格式
			
 
				+        mineru_format_data = []
			
 
				+        for item in dotsocr_data:
			
 
				+            try:
			
 
				+                converted_item = self._convert_dotsocr_to_mineru(item)
			
 
				+                if converted_item:
			
 
				+                    mineru_format_data.append(converted_item)
			
 
				+            except Exception as e:
			
 
				+                print(f"⚠️ DotsOCR 转换失败: {e}")
			
 
				+                continue
			
 
				+        
			
 
				+        print(f"   ✓ 转换完成: {len(mineru_format_data)} 个块")
			
 
				+        
			
 
				+        # 🎯 第二步：复用 MinerU 处理逻辑
			
 
				+        return self.process_mineru_data(
			
 
				+            mineru_data=mineru_format_data,
			
 
				+            paddle_text_boxes=paddle_text_boxes,
			
 
				+            rotation_angle=rotation_angle,
			
 
				+            orig_image_size=orig_image_size
			
 
				+        )
			
 
				+
			
 
				+    def _convert_dotsocr_to_mineru(self, dotsocr_item: Dict) -> Dict:
			
 
				+        """
			
 
				+        🎯 将 DotsOCR 格式转换为 MinerU 格式
			
 
				+        
			
 
				+        DotsOCR:
			
 
				+        {
			
 
				+            "category": "Table",
			
 
				+            "bbox": [x1, y1, x2, y2],
			
 
				+            "text": "..."
			
 
				+        }
			
 
				+        
			
 
				+        MinerU:
			
 
				+        {
			
 
				+            "type": "table",
			
 
				+            "bbox": [x1, y1, x2, y2],
			
 
				+            "table_body": "...",
			
 
				+            "page_idx": 0
			
 
				+        }
			
 
				+        """
			
 
				+        category = dotsocr_item.get('category', '')
			
 
				+        
			
 
				+        # 🎯 Category 映射
			
 
				+        category_map = {
			
 
				+            'Page-header': 'header',
			
 
				+            'Page-footer': 'footer',
			
 
				+            'Picture': 'image',
			
 
				+            'Figure': 'image',
			
 
				+            'Section-header': 'title',
			
 
				+            'Table': 'table',
			
 
				+            'Text': 'text',
			
 
				+            'Title': 'title',
			
 
				+            'List': 'list',
			
 
				+            'Caption': 'title'
			
 
				+        }
			
 
				+        
			
 
				+        mineru_type = category_map.get(category, 'text')
			
 
				+        
			
 
				+        # 🎯 基础转换
			
 
				+        mineru_item = {
			
 
				+            'type': mineru_type,
			
 
				+            'bbox': dotsocr_item.get('bbox', []),
			
 
				+            'page_idx': 0  # DotsOCR 默认单页
			
 
				+        }
			
 
				+        
			
 
				+        # 🎯 处理文本内容
			
 
				+        text = dotsocr_item.get('text', '')
			
 
				+        
			
 
				+        if mineru_type == 'table':
			
 
				+            # 表格：text -> table_body
			
 
				+            mineru_item['table_body'] = text
			
 
				+        else:
			
 
				+            # 其他类型：保持 text
			
 
				+            mineru_item['text'] = text
			
 
				+            
			
 
				+            # 标题级别
			
 
				+            if category == 'Section-header':
			
 
				+                mineru_item['text_level'] = 1
			
 
				+        
			
 
				+        return mineru_item
			
 
				+    
			
 
				+    def process_paddleocr_vl_data(self, paddleocr_vl_data: Dict,
			
 
				+                                  paddle_text_boxes: List[Dict], rotation_angle: float, orig_image_size: Tuple[int, int]) -> List[Dict]:
			
 
				+        """
			
 
				+        处理 PaddleOCR_VL 数据，添加 bbox 信息
			
 
				+        
			
 
				+        Args:
			
 
				+            paddleocr_vl_data: PaddleOCR_VL 数据 (JSON 对象)
			
 
				+            paddle_text_boxes: PaddleOCR 文字框列表
			
 
				+        
			
 
				+        Returns:
			
 
				+            🎯 MinerU 格式的合并数据（统一输出格式）
			
 
				+        """
			
 
				+        # 🎯 获取旋转角度和原始图像尺寸
			
 
				+        vl_rotation_angle = self._get_rotation_angle_from_vl(paddleocr_vl_data)
			
 
				+        vl_orig_image_size = (0,0)
			
 
				+        
			
 
				+        if vl_rotation_angle != 0:
			
 
				+            vl_orig_image_size = self._get_original_image_size_from_vl(paddleocr_vl_data)
			
 
				+            print(f"🔄 PaddleOCR_VL 检测到旋转角度: {vl_rotation_angle}°")
			
 
				+            print(f"📐 原始图像尺寸: {vl_orig_image_size[0]} x {vl_orig_image_size[1]}")
			
 
				+        
			
 
				+        # 提取 parsing_res_list
			
 
				+        parsing_res_list = paddleocr_vl_data.get('parsing_res_list', [])
			
 
				+        
			
 
				+        # 按 bbox 排序
			
 
				+        parsing_res_list.sort(
			
 
				+            key=lambda x: (x['block_bbox'][1], x['block_bbox'][0])
			
 
				+            if 'block_bbox' in x else (float('inf'), float('inf'))
			
 
				+        )
			
 
				+        mineru_format_data = []
			
 
				+    
			
 
				+        for item in parsing_res_list:
			
 
				+            # 🎯 先转换 bbox 坐标（如果需要）
			
 
				+            if vl_rotation_angle != 0 and orig_image_size:
			
 
				+                item = self._transform_vl_block_bbox(item, vl_rotation_angle, vl_orig_image_size)
			
 
				+            converted_item = self._convert_paddleocr_vl_to_mineru(item)
			
 
				+            if converted_item:
			
 
				+                mineru_format_data.append(converted_item)
			
 
				+    
			
 
				+        print(f"   ✓ 转换完成: {len(mineru_format_data)} 个块")
			
 
				+        
			
 
				+        # 🎯 第三步：复用 MinerU 处理逻辑
			
 
				+        return self.process_mineru_data(
			
 
				+            mineru_data=mineru_format_data,
			
 
				+            paddle_text_boxes=paddle_text_boxes,
			
 
				+            rotation_angle=rotation_angle,
			
 
				+            orig_image_size=orig_image_size
			
 
				+        )    
			
 
				+
			
 
				+    
			
 
				+    def _get_rotation_angle_from_vl(self, paddleocr_vl_data: Dict) -> float:
			
 
				+        """从 PaddleOCR_VL 数据中获取旋转角度"""
			
 
				+        return BBoxExtractor._get_rotation_angle(paddleocr_vl_data)
			
 
				+    
			
 
				+    def _get_original_image_size_from_vl(self, paddleocr_vl_data: Dict) -> tuple:
			
 
				+        """从 PaddleOCR_VL 数据中获取原始图像尺寸"""
			
 
				+        return BBoxExtractor._get_original_image_size(paddleocr_vl_data)
			
 
				+    
			
 
				+    def _transform_vl_block_bbox(self, item: Dict, angle: float, 
			
 
				+                                 orig_image_size: tuple) -> Dict:
			
 
				+        """
			
 
				+        转换 PaddleOCR_VL 的 block_bbox 坐标
			
 
				+        
			
 
				+        Args:
			
 
				+            item: PaddleOCR_VL 的 block 数据
			
 
				+            angle: 旋转角度
			
 
				+            orig_image_size: 原始图像尺寸
			
 
				+        
			
 
				+        Returns:
			
 
				+            转换后的 block 数据
			
 
				+        """
			
 
				+        transformed_item = item.copy()
			
 
				+        
			
 
				+        if 'block_bbox' not in item:
			
 
				+            return transformed_item
			
 
				+        
			
 
				+        block_bbox = item['block_bbox']
			
 
				+        if len(block_bbox) < 4:
			
 
				+            return transformed_item
			
 
				+        
			
 
				+        transformed_bbox = BBoxExtractor.inverse_rotate_box_coordinates(block_bbox, angle, orig_image_size)
			
 
				+        
			
 
				+        transformed_item['block_bbox'] = transformed_bbox
			
 
				+        
			
 
				+        return transformed_item
			
 
				+    
			
 
				+    def _convert_paddleocr_vl_to_mineru(self, paddleocr_vl_item: Dict) -> Dict:
			
 
				+        """
			
 
				+        🎯 将 PaddleOCR_VL 格式转换为 MinerU 格式
			
 
				+        
			
 
				+        基于 PP-DocLayout_plus-L 的 20 种类别
			
 
				+        """
			
 
				+        block_label = paddleocr_vl_item.get('block_label', '')
			
 
				+        
			
 
				+        # 🎯 PP-DocLayout_plus-L 类别映射（共 20 种）
			
 
				+        label_map = {
			
 
				+            # 标题类（3种）
			
 
				+            'paragraph_title': 'title',
			
 
				+            'doc_title': 'title',
			
 
				+            'figure_table_chart_title': 'title',
			
 
				+            
			
 
				+            # 文本类（9种）
			
 
				+            'text': 'text',
			
 
				+            'number': 'text',
			
 
				+            'content': 'text',
			
 
				+            'abstract': 'text',
			
 
				+            'footnote': 'text',
			
 
				+            'aside_text': 'text',
			
 
				+            'algorithm': 'text',
			
 
				+            'reference': 'text',
			
 
				+            'reference_content': 'text',
			
 
				+            
			
 
				+            # 页眉页脚（2种）
			
 
				+            'header': 'header',
			
 
				+            'footer': 'footer',
			
 
				+            
			
 
				+            # 表格（1种）
			
 
				+            'table': 'table',
			
 
				+            
			
 
				+            # 图片/图表（3种）
			
 
				+            'image': 'image',
			
 
				+            'chart': 'image',
			
 
				+            'seal': 'image',
			
 
				+            
			
 
				+            # 公式（2种）
			
 
				+            'formula': 'equation',
			
 
				+            'formula_number': 'equation'
			
 
				+        }
			
 
				+        
			
 
				+        mineru_type = label_map.get(block_label, 'text')
			
 
				+        
			
 
				+        mineru_item = {
			
 
				+            'type': mineru_type,
			
 
				+            'bbox': paddleocr_vl_item.get('block_bbox', []),
			
 
				+            'page_idx': 0
			
 
				+        }
			
 
				+        
			
 
				+        content = paddleocr_vl_item.get('block_content', '')
			
 
				+        
			
 
				+        if mineru_type == 'table':
			
 
				+            mineru_item['table_body'] = content
			
 
				+        else:
			
 
				+            mineru_item['text'] = content
			
 
				+            
			
 
				+            # 标题级别
			
 
				+            if block_label == 'doc_title':
			
 
				+                mineru_item['text_level'] = 1
			
 
				+            elif block_label == 'paragraph_title':
			
 
				+                mineru_item['text_level'] = 2
			
 
				+            elif block_label == 'figure_table_chart_title':
			
 
				+                mineru_item['text_level'] = 3
			
 
				+        
			
 
				+        return mineru_item
			
 
				+    
			
 
				+    def _process_table(self, item: Dict, paddle_text_boxes: List[Dict],
			
 
				+                  start_pointer: int) -> Tuple[Dict, int, float]:
			
 
				+        """
			
 
				+        处理表格类型（MinerU 格式）
			
 
				+        
			
 
				+        策略：
			
 
				+        - 解析 HTML 表格
			
 
				+        - 为每个单元格匹配 PaddleOCR 的 bbox
			
 
				+        - 返回处理后的表格、新指针位置和倾斜角度
			
 
				+        """
			
 
				+        skew_angle = 0.0
			
 
				+        table_body = item.get('table_body', '')
			
 
				+        
			
 
				+        if not table_body:
			
 
				+            print(f"⚠️ 表格内容为空，跳过")
			
 
				+            return item, start_pointer, skew_angle
			
 
				+        
			
 
				+        try:
			
 
				+            # 🔑 传入 table_bbox 用于筛选
			
 
				+            table_bbox = item.get('bbox')  # MinerU 提供的表格边界
			
 
				+            
			
 
				+            # 🎯 委托给 TableCellMatcher
			
 
				+            enhanced_html, cells, new_pointer, skew_angle = \
			
 
				+                self.table_cell_matcher.enhance_table_html_with_bbox(
			
 
				+                    table_body,
			
 
				+                    paddle_text_boxes,
			
 
				+                    start_pointer,
			
 
				+                    table_bbox
			
 
				+                )
			
 
				+            
			
 
				+            # 更新 item
			
 
				+            item['table_body'] = enhanced_html
			
 
				+            item['table_cells'] = cells
			
 
				+            
			
 
				+            # 统计信息
			
 
				+            matched_count = len(cells)
			
 
				+            total_cells = len(BeautifulSoup(table_body, 'html.parser').find_all(['td', 'th']))
			
 
				+            
			
 
				+            print(f"   表格单元格: {matched_count}/{total_cells} 匹配")
			
 
				+            
			
 
				+            return item, new_pointer, skew_angle    # 🆕 返回倾斜角度
			
 
				+            
			
 
				+        except Exception as e:
			
 
				+            print(f"⚠️ 表格处理失败: {e}")
			
 
				+            import traceback
			
 
				+            traceback.print_exc()
			
 
				+            return item, start_pointer, skew_angle
			
 
				+        
			
 
				+    def _process_text(self, item: Dict, paddle_text_boxes: List[Dict],
			
 
				+                     paddle_pointer: int, last_matched_index: int) -> Tuple[Dict, int, int]:
			
 
				+        """处理文本"""
			
 
				+        merged_item = item.copy()
			
 
				+        text = item.get('text', '')
			
 
				+        
			
 
				+        matched_bbox, paddle_pointer, last_matched_index = \
			
 
				+            self.text_matcher.find_matching_bbox(
			
 
				+                text, paddle_text_boxes, paddle_pointer, last_matched_index,
			
 
				+                self.look_ahead_window
			
 
				+            )
			
 
				+        
			
 
				+        if matched_bbox:
			
 
				+            matched_bbox['used'] = True
			
 
				+        
			
 
				+        return merged_item, paddle_pointer, last_matched_index
			
 
				+    
			
 
				+    def _process_list(self, item: Dict, paddle_text_boxes: List[Dict],
			
 
				+                     paddle_pointer: int, last_matched_index: int) -> Tuple[Dict, int, int]:
			
 
				+        """处理列表"""
			
 
				+        merged_item = item.copy()
			
 
				+        list_items = item.get('list_items', [])
			
 
				+        
			
 
				+        for list_item in list_items:
			
 
				+            matched_bbox, paddle_pointer, last_matched_index = \
			
 
				+                self.text_matcher.find_matching_bbox(
			
 
				+                    list_item, paddle_text_boxes, paddle_pointer, last_matched_index,
			
 
				+                    self.look_ahead_window
			
 
				+                )
			
 
				+            
			
 
				+            if matched_bbox:
			
 
				+                matched_bbox['used'] = True
			
 
				+        
			
 
				+        return merged_item, paddle_pointer, last_matched_index
			
--- a/ocr_tools/ocr_merger/dotsocr_merger.py
+++ b/ocr_tools/ocr_merger/dotsocr_merger.py
@@ -0,0 +1,100 @@
 
				+"""
			
 
				+DotsOCR 和 PaddleOCR 合并模块
			
 
				+"""
			
 
				+import json
			
 
				+from typing import List, Dict
			
 
				+
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径（用于导入 ocr_utils）
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # ocr_merger -> ocr_tools -> ocr_platform -> repository.git
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+try:
			
 
				+    from .text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+    from .data_processor import DataProcessor
			
 
				+    from .markdown_generator import MarkdownGenerator
			
 
				+    from .unified_output_converter import UnifiedOutputConverter
			
 
				+except ImportError:
			
 
				+    from text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+    from data_processor import DataProcessor
			
 
				+    from markdown_generator import MarkdownGenerator
			
 
				+    from unified_output_converter import UnifiedOutputConverter
			
 
				+
			
 
				+
			
 
				+class DotsOCRMerger:
			
 
				+    """DotsOCR 和 PaddleOCR 结果合并器"""
			
 
				+    
			
 
				+    def __init__(self, look_ahead_window: int = 10, similarity_threshold: int = 90):
			
 
				+        """
			
 
				+        Args:
			
 
				+            look_ahead_window: 向前查找的窗口大小
			
 
				+            similarity_threshold: 文本相似度阈值
			
 
				+        """
			
 
				+        self.look_ahead_window = look_ahead_window
			
 
				+        self.similarity_threshold = similarity_threshold
			
 
				+        
			
 
				+        # 初始化子模块
			
 
				+        self.text_matcher = TextMatcher(similarity_threshold)
			
 
				+        self.bbox_extractor = BBoxExtractor()
			
 
				+        self.data_processor = DataProcessor(self.text_matcher, look_ahead_window)
			
 
				+        self.markdown_generator = MarkdownGenerator()
			
 
				+        self.output_converter = UnifiedOutputConverter()
			
 
				+    
			
 
				+    def merge_table_with_bbox(self, dotsocr_json_path: str, 
			
 
				+                             paddle_json_path: str,
			
 
				+                             data_format: str = 'mineru') -> List[Dict]:
			
 
				+        """
			
 
				+        合并 DotsOCR 和 PaddleOCR 的结果
			
 
				+        
			
 
				+        Args:
			
 
				+            dotsocr_json_path: DotsOCR 输出的 JSON 路径
			
 
				+            paddle_json_path: PaddleOCR 输出的 JSON 路径
			
 
				+            data_format: 输出格式（固定为 'mineru'）
			
 
				+        
			
 
				+        Returns:
			
 
				+            MinerU 格式的合并数据
			
 
				+        """
			
 
				+        # 加载数据
			
 
				+        with open(dotsocr_json_path, 'r', encoding='utf-8') as f:
			
 
				+            dotsocr_data = json.load(f)
			
 
				+        
			
 
				+        with open(paddle_json_path, 'r', encoding='utf-8') as f:
			
 
				+            paddle_data = json.load(f)
			
 
				+        
			
 
				+        # 🎯 提取 PaddleOCR 的文字框信息
			
 
				+        paddle_text_boxes, rotation_angle, orig_image_size = self.bbox_extractor.extract_paddle_text_boxes(paddle_data)
			
 
				+        
			
 
				+        # 🎯 使用专门的 DotsOCR 处理方法（自动转换为 MinerU 格式）
			
 
				+        merged_data = self.data_processor.process_dotsocr_data(
			
 
				+            dotsocr_data, paddle_text_boxes, rotation_angle, orig_image_size
			
 
				+        )
			
 
				+        
			
 
				+        return merged_data
			
 
				+    
			
 
				+    def generate_enhanced_markdown(self, merged_data: List[Dict], 
			
 
				+                                   output_path: str = None,
			
 
				+                                   source_file: str = None,
			
 
				+                                   data_format: str = 'mineru') -> str:
			
 
				+        """
			
 
				+        生成增强的 Markdown（MinerU 格式）
			
 
				+        
			
 
				+        Args:
			
 
				+            merged_data: 合并后的数据（MinerU 格式）
			
 
				+            output_path: 输出路径
			
 
				+            source_file: 源文件路径
			
 
				+            data_format: 数据格式（固定为 'mineru'）
			
 
				+        """
			
 
				+        # 🎯 强制使用 MinerU 格式生成 Markdown
			
 
				+        return self.markdown_generator.generate_enhanced_markdown(
			
 
				+            merged_data, output_path, source_file, data_format='mineru'
			
 
				+        )
			
 
				+    
			
 
				+    def extract_table_cells_with_bbox(self, merged_data: List[Dict]) -> List[Dict]:
			
 
				+        """提取所有表格单元格及其 bbox 信息"""
			
 
				+        # 🎯 直接复用 BBoxExtractor 的方法
			
 
				+        return self.bbox_extractor.extract_table_cells_with_bbox(merged_data)
			
--- a/ocr_tools/ocr_merger/markdown_generator.py
+++ b/ocr_tools/ocr_merger/markdown_generator.py
@@ -0,0 +1,460 @@
 
				+"""
			
 
				+Markdown 生成模块
			
 
				+负责将合并后的数据生成 Markdown 文件
			
 
				+"""
			
 
				+import shutil
			
 
				+from pathlib import Path
			
 
				+from typing import List, Dict, Optional
			
 
				+
			
 
				+
			
 
				+class MarkdownGenerator:
			
 
				+    """Markdown 生成器"""
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def detect_data_format(merged_data: List[Dict]) -> str:
			
 
				+        """
			
 
				+        检测数据格式
			
 
				+        
			
 
				+        Returns:
			
 
				+            'mineru' 或 'paddleocr_vl'
			
 
				+        """
			
 
				+        if not merged_data:
			
 
				+            return 'mineru'
			
 
				+        
			
 
				+        first_item = merged_data[0]
			
 
				+        
			
 
				+        # 检查是否有 PaddleOCR_VL 特有字段
			
 
				+        if 'block_label' in first_item and 'block_content' in first_item:
			
 
				+            return 'paddleocr_vl'
			
 
				+        
			
 
				+        # 检查是否有 MinerU 特有字段
			
 
				+        if 'type' in first_item and ('table_body' in first_item or 'text' in first_item):
			
 
				+            return 'mineru'
			
 
				+        
			
 
				+        # 默认按 MinerU 格式处理
			
 
				+        return 'mineru'
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def generate_enhanced_markdown(merged_data: List[Dict], 
			
 
				+                                   output_path: Optional[str] = None,
			
 
				+                                   source_file: Optional[str] = None,
			
 
				+                                   data_format: Optional[str] = None) -> str:
			
 
				+        """
			
 
				+        生成增强的 Markdown（包含 bbox 信息的注释）
			
 
				+        
			
 
				+        Args:
			
 
				+            merged_data: 合并后的数据
			
 
				+            output_path: 输出路径
			
 
				+            source_file: 源文件路径（用于复制图片）
			
 
				+            data_format: 数据格式 ('mineru' 或 'paddleocr_vl')，None 则自动检测
			
 
				+        
			
 
				+        Returns:
			
 
				+            Markdown 内容
			
 
				+        """
			
 
				+        # ✅ 自动检测数据格式
			
 
				+        if data_format is None:
			
 
				+            data_format = MarkdownGenerator.detect_data_format(merged_data)
			
 
				+        
			
 
				+        print(f"ℹ️  检测到数据格式: {data_format}")
			
 
				+        
			
 
				+        # ✅ 根据格式选择处理函数
			
 
				+        if data_format == 'paddleocr_vl':
			
 
				+            return MarkdownGenerator._generate_paddleocr_vl_markdown(
			
 
				+                merged_data, output_path, source_file
			
 
				+            )
			
 
				+        else:
			
 
				+            return MarkdownGenerator._generate_mineru_markdown(
			
 
				+                merged_data, output_path, source_file
			
 
				+            )
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _generate_mineru_markdown(merged_data: List[Dict],
			
 
				+                                  output_path: Optional[str] = None,
			
 
				+                                  source_file: Optional[str] = None) -> str:
			
 
				+        """生成 MinerU 格式的 Markdown"""
			
 
				+        md_lines = []
			
 
				+        
			
 
				+        for item in merged_data:
			
 
				+            item_type = item.get('type', '')
			
 
				+            
			
 
				+            if item_type == 'title':
			
 
				+                md_lines.extend(MarkdownGenerator._format_mineru_title(item))
			
 
				+            elif item_type == 'text':
			
 
				+                md_lines.extend(MarkdownGenerator._format_mineru_text(item))
			
 
				+            elif item_type == 'list':
			
 
				+                md_lines.extend(MarkdownGenerator._format_mineru_list(item))
			
 
				+            elif item_type == 'table':
			
 
				+                md_lines.extend(MarkdownGenerator._format_mineru_table(item))
			
 
				+            elif item_type == 'image':
			
 
				+                md_lines.extend(MarkdownGenerator._format_mineru_image(
			
 
				+                    item, output_path, source_file
			
 
				+                ))
			
 
				+            elif item_type in ['equation', 'interline_equation']:
			
 
				+                md_lines.extend(MarkdownGenerator._format_equation(item))
			
 
				+            elif item_type == 'inline_equation':
			
 
				+                md_lines.extend(MarkdownGenerator._format_inline_equation(item))
			
 
				+            elif item_type == 'header':
			
 
				+                md_lines.extend(MarkdownGenerator._format_mineru_header(item))
			
 
				+            elif item_type == 'footer':
			
 
				+                md_lines.extend(MarkdownGenerator._format_mineru_footer(item))
			
 
				+            elif item_type == 'page_number':
			
 
				+                md_lines.extend(MarkdownGenerator._format_mineru_page_number(item))
			
 
				+            elif item_type == 'ref_text':
			
 
				+                md_lines.extend(MarkdownGenerator._format_reference(item))
			
 
				+            else:
			
 
				+                md_lines.extend(MarkdownGenerator._format_unknown(item))
			
 
				+        
			
 
				+        markdown_content = '\n'.join(md_lines)
			
 
				+        
			
 
				+        if output_path:
			
 
				+            with open(output_path, 'w', encoding='utf-8') as f:
			
 
				+                f.write(markdown_content)
			
 
				+        
			
 
				+        return markdown_content
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _generate_paddleocr_vl_markdown(merged_data: List[Dict],
			
 
				+                                        output_path: Optional[str] = None,
			
 
				+                                        source_file: Optional[str] = None) -> str:
			
 
				+        """生成 PaddleOCR_VL 格式的 Markdown"""
			
 
				+        md_lines = []
			
 
				+        
			
 
				+        for item in merged_data:
			
 
				+            block_label = item.get('block_label', '')
			
 
				+            
			
 
				+            if 'title' in block_label:
			
 
				+                md_lines.extend(MarkdownGenerator._format_paddleocr_vl_title(item))
			
 
				+            elif block_label == 'text':
			
 
				+                md_lines.extend(MarkdownGenerator._format_paddleocr_vl_text(item))
			
 
				+            elif block_label == 'table':
			
 
				+                md_lines.extend(MarkdownGenerator._format_paddleocr_vl_table(item))
			
 
				+            elif block_label == 'image':
			
 
				+                md_lines.extend(MarkdownGenerator._format_paddleocr_vl_figure(item))
			
 
				+            elif block_label == 'equation':
			
 
				+                md_lines.extend(MarkdownGenerator._format_paddleocr_vl_equation(item))
			
 
				+            elif block_label == 'reference':
			
 
				+                md_lines.extend(MarkdownGenerator._format_paddleocr_vl_reference(item))
			
 
				+            else:
			
 
				+                md_lines.extend(MarkdownGenerator._format_paddleocr_vl_unknown(item))
			
 
				+        
			
 
				+        markdown_content = '\n'.join(md_lines)
			
 
				+        
			
 
				+        if output_path:
			
 
				+            with open(output_path, 'w', encoding='utf-8') as f:
			
 
				+                f.write(markdown_content)
			
 
				+        
			
 
				+        return markdown_content
			
 
				+    
			
 
				+    # ================== MinerU 格式化方法 ==================
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_mineru_title(item: Dict) -> List[str]:
			
 
				+        """格式化 MinerU 标题"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        text = item.get('text', '')
			
 
				+        text_level = item.get('text_level', 1)
			
 
				+        heading = '#' * min(text_level, 6)
			
 
				+        lines.append(f"{heading} {text}\n")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_mineru_text(item: Dict) -> List[str]:
			
 
				+        """格式化 MinerU 文本"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        text = item.get('text', '')
			
 
				+        text_level = item.get('text_level', 0)
			
 
				+        
			
 
				+        if text_level > 0:
			
 
				+            heading = '#' * min(text_level, 6)
			
 
				+            lines.append(f"{heading} {text}\n")
			
 
				+        else:
			
 
				+            lines.append(f"{text}\n")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_mineru_list(item: Dict) -> List[str]:
			
 
				+        """格式化 MinerU 列表"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        list_items = item.get('list_items', [])
			
 
				+        for list_item in list_items:
			
 
				+            lines.append(f"{list_item}\n")
			
 
				+        
			
 
				+        lines.append("")
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_mineru_table(item: Dict) -> List[str]:
			
 
				+        """格式化 MinerU 表格"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        # 表格标题
			
 
				+        table_caption = item.get('table_caption', [])
			
 
				+        for caption in table_caption:
			
 
				+            if caption:
			
 
				+                lines.append(f"**{caption}**\n")
			
 
				+        
			
 
				+        # 表格内容
			
 
				+        table_body = item.get('table_body_with_bbox', item.get('table_body', ''))
			
 
				+        if table_body:
			
 
				+            lines.append(table_body)
			
 
				+            lines.append("")
			
 
				+        
			
 
				+        # 表格脚注
			
 
				+        table_footnote = item.get('table_footnote', [])
			
 
				+        for footnote in table_footnote:
			
 
				+            if footnote:
			
 
				+                lines.append(f"*{footnote}*")
			
 
				+        if table_footnote:
			
 
				+            lines.append("")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_mineru_image(item: Dict, output_path: Optional[str],
			
 
				+                            source_file: Optional[str]) -> List[str]:
			
 
				+        """格式化 MinerU 图片"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        img_path = item.get('img_path', '')
			
 
				+        
			
 
				+        # 复制图片
			
 
				+        if img_path and source_file and output_path:
			
 
				+            MarkdownGenerator._copy_image(img_path, source_file, output_path)
			
 
				+        
			
 
				+        # 图片标题
			
 
				+        image_caption = item.get('image_caption', [])
			
 
				+        for caption in image_caption:
			
 
				+            if caption:
			
 
				+                lines.append(f"**{caption}**\n")
			
 
				+        
			
 
				+        lines.append(f"![Image]({img_path})\n")
			
 
				+        
			
 
				+        # 图片脚注
			
 
				+        image_footnote = item.get('image_footnote', [])
			
 
				+        for footnote in image_footnote:
			
 
				+            if footnote:
			
 
				+                lines.append(f"*{footnote}*")
			
 
				+        if image_footnote:
			
 
				+            lines.append("")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_mineru_header(item: Dict) -> List[str]:
			
 
				+        """格式化MinerU header"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        text = item.get('text', '')
			
 
				+        lines.append(f"<!-- 页眉: {text} -->\n")
			
 
				+        return lines
			
 
				+
			
 
				+    @staticmethod
			
 
				+    def _format_mineru_footer(item: Dict) -> List[str]:
			
 
				+        """格式化MinerU footer"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        text = item.get('text', '')
			
 
				+        lines.append(f"<!-- 页脚: {text} -->\n")
			
 
				+        return lines
			
 
				+
			
 
				+    @staticmethod
			
 
				+    def _format_mineru_page_number(item: Dict) -> List[str]:
			
 
				+        """格式化MinerU page_number"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        text = item.get('text', '')
			
 
				+        lines.append(f"<!-- 页码: {text} -->\n")
			
 
				+        return lines
			
 
				+
			
 
				+    # ================== PaddleOCR_VL 格式化方法 ==================
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_paddleocr_vl_title(item: Dict) -> List[str]:
			
 
				+        """格式化 PaddleOCR_VL 标题"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('block_bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        text = item.get('block_content', '')
			
 
				+        block_label = item.get('block_label', '')
			
 
				+        
			
 
				+        # 根据 block_label 确定标题级别
			
 
				+        level_map = {
			
 
				+            'paragraph_title': 1,
			
 
				+            'figure_title': 2,
			
 
				+            'title': 1
			
 
				+        }
			
 
				+        text_level = level_map.get(block_label, 1)
			
 
				+        
			
 
				+        heading = '#' * min(text_level, 6)
			
 
				+        lines.append(f"{heading} {text}\n")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_paddleocr_vl_text(item: Dict) -> List[str]:
			
 
				+        """格式化 PaddleOCR_VL 文本"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('block_bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        text = item.get('block_content', '')
			
 
				+        lines.append(f"{text}\n")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_paddleocr_vl_table(item: Dict) -> List[str]:
			
 
				+        """格式化 PaddleOCR_VL 表格"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('block_bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        # 表格内容
			
 
				+        table_content = item.get('block_content_with_bbox', 
			
 
				+                                item.get('block_content', ''))
			
 
				+        if table_content:
			
 
				+            lines.append(table_content)
			
 
				+            lines.append("")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_paddleocr_vl_figure(item: Dict) -> List[str]:
			
 
				+        """格式化 PaddleOCR_VL 图片"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('block_bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        # PaddleOCR_VL 图片信息在 block_content 中
			
 
				+        content = item.get('block_content', '')
			
 
				+        lines.append(f"![Figure]({content})\n")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_paddleocr_vl_equation(item: Dict) -> List[str]:
			
 
				+        """格式化 PaddleOCR_VL 公式"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('block_bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        latex = item.get('block_content', '')
			
 
				+        if latex:
			
 
				+            lines.append(f"$$\n{latex}\n$$\n")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_paddleocr_vl_reference(item: Dict) -> List[str]:
			
 
				+        """格式化 PaddleOCR_VL 参考文献"""
			
 
				+        text = item.get('block_content', '')
			
 
				+        return [f"> {text}\n"]
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_paddleocr_vl_unknown(item: Dict) -> List[str]:
			
 
				+        """格式化 PaddleOCR_VL 未知类型"""
			
 
				+        lines = []
			
 
				+        bbox = item.get('block_bbox', [])
			
 
				+        if bbox:
			
 
				+            lines.append(MarkdownGenerator._add_bbox_comment(bbox))
			
 
				+        
			
 
				+        text = item.get('block_content', '')
			
 
				+        if text:
			
 
				+            lines.append(f"{text}\n")
			
 
				+        
			
 
				+        return lines
			
 
				+    
			
 
				+    # ================== 通用方法 ==================
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _add_bbox_comment(bbox: List) -> str:
			
 
				+        """添加 bbox 注释"""
			
 
				+        return f"<!-- bbox: {bbox} -->"
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_equation(item: Dict) -> List[str]:
			
 
				+        """格式化公式（通用）"""
			
 
				+        latex = item.get('latex', '')
			
 
				+        if latex:
			
 
				+            return [f"$$\n{latex}\n$$\n"]
			
 
				+        return []
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_inline_equation(item: Dict) -> List[str]:
			
 
				+        """格式化行内公式（通用）"""
			
 
				+        latex = item.get('latex', '')
			
 
				+        if latex:
			
 
				+            return [f"${latex}$\n"]
			
 
				+        return []
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_metadata(item: Dict, item_type: str) -> List[str]:
			
 
				+        """格式化元数据（通用）"""
			
 
				+        text = item.get('text', '')
			
 
				+        type_map = {
			
 
				+            'page_number': '页码',
			
 
				+            'header': '页眉',
			
 
				+            'footer': '页脚'
			
 
				+        }
			
 
				+        if text:
			
 
				+            return [f"<!-- {type_map.get(item_type, item_type)}: {text} -->\n"]
			
 
				+        return []
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_reference(item: Dict) -> List[str]:
			
 
				+        """格式化参考文献（MinerU）"""
			
 
				+        text = item.get('text', '')
			
 
				+        return [f"> {text}\n"]
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _format_unknown(item: Dict) -> List[str]:
			
 
				+        """格式化未知类型（MinerU）"""
			
 
				+        text = item.get('text', '')
			
 
				+        if text:
			
 
				+            return [f"{text}\n"]
			
 
				+        return []
			
 
				+    
			
 
				+    @staticmethod
			
 
				+    def _copy_image(img_path: str, source_file: str, output_path: str):
			
 
				+        """复制图片到输出目录"""
			
 
				+        source_dir = Path(source_file).parent
			
 
				+        img_full_path = source_dir / img_path
			
 
				+        if img_full_path.exists():
			
 
				+            output_img_path = Path(output_path).parent / img_path
			
 
				+            output_img_path.parent.mkdir(parents=True, exist_ok=True)
			
 
				+            shutil.copy(img_full_path, output_img_path)
			
--- a/ocr_tools/ocr_merger/merge_dotsocr_paddleocr.py
+++ b/ocr_tools/ocr_merger/merge_dotsocr_paddleocr.py
@@ -0,0 +1,321 @@
 
				+"""
			
 
				+合并 DotsOCR 和 PaddleOCR 的结果
			
 
				+主程序入口
			
 
				+"""
			
 
				+import json
			
 
				+import argparse
			
 
				+from pathlib import Path
			
 
				+
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # ocr_merger -> ocr_tools -> ocr_platform -> repository.git
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+try:
			
 
				+    from ocr_tools.ocr_merger.dotsocr_merger import DotsOCRMerger
			
 
				+except ImportError:
			
 
				+    from dotsocr_merger import DotsOCRMerger
			
 
				+
			
 
				+
			
 
				+def merge_single_file(dotsocr_file: Path, paddle_file: Path, output_dir: Path, 
			
 
				+                     output_type: str, merger: DotsOCRMerger) -> bool:
			
 
				+    """
			
 
				+    合并单个文件
			
 
				+    
			
 
				+    Args:
			
 
				+        dotsocr_file: DotsOCR JSON 文件路径
			
 
				+        paddle_file: PaddleOCR JSON 文件路径
			
 
				+        output_dir: 输出目录
			
 
				+        output_type: 输出格式
			
 
				+        merger: 合并器实例
			
 
				+    
			
 
				+    Returns:
			
 
				+        是否成功
			
 
				+    """
			
 
				+    print(f"📄 处理: {dotsocr_file.name}")
			
 
				+    
			
 
				+    # 输出文件路径
			
 
				+    merged_md_path = output_dir / f"{dotsocr_file.stem}.md"
			
 
				+    merged_json_path = output_dir / f"{dotsocr_file.stem}.json"
			
 
				+    
			
 
				+    try:
			
 
				+        # ✅ 合并数据 (统一输出为MinerU格式)
			
 
				+        merged_data = merger.merge_table_with_bbox(
			
 
				+            str(dotsocr_file),
			
 
				+            str(paddle_file),
			
 
				+            data_format='mineru'  # 强制使用MinerU格式
			
 
				+        )
			
 
				+        
			
 
				+        # ✅ 生成 Markdown (基于MinerU格式)
			
 
				+        if output_type in ['markdown', 'both']:
			
 
				+            markdown = merger.generate_enhanced_markdown(
			
 
				+                merged_data, 
			
 
				+                str(merged_md_path), 
			
 
				+                str(dotsocr_file),
			
 
				+                data_format='mineru'  # 强制使用MinerU格式
			
 
				+            )
			
 
				+        
			
 
				+        # ✅ 保存 JSON (MinerU格式)
			
 
				+        if output_type in ['json', 'both']:
			
 
				+            with open(merged_json_path, 'w', encoding='utf-8') as f:
			
 
				+                json.dump(merged_data, f, ensure_ascii=False, indent=2)
			
 
				+
			
 
				+        print(f"  ✅ 合并完成 (MinerU格式)")
			
 
				+        print(f"  📊 共处理了 {len(merged_data)} 个对象")
			
 
				+        print(f"  💾 输出文件:")
			
 
				+        if output_type in ['markdown', 'both']:
			
 
				+            print(f"    - {merged_md_path.name}")
			
 
				+        if output_type in ['json', 'both']:
			
 
				+            print(f"    - {merged_json_path.name}")
			
 
				+
			
 
				+        return True
			
 
				+        
			
 
				+    except Exception as e:
			
 
				+        print(f"  ❌ 处理失败: {e}")
			
 
				+        import traceback
			
 
				+        traceback.print_exc()
			
 
				+        return False
			
 
				+
			
 
				+
			
 
				+def merge_dotsocr_batch(dotsocr_dir: str, paddle_dir: str, output_dir: str,
			
 
				+                       output_type: str = 'both',
			
 
				+                       look_ahead_window: int = 10, 
			
 
				+                       similarity_threshold: int = 80):
			
 
				+    """
			
 
				+    批量合并 DotsOCR 和 PaddleOCR 的结果
			
 
				+    
			
 
				+    Args:
			
 
				+        dotsocr_dir: DotsOCR 结果目录
			
 
				+        paddle_dir: PaddleOCR 结果目录
			
 
				+        output_dir: 输出目录
			
 
				+        output_type: 输出格式
			
 
				+        look_ahead_window: 向前查找窗口大小
			
 
				+        similarity_threshold: 相似度阈值
			
 
				+    """
			
 
				+    dotsocr_path = Path(dotsocr_dir)
			
 
				+    paddle_path = Path(paddle_dir)
			
 
				+    output_path = Path(output_dir)
			
 
				+    output_path.mkdir(parents=True, exist_ok=True)
			
 
				+    
			
 
				+    merger = DotsOCRMerger(look_ahead_window, similarity_threshold)
			
 
				+    
			
 
				+    # 查找所有 DotsOCR 的 JSON 文件
			
 
				+    dotsocr_files = list(dotsocr_path.glob('*_page_*[0-9].json'))
			
 
				+    dotsocr_files.sort()
			
 
				+    
			
 
				+    print(f"\n🔍 找到 {len(dotsocr_files)} 个 DotsOCR 文件")
			
 
				+    print(f"📂 DotsOCR 目录: {dotsocr_dir}")
			
 
				+    print(f"📂 PaddleOCR 目录: {paddle_dir}")
			
 
				+    print(f"📂 输出目录: {output_dir}")
			
 
				+    print(f"⚙️  查找窗口: {look_ahead_window}")
			
 
				+    print(f"⚙️  相似度阈值: {similarity_threshold}%\n")
			
 
				+    
			
 
				+    success_count = 0
			
 
				+    failed_count = 0
			
 
				+    
			
 
				+    for dotsocr_file in dotsocr_files:
			
 
				+        # 查找对应的 PaddleOCR 文件
			
 
				+        paddle_file = paddle_path / dotsocr_file.name
			
 
				+        
			
 
				+        if not paddle_file.exists():
			
 
				+            print(f"⚠️  跳过: 未找到对应的 PaddleOCR 文件: {paddle_file.name}\n")
			
 
				+            failed_count += 1
			
 
				+            continue
			
 
				+
			
 
				+        if merge_single_file(dotsocr_file, paddle_file, output_path, output_type, merger):
			
 
				+            success_count += 1
			
 
				+        else:
			
 
				+            failed_count += 1
			
 
				+        
			
 
				+        print()
			
 
				+    
			
 
				+    print("=" * 60)
			
 
				+    print(f"✅ 处理完成!")
			
 
				+    print(f"📊 统计信息:")
			
 
				+    print(f"  - 总文件数: {len(dotsocr_files)}")
			
 
				+    print(f"  - 成功: {success_count}")
			
 
				+    print(f"  - 失败: {failed_count}")
			
 
				+    print("=" * 60)
			
 
				+
			
 
				+
			
 
				+def main():
			
 
				+    """主函数"""
			
 
				+    parser = argparse.ArgumentParser(
			
 
				+        description='合并 DotsOCR 和 PaddleOCR 的识别结果，统一输出为MinerU格式',
			
 
				+        formatter_class=argparse.RawDescriptionHelpFormatter,
			
 
				+        epilog="""
			
 
				+示例用法:
			
 
				+
			
 
				+  1. 批量处理整个目录:
			
 
				+     python merge_dotsocr_paddleocr.py \\
			
 
				+         --dotsocr-dir /path/to/dotsocr/results \\
			
 
				+         --paddle-dir /path/to/paddle/results \\
			
 
				+         --output-dir /path/to/output
			
 
				+
			
 
				+  2. 处理单个文件:
			
 
				+     python merge_dotsocr_paddleocr.py \\
			
 
				+         --dotsocr-file /path/to/file_page_001.json \\
			
 
				+         --paddle-file /path/to/file_page_001.json \\
			
 
				+         --output-dir /path/to/output
			
 
				+
			
 
				+  3. 自定义参数:
			
 
				+     python merge_dotsocr_paddleocr.py \\
			
 
				+         --dotsocr-dir /path/to/dotsocr \\
			
 
				+         --paddle-dir /path/to/paddle \\
			
 
				+         --output-dir /path/to/output \\
			
 
				+         --window 15 \\
			
 
				+         --threshold 85
			
 
				+        
			
 
				+输出格式说明:
			
 
				+  - JSON: 统一的MinerU格式JSON文件
			
 
				+  - Markdown: 基于MinerU格式生成的Markdown文件
			
 
				+        """
			
 
				+    )
			
 
				+    
			
 
				+    # 文件/目录参数
			
 
				+    file_group = parser.add_argument_group('文件参数')
			
 
				+    file_group.add_argument(
			
 
				+        '--dotsocr-file', 
			
 
				+        type=str,
			
 
				+        help='DotsOCR 输出的 JSON 文件路径（单文件模式）'
			
 
				+    )
			
 
				+    file_group.add_argument(
			
 
				+        '--paddle-file', 
			
 
				+        type=str,
			
 
				+        help='PaddleOCR 输出的 JSON 文件路径（单文件模式）'
			
 
				+    )
			
 
				+    
			
 
				+    dir_group = parser.add_argument_group('目录参数')
			
 
				+    dir_group.add_argument(
			
 
				+        '--dotsocr-dir', 
			
 
				+        type=str,
			
 
				+        help='DotsOCR 结果目录（批量模式）'
			
 
				+    )
			
 
				+    dir_group.add_argument(
			
 
				+        '--paddle-dir', 
			
 
				+        type=str,
			
 
				+        help='PaddleOCR 结果目录（批量模式）'
			
 
				+    )
			
 
				+    
			
 
				+    # 输出参数
			
 
				+    output_group = parser.add_argument_group('输出参数')
			
 
				+    output_group.add_argument(
			
 
				+        '-o', '--output-dir',
			
 
				+        type=str,
			
 
				+        required=True,
			
 
				+        help='输出目录（必需）'
			
 
				+    )
			
 
				+    output_group.add_argument(
			
 
				+        '-f', '--output-type', 
			
 
				+        choices=['json', 'markdown', 'both'], 
			
 
				+        default='both', 
			
 
				+        help='输出格式'
			
 
				+    )
			
 
				+
			
 
				+    # 算法参数
			
 
				+    algo_group = parser.add_argument_group('算法参数')
			
 
				+    algo_group.add_argument(
			
 
				+        '-w', '--window',
			
 
				+        type=int,
			
 
				+        default=15,
			
 
				+        help='向前查找的窗口大小（默认: 15）'
			
 
				+    )
			
 
				+    algo_group.add_argument(
			
 
				+        '-t', '--threshold',
			
 
				+        type=int,
			
 
				+        default=80,
			
 
				+        help='文本相似度阈值（0-100，默认: 80）'
			
 
				+    )
			
 
				+    
			
 
				+    args = parser.parse_args()
			
 
				+    output_type = args.output_type.lower()
			
 
				+    
			
 
				+    # 验证参数
			
 
				+    if args.dotsocr_file and args.paddle_file:
			
 
				+        # 单文件模式
			
 
				+        dotsocr_file = Path(args.dotsocr_file)
			
 
				+        paddle_file = Path(args.paddle_file)
			
 
				+        output_dir = Path(args.output_dir)
			
 
				+        
			
 
				+        if not dotsocr_file.exists():
			
 
				+            print(f"❌ 错误: DotsOCR 文件不存在: {dotsocr_file}")
			
 
				+            return
			
 
				+        
			
 
				+        if not paddle_file.exists():
			
 
				+            print(f"❌ 错误: PaddleOCR 文件不存在: {paddle_file}")
			
 
				+            return
			
 
				+        
			
 
				+        output_dir.mkdir(parents=True, exist_ok=True)
			
 
				+        
			
 
				+        print("\n🔧 单文件处理模式")
			
 
				+        print(f"📄 DotsOCR 文件: {dotsocr_file}")
			
 
				+        print(f"📄 PaddleOCR 文件: {paddle_file}")
			
 
				+        print(f"📂 输出目录: {output_dir}\n")
			
 
				+        
			
 
				+        merger = DotsOCRMerger(
			
 
				+            look_ahead_window=args.window,
			
 
				+            similarity_threshold=args.threshold
			
 
				+        )
			
 
				+        
			
 
				+        success = merge_single_file(dotsocr_file, paddle_file, output_dir, output_type, merger)
			
 
				+        
			
 
				+        if success:
			
 
				+            print("\n✅ 处理完成!")
			
 
				+        else:
			
 
				+            print("\n❌ 处理失败!")
			
 
				+    
			
 
				+    elif args.dotsocr_dir and args.paddle_dir:
			
 
				+        # 批量模式
			
 
				+        if not Path(args.dotsocr_dir).exists():
			
 
				+            print(f"❌ 错误: DotsOCR 目录不存在: {args.dotsocr_dir}")
			
 
				+            return
			
 
				+        
			
 
				+        if not Path(args.paddle_dir).exists():
			
 
				+            print(f"❌ 错误: PaddleOCR 目录不存在: {args.paddle_dir}")
			
 
				+            return
			
 
				+        
			
 
				+        print("\n🔧 批量处理模式")
			
 
				+        
			
 
				+        merge_dotsocr_batch(
			
 
				+            args.dotsocr_dir,
			
 
				+            args.paddle_dir,
			
 
				+            args.output_dir,
			
 
				+            output_type=output_type,
			
 
				+            look_ahead_window=args.window,
			
 
				+            similarity_threshold=args.threshold
			
 
				+        )
			
 
				+    
			
 
				+    else:
			
 
				+        parser.print_help()
			
 
				+        print("\n❌ 错误: 请指定单文件模式或批量模式的参数")
			
 
				+
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    print("🚀 启动 DotsOCR + PaddleOCR 合并程序 (统一输出MinerU格式)...")
			
 
				+    
			
 
				+    import sys
			
 
				+    
			
 
				+    if len(sys.argv) == 1:
			
 
				+        # 默认配置
			
 
				+        default_config = {
			
 
				+            "dotsocr-file": "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/dotsocr_vllm_results/A用户_单元格扫描流水_page_002.json",
			
 
				+            "paddle-file": "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/ppstructurev3_client_results/A用户_单元格扫描流水_page_002.json",
			
 
				+            "output-dir": "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/dotsocr_vllm_results_cell_bbox",
			
 
				+            "output-type": "both",
			
 
				+            "window": "15",
			
 
				+            "threshold": "85"
			
 
				+        }
			
 
				+        
			
 
				+        print("ℹ️  未提供命令行参数，使用默认配置运行...")
			
 
				+        print("⚙️  默认参数:")
			
 
				+        for key, value in default_config.items():
			
 
				+            print(f"  --{key}: {value}")
			
 
				+        
			
 
				+        sys.argv = [sys.argv[0]]
			
 
				+        for key, value in default_config.items():
			
 
				+            sys.argv.extend([f"--{key}", str(value)])
			
 
				+    
			
 
				+    sys.exit(main())
			
--- a/ocr_tools/ocr_merger/merge_mineru_paddle_ocr.py
+++ b/ocr_tools/ocr_merger/merge_mineru_paddle_ocr.py
@@ -0,0 +1,318 @@
 
				+"""
			
 
				+合并 MinerU 和 PaddleOCR 的结果
			
 
				+主程序入口
			
 
				+"""
			
 
				+import json
			
 
				+import argparse
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # ocr_merger -> ocr_tools -> ocr_platform -> repository.git
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+try:
			
 
				+    from ocr_tools.ocr_merger.merger_core import MinerUPaddleOCRMerger
			
 
				+except ImportError:
			
 
				+    from merger_core import MinerUPaddleOCRMerger
			
 
				+
			
 
				+
			
 
				+def merge_single_file(mineru_file: Path, paddle_file: Path, output_dir: Path, 
			
 
				+                     output_type: str, merger: MinerUPaddleOCRMerger) -> bool:
			
 
				+    """
			
 
				+    合并单个文件
			
 
				+    
			
 
				+    Args:
			
 
				+        mineru_file: MinerU JSON 文件路径
			
 
				+        paddle_file: PaddleOCR JSON 文件路径
			
 
				+        output_dir: 输出目录
			
 
				+        merger: 合并器实例
			
 
				+    
			
 
				+    Returns:
			
 
				+        是否成功
			
 
				+    """
			
 
				+    print(f"📄 处理: {mineru_file.name}")
			
 
				+    
			
 
				+    # 输出文件路径
			
 
				+    merged_md_path = output_dir / f"{mineru_file.stem}.md"
			
 
				+    merged_json_path = output_dir / f"{mineru_file.stem}.json"
			
 
				+    
			
 
				+    try:
			
 
				+        # 合并数据
			
 
				+        merged_data = merger.merge_table_with_bbox(
			
 
				+            str(mineru_file),
			
 
				+            str(paddle_file)
			
 
				+        )
			
 
				+        
			
 
				+        # 生成 Markdown
			
 
				+        if output_type in ['markdown', 'both']:
			
 
				+            merger.generate_enhanced_markdown(merged_data, str(merged_md_path), mineru_file)
			
 
				+        
			
 
				+        # 保存 JSON
			
 
				+        if output_type in ['json', 'both']:
			
 
				+            with open(merged_json_path, 'w', encoding='utf-8') as f:
			
 
				+                json.dump(merged_data, f, ensure_ascii=False, indent=2)
			
 
				+
			
 
				+        print(f"  ✅ 合并完成")
			
 
				+        print(f"  📊 共处理了 {len(merged_data)} 个对象")
			
 
				+        print(f"  💾 输出文件:")
			
 
				+        if output_type in ['markdown', 'both']:
			
 
				+            print(f"    - {merged_md_path.name}")
			
 
				+        if output_type in ['json', 'both']:
			
 
				+            print(f"    - {merged_json_path.name}")
			
 
				+
			
 
				+        return True
			
 
				+        
			
 
				+    except Exception as e:
			
 
				+        print(f"  ❌ 处理失败: {e}")
			
 
				+        import traceback
			
 
				+        traceback.print_exc()
			
 
				+        return False
			
 
				+
			
 
				+
			
 
				+def merge_mineru_paddle_batch(mineru_dir: str, paddle_dir: str, output_dir: str,
			
 
				+                              output_type: str = 'both',
			
 
				+                              look_ahead_window: int = 10, 
			
 
				+                              similarity_threshold: int = 80):
			
 
				+    """
			
 
				+    批量合并 MinerU 和 PaddleOCR 的结果
			
 
				+    
			
 
				+    Args:
			
 
				+        mineru_dir: MinerU 结果目录
			
 
				+        paddle_dir: PaddleOCR 结果目录
			
 
				+        output_dir: 输出目录
			
 
				+        look_ahead_window: 向前查找窗口大小
			
 
				+        similarity_threshold: 相似度阈值
			
 
				+    """
			
 
				+    mineru_path = Path(mineru_dir)
			
 
				+    paddle_path = Path(paddle_dir)
			
 
				+    output_path = Path(output_dir)
			
 
				+    output_path.mkdir(parents=True, exist_ok=True)
			
 
				+    
			
 
				+    merger = MinerUPaddleOCRMerger(look_ahead_window, similarity_threshold)
			
 
				+    
			
 
				+    # 查找所有 MinerU 的 JSON 文件
			
 
				+    mineru_files = list(mineru_path.glob('*_page_*[0-9].json'))
			
 
				+    mineru_files.sort()
			
 
				+    
			
 
				+    print(f"\n🔍 找到 {len(mineru_files)} 个 MinerU 文件")
			
 
				+    print(f"📂 MinerU 目录: {mineru_dir}")
			
 
				+    print(f"📂 PaddleOCR 目录: {paddle_dir}")
			
 
				+    print(f"📂 输出目录: {output_dir}")
			
 
				+    print(f"⚙️  查找窗口: {look_ahead_window}")
			
 
				+    print(f"⚙️  相似度阈值: {similarity_threshold}%\n")
			
 
				+    
			
 
				+    success_count = 0
			
 
				+    failed_count = 0
			
 
				+    
			
 
				+    for mineru_file in mineru_files:
			
 
				+        # 查找对应的 PaddleOCR 文件
			
 
				+        paddle_file = paddle_path / mineru_file.name
			
 
				+        
			
 
				+        if not paddle_file.exists():
			
 
				+            print(f"⚠️  跳过: 未找到对应的 PaddleOCR 文件: {paddle_file.name}\n")
			
 
				+            failed_count += 1
			
 
				+            continue
			
 
				+
			
 
				+        if merge_single_file(mineru_file, paddle_file, output_path, output_type, merger):
			
 
				+            success_count += 1
			
 
				+        else:
			
 
				+            failed_count += 1
			
 
				+        
			
 
				+        print()  # 空行分隔
			
 
				+    
			
 
				+    # 打印统计信息
			
 
				+    print("=" * 60)
			
 
				+    print(f"✅ 处理完成!")
			
 
				+    print(f"📊 统计信息:")
			
 
				+    print(f"  - 总文件数: {len(mineru_files)}")
			
 
				+    print(f"  - 成功: {success_count}")
			
 
				+    print(f"  - 失败: {failed_count}")
			
 
				+    print("=" * 60)
			
 
				+
			
 
				+
			
 
				+def main():
			
 
				+    """主函数"""
			
 
				+    parser = argparse.ArgumentParser(
			
 
				+        description='合并 MinerU 和 PaddleOCR 的识别结果，添加 bbox 坐标信息',
			
 
				+        formatter_class=argparse.RawDescriptionHelpFormatter,
			
 
				+        epilog="""
			
 
				+示例用法:
			
 
				+
			
 
				+  1. 批量处理整个目录:
			
 
				+     python merge_mineru_paddle_ocr.py \\
			
 
				+         --mineru-dir /path/to/mineru/results \\
			
 
				+         --paddle-dir /path/to/paddle/results \\
			
 
				+         --output-dir /path/to/output
			
 
				+
			
 
				+  2. 处理单个文件:
			
 
				+     python merge_mineru_paddle_ocr.py \\
			
 
				+         --mineru-file /path/to/file_page_001.json \\
			
 
				+         --paddle-file /path/to/file_page_001.json \\
			
 
				+         --output-dir /path/to/output
			
 
				+
			
 
				+  3. 自定义参数:
			
 
				+     python merge_mineru_paddle_ocr.py \\
			
 
				+         --mineru-dir /path/to/mineru \\
			
 
				+         --paddle-dir /path/to/paddle \\
			
 
				+         --output-dir /path/to/output \\
			
 
				+         --window 15 \\
			
 
				+         --threshold 85
			
 
				+        """
			
 
				+    )
			
 
				+    
			
 
				+    # 文件/目录参数
			
 
				+    file_group = parser.add_argument_group('文件参数')
			
 
				+    file_group.add_argument(
			
 
				+        '--mineru-file', 
			
 
				+        type=str,
			
 
				+        help='MinerU 输出的 JSON 文件路径（单文件模式）'
			
 
				+    )
			
 
				+    file_group.add_argument(
			
 
				+        '--paddle-file', 
			
 
				+        type=str,
			
 
				+        help='PaddleOCR 输出的 JSON 文件路径（单文件模式）'
			
 
				+    )
			
 
				+    
			
 
				+    dir_group = parser.add_argument_group('目录参数')
			
 
				+    dir_group.add_argument(
			
 
				+        '--mineru-dir', 
			
 
				+        type=str,
			
 
				+        help='MinerU 结果目录（批量模式）'
			
 
				+    )
			
 
				+    dir_group.add_argument(
			
 
				+        '--paddle-dir', 
			
 
				+        type=str,
			
 
				+        help='PaddleOCR 结果目录（批量模式）'
			
 
				+    )
			
 
				+    
			
 
				+    # 输出参数
			
 
				+    output_group = parser.add_argument_group('输出参数')
			
 
				+    output_group.add_argument(
			
 
				+        '-o', '--output-dir',
			
 
				+        type=str,
			
 
				+        required=True,
			
 
				+        help='输出目录（必需）'
			
 
				+    )
			
 
				+    output_group.add_argument(
			
 
				+        '-f', '--output-type', 
			
 
				+        choices=['json', 'markdown', 'both'], 
			
 
				+        default='both', help='输出格式'
			
 
				+    )
			
 
				+
			
 
				+    # 算法参数
			
 
				+    algo_group = parser.add_argument_group('算法参数')
			
 
				+    algo_group.add_argument(
			
 
				+        '-w', '--window',
			
 
				+        type=int,
			
 
				+        default=15,
			
 
				+        help='向前查找的窗口大小（默认: 10）'
			
 
				+    )
			
 
				+    algo_group.add_argument(
			
 
				+        '-t', '--threshold',
			
 
				+        type=int,
			
 
				+        default=80,
			
 
				+        help='文本相似度阈值（0-100，默认: 80）'
			
 
				+    )
			
 
				+    
			
 
				+    args = parser.parse_args()
			
 
				+    output_type = args.output_type.lower()
			
 
				+    
			
 
				+    # 验证参数
			
 
				+    if args.mineru_file and args.paddle_file:
			
 
				+        # 单文件模式
			
 
				+        mineru_file = Path(args.mineru_file)
			
 
				+        paddle_file = Path(args.paddle_file)
			
 
				+        output_dir = Path(args.output_dir)
			
 
				+        
			
 
				+        if not mineru_file.exists():
			
 
				+            print(f"❌ 错误: MinerU 文件不存在: {mineru_file}")
			
 
				+            return
			
 
				+        
			
 
				+        if not paddle_file.exists():
			
 
				+            print(f"❌ 错误: PaddleOCR 文件不存在: {paddle_file}")
			
 
				+            return
			
 
				+        
			
 
				+        output_dir.mkdir(parents=True, exist_ok=True)
			
 
				+        
			
 
				+        print("\n🔧 单文件处理模式")
			
 
				+        print(f"📄 MinerU 文件: {mineru_file}")
			
 
				+        print(f"📄 PaddleOCR 文件: {paddle_file}")
			
 
				+        print(f"📂 输出目录: {output_dir}")
			
 
				+        print(f"⚙️  查找窗口: {args.window}")
			
 
				+        print(f"⚙️  相似度阈值: {args.threshold}%\n")
			
 
				+        
			
 
				+        merger = MinerUPaddleOCRMerger(
			
 
				+            look_ahead_window=args.window,
			
 
				+            similarity_threshold=args.threshold
			
 
				+        )
			
 
				+        
			
 
				+        success = merge_single_file(mineru_file, paddle_file, output_dir, output_type, merger)
			
 
				+        
			
 
				+        if success:
			
 
				+            print("\n✅ 处理完成!")
			
 
				+        else:
			
 
				+            print("\n❌ 处理失败!")
			
 
				+    
			
 
				+    elif args.mineru_dir and args.paddle_dir:
			
 
				+        # 批量模式
			
 
				+        if not Path(args.mineru_dir).exists():
			
 
				+            print(f"❌ 错误: MinerU 目录不存在: {args.mineru_dir}")
			
 
				+            return
			
 
				+        
			
 
				+        if not Path(args.paddle_dir).exists():
			
 
				+            print(f"❌ 错误: PaddleOCR 目录不存在: {args.paddle_dir}")
			
 
				+            return
			
 
				+        
			
 
				+        print("\n🔧 批量处理模式")
			
 
				+        
			
 
				+        merge_mineru_paddle_batch(
			
 
				+            args.mineru_dir,
			
 
				+            args.paddle_dir,
			
 
				+            args.output_dir,
			
 
				+            output_type=output_type,
			
 
				+            look_ahead_window=args.window,
			
 
				+            similarity_threshold=args.threshold
			
 
				+        )
			
 
				+    
			
 
				+    else:
			
 
				+        parser.print_help()
			
 
				+        print("\n❌ 错误: 请指定单文件模式或批量模式的参数")
			
 
				+        print("  单文件模式: --mineru-file 和 --paddle-file")
			
 
				+        print("  批量模式: --mineru-dir 和 --paddle-dir")
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    print("🚀 启动 MinerU + PaddleOCR 合并程序...")
			
 
				+    
			
 
				+    import sys
			
 
				+    
			
 
				+    if len(sys.argv) == 1:
			
 
				+        # 如果没有命令行参数，使用默认配置运行
			
 
				+        print("ℹ️  未提供命令行参数，使用默认配置运行...")
			
 
				+        
			
 
				+        # 默认配置
			
 
				+        default_config = {
			
 
				+            # "mineru-file": "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/mineru_vllm_results/A用户_单元格扫描流水_page_001.json",
			
 
				+            # "paddle-file": "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/ppstructurev3_client_results/A用户_单元格扫描流水_page_001.json",
			
 
				+            # "output-dir": "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/mineru_vllm_results_cell_bbox",
			
 
				+            "mineru-file": "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/mineru_vllm_results/B用户_扫描流水_page_007.json",
			
 
				+            "paddle-file": "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/ppstructurev3_client_results/B用户_扫描流水_page_007.json",
			
 
				+            "output-dir": "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/mineru_vllm_results_cell_bbox",
			
 
				+            # "mineru-dir": "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/mineru-vlm-2.5.3_Results",
			
 
				+            # "paddle-dir": "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/data_PPStructureV3_Results",
			
 
				+            # "output-dir": "/Users/zhch158/workspace/data/流水分析/德_内蒙古银行照/merged_results",
			
 
				+            "output-type": "both",
			
 
				+            "window": "15",
			
 
				+            "threshold": "85"
			
 
				+        }
			
 
				+        
			
 
				+        print("⚙️  默认参数:")
			
 
				+        for key, value in default_config.items():
			
 
				+            print(f"  --{key}: {value}")
			
 
				+        # 构造参数
			
 
				+        sys.argv = [sys.argv[0]]
			
 
				+        for key, value in default_config.items():
			
 
				+            sys.argv.extend([f"--{key}", str(value)])
			
 
				+    
			
 
				+    sys.exit(main())
			
--- a/ocr_tools/ocr_merger/merge_paddleocr_vl_paddleocr.py
+++ b/ocr_tools/ocr_merger/merge_paddleocr_vl_paddleocr.py
@@ -0,0 +1,337 @@
 
				+"""
			
 
				+合并 PaddleOCR_VL 和 PaddleOCR 的结果
			
 
				+主程序入口
			
 
				+"""
			
 
				+import json
			
 
				+import argparse
			
 
				+from pathlib import Path
			
 
				+
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # ocr_merger -> ocr_tools -> ocr_platform -> repository.git
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+try:
			
 
				+    from ocr_tools.ocr_merger.paddleocr_vl_merger import PaddleOCRVLMerger
			
 
				+except ImportError:
			
 
				+    from paddleocr_vl_merger import PaddleOCRVLMerger
			
 
				+
			
 
				+
			
 
				+def merge_single_file(paddleocr_vl_file: Path, paddle_file: Path, output_dir: Path, 
			
 
				+                     output_type: str, merger: PaddleOCRVLMerger) -> bool:
			
 
				+    """
			
 
				+    合并单个文件
			
 
				+    
			
 
				+    Args:
			
 
				+        paddleocr_vl_file: PaddleOCR_VL JSON 文件路径
			
 
				+        paddle_file: PaddleOCR JSON 文件路径
			
 
				+        output_dir: 输出目录
			
 
				+        output_format: 输出格式
			
 
				+        merger: 合并器实例
			
 
				+    
			
 
				+    Returns:
			
 
				+        是否成功
			
 
				+    """
			
 
				+    print(f"📄 处理: {paddleocr_vl_file.name}")
			
 
				+    
			
 
				+    # 输出文件路径
			
 
				+    merged_md_path = output_dir / f"{paddleocr_vl_file.stem}.md"
			
 
				+    merged_json_path = output_dir / f"{paddleocr_vl_file.stem}.json"
			
 
				+    
			
 
				+    try:
			
 
				+        # ✅ 合并数据 (统一输出为MinerU格式)
			
 
				+        merged_data = merger.merge_table_with_bbox(
			
 
				+            str(paddleocr_vl_file),
			
 
				+            str(paddle_file),
			
 
				+            data_format='mineru'  # 强制使用MinerU格式
			
 
				+        )
			
 
				+        
			
 
				+        # ✅ 生成 Markdown (基于MinerU格式)
			
 
				+        if output_type in ['markdown', 'both']:
			
 
				+            markdown = merger.generate_enhanced_markdown(
			
 
				+                merged_data, 
			
 
				+                str(merged_md_path), 
			
 
				+                str(paddleocr_vl_file),
			
 
				+                data_format='mineru'  # 强制使用MinerU格式
			
 
				+            )
			
 
				+        
			
 
				+        # ✅ 保存 JSON (MinerU格式)
			
 
				+        if output_type in ['json', 'both']:
			
 
				+            with open(merged_json_path, 'w', encoding='utf-8') as f:
			
 
				+                json.dump(merged_data, f, ensure_ascii=False, indent=2)
			
 
				+
			
 
				+        print(f"  ✅ 合并完成 (MinerU格式)")
			
 
				+        print(f"  📊 共处理了 {len(merged_data)} 个对象")
			
 
				+        print(f"  💾 输出文件:")
			
 
				+        if output_type in ['markdown', 'both']:
			
 
				+            print(f"    - {merged_md_path.name}")
			
 
				+        if output_type in ['json', 'both']:
			
 
				+            print(f"    - {merged_json_path.name}")
			
 
				+
			
 
				+        return True
			
 
				+        
			
 
				+    except Exception as e:
			
 
				+        print(f"  ❌ 处理失败: {e}")
			
 
				+        import traceback
			
 
				+        traceback.print_exc()
			
 
				+        return False
			
 
				+
			
 
				+
			
 
				+def merge_paddleocr_vl_batch(paddleocr_vl_dir: str, paddle_dir: str, output_dir: str,
			
 
				+                             output_type: str = 'both',
			
 
				+                             look_ahead_window: int = 10, 
			
 
				+                             similarity_threshold: int = 80):
			
 
				+    """
			
 
				+    批量合并 PaddleOCR_VL 和 PaddleOCR 的结果
			
 
				+    
			
 
				+    Args:
			
 
				+        paddleocr_vl_dir: PaddleOCR_VL 结果目录
			
 
				+        paddle_dir: PaddleOCR 结果目录
			
 
				+        output_dir: 输出目录
			
 
				+        output_format: 输出格式
			
 
				+        look_ahead_window: 向前查找窗口大小
			
 
				+        similarity_threshold: 相似度阈值
			
 
				+    """
			
 
				+    paddleocr_vl_path = Path(paddleocr_vl_dir)
			
 
				+    paddle_path = Path(paddle_dir)
			
 
				+    output_path = Path(output_dir)
			
 
				+    output_path.mkdir(parents=True, exist_ok=True)
			
 
				+    
			
 
				+    merger = PaddleOCRVLMerger(look_ahead_window, similarity_threshold)
			
 
				+    
			
 
				+    # 查找所有 PaddleOCR_VL 的 JSON 文件
			
 
				+    paddleocr_vl_files = list(paddleocr_vl_path.glob('*_page_*[0-9].json'))
			
 
				+    paddleocr_vl_files.sort()
			
 
				+    
			
 
				+    print(f"\n🔍 找到 {len(paddleocr_vl_files)} 个 PaddleOCR_VL 文件")
			
 
				+    print(f"📂 PaddleOCR_VL 目录: {paddleocr_vl_dir}")
			
 
				+    print(f"📂 PaddleOCR 目录: {paddle_dir}")
			
 
				+    print(f"📂 输出目录: {output_dir}")
			
 
				+    print(f"⚙️  查找窗口: {look_ahead_window}")
			
 
				+    print(f"⚙️  相似度阈值: {similarity_threshold}%\n")
			
 
				+    
			
 
				+    success_count = 0
			
 
				+    failed_count = 0
			
 
				+    
			
 
				+    for paddleocr_vl_file in paddleocr_vl_files:
			
 
				+        # 查找对应的 PaddleOCR 文件
			
 
				+        paddle_file = paddle_path / paddleocr_vl_file.name
			
 
				+        
			
 
				+        if not paddle_file.exists():
			
 
				+            print(f"⚠️  跳过: 未找到对应的 PaddleOCR 文件: {paddle_file.name}\n")
			
 
				+            failed_count += 1
			
 
				+            continue
			
 
				+
			
 
				+        if merge_single_file(paddleocr_vl_file, paddle_file, output_path, output_type, merger):
			
 
				+            success_count += 1
			
 
				+        else:
			
 
				+            failed_count += 1
			
 
				+        
			
 
				+        print()
			
 
				+    
			
 
				+    print("=" * 60)
			
 
				+    print(f"✅ 处理完成!")
			
 
				+    print(f"📊 统计信息:")
			
 
				+    print(f"  - 总文件数: {len(paddleocr_vl_files)}")
			
 
				+    print(f"  - 成功: {success_count}")
			
 
				+    print(f"  - 失败: {failed_count}")
			
 
				+    print("=" * 60)
			
 
				+
			
 
				+
			
 
				+def main():
			
 
				+    """主函数"""
			
 
				+    parser = argparse.ArgumentParser(
			
 
				+        description='合并 PaddleOCR_VL 和 PaddleOCR 的识别结果，统一输出为MinerU格式',
			
 
				+        formatter_class=argparse.RawDescriptionHelpFormatter,
			
 
				+        epilog="""
			
 
				+示例用法:
			
 
				+
			
 
				+  1. 批量处理整个目录:
			
 
				+     python merge_paddleocr_vl_paddleocr.py \\
			
 
				+         --paddleocr-vl-dir /path/to/paddleocr_vl/results \\
			
 
				+         --paddle-dir /path/to/paddle/results \\
			
 
				+         --output-dir /path/to/output
			
 
				+
			
 
				+  2. 处理单个文件:
			
 
				+     python merge_paddleocr_vl_paddleocr.py \\
			
 
				+         --paddleocr-vl-file /path/to/file_page_001.json \\
			
 
				+         --paddle-file /path/to/file_page_001.json \\
			
 
				+         --output-dir /path/to/output
			
 
				+        
			
 
				+输出格式说明:
			
 
				+  - JSON: 统一的MinerU格式JSON文件
			
 
				+  - Markdown: 基于MinerU格式生成的Markdown文件
			
 
				+        """
			
 
				+    )
			
 
				+    
			
 
				+    # 文件/目录参数
			
 
				+    file_group = parser.add_argument_group('文件参数')
			
 
				+    file_group.add_argument(
			
 
				+        '--paddleocr-vl-file', 
			
 
				+        type=str,
			
 
				+        help='PaddleOCR_VL 输出的 JSON 文件路径（单文件模式）'
			
 
				+    )
			
 
				+    file_group.add_argument(
			
 
				+        '--paddle-file', 
			
 
				+        type=str,
			
 
				+        help='PaddleOCR 输出的 JSON 文件路径（单文件模式）'
			
 
				+    )
			
 
				+    
			
 
				+    dir_group = parser.add_argument_group('目录参数')
			
 
				+    dir_group.add_argument(
			
 
				+        '--paddleocr-vl-dir', 
			
 
				+        type=str,
			
 
				+        help='PaddleOCR_VL 结果目录（批量模式）'
			
 
				+    )
			
 
				+    dir_group.add_argument(
			
 
				+        '--paddle-dir', 
			
 
				+        type=str,
			
 
				+        help='PaddleOCR 结果目录（批量模式）'
			
 
				+    )
			
 
				+    
			
 
				+    # 输出参数
			
 
				+    output_group = parser.add_argument_group('输出参数')
			
 
				+    output_group.add_argument(
			
 
				+        '-o', '--output-dir',
			
 
				+        type=str,
			
 
				+        required=True,
			
 
				+        help='输出目录（必需）'
			
 
				+    )
			
 
				+    output_group.add_argument(
			
 
				+        '-f', '--output-type', 
			
 
				+        choices=['json', 'markdown', 'both'], 
			
 
				+        default='both', 
			
 
				+        help='输出格式'
			
 
				+    )
			
 
				+
			
 
				+    # 算法参数
			
 
				+    algo_group = parser.add_argument_group('算法参数')
			
 
				+    algo_group.add_argument(
			
 
				+        '-w', '--window',
			
 
				+        type=int,
			
 
				+        default=15,
			
 
				+        help='向前查找的窗口大小（默认: 15）'
			
 
				+    )
			
 
				+    algo_group.add_argument(
			
 
				+        '-t', '--threshold',
			
 
				+        type=int,
			
 
				+        default=80,
			
 
				+        help='文本相似度阈值（0-100，默认: 80）'
			
 
				+    )
			
 
				+    
			
 
				+    args = parser.parse_args()
			
 
				+    output_type = args.output_type.lower()
			
 
				+    
			
 
				+    # 验证参数
			
 
				+    if args.paddleocr_vl_file and args.paddle_file:
			
 
				+        # 单文件模式
			
 
				+        paddleocr_vl_file = Path(args.paddleocr_vl_file)
			
 
				+        paddle_file = Path(args.paddle_file)
			
 
				+        output_dir = Path(args.output_dir)
			
 
				+        
			
 
				+        if not paddleocr_vl_file.exists():
			
 
				+            print(f"❌ 错误: PaddleOCR_VL 文件不存在: {paddleocr_vl_file}")
			
 
				+            return
			
 
				+        
			
 
				+        if not paddle_file.exists():
			
 
				+            print(f"❌ 错误: PaddleOCR 文件不存在: {paddle_file}")
			
 
				+            return
			
 
				+        
			
 
				+        output_dir.mkdir(parents=True, exist_ok=True)
			
 
				+        
			
 
				+        print("\n🔧 单文件处理模式")
			
 
				+        print(f"📄 PaddleOCR_VL 文件: {paddleocr_vl_file}")
			
 
				+        print(f"📄 PaddleOCR 文件: {paddle_file}")
			
 
				+        print(f"📂 输出目录: {output_dir}\n")
			
 
				+        
			
 
				+        merger = PaddleOCRVLMerger(
			
 
				+            look_ahead_window=args.window,
			
 
				+            similarity_threshold=args.threshold
			
 
				+        )
			
 
				+        
			
 
				+        success = merge_single_file(paddleocr_vl_file, paddle_file, output_dir, output_type, merger)
			
 
				+        
			
 
				+        if success:
			
 
				+            print("\n✅ 处理完成!")
			
 
				+        else:
			
 
				+            print("\n❌ 处理失败!")
			
 
				+    
			
 
				+    elif args.paddleocr_vl_dir and args.paddle_dir:
			
 
				+        # 批量模式
			
 
				+        if not Path(args.paddleocr_vl_dir).exists():
			
 
				+            print(f"❌ 错误: PaddleOCR_VL 目录不存在: {args.paddleocr_vl_dir}")
			
 
				+            return
			
 
				+        
			
 
				+        if not Path(args.paddle_dir).exists():
			
 
				+            print(f"❌ 错误: PaddleOCR 目录不存在: {args.paddle_dir}")
			
 
				+            return
			
 
				+        
			
 
				+        print("\n🔧 批量处理模式")
			
 
				+        
			
 
				+        merge_paddleocr_vl_batch(
			
 
				+            args.paddleocr_vl_dir,
			
 
				+            args.paddle_dir,
			
 
				+            args.output_dir,
			
 
				+            output_type=output_type,
			
 
				+            look_ahead_window=args.window,
			
 
				+            similarity_threshold=args.threshold
			
 
				+        )
			
 
				+    
			
 
				+    else:
			
 
				+        parser.print_help()
			
 
				+        print("\n❌ 错误: 请指定单文件模式或批量模式的参数")
			
 
				+
			
 
				+
			
 
				+if __name__ == "__main__":
			
 
				+    print("🚀 启动 PaddleOCR_VL + PaddleOCR 合并程序 (统一输出MinerU格式)...")
			
 
				+    
			
 
				+    import sys
			
 
				+    
			
 
				+    if len(sys.argv) == 1:
			
 
				+        # 默认配置
			
 
				+        # default_config = {
			
 
				+        #     "paddleocr-vl-file": "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/paddleocr_vl_results/A用户_单元格扫描流水_page_007.json",
			
 
				+        #     "paddle-file": "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/ppstructurev3_client_results/A用户_单元格扫描流水_page_007.json",
			
 
				+        #     "output-dir": "/Users/zhch158/workspace/data/流水分析/A用户_单元格扫描流水/paddleocr_vl_results_cell_bbox",
			
 
				+        #     "output-type": "both",
			
 
				+        #     "window": "15",
			
 
				+        #     "threshold": "85"
			
 
				+        # }
			
 
				+        # default_config = {
			
 
				+        #     "paddleocr-vl-file": "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/paddleocr_vl_results/B用户_扫描流水_page_001.json",
			
 
				+        #     "paddle-file": "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/ppstructurev3_client_results/B用户_扫描流水_page_001.json",
			
 
				+        #     "output-dir": "/Users/zhch158/workspace/data/流水分析/B用户_扫描流水/paddleocr_vl_results_cell_bbox",
			
 
				+        #     "output-type": "both",
			
 
				+        #     "window": "15",
			
 
				+        #     "threshold": "85"
			
 
				+        # }
			
 
				+        default_config = {
			
 
				+            "paddleocr-vl-file": "/Users/zhch158/workspace/data/流水分析/2023年度报告母公司/paddleocr_vl_results/2023年度报告母公司_page_006.json",
			
 
				+            "paddle-file": "/Users/zhch158/workspace/data/流水分析/2023年度报告母公司/ppstructurev3_client_results/2023年度报告母公司_page_006.json",
			
 
				+            "output-dir": "/Users/zhch158/workspace/data/流水分析/2023年度报告母公司/paddleocr_vl_results_cell_bbox",
			
 
				+            "output-type": "both",
			
 
				+            "window": "15",
			
 
				+            "threshold": "85"
			
 
				+        }
			
 
				+        # default_config = {
			
 
				+        #     "paddleocr-vl-file": "/Users/zhch158/workspace/data/流水分析/康强_北京农村商业银行/paddleocr_vl_results/康强_北京农村商业银行_page_001.json",
			
 
				+        #     "paddle-file": "/Users/zhch158/workspace/data/流水分析/康强_北京农村商业银行/ppstructurev3_client_results/康强_北京农村商业银行_page_001.json",
			
 
				+        #     "output-dir": "/Users/zhch158/workspace/data/流水分析/康强_北京农村商业银行/paddleocr_vl_results_cell_bbox",
			
 
				+        #     "output-type": "both",
			
 
				+        #     "window": "15",
			
 
				+        #     "threshold": "85"
			
 
				+        # }
			
 
				+            
			
 
				+        print("ℹ️  未提供命令行参数，使用默认配置运行...")
			
 
				+        print("⚙️  默认参数:")
			
 
				+        for key, value in default_config.items():
			
 
				+            print(f"  --{key}: {value}")
			
 
				+        
			
 
				+        sys.argv = [sys.argv[0]]
			
 
				+        for key, value in default_config.items():
			
 
				+            sys.argv.extend([f"--{key}", str(value)])
			
 
				+    
			
 
				+    sys.exit(main())
			
--- a/ocr_tools/ocr_merger/merger_core.py
+++ b/ocr_tools/ocr_merger/merger_core.py
@@ -0,0 +1,85 @@
 
				+"""
			
 
				+核心合并模块
			
 
				+整合各个子模块，提供统一的合并接口
			
 
				+"""
			
 
				+import json
			
 
				+from typing import List, Dict
			
 
				+
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径（用于导入 ocr_utils）
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # ocr_merger -> ocr_tools -> ocr_platform -> repository.git
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+try:
			
 
				+    from .text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+    from .data_processor import DataProcessor
			
 
				+    from .markdown_generator import MarkdownGenerator
			
 
				+except ImportError:
			
 
				+    from text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+    from data_processor import DataProcessor
			
 
				+    from markdown_generator import MarkdownGenerator
			
 
				+
			
 
				+
			
 
				+class MinerUPaddleOCRMerger:
			
 
				+    """MinerU 和 PaddleOCR 结果合并器"""
			
 
				+    
			
 
				+    def __init__(self, look_ahead_window: int = 10, similarity_threshold: int = 90):
			
 
				+        """
			
 
				+        Args:
			
 
				+            look_ahead_window: 向前查找的窗口大小
			
 
				+            similarity_threshold: 文本相似度阈值
			
 
				+        """
			
 
				+        self.look_ahead_window = look_ahead_window
			
 
				+        self.similarity_threshold = similarity_threshold
			
 
				+        
			
 
				+        # 初始化子模块
			
 
				+        self.text_matcher = TextMatcher(similarity_threshold)
			
 
				+        self.bbox_extractor = BBoxExtractor()
			
 
				+        self.data_processor = DataProcessor(self.text_matcher, look_ahead_window)
			
 
				+        self.markdown_generator = MarkdownGenerator()
			
 
				+    
			
 
				+    def merge_table_with_bbox(self, mineru_json_path: str, 
			
 
				+                             paddle_json_path: str) -> List[Dict]:
			
 
				+        """
			
 
				+        合并 MinerU 和 PaddleOCR 的结果
			
 
				+        
			
 
				+        Args:
			
 
				+            mineru_json_path: MinerU 输出的 JSON 路径
			
 
				+            paddle_json_path: PaddleOCR 输出的 JSON 路径
			
 
				+        
			
 
				+        Returns:
			
 
				+            合并后的结果
			
 
				+        """
			
 
				+        # 加载数据
			
 
				+        with open(mineru_json_path, 'r', encoding='utf-8') as f:
			
 
				+            mineru_data = json.load(f)
			
 
				+        
			
 
				+        with open(paddle_json_path, 'r', encoding='utf-8') as f:
			
 
				+            paddle_data = json.load(f)
			
 
				+        
			
 
				+        # 提取 PaddleOCR 的文字框信息
			
 
				+        paddle_text_boxes, rotation_angle, orig_image_size = self.bbox_extractor.extract_paddle_text_boxes(paddle_data)
			
 
				+        
			
 
				+        # 处理 MinerU 的数据
			
 
				+        merged_data = self.data_processor.process_mineru_data(
			
 
				+            mineru_data, paddle_text_boxes, rotation_angle, orig_image_size
			
 
				+        )
			
 
				+        
			
 
				+        return merged_data
			
 
				+    
			
 
				+    def generate_enhanced_markdown(self, merged_data: List[Dict], 
			
 
				+                                   output_path: str = None,
			
 
				+                                   mineru_file: str = None) -> str:
			
 
				+        """生成增强的 Markdown"""
			
 
				+        return self.markdown_generator._generate_mineru_markdown(
			
 
				+            merged_data, output_path, mineru_file
			
 
				+        )
			
 
				+    
			
 
				+    def extract_table_cells_with_bbox(self, merged_data: List[Dict]) -> List[Dict]:
			
 
				+        """提取所有表格单元格及其 bbox 信息"""
			
 
				+        return self.bbox_extractor.extract_table_cells_with_bbox(merged_data)
			
--- a/ocr_tools/ocr_merger/paddleocr_vl_merger.py
+++ b/ocr_tools/ocr_merger/paddleocr_vl_merger.py
@@ -0,0 +1,120 @@
 
				+"""
			
 
				+PaddleOCR_VL 和 PaddleOCR 合并模块
			
 
				+"""
			
 
				+import json
			
 
				+from typing import List, Dict
			
 
				+
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径（用于导入 ocr_utils）
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # ocr_merger -> ocr_tools -> ocr_platform -> repository.git
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+try:
			
 
				+    from .text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+    from .data_processor import DataProcessor
			
 
				+    from .markdown_generator import MarkdownGenerator
			
 
				+    from .unified_output_converter import UnifiedOutputConverter
			
 
				+except ImportError:
			
 
				+    from text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+    from data_processor import DataProcessor
			
 
				+    from markdown_generator import MarkdownGenerator
			
 
				+    from unified_output_converter import UnifiedOutputConverter
			
 
				+
			
 
				+
			
 
				+class PaddleOCRVLMerger:
			
 
				+    """PaddleOCR_VL 和 PaddleOCR 结果合并器"""
			
 
				+    
			
 
				+    def __init__(self, look_ahead_window: int = 10, similarity_threshold: int = 90):
			
 
				+        """
			
 
				+        Args:
			
 
				+            look_ahead_window: 向前查找的窗口大小
			
 
				+            similarity_threshold: 文本相似度阈值
			
 
				+        """
			
 
				+        self.look_ahead_window = look_ahead_window
			
 
				+        self.similarity_threshold = similarity_threshold
			
 
				+        
			
 
				+        # 初始化子模块
			
 
				+        self.text_matcher = TextMatcher(similarity_threshold)
			
 
				+        self.bbox_extractor = BBoxExtractor()
			
 
				+        self.data_processor = DataProcessor(self.text_matcher, look_ahead_window)
			
 
				+        self.markdown_generator = MarkdownGenerator()
			
 
				+        self.output_converter = UnifiedOutputConverter()
			
 
				+    
			
 
				+    def merge_table_with_bbox(self, paddleocr_vl_json_path: str, 
			
 
				+                             paddle_json_path: str,
			
 
				+                             data_format: str = 'mineru') -> List[Dict]:
			
 
				+        """
			
 
				+        合并 PaddleOCR_VL 和 PaddleOCR 的结果
			
 
				+        
			
 
				+        Args:
			
 
				+            paddleocr_vl_json_path: PaddleOCR_VL 输出的 JSON 路径
			
 
				+            paddle_json_path: PaddleOCR 输出的 JSON 路径
			
 
				+            data_format: 输出格式 ('mineru' 或 'paddleocr_vl')
			
 
				+        
			
 
				+        Returns:
			
 
				+            合并后的结果 (默认MinerU格式)
			
 
				+        """
			
 
				+        # 加载数据
			
 
				+        with open(paddleocr_vl_json_path, 'r', encoding='utf-8') as f:
			
 
				+            paddleocr_vl_data = json.load(f)
			
 
				+        
			
 
				+        with open(paddle_json_path, 'r', encoding='utf-8') as f:
			
 
				+            paddle_data = json.load(f)
			
 
				+        
			
 
				+        # 提取 PaddleOCR 的文字框信息
			
 
				+        paddle_text_boxes, rotation_angle, orig_image_size = self.bbox_extractor.extract_paddle_text_boxes(paddle_data)
			
 
				+        
			
 
				+        # 处理 PaddleOCR_VL 的数据, merge后已是minerU json格式
			
 
				+        merged_data = self.data_processor.process_paddleocr_vl_data(
			
 
				+            paddleocr_vl_data, paddle_text_boxes, rotation_angle, orig_image_size
			
 
				+        )
			
 
				+        
			
 
				+        # 不用再转换，
			
 
				+        # 转换为指定格式
			
 
				+        # if data_format == 'mineru':
			
 
				+        #     merged_data = self.output_converter.convert_to_mineru_format(
			
 
				+        #         merged_data, data_source='paddleocr_vl'
			
 
				+        #     )
			
 
				+        
			
 
				+        return merged_data
			
 
				+    
			
 
				+    def generate_enhanced_markdown(self, merged_data: List[Dict], 
			
 
				+                                   output_path: str = None,
			
 
				+                                   source_file: str = None,
			
 
				+                                   data_format: str = None) -> str:
			
 
				+        """
			
 
				+        生成增强的 Markdown
			
 
				+        
			
 
				+        Args:
			
 
				+            merged_data: 合并后的数据
			
 
				+            output_path: 输出路径
			
 
				+            source_file: 源文件路径
			
 
				+            data_format: 数据格式，None 则自动检测
			
 
				+        """
			
 
				+        # 如果data_format未指定,自动检测
			
 
				+        if data_format is None:
			
 
				+            data_format = self.markdown_generator.detect_data_format(merged_data)
			
 
				+        
			
 
				+        # 如果是PaddleOCR_VL格式,先转换为MinerU格式
			
 
				+        if data_format == 'paddleocr_vl':
			
 
				+            merged_data = self.output_converter.convert_to_mineru_format(
			
 
				+                merged_data, data_source='paddleocr_vl'
			
 
				+            )
			
 
				+            data_format = 'mineru'
			
 
				+        
			
 
				+        return self.markdown_generator.generate_enhanced_markdown(
			
 
				+            merged_data, output_path, source_file, data_format
			
 
				+        )
			
 
				+    
			
 
				+    def extract_table_cells_with_bbox(self, merged_data: List[Dict]) -> List[Dict]:
			
 
				+        """提取所有表格单元格及其 bbox 信息"""
			
 
				+        # 确保数据是MinerU格式
			
 
				+        if self.output_converter._detect_data_source(merged_data) != 'mineru':
			
 
				+            merged_data = self.output_converter.convert_to_mineru_format(merged_data)
			
 
				+        
			
 
				+        return self.bbox_extractor.extract_table_cells_with_bbox(merged_data)
			
--- a/ocr_tools/ocr_merger/table_cell_matcher.py
+++ b/ocr_tools/ocr_merger/table_cell_matcher.py
@@ -0,0 +1,910 @@
 
				+"""
			
 
				+表格单元格匹配器
			
 
				+负责将 HTML 表格单元格与 PaddleOCR bbox 进行匹配
			
 
				+"""
			
 
				+from typing import List, Dict, Tuple, Optional
			
 
				+from bs4 import BeautifulSoup
			
 
				+import numpy as np
			
 
				+
			
 
				+try:
			
 
				+    from rapidfuzz import fuzz
			
 
				+except ImportError:
			
 
				+    from fuzzywuzzy import fuzz
			
 
				+
			
 
				+import sys
			
 
				+from pathlib import Path
			
 
				+
			
 
				+# 添加 ocr_platform 根目录到 Python 路径（用于导入 ocr_utils）
			
 
				+ocr_platform_root = Path(__file__).parents[3]  # ocr_merger -> ocr_tools -> ocr_platform -> repository.git
			
 
				+if str(ocr_platform_root) not in sys.path:
			
 
				+    sys.path.insert(0, str(ocr_platform_root))
			
 
				+
			
 
				+try:
			
 
				+    from .text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+except ImportError:
			
 
				+    from text_matcher import TextMatcher
			
 
				+    from ocr_utils import BBoxExtractor  # 从 ocr_utils 导入
			
 
				+
			
 
				+class TableCellMatcher:
			
 
				+    """表格单元格匹配器"""
			
 
				+    
			
 
				+    def __init__(self, text_matcher: TextMatcher, 
			
 
				+                 x_tolerance: int = 3, 
			
 
				+                 y_tolerance: int = 10,
			
 
				+                 skew_threshold: float = 0.3):
			
 
				+        """
			
 
				+        Args:
			
 
				+            text_matcher: 文本匹配器
			
 
				+            x_tolerance: X轴容差（用于列边界判断）
			
 
				+            y_tolerance: Y轴容差（用于行分组）
			
 
				+            skew_threshold: 倾斜校正阈值（度数）
			
 
				+        """
			
 
				+        self.text_matcher = text_matcher
			
 
				+        self.x_tolerance = x_tolerance
			
 
				+        self.y_tolerance = y_tolerance
			
 
				+        self.skew_threshold = skew_threshold  # 倾斜校正阈值（度数）
			
 
				+    
			
 
				+    def enhance_table_html_with_bbox(self, html: str, paddle_text_boxes: List[Dict],
			
 
				+                                  start_pointer: int, table_bbox: Optional[List[int]] = None) -> Tuple[str, List[Dict], int, float]:
			
 
				+        """
			
 
				+        为 HTML 表格添加 bbox 信息（优化版：使用行级动态规划）
			
 
				+        Returns:
			
 
				+            (enhanced_html, cells, new_pointer, skew_angle): 
			
 
				+            增强后的HTML、单元格列表、新指针位置、倾斜角度
			
 
				+        """
			
 
				+        soup = BeautifulSoup(html, 'html.parser')
			
 
				+        cells = []
			
 
				+        skew_angle = 0.0
			
 
				+        
			
 
				+        # 🔑 第一步：筛选表格区域内的 paddle boxes
			
 
				+        table_region_boxes, actual_table_bbox = self._filter_boxes_in_table_region(
			
 
				+            paddle_text_boxes[start_pointer:],
			
 
				+            table_bbox,
			
 
				+            html
			
 
				+        )
			
 
				+        
			
 
				+        if not table_region_boxes:
			
 
				+            print(f"⚠️ 未在表格区域找到 paddle boxes")
			
 
				+            return str(soup), cells, start_pointer, skew_angle
			
 
				+        
			
 
				+        print(f"📊 表格区域: {len(table_region_boxes)} 个文本框")
			
 
				+        
			
 
				+        # 🔑 第二步：将表格区域的 boxes 按行分组
			
 
				+        grouped_boxes, skew_angle = self._group_paddle_boxes_by_rows(
			
 
				+            table_region_boxes,
			
 
				+            y_tolerance=self.y_tolerance,
			
 
				+            auto_correct_skew=True,
			
 
				+            skew_threshold=self.skew_threshold
			
 
				+        )
			
 
				+        
			
 
				+        # 🔑 第三步：在每组内按 x 坐标排序
			
 
				+        for group in grouped_boxes:
			
 
				+            group['boxes'].sort(key=lambda x: x['bbox'][0])
			
 
				+        
			
 
				+        grouped_boxes.sort(key=lambda g: g['y_center'])
			
 
				+        
			
 
				+        # 🔑 第四步：智能匹配 HTML 行与 paddle 行组
			
 
				+        html_rows = soup.find_all('tr')
			
 
				+        row_mapping = self._match_html_rows_to_paddle_groups(html_rows, grouped_boxes)
			
 
				+        
			
 
				+        print(f"   HTML行: {len(html_rows)} 行, 映射: {len([v for v in row_mapping.values() if v])} 个有效映射")
			
 
				+        
			
 
				+        # 🔑 第五步：遍历 HTML 表格，使用 DP 进行行内匹配
			
 
				+        for row_idx, row in enumerate(html_rows):
			
 
				+            group_indices = row_mapping.get(row_idx, [])
			
 
				+            
			
 
				+            if not group_indices:
			
 
				+                continue
			
 
				+            
			
 
				+            # 合并多个组的 boxes
			
 
				+            current_boxes = []
			
 
				+            for group_idx in group_indices:
			
 
				+                if group_idx < len(grouped_boxes):
			
 
				+                    current_boxes.extend(grouped_boxes[group_idx]['boxes'])
			
 
				+            
			
 
				+            # 再次按 x 排序确保顺序
			
 
				+            current_boxes.sort(key=lambda x: x['bbox'][0])
			
 
				+            
			
 
				+            html_cells = row.find_all(['td', 'th'])
			
 
				+            if not html_cells:
			
 
				+                continue
			
 
				+            
			
 
				+            # 🎯 核心变更：使用行级 DP 替代原来的顺序匹配
			
 
				+            # 输入：HTML 单元格列表, OCR Box 列表
			
 
				+            # 输出：匹配结果列表
			
 
				+            dp_results = self._match_cells_in_row_dp(html_cells, current_boxes)
			
 
				+            
			
 
				+            print(f"   行 {row_idx + 1}: {len(html_cells)} 列, 匹配到 {len(dp_results)} 个单元格")
			
 
				+
			
 
				+            # 解析 DP 结果并填充 cells 列表
			
 
				+            for res in dp_results:
			
 
				+                cell_idx = res['cell_idx']
			
 
				+                match_info = res['match_info']
			
 
				+                
			
 
				+                cell_element = html_cells[cell_idx]
			
 
				+                cell_text = cell_element.get_text(strip=True)
			
 
				+                
			
 
				+                matched_boxes = match_info['boxes']
			
 
				+                matched_text = match_info['text']
			
 
				+                score = match_info['score']
			
 
				+                
			
 
				+                # 标记 box 为已使用
			
 
				+                paddle_indices = []
			
 
				+                for box in matched_boxes:
			
 
				+                    box['used'] = True
			
 
				+                    paddle_indices.append(box.get('paddle_bbox_index', -1))
			
 
				+                
			
 
				+                # 计算合并后的 bbox (使用原始坐标 original_bbox 优先)
			
 
				+                merged_bbox = self._merge_boxes_bbox(matched_boxes)
			
 
				+                
			
 
				+                # 注入 HTML 属性
			
 
				+                cell_element['data-bbox'] = f"[{merged_bbox[0]},{merged_bbox[1]},{merged_bbox[2]},{merged_bbox[3]}]"
			
 
				+                cell_element['data-score'] = f"{score:.4f}"
			
 
				+                cell_element['data-paddle-indices'] = str(paddle_indices)
			
 
				+                
			
 
				+                # 构建返回结构 (保持与原函数一致)
			
 
				+                cells.append({
			
 
				+                    'type': 'table_cell',
			
 
				+                    'text': cell_text,
			
 
				+                    'matched_text': matched_text,
			
 
				+                    'bbox': merged_bbox,
			
 
				+                    'row': row_idx + 1,
			
 
				+                    'col': cell_idx + 1,
			
 
				+                    'score': score,
			
 
				+                    'paddle_bbox_indices': paddle_indices
			
 
				+                })
			
 
				+                
			
 
				+                print(f"      列 {cell_idx + 1}: '{cell_text[:15]}...' 匹配 {len(matched_boxes)} 个box (分值: {score:.1f})")
			
 
				+
			
 
				+        # 计算新的指针位置 (逻辑保持不变：基于 used 标记)
			
 
				+        used_count = sum(1 for box in table_region_boxes if box.get('used'))
			
 
				+        new_pointer = start_pointer + used_count
			
 
				+        
			
 
				+        print(f"   总计匹配: {len(cells)} 个单元格")
			
 
				+        
			
 
				+        return str(soup), cells, new_pointer, skew_angle
			
 
				+
			
 
				+    def _merge_boxes_bbox(self, boxes: List[Dict]) -> List[int]:
			
 
				+        """辅助函数：合并多个 box 的坐标"""
			
 
				+        if not boxes:
			
 
				+            return [0, 0, 0, 0]
			
 
				+        
			
 
				+        # 优先使用 original_bbox，如果没有则使用 bbox
			
 
				+        def get_coords(b):
			
 
				+            return b.get('original_bbox', b['bbox'])
			
 
				+            
			
 
				+        x1 = min(get_coords(b)[0] for b in boxes)
			
 
				+        y1 = min(get_coords(b)[1] for b in boxes)
			
 
				+        x2 = max(get_coords(b)[2] for b in boxes)
			
 
				+        y2 = max(get_coords(b)[3] for b in boxes)
			
 
				+        return [x1, y1, x2, y2]
			
 
				+
			
 
				+    def _match_cells_in_row_dp(self, html_cells: List, row_boxes: List[Dict]) -> List[Dict]:
			
 
				+        """
			
 
				+        使用动态规划进行行内单元格匹配
			
 
				+        目标：找到一种分配方案，使得整行的匹配总分最高
			
 
				+        """
			
 
				+        n_cells = len(html_cells)
			
 
				+        n_boxes = len(row_boxes)
			
 
				+        
			
 
				+        # dp[i][j] 表示：前 i 个单元格 消耗了 前 j 个 boxes 的最大得分
			
 
				+        dp = np.full((n_cells + 1, n_boxes + 1), -np.inf)
			
 
				+        dp[0][0] = 0
			
 
				+        
			
 
				+        # path[i][j] = (prev_j, matched_info) 用于回溯
			
 
				+        path = {}
			
 
				+        
			
 
				+        # 允许合并的最大 box 数量
			
 
				+        MAX_MERGE = 5 
			
 
				+        
			
 
				+        for i in range(1, n_cells + 1):
			
 
				+            cell = html_cells[i-1]
			
 
				+            cell_text = cell.get_text(strip=True)
			
 
				+            
			
 
				+            # 如果单元格为空，允许继承状态（相当于跳过该单元格）
			
 
				+            if not cell_text:
			
 
				+                for j in range(n_boxes + 1):
			
 
				+                    if dp[i-1][j] > -np.inf:
			
 
				+                        dp[i][j] = dp[i-1][j]
			
 
				+                        path[(i, j)] = (j, None)
			
 
				+                continue
			
 
				+
			
 
				+            # 遍历当前 box 指针 j
			
 
				+            for j in range(n_boxes + 1):
			
 
				+                # 策略 A: 当前单元格不匹配任何 box (Cell Missing / OCR漏检)
			
 
				+                if dp[i-1][j] > dp[i][j]:
			
 
				+                    dp[i][j] = dp[i-1][j]
			
 
				+                    path[(i, j)] = (j, None)
			
 
				+
			
 
				+                # 策略 B: 当前单元格匹配了 k 个 boxes (从 prev_j 到 j)
			
 
				+                # 限制搜索范围：最多往前看 MAX_MERGE 个 box
			
 
				+                search_limit = max(0, j - MAX_MERGE)
			
 
				+                
			
 
				+                # 允许中间跳过少量噪音 box (例如 prev_j 到 j 之间跨度大，但只取了部分)
			
 
				+                # 但为了简化，这里假设是连续取用 row_boxes[prev_j:j]
			
 
				+                for prev_j in range(j - 1, search_limit - 1, -1):
			
 
				+                    if dp[i-1][prev_j] == -np.inf:
			
 
				+                        continue
			
 
				+                        
			
 
				+                    candidate_boxes = row_boxes[prev_j:j]
			
 
				+                    
			
 
				+                    # 组合文本 (使用空格连接)
			
 
				+                    merged_text = " ".join([b['text'] for b in candidate_boxes])
			
 
				+                    
			
 
				+                    # 计算得分
			
 
				+                    score = self._compute_match_score(cell_text, merged_text)
			
 
				+                    
			
 
				+                    # 只有及格的匹配才考虑
			
 
				+                    if score > 50: 
			
 
				+                        new_score = dp[i-1][prev_j] + score
			
 
				+                        if new_score > dp[i][j]:
			
 
				+                            dp[i][j] = new_score
			
 
				+                            path[(i, j)] = (prev_j, {
			
 
				+                                'text': merged_text,
			
 
				+                                'boxes': candidate_boxes,
			
 
				+                                'score': score
			
 
				+                            })
			
 
				+
			
 
				+        # --- 回溯找最优解 ---
			
 
				+        best_j = np.argmax(dp[n_cells])
			
 
				+        if dp[n_cells][best_j] == -np.inf:
			
 
				+            return [] 
			
 
				+            
			
 
				+        results = []
			
 
				+        curr_i, curr_j = n_cells, best_j
			
 
				+        
			
 
				+        while curr_i > 0:
			
 
				+            step_info = path.get((curr_i, curr_j))
			
 
				+            if step_info:
			
 
				+                prev_j, match_info = step_info
			
 
				+                if match_info:
			
 
				+                    results.append({
			
 
				+                        'cell_idx': curr_i - 1,
			
 
				+                        'match_info': match_info
			
 
				+                    })
			
 
				+                curr_j = prev_j
			
 
				+            curr_i -= 1
			
 
				+            
			
 
				+        return results[::-1]
			
 
				+
			
 
				+    def _compute_match_score(self, cell_text: str, box_text: str) -> float:
			
 
				+        """
			
 
				+        纯粹的评分函数：计算单元格文本与候选 Box 文本的匹配得分
			
 
				+        包含所有防御逻辑
			
 
				+        """
			
 
				+        # 1. 预处理
			
 
				+        cell_norm = self.text_matcher.normalize_text(cell_text)
			
 
				+        box_norm = self.text_matcher.normalize_text(box_text)
			
 
				+        
			
 
				+        if not cell_norm or not box_norm:
			
 
				+            return 0.0
			
 
				+            
			
 
				+        # --- ⚡️ 快速防御 ---
			
 
				+        len_cell = len(cell_norm)
			
 
				+        len_box = len(box_norm)
			
 
				+        
			
 
				+        # 长度差异过大直接 0 分 (除非是包含关系且特征明显)
			
 
				+        if len_box > len_cell * 3 + 5:
			
 
				+            if len_cell < 5: return 0.0
			
 
				+
			
 
				+        # --- 🔍 核心相似度计算 ---
			
 
				+        cell_proc = self._preprocess_text_for_matching(cell_text)
			
 
				+        box_proc = self._preprocess_text_for_matching(box_text)
			
 
				+        
			
 
				+        # A. Token Sort (解决乱序)
			
 
				+        score_sort = fuzz.token_sort_ratio(cell_proc, box_proc)
			
 
				+        
			
 
				+        # B. Partial (解决截断/包含)
			
 
				+        score_partial = fuzz.partial_ratio(cell_norm, box_norm)
			
 
				+        
			
 
				+        # C. Subsequence (解决噪音插入)
			
 
				+        score_subseq = 0.0
			
 
				+        if len_cell > 5:
			
 
				+            score_subseq = self._calculate_subsequence_score(cell_norm, box_norm)
			
 
				+
			
 
				+        # --- 🛡️ 深度防御逻辑 ---
			
 
				+        
			
 
				+        # 1. 短文本防御
			
 
				+        if score_partial > 80:
			
 
				+            import re
			
 
				+            has_content = lambda t: bool(re.search(r'[a-zA-Z0-9\u4e00-\u9fa5]', t))
			
 
				+            
			
 
				+            # 纯符号防御
			
 
				+            if not has_content(cell_norm) and has_content(box_norm):
			
 
				+                if len_box > len_cell + 2: score_partial = 0.0
			
 
				+            
			
 
				+            # 微小碎片防御
			
 
				+            elif len_cell <= 2 and len_box > 8:
			
 
				+                score_partial = 0.0
			
 
				+                
			
 
				+            # 覆盖率防御
			
 
				+            else:
			
 
				+                coverage = len_cell / len_box if len_box > 0 else 0
			
 
				+                if coverage < 0.3 and score_sort < 45:
			
 
				+                    score_partial = 0.0
			
 
				+
			
 
				+        # 2. 子序列防御
			
 
				+        if score_subseq > 80:
			
 
				+            if len_box > len_cell * 1.5:
			
 
				+                import re
			
 
				+                if re.match(r'^[\d\-\:\.\s]+$', cell_norm) and len_cell < 12:
			
 
				+                    score_subseq = 0.0
			
 
				+
			
 
				+        # --- 📊 综合评分 ---
			
 
				+        final_score = max(score_sort, score_partial, score_subseq)
			
 
				+        
			
 
				+        # 精确匹配奖励
			
 
				+        if cell_norm == box_norm:
			
 
				+            final_score = 100.0
			
 
				+        elif cell_norm in box_norm:
			
 
				+            final_score = min(100, final_score + 5)
			
 
				+            
			
 
				+        return final_score
			
 
				+
			
 
				+
			
 
				+    def _filter_boxes_in_table_region(self, paddle_boxes: List[Dict],
			
 
				+                                  table_bbox: Optional[List[int]],
			
 
				+                                  html: str) -> Tuple[List[Dict], List[int]]:
			
 
				+        """
			
 
				+        筛选表格区域内的 paddle boxes
			
 
				+    
			
 
				+        策略：
			
 
				+        1. 如果有 table_bbox，使用边界框筛选（扩展边界）
			
 
				+    
			
 
				+        Args:
			
 
				+            paddle_boxes: paddle OCR 结果
			
 
				+            table_bbox: 表格边界框 [x1, y1, x2, y2]
			
 
				+            html: HTML 内容（用于内容验证）
			
 
				+    
			
 
				+        Returns:
			
 
				+            (筛选后的 boxes, 实际表格边界框)
			
 
				+        """
			
 
				+        if not paddle_boxes:
			
 
				+            return [], [0, 0, 0, 0]
			
 
				+        
			
 
				+        # 🎯 策略 1: 使用提供的 table_bbox（扩展边界）
			
 
				+        if table_bbox and len(table_bbox) == 4:
			
 
				+            x1, y1, x2, y2 = table_bbox
			
 
				+            
			
 
				+            # 扩展边界（考虑边框外的文本）
			
 
				+            margin = 20
			
 
				+            expanded_bbox = [
			
 
				+                max(0, x1 - margin),
			
 
				+                max(0, y1 - margin),
			
 
				+                x2 + margin,
			
 
				+                y2 + margin
			
 
				+            ]
			
 
				+            
			
 
				+            filtered = []
			
 
				+            for box in paddle_boxes:
			
 
				+                bbox = box['bbox']
			
 
				+                box_center_x = (bbox[0] + bbox[2]) / 2
			
 
				+                box_center_y = (bbox[1] + bbox[3]) / 2
			
 
				+                
			
 
				+                # 中心点在扩展区域内
			
 
				+                if (expanded_bbox[0] <= box_center_x <= expanded_bbox[2] and
			
 
				+                    expanded_bbox[1] <= box_center_y <= expanded_bbox[3]):
			
 
				+                    filtered.append(box)
			
 
				+            
			
 
				+            if filtered:
			
 
				+                # 计算实际边界框
			
 
				+                actual_bbox = [
			
 
				+                    min(b['bbox'][0] for b in filtered),
			
 
				+                    min(b['bbox'][1] for b in filtered),
			
 
				+                    max(b['bbox'][2] for b in filtered),
			
 
				+                    max(b['bbox'][3] for b in filtered)
			
 
				+                ]
			
 
				+                return filtered, actual_bbox
			
 
				+            else:
			
 
				+                return [], [0, 0, 0, 0]
			
 
				+        else:
			
 
				+            raise ValueError(f"table_bbox is not valid: table_bbox={table_bbox}")
			
 
				+            
			
 
				+
			
 
				+    def _group_paddle_boxes_by_rows(self, paddle_boxes: List[Dict], 
			
 
				+                                    y_tolerance: int = 10,
			
 
				+                                    auto_correct_skew: bool = True,
			
 
				+                                    skew_threshold: float = 0.3) -> Tuple[List[Dict], float]:
			
 
				+        """
			
 
				+        将 paddle_text_boxes 按 y 坐标分组（聚类）- 增强版本
			
 
				+    
			
 
				+        Args:
			
 
				+            paddle_boxes: Paddle OCR 文字框列表
			
 
				+            y_tolerance: Y 坐标容忍度（像素）
			
 
				+            auto_correct_skew: 是否自动校正倾斜
			
 
				+    
			
 
				+        Returns:
			
 
				+            分组列表，每组包含 {'y_center': float, 'boxes': List[Dict]}
			
 
				+        """
			
 
				+        skew_angle = 0.0
			
 
				+        if not paddle_boxes:
			
 
				+            return [], skew_angle
			
 
				+        
			
 
				+        # 🎯 步骤 1: 检测并校正倾斜（使用 BBoxExtractor）
			
 
				+        if auto_correct_skew:
			
 
				+            skew_angle = BBoxExtractor.calculate_skew_angle(paddle_boxes)
			
 
				+            
			
 
				+            if abs(skew_angle) > skew_threshold:
			
 
				+                max_x = max(box['bbox'][2] for box in paddle_boxes)
			
 
				+                max_y = max(box['bbox'][3] for box in paddle_boxes)
			
 
				+                image_size = (max_x, max_y)
			
 
				+                
			
 
				+                print(f"   🔧 校正倾斜角度: {skew_angle:.2f}°")
			
 
				+                
			
 
				+                # 计算校正角度 (顺时针旋转)
			
 
				+                correction_angle = -skew_angle
			
 
				+                
			
 
				+                paddle_boxes = BBoxExtractor.correct_boxes_skew(
			
 
				+                    paddle_boxes, correction_angle, image_size
			
 
				+                )
			
 
				+        
			
 
				+        # 🎯 步骤 2: 按校正后的 y 坐标分组
			
 
				+        boxes_with_y = []
			
 
				+        for box in paddle_boxes:
			
 
				+            bbox = box['bbox']
			
 
				+            y_center = (bbox[1] + bbox[3]) / 2
			
 
				+            boxes_with_y.append({
			
 
				+                'y_center': y_center,
			
 
				+                'box': box
			
 
				+            })
			
 
				+        
			
 
				+        # 按 y 坐标排序
			
 
				+        boxes_with_y.sort(key=lambda x: x['y_center'])
			
 
				+        
			
 
				+        groups = []
			
 
				+        current_group = None
			
 
				+        
			
 
				+        for item in boxes_with_y:
			
 
				+            if current_group is None:
			
 
				+                # 开始新组
			
 
				+                current_group = {
			
 
				+                    'y_center': item['y_center'],
			
 
				+                    'boxes': [item['box']]
			
 
				+                }
			
 
				+            else:
			
 
				+                if abs(item['y_center'] - current_group['y_center']) <= y_tolerance:
			
 
				+                    current_group['boxes'].append(item['box'])
			
 
				+                    # 更新组的中心
			
 
				+                    current_group['y_center'] = sum(
			
 
				+                        (b['bbox'][1] + b['bbox'][3]) / 2 for b in current_group['boxes']
			
 
				+                    ) / len(current_group['boxes'])
			
 
				+                else:
			
 
				+                    groups.append(current_group)
			
 
				+                    current_group = {
			
 
				+                        'y_center': item['y_center'],
			
 
				+                        'boxes': [item['box']]
			
 
				+                    }
			
 
				+        
			
 
				+        if current_group:
			
 
				+            groups.append(current_group)
			
 
				+        
			
 
				+        print(f"   ✓ 分组完成: {len(groups)} 行")
			
 
				+        
			
 
				+        return groups, skew_angle
			
 
				+
			
 
				+
			
 
				+    def _match_html_rows_to_paddle_groups(self, html_rows: List, 
			
 
				+                                        grouped_boxes: List[Dict]) -> Dict[int, List[int]]:
			
 
				+        """
			
 
				+        智能匹配 HTML 行与 paddle 分组（增强版 DP：支持跳过 HTML 行，防止链条断裂）
			
 
				+        """
			
 
				+        if not html_rows or not grouped_boxes:
			
 
				+            return {}
			
 
				+        
			
 
				+        mapping = {}
			
 
				+        
			
 
				+        # 提取 HTML 文本（用于验证）
			
 
				+        html_row_texts = []
			
 
				+        for row in html_rows:
			
 
				+            cells = row.find_all(['td', 'th'])
			
 
				+            texts = [self.text_matcher.normalize_text(c.get_text(strip=True)) for c in cells]
			
 
				+            html_row_texts.append("".join(texts))
			
 
				+        
			
 
				+        # 预计算所有组的文本（用于验证）
			
 
				+        group_texts = []
			
 
				+        for group in grouped_boxes:
			
 
				+            boxes = group['boxes']
			
 
				+            texts = [self.text_matcher.normalize_text(b['text']) for b in boxes]
			
 
				+            group_texts.append("".join(texts))
			
 
				+        
			
 
				+        # 🎯 策略 1: 数量相等时，验证内容匹配度，不能简单1:1映射
			
 
				+        if len(html_rows) == len(grouped_boxes):
			
 
				+            # 计算每对行的相似度
			
 
				+            similarity_matrix = []
			
 
				+            for i, html_text in enumerate(html_row_texts):
			
 
				+                row_similarities = []
			
 
				+                for j, group_text in enumerate(group_texts):
			
 
				+                    similarity = self._calculate_similarity(html_text, group_text)
			
 
				+                    row_similarities.append(similarity)
			
 
				+                similarity_matrix.append(row_similarities)
			
 
				+            
			
 
				+            # 检查是否所有对角线元素都是最佳匹配（允许一定偏移）
			
 
				+            # 如果对角线匹配度都很高，才使用1:1映射
			
 
				+            diagonal_ok = True
			
 
				+            min_similarity_threshold = 0.3  # 最低相似度阈值
			
 
				+            
			
 
				+            for i in range(len(html_rows)):
			
 
				+                diag_sim = similarity_matrix[i][i]
			
 
				+                # 检查是否是对角线元素是最佳匹配
			
 
				+                max_sim_in_row = max(similarity_matrix[i])
			
 
				+                # 如果对角线相似度太低，或者不是该行的最佳匹配，则不使用1:1映射
			
 
				+                if diag_sim < min_similarity_threshold or (max_sim_in_row > diag_sim + 0.1):
			
 
				+                    diagonal_ok = False
			
 
				+                    break
			
 
				+            
			
 
				+            # 只有当对角线匹配度都足够高时，才使用简单1:1映射
			
 
				+            if diagonal_ok:
			
 
				+                print(f"   ✓ 行数相同且对角线匹配良好，使用1:1映射")
			
 
				+                for i in range(len(html_rows)):
			
 
				+                    mapping[i] = [i]
			
 
				+                return mapping
			
 
				+            else:
			
 
				+                print(f"   ⚠️ 行数相同但内容不匹配，使用DP算法进行智能匹配")
			
 
				+                # 继续执行下面的DP算法（html_row_texts 和 group_texts 已在上面计算）
			
 
				+        
			
 
				+        n_html = len(html_row_texts)
			
 
				+        n_paddle = len(grouped_boxes)
			
 
				+        # 剪枝参数
			
 
				+        beam_width = self._get_adaptive_beam_width(n_html)
			
 
				+
			
 
				+        # ⚡️ 优化 3: 预计算合并文本
			
 
				+        MAX_MERGE = 4
			
 
				+        merged_cache = {}
			
 
				+        for j in range(n_paddle):
			
 
				+            current_t = ""
			
 
				+            for k in range(MAX_MERGE):
			
 
				+                if j + k < n_paddle:
			
 
				+                    current_t += group_texts[j + k]
			
 
				+                    merged_cache[(j, k + 1)] = current_t
			
 
				+                else:
			
 
				+                    break
			
 
				+
			
 
				+        # --- 动态规划 (DP) ---
			
 
				+        # dp[i][j] 表示：HTML 前 i 行 (0..i) 匹配到了 Paddle 的前 j 组 (0..j) 的最大得分
			
 
				+        # 初始化为负无穷
			
 
				+        dp = np.full((n_html, n_paddle), -np.inf)
			
 
				+        # 记录路径：path[i][j] = (prev_j, start_j) 
			
 
				+        # prev_j: 上一行结束的 paddle index
			
 
				+        # start_j: 当前行开始的 paddle index (因为一行可能对应多个组)
			
 
				+        path = {} 
			
 
				+
			
 
				+        # 参数配置
			
 
				+        SEARCH_WINDOW = 15  # 向前搜索窗口
			
 
				+        SKIP_PADDLE_PENALTY = 0.1  # 跳过 Paddle 组的惩罚
			
 
				+        SKIP_HTML_PENALTY = 0.3    # 关键：跳过 HTML 行的惩罚        
			
 
				+        # --- 1. 初始化第一行 ---
			
 
				+        # 选项 A: 匹配 Paddle 组
			
 
				+        first_row_matched = False
			
 
				+        for end_j in range(min(n_paddle, SEARCH_WINDOW + MAX_MERGE)):
			
 
				+            for count in range(1, MAX_MERGE + 1):
			
 
				+                start_j = end_j - count + 1
			
 
				+                if start_j < 0: continue
			
 
				+                
			
 
				+                current_text = merged_cache.get((start_j, count), "")
			
 
				+                similarity = self._calculate_similarity(html_row_texts[0], current_text)
			
 
				+                
			
 
				+                penalty = start_j * SKIP_PADDLE_PENALTY
			
 
				+                score = similarity - penalty
			
 
				+                
			
 
				+                # 只有得分尚可才作为有效状态
			
 
				+                if score > 0.1:
			
 
				+                    if score > dp[0][end_j]:
			
 
				+                        dp[0][end_j] = score
			
 
				+                        path[(0, end_j)] = (-1, start_j)
			
 
				+                        first_row_matched = True
			
 
				+        
			
 
				+        # 选项 B: 如果第一行完全没有匹配，设置默认初始状态（允许跳过第一行）
			
 
				+        # 这样后续行可以从第一个OCR组开始匹配
			
 
				+        if not first_row_matched:
			
 
				+            print(f"   ⚠️ 第一行（表头）无法匹配任何OCR分组，允许跳过第一行")
			
 
				+            # 设置一个默认初始状态：从第一个OCR组开始，但得分较低（表示跳过了第一行）
			
 
				+            # 这样第二行可以从第一个OCR组开始匹配
			
 
				+            if n_paddle > 0:
			
 
				+                # 从索引0开始，但得分很低（表示跳过了第一行HTML）
			
 
				+                dp[0][0] = -SKIP_HTML_PENALTY  # 负分表示跳过了第一行
			
 
				+                path[(0, 0)] = (-1, 0)  # 没有消耗任何OCR组
			
 
				+                first_row_matched = True  # 标记为已处理，避免后续重复
			
 
				+
			
 
				+        # --- 2. 状态转移 ---
			
 
				+        for i in range(1, n_html):
			
 
				+            html_text = html_row_texts[i]
			
 
				+            
			
 
				+            # 获取上一行所有有效位置
			
 
				+            valid_prev_indices = [j for j in range(n_paddle) if dp[i-1][j] > -np.inf]
			
 
				+            
			
 
				+            # 🛡️ 关键修复：如果上一行没有有效状态（第一行完全无法匹配），
			
 
				+            # 允许从第一个OCR组开始（跳过第一行HTML）
			
 
				+            if not valid_prev_indices:
			
 
				+                print(f"   ⚠️ 第{i}行：上一行无有效状态，从第一个OCR组开始匹配（跳过前面的HTML行）")
			
 
				+                # 从第一个OCR组开始，但得分较低（表示跳过了前面的HTML行）
			
 
				+                if n_paddle > 0:
			
 
				+                    dp[i][0] = -SKIP_HTML_PENALTY * i  # 惩罚与跳过的行数成正比
			
 
				+                    path[(i, 0)] = (-1, 0)  # 没有消耗任何OCR组
			
 
				+                    valid_prev_indices = [0]  # 设置一个初始状态
			
 
				+            
			
 
				+            # 剪枝
			
 
				+            # 剪枝操作：为了提升DP效率，若上一行的可行状态(prev_j)过多（>30），
			
 
				+            # 只保留得分最高的前beam_width个prev_j作为起点，减少组合爆炸
			
 
				+            if len(valid_prev_indices) > beam_width:
			
 
				+                valid_prev_indices.sort(key=lambda j: dp[i-1][j], reverse=True)
			
 
				+                valid_prev_indices = valid_prev_indices[:beam_width]
			
 
				+
			
 
				+            # 🛡️ 关键修复：允许跳过当前 HTML 行 (继承上一行的状态)
			
 
				+            # 如果跳过当前行，Paddle 指针 j 不变
			
 
				+            for prev_j in valid_prev_indices:
			
 
				+                score_skip = dp[i-1][prev_j] - SKIP_HTML_PENALTY
			
 
				+                if score_skip > dp[i][prev_j]:
			
 
				+                    dp[i][prev_j] = score_skip
			
 
				+                    # 记录路径：start_j = prev_j + 1 表示没有消耗新组 (空范围)
			
 
				+                    path[(i, prev_j)] = (prev_j, prev_j + 1)
			
 
				+
			
 
				+            # 如果是空行，直接跳过计算，仅保留继承的状态
			
 
				+            if not html_text:
			
 
				+                continue
			
 
				+
			
 
				+            # 正常匹配逻辑
			
 
				+            for prev_j in valid_prev_indices:
			
 
				+                prev_score = dp[i-1][prev_j]
			
 
				+                
			
 
				+                max_gap = min(SEARCH_WINDOW, n_paddle - prev_j - 1)
			
 
				+                
			
 
				+                for gap in range(max_gap):
			
 
				+                    start_j = prev_j + 1 + gap
			
 
				+                    
			
 
				+                    for count in range(1, MAX_MERGE + 1):
			
 
				+                        end_j = start_j + count - 1
			
 
				+                        if end_j >= n_paddle: break
			
 
				+                        
			
 
				+                        current_text = merged_cache.get((start_j, count), "")
			
 
				+                        
			
 
				+                        # 长度预筛选
			
 
				+                        h_len = len(html_text)
			
 
				+                        p_len = len(current_text)
			
 
				+                        if h_len > 10 and p_len < h_len * 0.2:
			
 
				+                            continue
			
 
				+
			
 
				+                        similarity = self._calculate_similarity(html_text, current_text)
			
 
				+                        
			
 
				+                        # 计算惩罚
			
 
				+                        # 1. 跳过惩罚 (gap)
			
 
				+                        # 2. 长度惩罚 (防止过度合并)
			
 
				+                        len_penalty = 0.0
			
 
				+                        if h_len > 0:
			
 
				+                            ratio = p_len / h_len
			
 
				+                            if ratio > 2.0: len_penalty = (ratio - 2.0) * 0.2
			
 
				+
			
 
				+                        current_score = similarity - (gap * SKIP_PADDLE_PENALTY) - len_penalty
			
 
				+                        
			
 
				+                        # 只有正收益才转移
			
 
				+                        if current_score > 0.1:
			
 
				+                            total_score = prev_score + current_score
			
 
				+                            
			
 
				+                            if total_score > dp[i][end_j]:
			
 
				+                                dp[i][end_j] = total_score
			
 
				+                                path[(i, end_j)] = (prev_j, start_j)
			
 
				+
			
 
				+        # --- 3. 回溯找最优路径 ---
			
 
				+        # 找到最后一行得分最高的结束位置
			
 
				+        best_end_j = -1
			
 
				+        max_score = -np.inf
			
 
				+        
			
 
				+        # 优先找最后一行，如果最后一行没匹配上，往前找
			
 
				+        found_end = False
			
 
				+        for i in range(n_html - 1, -1, -1):
			
 
				+            for j in range(n_paddle):
			
 
				+                if dp[i][j] > max_score:
			
 
				+                    max_score = dp[i][j]
			
 
				+                    best_end_j = j
			
 
				+                    best_last_row = i
			
 
				+            if max_score > -np.inf:
			
 
				+                found_end = True
			
 
				+                break
			
 
				+        
			
 
				+        mapping = {}
			
 
				+        used_groups = set()
			
 
				+        
			
 
				+        if found_end:
			
 
				+            curr_i = best_last_row
			
 
				+            curr_j = best_end_j
			
 
				+            
			
 
				+            while curr_i >= 0:
			
 
				+                if (curr_i, curr_j) in path:
			
 
				+                    prev_j, start_j = path[(curr_i, curr_j)]
			
 
				+                    
			
 
				+                    # 如果 start_j <= curr_j，说明消耗了 Paddle 组
			
 
				+                    # 如果 start_j > curr_j，说明是跳过 HTML 行 (空范围)
			
 
				+                    if start_j <= curr_j:
			
 
				+                        indices = list(range(start_j, curr_j + 1))
			
 
				+                        mapping[curr_i] = indices
			
 
				+                        used_groups.update(indices)
			
 
				+                    else:
			
 
				+                        mapping[curr_i] = []
			
 
				+                    
			
 
				+                    curr_j = prev_j
			
 
				+                    curr_i -= 1
			
 
				+                else:
			
 
				+                    break
			
 
				+        
			
 
				+        # 填补未匹配的行
			
 
				+        for i in range(n_html):
			
 
				+            if i not in mapping:
			
 
				+                mapping[i] = []
			
 
				+
			
 
				+        # --- 4. 后处理：未匹配组的归属 (Orphans) ---
			
 
				+        unused_groups = [i for i in range(len(grouped_boxes)) if i not in used_groups]
			
 
				+        
			
 
				+        if unused_groups:
			
 
				+            print(f"   ℹ️ 发现 {len(unused_groups)} 个未匹配的 paddle 组: {unused_groups}")
			
 
				+            for unused_idx in unused_groups:
			
 
				+                unused_group = grouped_boxes[unused_idx]
			
 
				+                unused_y_min = min(b['bbox'][1] for b in unused_group['boxes'])
			
 
				+                unused_y_max = max(b['bbox'][3] for b in unused_group['boxes'])
			
 
				+                
			
 
				+                above_idx = None
			
 
				+                below_idx = None
			
 
				+                above_distance = float('inf')
			
 
				+                below_distance = float('inf')
			
 
				+                
			
 
				+                for i in range(unused_idx - 1, -1, -1):
			
 
				+                    if i in used_groups:
			
 
				+                        above_idx = i
			
 
				+                        above_group = grouped_boxes[i]
			
 
				+                        max_y_box = max(above_group['boxes'], key=lambda b: b['bbox'][3])
			
 
				+                        above_y_center = (max_y_box['bbox'][1] + max_y_box['bbox'][3]) / 2
			
 
				+                        above_distance = abs(unused_y_min - above_y_center)
			
 
				+                        break
			
 
				+                
			
 
				+                for i in range(unused_idx + 1, len(grouped_boxes)):
			
 
				+                    if i in used_groups:
			
 
				+                        below_idx = i
			
 
				+                        below_group = grouped_boxes[i]
			
 
				+                        min_y_box = min(below_group['boxes'], key=lambda b: b['bbox'][1])
			
 
				+                        below_y_center = (min_y_box['bbox'][1] + min_y_box['bbox'][3]) / 2
			
 
				+                        below_distance = abs(below_y_center - unused_y_max)
			
 
				+                        break
			
 
				+                
			
 
				+                closest_used_idx = None
			
 
				+                merge_direction = ""
			
 
				+                
			
 
				+                if above_idx is not None and below_idx is not None:
			
 
				+                    if above_distance < below_distance:
			
 
				+                        closest_used_idx = above_idx
			
 
				+                        merge_direction = "上方"
			
 
				+                    else:
			
 
				+                        closest_used_idx = below_idx
			
 
				+                        merge_direction = "下方"
			
 
				+                elif above_idx is not None:
			
 
				+                    closest_used_idx = above_idx
			
 
				+                    merge_direction = "上方"
			
 
				+                elif below_idx is not None:
			
 
				+                    closest_used_idx = below_idx
			
 
				+                    merge_direction = "下方"
			
 
				+                
			
 
				+                if closest_used_idx is not None:
			
 
				+                    target_html_row = None
			
 
				+                    for html_row_idx, group_indices in mapping.items():
			
 
				+                        if closest_used_idx in group_indices:
			
 
				+                            target_html_row = html_row_idx
			
 
				+                            break
			
 
				+                    
			
 
				+                    if target_html_row is not None:
			
 
				+                        if unused_idx not in mapping[target_html_row]:
			
 
				+                            mapping[target_html_row].append(unused_idx)
			
 
				+                            mapping[target_html_row].sort()
			
 
				+                            print(f"      • 组 {unused_idx} 合并到 HTML 行 {target_html_row}（{merge_direction}行）")                
			
 
				+                used_groups.add(unused_idx)
			
 
				+        
			
 
				+        # 🔑 策略 4: 第三遍 - 按 y 坐标排序每行的组索引
			
 
				+        for row_idx in mapping:
			
 
				+            if mapping[row_idx]:
			
 
				+                mapping[row_idx].sort(key=lambda idx: grouped_boxes[idx]['y_center'])
			
 
				+        
			
 
				+        return mapping
			
 
				+
			
 
				+    def _get_adaptive_beam_width(self, html_row_count: int) -> int:
			
 
				+        """根据HTML行数动态调整剪枝参数"""
			
 
				+        if html_row_count <= 20:
			
 
				+            return 10
			
 
				+        elif html_row_count <= 40:
			
 
				+            return 15
			
 
				+        else:
			
 
				+            return 20  # 最大20，而不是30
			
 
				+
			
 
				+    def _calculate_similarity(self, text1: str, text2: str) -> float:
			
 
				+        """
			
 
				+        计算两个文本的相似度，结合字符覆盖率和序列相似度 (性能优化版)
			
 
				+        """
			
 
				+        if not text1 or not text2:
			
 
				+            return 0.0
			
 
				+            
			
 
				+        len1, len2 = len(text1), len(text2)
			
 
				+        
			
 
				+        # ⚡️ 优化 1: 长度快速检查
			
 
				+        # 如果长度差异过大（例如一个 50 字符，一个 2 字符），直接认为不匹配
			
 
				+        if len1 > 0 and len2 > 0:
			
 
				+            min_l, max_l = min(len1, len2), max(len1, len2)
			
 
				+            if max_l > 10 and min_l / max_l < 0.2:
			
 
				+                return 0.0
			
 
				+
			
 
				+        # 1. 字符覆盖率 (Character Overlap)
			
 
				+        from collections import Counter
			
 
				+        c1 = Counter(text1)
			
 
				+        c2 = Counter(text2)
			
 
				+        
			
 
				+        intersection = c1 & c2
			
 
				+        overlap_count = sum(intersection.values())
			
 
				+        
			
 
				+        coverage = overlap_count / len1 if len1 > 0 else 0
			
 
				+        
			
 
				+        # ⚡️ 优化 2: 覆盖率低时跳过昂贵的 fuzz 计算
			
 
				+        # 如果字符重叠率低于 30%，说明内容基本不相关，没必要算序列相似度
			
 
				+        if coverage < 0.3:
			
 
				+            return coverage * 0.7
			
 
				+
			
 
				+        # 2. 序列相似度 (Sequence Similarity)
			
 
				+        # 使用 token_sort_ratio 来容忍一定的乱序
			
 
				+        seq_score = fuzz.token_sort_ratio(text1, text2) / 100.0
			
 
				+        
			
 
				+        return (coverage * 0.7) + (seq_score * 0.3)
			
 
				+
			
 
				+    def _preprocess_text_for_matching(self, text: str) -> str:
			
 
				+        """
			
 
				+        预处理文本：在不同类型的字符（如中文和数字/英文）之间插入空格，
			
 
				+        以便于 token_sort_ratio 更准确地进行分词和匹配。
			
 
				+        """
			
 
				+        if not text:
			
 
				+            return ""
			
 
				+        import re
			
 
				+        # 1. 在中文和非中文(数字/字母)之间插入空格
			
 
				+        # 例如: "2024年" -> "2024 年", "ID号码123" -> "ID号码 123"
			
 
				+        text = re.sub(r'([\u4e00-\u9fa5])([a-zA-Z0-9])', r'\1 \2', text)
			
 
				+        text = re.sub(r'([a-zA-Z0-9])([\u4e00-\u9fa5])', r'\1 \2', text)
			
 
				+        return text
			
 
				+
			
 
				+    def _calculate_subsequence_score(self, target: str, source: str) -> float:
			
 
				+        """
			
 
				+        计算子序列匹配得分 (解决 OCR 噪音插入问题)
			
 
				+        例如: Target="12345", Source="12(date)34(time)5" -> Score close to 100
			
 
				+        """
			
 
				+        # 1. 仅保留字母和数字，忽略符号干扰
			
 
				+        t_clean = "".join(c for c in target if c.isalnum())
			
 
				+        s_clean = "".join(c for c in source if c.isalnum())
			
 
				+        
			
 
				+        if not t_clean or not s_clean:
			
 
				+            return 0.0
			
 
				+            
			
 
				+        # 2. 贪婪匹配子序列
			
 
				+        t_idx, s_idx = 0, 0
			
 
				+        matches = 0
			
 
				+        
			
 
				+        while t_idx < len(t_clean) and s_idx < len(s_clean):
			
 
				+            if t_clean[t_idx] == s_clean[s_idx]:
			
 
				+                matches += 1
			
 
				+                t_idx += 1
			
 
				+                s_idx += 1
			
 
				+            else:
			
 
				+                # 跳过 source 中的噪音字符
			
 
				+                s_idx += 1
			
 
				+        
			
 
				+        # 3. 计算得分
			
 
				+        match_rate = matches / len(t_clean)
			
 
				+        
			
 
				+        # 如果匹配率太低，直接返回
			
 
				+        if match_rate < 0.8:
			
 
				+            return match_rate * 100
			
 
				+            
			
 
				+        # 4. 噪音惩罚 (防止 Target="1", Source="123456789" 这种误判)
			
 
				+        # 计算噪音长度
			
 
				+        noise_len = len(s_clean) - matches
			
 
				+        
			
 
				+        # 允许一定比例的噪音 (例如日期时间插入，通常占总长度的 30%-50%)
			
 
				+        # 如果噪音长度超过目标长度的 60%，开始扣分
			
 
				+        penalty = 0
			
 
				+        if noise_len > len(t_clean) * 0.6:
			
 
				+            excess_noise = noise_len - (len(t_clean) * 0.6)
			
 
				+            penalty = excess_noise * 0.5 # 每多一个噪音字符扣 0.5 分
			
 
				+            penalty = min(penalty, 20)   # 最多扣 20 分
			
 
				+            
			
 
				+        final_score = (match_rate * 100) - penalty
			
 
				+        return max(0, final_score)
			
--- a/ocr_tools/ocr_merger/text_matcher.py
+++ b/ocr_tools/ocr_merger/text_matcher.py
@@ -0,0 +1,138 @@
 
				+"""
			
 
				+文本匹配工具模块
			
 
				+负责文本标准化、相似度计算等
			
 
				+"""
			
 
				+import re
			
 
				+from typing import Optional, List, Dict
			
 
				+from fuzzywuzzy import fuzz
			
 
				+
			
 
				+
			
 
				+class TextMatcher:
			
 
				+    """文本匹配器"""
			
 
				+    
			
 
				+    def __init__(self, similarity_threshold: int = 90):
			
 
				+        """
			
 
				+        Args:
			
 
				+            similarity_threshold: 文本相似度阈值
			
 
				+        """
			
 
				+        self.similarity_threshold = similarity_threshold
			
 
				+    
			
 
				+    def normalize_text(self, text: str) -> str:
			
 
				+        """标准化文本（去除空格、标点等）"""
			
 
				+        # 移除所有空白字符
			
 
				+        text = re.sub(r'\s+', '', text)
			
 
				+        # 转换全角数字和字母为半角
			
 
				+        text = self._full_to_half(text)
			
 
				+        return text.lower()
			
 
				+    
			
 
				+    def _full_to_half(self, text: str) -> str:
			
 
				+        """全角转半角"""
			
 
				+        result = []
			
 
				+        for char in text:
			
 
				+            code = ord(char)
			
 
				+            if code == 0x3000:  # 全角空格
			
 
				+                code = 0x0020
			
 
				+            elif 0xFF01 <= code <= 0xFF5E:  # 全角字符
			
 
				+                code -= 0xFEE0
			
 
				+            result.append(chr(code))
			
 
				+        return ''.join(result)
			
 
				+    
			
 
				+    def find_matching_bbox(self, target_text: str, text_boxes: List[Dict], 
			
 
				+                          start_index: int, last_match_index: int,
			
 
				+                          look_ahead_window: int = 10) -> tuple[Optional[Dict], int, int]:
			
 
				+        """
			
 
				+        查找匹配的文字框
			
 
				+        
			
 
				+        Args:
			
 
				+            target_text: 目标文本
			
 
				+            text_boxes: 文字框列表
			
 
				+            start_index: 起始索引
			
 
				+            last_match_index: 上次匹配成功的索引
			
 
				+            look_ahead_window: 向前查找窗口
			
 
				+        
			
 
				+        Returns:
			
 
				+            (匹配的文字框信息, 新的指针位置, last_match_index)
			
 
				+        """
			
 
				+        target_text = self.normalize_text(target_text)
			
 
				+        
			
 
				+        # 过滤过短的目标文本
			
 
				+        if len(target_text) < 2:
			
 
				+            return None, start_index, last_match_index
			
 
				+
			
 
				+        # 由于minerU和Paddle的顺序基本一致, 也有不一致的地方, 所以需要向前找第一个未使用的位置
			
 
				+        # MinerU和Paddle都可能识别错误，所以需要一个look_ahead_window来避免漏掉匹配
			
 
				+        # 匹配时会遇到一些特殊情况，比如Paddle把两个连着的cell识别为一个字符串，MinerU将单元格上下2行识别为一行
			
 
				+        # 	'1|2024-08-11|扫二维码付'   minerU识别为“扫二维码付款”，Paddle识别为'12024-08-11扫二维码付'  
			
 
				+        #                  款
			
 
				+        # 字符串的顺序极大概率是一致的，所以如果短字符串是长字符串的子串，可以增加相似权重
			
 
				+        search_start = self._find_search_start(
			
 
				+            text_boxes, last_match_index, start_index, look_ahead_window
			
 
				+        )
			
 
				+        search_end = min(start_index + look_ahead_window, len(text_boxes))
			
 
				+        
			
 
				+        best_match = None
			
 
				+        best_index = start_index
			
 
				+
			
 
				+        # 在搜索范围内查找最佳匹配
			
 
				+        for i in range(search_start, search_end):
			
 
				+            if text_boxes[i]['used']:
			
 
				+                continue
			
 
				+            
			
 
				+            box_text = self.normalize_text(text_boxes[i]['text'])
			
 
				+            
			
 
				+            # 精确匹配优先
			
 
				+            if target_text == box_text:
			
 
				+                if i >= start_index:
			
 
				+                    return text_boxes[i], i + 1, i
			
 
				+                else:
			
 
				+                    return text_boxes[i], start_index, i
			
 
				+            
			
 
				+            # 过滤过短的候选文本(避免单字符匹配)
			
 
				+            # if len(box_text) < 2:
			
 
				+            #     continue
			
 
				+            
			
 
				+            # 长度比例检查 - 避免长度差异过大的匹配
			
 
				+            length_ratio = min(len(target_text), len(box_text)) / max(len(target_text), len(box_text))
			
 
				+            if length_ratio < 0.35:  # 长度差异超过1/3则跳过
			
 
				+                continue
			
 
				+
			
 
				+            # 子串检查
			
 
				+            shorter = target_text if len(target_text) < len(box_text) else box_text
			
 
				+            longer = box_text if len(target_text) < len(box_text) else target_text
			
 
				+            is_substring = shorter in longer        
			
 
				+
			
 
				+            # 计算多种相似度
			
 
				+            # token_sort_ratio = fuzz.token_sort_ratio(target_text, box_text)
			
 
				+            partial_ratio = fuzz.partial_ratio(target_text, box_text)
			
 
				+            if is_substring:
			
 
				+                partial_ratio += 10  # 子串时提升相似度
			
 
				+            
			
 
				+            # 综合相似度 - 两种算法都要达到阈值
			
 
				+            if (partial_ratio >= self.similarity_threshold):
			
 
				+                if i >= start_index:
			
 
				+                    return text_boxes[i], i + 1, last_match_index
			
 
				+                else:
			
 
				+                    return text_boxes[i], start_index, last_match_index
			
 
				+
			
 
				+        return best_match, best_index, last_match_index    
			
 
				+
			
 
				+    def _find_search_start(self, text_boxes: List[Dict], last_match_index: int,
			
 
				+                          start_index: int, look_ahead_window: int) -> int:
			
 
				+        """确定搜索起始位置"""
			
 
				+        search_start = last_match_index - 1
			
 
				+        unused_count = 0
			
 
				+        
			
 
				+        while search_start >= 0:
			
 
				+            if not text_boxes[search_start]['used']:
			
 
				+                unused_count += 1
			
 
				+            if unused_count >= look_ahead_window:
			
 
				+                break
			
 
				+            search_start -= 1
			
 
				+        
			
 
				+        if search_start < 0:
			
 
				+            search_start = 0
			
 
				+            while search_start < start_index and text_boxes[search_start]['used']:
			
 
				+                search_start += 1
			
 
				+        
			
 
				+        return search_start
			
 
				+    
			
--- a/ocr_tools/ocr_merger/unified_output_converter.py
+++ b/ocr_tools/ocr_merger/unified_output_converter.py
@@ -0,0 +1,137 @@
 
				+"""
			
 
				+统一输出格式转换器
			
 
				+将不同OCR工具的结果转换为标准的MinerU格式
			
 
				+"""
			
 
				+from typing import List, Dict
			
 
				+from bs4 import BeautifulSoup
			
 
				+
			
 
				+class UnifiedOutputConverter:
			
 
				+    """统一输出格式转换器"""
			
 
				+    
			
 
				+    def __init__(self):
			
 
				+        # self.text_processor = TextProcessor()
			
 
				+        pass
			
 
				+    
			
 
				+    def convert_to_mineru_format(self, merged_data: List[Dict], 
			
 
				+                                 data_source: str = 'auto') -> List[Dict]:
			
 
				+        """
			
 
				+        将合并后的数据转换为MinerU标准格式
			
 
				+        
			
 
				+        Args:
			
 
				+            merged_data: 合并后的数据
			
 
				+            data_source: 数据来源 ('paddleocr_vl', 'mineru', 'auto')
			
 
				+        
			
 
				+        Returns:
			
 
				+            MinerU格式的数据列表
			
 
				+        """
			
 
				+        # 自动检测数据格式
			
 
				+        if data_source == 'auto':
			
 
				+            data_source = self._detect_data_source(merged_data)
			
 
				+        
			
 
				+        if data_source == 'paddleocr_vl':
			
 
				+            return self._convert_paddleocr_vl_to_mineru(merged_data)
			
 
				+        elif data_source == 'mineru':
			
 
				+            return merged_data  # 已经是MinerU格式
			
 
				+        else:
			
 
				+            raise ValueError(f"Unsupported data source: {data_source}")
			
 
				+    
			
 
				+    def _detect_data_source(self, merged_data: List[Dict]) -> str:
			
 
				+        """检测数据来源"""
			
 
				+        if not merged_data:
			
 
				+            return 'mineru'
			
 
				+        
			
 
				+        first_item = merged_data[0]
			
 
				+        
			
 
				+        # 检查PaddleOCR_VL特征
			
 
				+        if 'block_label' in first_item:
			
 
				+            return 'paddleocr_vl'
			
 
				+        
			
 
				+        # 检查MinerU特征
			
 
				+        if 'type' in first_item:
			
 
				+            return 'mineru'
			
 
				+        
			
 
				+        return 'mineru'  # 默认
			
 
				+    
			
 
				+    def _convert_paddleocr_vl_to_mineru(self, merged_data: List[Dict]) -> List[Dict]:
			
 
				+        """将PaddleOCR_VL格式转换为MinerU格式"""
			
 
				+        mineru_data = []
			
 
				+        
			
 
				+        for item in merged_data:
			
 
				+            block_label = item.get('block_label', '')
			
 
				+            
			
 
				+            # 映射block_label到MinerU的type
			
 
				+            type_mapping = {
			
 
				+                'header': 'header',
			
 
				+                'footer': 'footer',
			
 
				+                'page_number': 'page_number',
			
 
				+                'paragraph_title': 'title',
			
 
				+                'doc_title': 'title',
			
 
				+                'abstract_title': 'title',
			
 
				+                'reference_title': 'title',
			
 
				+                'content_title': 'title',
			
 
				+                'figure_title': 'title',
			
 
				+                'table_title': 'title',
			
 
				+                'text': 'text',
			
 
				+                'table': 'table',
			
 
				+                'figure': 'image',
			
 
				+                'chart': 'image',
			
 
				+                'seal': 'image',
			
 
				+                'equation': 'interline_equation',
			
 
				+                'reference': 'ref_text',
			
 
				+            }
			
 
				+            
			
 
				+            mineru_type = type_mapping.get(block_label, 'text')
			
 
				+            
			
 
				+            # 构建MinerU格式的数据项
			
 
				+            mineru_item = {
			
 
				+                'type': mineru_type,
			
 
				+                'bbox': item.get('block_bbox', item.get('bbox', [])),
			
 
				+                'page_idx': item.get('page_idx', 0)
			
 
				+            }
			
 
				+            
			
 
				+            # 根据类型添加特定字段
			
 
				+            if mineru_type == 'title':
			
 
				+                mineru_item['text'] = item.get('block_content', item.get('text', ''))
			
 
				+                # 根据block_label确定标题级别
			
 
				+                level_map = {
			
 
				+                    'doc_title': 1,
			
 
				+                    'paragraph_title': 2,
			
 
				+                    'abstract_title': 2,
			
 
				+                    'reference_title': 2,
			
 
				+                    'content_title': 3,
			
 
				+                    'figure_title': 4,
			
 
				+                    'table_title': 4,
			
 
				+                }
			
 
				+                mineru_item['text_level'] = level_map.get(block_label, 1)
			
 
				+            
			
 
				+            elif mineru_type == 'text':
			
 
				+                mineru_item['text'] = item.get('block_content', item.get('text', ''))
			
 
				+            
			
 
				+            elif mineru_type in ['header', 'footer', 'page_number', 'ref_text']:
			
 
				+                mineru_item['text'] = item.get('block_content', item.get('text', ''))
			
 
				+            
			
 
				+            elif mineru_type == 'table':
			
 
				+                # 处理表格
			
 
				+                table_html = item.get('block_content_with_bbox', 
			
 
				+                                    item.get('block_content', ''))
			
 
				+                mineru_item['table_body'] = table_html
			
 
				+                mineru_item['table_body_with_bbox'] = table_html
			
 
				+                mineru_item['table_caption'] = []
			
 
				+                mineru_item['table_footnote'] = []
			
 
				+                
			
 
				+                # 提取表格单元格信息
			
 
				+                if item.get('table_cells'):
			
 
				+                    mineru_item['table_cells'] = item['table_cells']
			
 
				+            
			
 
				+            elif mineru_type == 'image':
			
 
				+                mineru_item['img_path'] = item.get('img_path', '')
			
 
				+                mineru_item['image_caption'] = []
			
 
				+                mineru_item['image_footnote'] = []
			
 
				+            
			
 
				+            elif mineru_type == 'interline_equation':
			
 
				+                mineru_item['text'] = item.get('block_content', item.get('text', ''))
			
 
				+                mineru_item['text_format'] = 'latex'
			
 
				+            
			
 
				+            mineru_data.append(mineru_item)
			
 
				+        
			
 
				+        return mineru_data
			
--- a/ocr_tools/ocr_merger/坐标系变换.md
+++ b/ocr_tools/ocr_merger/坐标系变换.md
--- a/ocr_tools/ocr_merger/表格行匹配算法可视化图示.md
+++ b/ocr_tools/ocr_merger/表格行匹配算法可视化图示.md
@@ -0,0 +1,345 @@
 
				+# 表格行匹配算法可视化图示
			
 
				+
			
 
				+## 📐 数据结构可视化
			
 
				+
			
 
				+### 输入数据
			
 
				+
			
 
				+```
			
 
				+HTML表格行（7行）:
			
 
				+┌─────────────────────────────────────────────────────────┐
			
 
				+│ 行0: 部门Department: 01372403999 柜员Search Teller...  │
			
 
				+├─────────────────────────────────────────────────────────┤
			
 
				+│ 行1: 账号/卡号Account/Card No: 6222621010026732125...  │
			
 
				+├─────────────────────────────────────────────────────────┤
			
 
				+│ 行2: 查询起日Query Starting Date: 2023-08-12...       │
			
 
				+├─────────────────────────────────────────────────────────┤
			
 
				+│ 行3: 查询时间Query Time: 2024年08月12日 11:21:23...    │
			
 
				+├─────────────────────────────────────────────────────────┤
			
 
				+│ 行4: 证件种类 ID Type: 第二代居民身份证...              │
			
 
				+├─────────────────────────────────────────────────────────┤
			
 
				+│ 行5: Serial Num | Trans Date | Trans Time | ... (表头) │
			
 
				+├─────────────────────────────────────────────────────────┤
			
 
				+│ 行6: 1 | 2023-08-12 | 13:04:52 | 网上银行卡转入...     │
			
 
				+└─────────────────────────────────────────────────────────┘
			
 
				+
			
 
				+OCR文本分组（48组）:
			
 
				+┌─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┬─────┐
			
 
				+│ 组0 │ 组1 │ 组2 │ 组3 │ 组4 │ 组5 │ ... │ 组46│ 组47│
			
 
				+├─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┼─────┤
			
 
				+│部门 │账号 │查询 │查询 │证件 │序号 │ ... │司   │9646 │
			
 
				+│信息 │信息 │起日 │时间 │信息 │表头 │     │2023 │42s  │
			
 
				+└─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┴─────┘
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🔄 DP状态转移可视化
			
 
				+
			
 
				+### 阶段1：初始化第一行（i=0）
			
 
				+
			
 
				+```
			
 
				+HTML行0的匹配搜索空间:
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ OCR组索引:  0    1    2    3    4    5  ...  18            │
			
 
				+│            ┌────┐                                          │
			
 
				+│            │组0 │  ← 单独匹配                               │
			
 
				+│            └────┘                                          │
			
 
				+│            ┌────┬────┐                                      │
			
 
				+│            │组0 │组1 │  ← 合并2组                           │
			
 
				+│            └────┴────┘                                      │
			
 
				+│            ┌────┬────┬────┐                                │
			
 
				+│            │组0 │组1 │组2 │  ← 合并3组 (最优: 得分0.98)    │
			
 
				+│            └────┴────┴────┘                                │
			
 
				+│            ┌────┬────┬────┬────┐                          │
			
 
				+│            │组0 │组1 │组2 │组3 │  ← 合并4组                │
			
 
				+│            └────┴────┴────┴────┘                          │
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+
			
 
				+DP状态矩阵（初始化后）:
			
 
				+     OCR组索引 j →
			
 
				+HTML  0    1    2    3    4    5  ...  47
			
 
				+行↓
			
 
				+0    [0.98] -    -    -    -    -  ...  -
			
 
				+      ↑
			
 
				+   最优匹配: HTML行0 → OCR组[0,1,2]
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+### 阶段2：处理第二行（i=1）
			
 
				+
			
 
				+```
			
 
				+从 dp[0][2] = 0.98 出发，搜索HTML行1的匹配:
			
 
				+
			
 
				+搜索窗口（SEARCH_WINDOW=15）:
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ 上一行结束位置: 组2                                         │
			
 
				+│                                                             │
			
 
				+│ 搜索范围: 组3 到 组17 (最多向前15个组)                     │
			
 
				+│                                                             │
			
 
				+│ 组索引:  2  [3]  [4]  [5]  [6]  ...  [17]                  │
			
 
				+│         ↑    │    │    │    │         │                    │
			
 
				+│      起点    └────┴────┴────┴─────────┘                    │
			
 
				+│             搜索窗口                                        │
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+
			
 
				+匹配尝试:
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ gap=0, start_j=3, count=1 → end_j=3                        │
			
 
				+│   匹配: 组3单独                                             │
			
 
				+│   相似度: 0.85, 得分: 0.98 + 0.85 = 1.83                   │
			
 
				+│                                                             │
			
 
				+│ gap=0, start_j=3, count=2 → end_j=4                        │
			
 
				+│   匹配: 组3+组4                                             │
			
 
				+│   相似度: 0.92, 得分: 0.98 + 0.92 = 1.90 ✅ (最优)         │
			
 
				+│                                                             │
			
 
				+│ gap=1, start_j=4, count=1 → end_j=4                        │
			
 
				+│   匹配: 组4单独                                             │
			
 
				+│   相似度: 0.20, 得分: 0.98 + 0.20 = 1.18                   │
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+
			
 
				+DP状态矩阵更新:
			
 
				+     OCR组索引 j →
			
 
				+HTML  0    1    2    3    4    5  ...  47
			
 
				+行↓
			
 
				+0    [0.98] -    -    -    -    -  ...  -
			
 
				+1    -     -    -    1.83 [1.90] -  ...  -
			
 
				+                      ↑     ↑
			
 
				+                   候选1  最优匹配
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+### 阶段3：完整DP过程（简化版）
			
 
				+
			
 
				+```
			
 
				+DP状态矩阵（完整过程）:
			
 
				+     OCR组索引 j →
			
 
				+HTML  0    1    2    3    4    5    6    7  ...  40  41  ...  47
			
 
				+行↓
			
 
				+0    [0.98] -    -    -    -    -    -    -  ...  -   -  ...  -
			
 
				+      │
			
 
				+      └─→ 匹配组[0,1,2]
			
 
				+      
			
 
				+1    -     -    -    1.83 [1.90] -    -    -  ...  -   -  ...  -
			
 
				+                      │     │
			
 
				+                      └─────┴─→ 匹配组[3,4]
			
 
				+                      
			
 
				+2    -     -    -    -    -    [2.85] -    -  ...  -   -  ...  -
			
 
				+                              │
			
 
				+                              └─→ 匹配组[5,6]
			
 
				+                              
			
 
				+3    -     -    -    -    -    -    [3.80] -  ...  -   -  ...  -
			
 
				+                                   │
			
 
				+                                   └─→ 匹配组[7,8,9]
			
 
				+                                   
			
 
				+4    -     -    -    -    -    -    -    [4.75] ...  -   -  ...  -
			
 
				+                                        │
			
 
				+                                        └─→ 匹配组[10,11]
			
 
				+                                        
			
 
				+5    -     -    -    -    -    -    -    -  ...  [5.20] -  ...  -
			
 
				+                                                      │
			
 
				+                                                      └─→ 匹配组[12..39]
			
 
				+                                                      
			
 
				+6    -     -    -    -    -    -    -    -  ...  -   -  ...  [5.50]
			
 
				+                                                                 │
			
 
				+                                                                 └─→ 匹配组[40..47] ✅
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🔍 回溯路径可视化
			
 
				+
			
 
				+### 从最优终点回溯
			
 
				+
			
 
				+```
			
 
				+最优终点: dp[6][47] = 5.50
			
 
				+
			
 
				+回溯过程:
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ 步骤1: (i=6, j=47)                                         │
			
 
				+│   path[(6, 47)] = (prev_j=45, start_j=46)                  │
			
 
				+│   → HTML行6匹配OCR组[46, 47]                               │
			
 
				+│   → 移动到 (i=5, j=45)                                     │
			
 
				+│                                                             │
			
 
				+│ 步骤2: (i=5, j=45)                                         │
			
 
				+│   path[(5, 45)] = (prev_j=39, start_j=40)                  │
			
 
				+│   → HTML行5匹配OCR组[40..45]                               │
			
 
				+│   → 移动到 (i=4, j=39)                                     │
			
 
				+│                                                             │
			
 
				+│ 步骤3: (i=4, j=39)                                         │
			
 
				+│   path[(4, 39)] = (prev_j=37, start_j=38)                  │
			
 
				+│   → HTML行4匹配OCR组[38, 39]                               │
			
 
				+│   → 移动到 (i=3, j=37)                                     │
			
 
				+│                                                             │
			
 
				+│ ... 继续回溯 ...                                            │
			
 
				+│                                                             │
			
 
				+│ 步骤7: (i=0, j=2)                                          │
			
 
				+│   path[(0, 2)] = (prev_j=-1, start_j=0)                   │
			
 
				+│   → HTML行0匹配OCR组[0, 1, 2]                              │
			
 
				+│   → 回溯结束                                               │
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+
			
 
				+最终匹配结果:
			
 
				+HTML行0 ──────→ OCR组[0, 1, 2]
			
 
				+HTML行1 ──────→ OCR组[3, 4]
			
 
				+HTML行2 ──────→ OCR组[5, 6]
			
 
				+HTML行3 ──────→ OCR组[7, 8, 9]
			
 
				+HTML行4 ──────→ OCR组[10, 11]
			
 
				+HTML行5 ──────→ OCR组[12, 13, ..., 39]
			
 
				+HTML行6 ──────→ OCR组[40, 41, ..., 47]
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🎯 关键概念图示
			
 
				+
			
 
				+### 1. 合并文本缓存（merged_cache）
			
 
				+
			
 
				+```
			
 
				+merged_cache[(start_j, count)] = 从start_j开始合并count个组的文本
			
 
				+
			
 
				+示例: start_j=0, count=3
			
 
				+┌─────┬─────┬─────┐
			
 
				+│ 组0 │ 组1 │ 组2 │  → 合并文本: "部门department:...账号/卡号...查询起日..."
			
 
				+└─────┴─────┴─────┘
			
 
				+   ↑     ↑     ↑
			
 
				+   └─────┴─────┘
			
 
				+    count=3
			
 
				+```
			
 
				+
			
 
				+### 2. Gap搜索机制
			
 
				+
			
 
				+```
			
 
				+上一行结束位置: prev_j = 2
			
 
				+
			
 
				+搜索窗口:
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ prev_j=2                                                     │
			
 
				+│   │                                                          │
			
 
				+│   │  gap=0: start_j=3  (紧接上一行)                         │
			
 
				+│   │  gap=1: start_j=4  (跳过1个组)                          │
			
 
				+│   │  gap=2: start_j=5  (跳过2个组)                          │
			
 
				+│   │  ...                                                     │
			
 
				+│   │  gap=14: start_j=16 (最多跳过14个组)                    │
			
 
				+│   │                                                          │
			
 
				+│   └──────────────────────────────────────────────────────────┘
			
 
				+│              SEARCH_WINDOW = 15
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+```
			
 
				+
			
 
				+### 3. 跳过HTML行的处理
			
 
				+
			
 
				+```
			
 
				+场景: HTML行i无法匹配任何OCR组
			
 
				+
			
 
				+处理方式:
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ HTML行i-1 → OCR组j (得分: 2.5)                              │
			
 
				+│   │                                                          │
			
 
				+│   ├─→ HTML行i匹配OCR组j+1 (正常匹配)                        │
			
 
				+│   │                                                          │
			
 
				+│   └─→ HTML行i跳过 (继承状态，得分: 2.5 - 0.3 = 2.2)         │
			
 
				+│       → HTML行i+1可以从OCR组j开始匹配                        │
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 📊 性能优化可视化
			
 
				+
			
 
				+### 剪枝优化
			
 
				+
			
 
				+```
			
 
				+剪枝前: 上一行有50个有效状态
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ valid_prev_indices = [0, 1, 2, ..., 49]  (50个状态)        │
			
 
				+│                                                             │
			
 
				+│ 需要计算: 50 × 15 × 4 = 3000 次匹配                         │
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+
			
 
				+剪枝后: 只保留得分最高的30个
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ valid_prev_indices = [2, 5, 8, ..., 47]  (30个状态)         │
			
 
				+│                                                             │
			
 
				+│ 需要计算: 30 × 15 × 4 = 1800 次匹配  (减少40%)              │
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+```
			
 
				+
			
 
				+### 长度预筛选
			
 
				+
			
 
				+```
			
 
				+HTML文本长度: 100字符
			
 
				+OCR文本长度: 10字符
			
 
				+
			
 
				+长度比例: 10 / 100 = 0.1 < 0.2 (阈值)
			
 
				+
			
 
				+┌─────────────────────────────────────────────────────────────┐
			
 
				+│ ❌ 直接跳过，不计算相似度                                    │
			
 
				+│   节省计算时间                                               │
			
 
				+└─────────────────────────────────────────────────────────────┘
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🔄 完整流程图
			
 
				+
			
 
				+```
			
 
				+开始
			
 
				+ │
			
 
				+ ├─→ 提取HTML行文本 (7行)
			
 
				+ │
			
 
				+ ├─→ 提取OCR组文本 (48组)
			
 
				+ │
			
 
				+ ├─→ 检查行数是否相等?
			
 
				+ │   ├─→ 是 → 验证对角线匹配
			
 
				+ │   │        ├─→ 匹配良好 → 1:1映射 → 结束
			
 
				+ │   │        └─→ 匹配不佳 → 继续DP
			
 
				+ │   └─→ 否 → 继续DP
			
 
				+ │
			
 
				+ ├─→ 预计算合并文本缓存 (merged_cache)
			
 
				+ │
			
 
				+ ├─→ 初始化第一行 (i=0)
			
 
				+ │   └─→ 搜索前15+4个组，找到最佳匹配
			
 
				+ │
			
 
				+ ├─→ DP状态转移 (i=1..6)
			
 
				+ │   ├─→ 获取上一行有效状态
			
 
				+ │   ├─→ 剪枝（保留前30个）
			
 
				+ │   ├─→ 允许跳过HTML行
			
 
				+ │   └─→ 正常匹配（搜索窗口15，最多合并4组）
			
 
				+ │
			
 
				+ ├─→ 回溯找最优路径
			
 
				+ │   └─→ 从最优终点回溯，构建映射关系
			
 
				+ │
			
 
				+ ├─→ 后处理
			
 
				+ │   ├─→ 处理未匹配的OCR组（Orphans）
			
 
				+ │   └─→ 按Y坐标排序
			
 
				+ │
			
 
				+ └─→ 返回映射结果
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 💡 实际案例数据流
			
 
				+
			
 
				+### 用户提供的实际数据
			
 
				+
			
 
				+```
			
 
				+HTML行数: 7
			
 
				+OCR组数: 48
			
 
				+
			
 
				+HTML行0文本: '部门department:01372403999柜员searchteller:ebf0000...'
			
 
				+OCR组0文本: '部门department:01372403999柜员searchteller:ebf0000...'
			
 
				+→ 相似度: 0.95+ ✅
			
 
				+
			
 
				+HTML行5文本: 'serialnum序号transdate交易日期transtime交易时间...'
			
 
				+OCR组5-39文本: 'serialtransdatetranstimetradingtypedcflg...'
			
 
				+→ 相似度: 0.85+ ✅
			
 
				+
			
 
				+HTML行6文本: '12023-08-1213:04:52网上银行卡转入贷cr33.52...'
			
 
				+OCR组40-47文本: '2023-08-1213:04:5233.52554.5210*****69...'
			
 
				+→ 相似度: 0.80+ ✅
			
 
				+```
			
 
				+
			
 
				+这个算法通过动态规划，成功地将7个HTML行与48个OCR组进行了智能匹配！
			
 
				+
			
--- a/ocr_tools/ocr_merger/表格行匹配算法详解.md
+++ b/ocr_tools/ocr_merger/表格行匹配算法详解.md
@@ -0,0 +1,375 @@
 
				+# 表格行匹配算法详解：`_match_html_rows_to_paddle_groups`
			
 
				+
			
 
				+## 📋 算法概述
			
 
				+
			
 
				+这是一个**动态规划（DP）算法**，用于将HTML表格行与OCR检测到的文本分组进行智能匹配。
			
 
				+
			
 
				+### 问题场景
			
 
				+- **输入1**: HTML表格行（7行）
			
 
				+  - 第0行：表头信息（部门、柜员、打印日期等）
			
 
				+  - 第1-5行：账户信息行（账号、查询日期、证件信息等）
			
 
				+  - 第6行：表头（Serial Num, Trans Date等）
			
 
				+  - 第7行：数据行（实际交易记录）
			
 
				+
			
 
				+- **输入2**: OCR文本分组（48组）
			
 
				+  - 每组包含多个OCR检测框，按Y坐标分组
			
 
				+  - 组0-4：表头信息
			
 
				+  - 组5-47：表头和数据行混合
			
 
				+
			
 
				+### 核心挑战
			
 
				+1. **1对多匹配**：一个HTML行可能对应多个OCR组（合并单元格、OCR分割错误）
			
 
				+2. **跳过处理**：某些HTML行可能没有对应的OCR组（OCR漏检）
			
 
				+3. **顺序不一致**：HTML行顺序与OCR组顺序可能不完全一致
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🎯 算法流程
			
 
				+
			
 
				+### 阶段1：预处理与验证
			
 
				+
			
 
				+#### 1.1 提取文本
			
 
				+```python
			
 
				+# HTML行文本（7行）
			
 
				+html_row_texts = [
			
 
				+    '部门department:01372403999柜员searchteller:ebf0000...',  # 行0
			
 
				+    '账号/卡号account/cardno:6222621010026732125...',         # 行1
			
 
				+    '查询起日querystartingdate:2023-08-12...',                 # 行2
			
 
				+    '查询时间querytime:2024年08月12日...',                     # 行3
			
 
				+    '证件种类idtype:第二代居民身份证...',                       # 行4
			
 
				+    'serialnum序号transdate交易日期...',                       # 行5（表头）
			
 
				+    '12023-08-1213:04:52网上银行卡转入...'                    # 行6（数据行）
			
 
				+]
			
 
				+
			
 
				+# OCR组文本（48组）
			
 
				+group_texts = [
			
 
				+    '部门department:01372403999...',  # 组0
			
 
				+    '账号/卡号account/cardno:...',    # 组1
			
 
				+    '查询起日querystartingdate:...',  # 组2
			
 
				+    ...
			
 
				+    '964642s0110305'                  # 组47
			
 
				+]
			
 
				+```
			
 
				+
			
 
				+#### 1.2 快速路径：行数相等时的验证
			
 
				+```python
			
 
				+if len(html_rows) == len(grouped_boxes):  # 7 == 48? False
			
 
				+    # 不执行此分支
			
 
				+```
			
 
				+
			
 
				+**实际场景**：7行 ≠ 48组，需要DP算法
			
 
				+
			
 
				+---
			
 
				+
			
 
				+### 阶段2：预计算合并文本缓存
			
 
				+
			
 
				+#### 2.1 构建 `merged_cache`
			
 
				+为了支持**一个HTML行匹配多个OCR组**，预计算所有可能的组合：
			
 
				+
			
 
				+```python
			
 
				+MAX_MERGE = 4  # 最多合并4个OCR组
			
 
				+merged_cache = {
			
 
				+    (0, 1): '部门department:01372403999...',           # 组0单独
			
 
				+    (0, 2): '部门department:...账号/卡号account...',    # 组0+组1
			
 
				+    (0, 3): '部门department:...查询起日...',            # 组0+组1+组2
			
 
				+    (0, 4): '部门department:...查询时间...',            # 组0+组1+组2+组3
			
 
				+    (1, 1): '账号/卡号account/cardno:...',              # 组1单独
			
 
				+    (1, 2): '账号/卡号...查询起日...',                  # 组1+组2
			
 
				+    ...
			
 
				+    (47, 1): '964642s0110305'                           # 组47单独
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+**作用**：避免在DP过程中重复拼接文本，提升性能
			
 
				+
			
 
				+---
			
 
				+
			
 
				+### 阶段3：动态规划初始化
			
 
				+
			
 
				+#### 3.1 DP状态定义
			
 
				+```python
			
 
				+# dp[i][j] = 将HTML前i行匹配到OCR前j组的最大得分
			
 
				+dp = np.full((7, 48), -np.inf)  # 初始化为负无穷
			
 
				+
			
 
				+# path[(i, j)] = (prev_j, start_j)
			
 
				+# prev_j: 上一行结束的OCR组索引
			
 
				+# start_j: 当前行开始的OCR组索引
			
 
				+path = {}
			
 
				+```
			
 
				+
			
 
				+#### 3.2 初始化第一行（i=0）
			
 
				+
			
 
				+**目标**：找到HTML第0行与OCR组的最佳匹配
			
 
				+
			
 
				+```python
			
 
				+SEARCH_WINDOW = 15  # 只在前15个组中搜索
			
 
				+MAX_MERGE = 4       # 最多合并4个组
			
 
				+
			
 
				+# 遍历所有可能的匹配
			
 
				+for end_j in range(min(48, 15 + 4)):  # end_j: 0..18
			
 
				+    for count in range(1, 5):          # count: 1..4
			
 
				+        start_j = end_j - count + 1    # 计算起始组索引
			
 
				+        
			
 
				+        # 例如：end_j=3, count=2 → start_j=2
			
 
				+        # 表示：HTML行0 匹配 OCR组2+组3
			
 
				+        
			
 
				+        current_text = merged_cache.get((start_j, count), "")
			
 
				+        similarity = calculate_similarity(html_row_texts[0], current_text)
			
 
				+        
			
 
				+        penalty = start_j * 0.1  # 跳过组的惩罚
			
 
				+        score = similarity - penalty
			
 
				+        
			
 
				+        if score > 0.1:  # 只有得分足够高才记录
			
 
				+            dp[0][end_j] = max(dp[0][end_j], score)
			
 
				+            path[(0, end_j)] = (-1, start_j)  # -1表示这是第一行
			
 
				+```
			
 
				+
			
 
				+**示例计算过程**：
			
 
				+
			
 
				+| end_j | count | start_j | 匹配的OCR组 | 相似度 | 惩罚 | 得分 | 是否记录 |
			
 
				+|-------|-------|---------|-------------|--------|------|------|----------|
			
 
				+| 0 | 1 | 0 | 组0 | 0.95 | 0.0 | 0.95 | ✅ |
			
 
				+| 1 | 1 | 1 | 组1 | 0.15 | 0.1 | 0.05 | ❌ |
			
 
				+| 1 | 2 | 0 | 组0+组1 | 0.98 | 0.0 | 0.98 | ✅ |
			
 
				+| 2 | 3 | 0 | 组0+组1+组2 | 0.92 | 0.0 | 0.92 | ✅ |
			
 
				+| ... | ... | ... | ... | ... | ... | ... | ... |
			
 
				+
			
 
				+**结果**：假设 `dp[0][2] = 0.98` 是最优，表示HTML行0匹配OCR组0+组1+组2
			
 
				+
			
 
				+---
			
 
				+
			
 
				+### 阶段4：DP状态转移（核心）
			
 
				+
			
 
				+#### 4.1 处理第i行（i=1..6）
			
 
				+
			
 
				+**状态转移方程**：
			
 
				+```
			
 
				+dp[i][end_j] = max {
			
 
				+    dp[i-1][prev_j] + match_score(i, start_j..end_j) - gap_penalty
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+**具体步骤**：
			
 
				+
			
 
				+##### Step 1: 获取上一行的有效状态
			
 
				+```python
			
 
				+# 找到所有 dp[i-1][j] > -inf 的位置
			
 
				+valid_prev_indices = [j for j in range(48) if dp[i-1][j] > -np.inf]
			
 
				+
			
 
				+# 例如处理第1行时：
			
 
				+# valid_prev_indices = [2]  # 因为dp[0][2] = 0.98
			
 
				+```
			
 
				+
			
 
				+##### Step 2: 剪枝优化
			
 
				+```python
			
 
				+if len(valid_prev_indices) > 30:
			
 
				+    # 只保留得分最高的前30个
			
 
				+    valid_prev_indices.sort(key=lambda j: dp[i-1][j], reverse=True)
			
 
				+    valid_prev_indices = valid_prev_indices[:30]
			
 
				+```
			
 
				+
			
 
				+##### Step 3: 允许跳过HTML行
			
 
				+```python
			
 
				+# 如果当前HTML行没有匹配，可以跳过（继承上一行状态）
			
 
				+for prev_j in valid_prev_indices:
			
 
				+    score_skip = dp[i-1][prev_j] - 0.3  # SKIP_HTML_PENALTY
			
 
				+    if score_skip > dp[i][prev_j]:
			
 
				+        dp[i][prev_j] = score_skip
			
 
				+        path[(i, prev_j)] = (prev_j, prev_j + 1)  # 空范围，表示跳过
			
 
				+```
			
 
				+
			
 
				+##### Step 4: 正常匹配逻辑
			
 
				+```python
			
 
				+for prev_j in valid_prev_indices:  # 例如：prev_j = 2
			
 
				+    prev_score = dp[i-1][prev_j]    # 例如：0.98
			
 
				+    
			
 
				+    max_gap = min(15, 48 - prev_j - 1)  # 最多向前搜索15个组
			
 
				+    
			
 
				+    for gap in range(max_gap):  # gap: 0..14
			
 
				+        start_j = prev_j + 1 + gap  # 例如：gap=0 → start_j=3
			
 
				+        
			
 
				+        for count in range(1, 5):  # count: 1..4
			
 
				+            end_j = start_j + count - 1  # 例如：count=2 → end_j=4
			
 
				+            
			
 
				+            # 获取合并文本
			
 
				+            current_text = merged_cache.get((start_j, count), "")
			
 
				+            
			
 
				+            # 长度预筛选（避免明显不匹配的情况）
			
 
				+            if len(html_text) > 10 and len(current_text) < len(html_text) * 0.2:
			
 
				+                continue  # 跳过
			
 
				+            
			
 
				+            # 计算相似度
			
 
				+            similarity = calculate_similarity(html_text, current_text)
			
 
				+            
			
 
				+            # 计算惩罚
			
 
				+            gap_penalty = gap * 0.1  # 跳过组的惩罚
			
 
				+            len_penalty = ...         # 长度不匹配的惩罚
			
 
				+            
			
 
				+            current_score = similarity - gap_penalty - len_penalty
			
 
				+            
			
 
				+            if current_score > 0.1:  # 只有正收益才转移
			
 
				+                total_score = prev_score + current_score
			
 
				+                
			
 
				+                if total_score > dp[i][end_j]:
			
 
				+                    dp[i][end_j] = total_score
			
 
				+                    path[(i, end_j)] = (prev_j, start_j)
			
 
				+```
			
 
				+
			
 
				+**示例：处理HTML行1（账号信息行）**
			
 
				+
			
 
				+假设 `dp[0][2] = 0.98`（上一行匹配到组2）
			
 
				+
			
 
				+| gap | start_j | count | end_j | 匹配的OCR组 | 相似度 | 总得分 | 是否更新 |
			
 
				+|-----|---------|-------|-------|-------------|--------|--------|----------|
			
 
				+| 0 | 3 | 1 | 3 | 组3 | 0.85 | 0.98+0.85=1.83 | ✅ |
			
 
				+| 0 | 3 | 2 | 4 | 组3+组4 | 0.92 | 0.98+0.92=1.90 | ✅ |
			
 
				+| 1 | 4 | 1 | 4 | 组4 | 0.20 | 0.98+0.20=1.18 | ✅ |
			
 
				+| ... | ... | ... | ... | ... | ... | ... | ... |
			
 
				+
			
 
				+最终：`dp[1][4] = 1.90`，`path[(1, 4)] = (2, 3)`
			
 
				+
			
 
				+---
			
 
				+
			
 
				+### 阶段5：回溯找最优路径
			
 
				+
			
 
				+#### 5.1 找到最优终点
			
 
				+```python
			
 
				+best_end_j = -1
			
 
				+max_score = -np.inf
			
 
				+
			
 
				+# 从最后一行往前找
			
 
				+for i in range(6, -1, -1):  # i: 6, 5, 4, ..., 0
			
 
				+    for j in range(48):
			
 
				+        if dp[i][j] > max_score:
			
 
				+            max_score = dp[i][j]
			
 
				+            best_end_j = j
			
 
				+            best_last_row = i
			
 
				+
			
 
				+# 例如：dp[6][47] = 5.2 是最大值
			
 
				+# → best_end_j = 47, best_last_row = 6
			
 
				+```
			
 
				+
			
 
				+#### 5.2 回溯构建映射
			
 
				+```python
			
 
				+mapping = {}
			
 
				+curr_i = 6  # best_last_row
			
 
				+curr_j = 47  # best_end_j
			
 
				+
			
 
				+while curr_i >= 0:
			
 
				+    if (curr_i, curr_j) in path:
			
 
				+        prev_j, start_j = path[(6, 47)]  # 例如：(45, 46)
			
 
				+        
			
 
				+        # 如果 start_j <= curr_j，说明消耗了OCR组
			
 
				+        if start_j <= curr_j:  # 46 <= 47? True
			
 
				+            indices = list(range(start_j, curr_j + 1))  # [46, 47]
			
 
				+            mapping[6] = [46, 47]  # HTML行6匹配OCR组46+47
			
 
				+        
			
 
				+        curr_j = prev_j  # 45
			
 
				+        curr_i -= 1      # 5
			
 
				+    else:
			
 
				+        break
			
 
				+
			
 
				+# 继续回溯...
			
 
				+# mapping[5] = [40, 41, 42, 43, 44, 45]
			
 
				+# mapping[4] = [38, 39]
			
 
				+# ...
			
 
				+# mapping[0] = [0, 1, 2]
			
 
				+```
			
 
				+
			
 
				+**最终映射结果**：
			
 
				+```python
			
 
				+mapping = {
			
 
				+    0: [0, 1, 2],           # HTML行0 → OCR组0+1+2
			
 
				+    1: [3, 4],               # HTML行1 → OCR组3+4
			
 
				+    2: [5, 6],               # HTML行2 → OCR组5+6
			
 
				+    3: [7, 8, 9],            # HTML行3 → OCR组7+8+9
			
 
				+    4: [10, 11],             # HTML行4 → OCR组10+11
			
 
				+    5: [12, 13, ..., 39],   # HTML行5 → OCR组12..39（表头）
			
 
				+    6: [40, 41, ..., 47]    # HTML行6 → OCR组40..47（数据行）
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+### 阶段6：后处理
			
 
				+
			
 
				+#### 6.1 处理未匹配的OCR组（Orphans）
			
 
				+```python
			
 
				+used_groups = {0, 1, 2, ..., 47}  # 已使用的组
			
 
				+unused_groups = []  # 如果没有未使用的组，跳过
			
 
				+
			
 
				+# 如果有未使用的组，找到最近的已匹配组，合并过去
			
 
				+```
			
 
				+
			
 
				+#### 6.2 按Y坐标排序
			
 
				+```python
			
 
				+for row_idx in mapping:
			
 
				+    if mapping[row_idx]:
			
 
				+        # 按OCR组的Y坐标排序，确保顺序正确
			
 
				+        mapping[row_idx].sort(key=lambda idx: grouped_boxes[idx]['y_center'])
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 📊 算法复杂度
			
 
				+
			
 
				+- **时间复杂度**：O(n_html × n_paddle × SEARCH_WINDOW × MAX_MERGE)
			
 
				+  - 本例：7 × 48 × 15 × 4 ≈ 20,160 次计算
			
 
				+- **空间复杂度**：O(n_html × n_paddle)
			
 
				+  - 本例：7 × 48 = 336 个DP状态
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🎨 可视化示例
			
 
				+
			
 
				+### DP状态矩阵（简化版）
			
 
				+
			
 
				+```
			
 
				+        OCR组索引 →
			
 
				+HTML  0  1  2  3  4  5  6  7  ...  47
			
 
				+行↓
			
 
				+0    [0.98] -  -  -  -  -  -  ...  -
			
 
				+1    -  -  -  [1.90] -  -  -  ...  -
			
 
				+2    -  -  -  -  -  [2.85] -  ...  -
			
 
				+3    -  -  -  -  -  -  [3.80] ...  -
			
 
				+4    -  -  -  -  -  -  -  [4.75] ...  -
			
 
				+5    -  -  -  -  -  -  -  -  ...  [5.20]
			
 
				+6    -  -  -  -  -  -  -  -  ...  [5.50] ← 最优终点
			
 
				+```
			
 
				+
			
 
				+### 匹配路径（箭头表示）
			
 
				+
			
 
				+```
			
 
				+HTML行0 → OCR组[0,1,2]
			
 
				+    ↓
			
 
				+HTML行1 → OCR组[3,4]
			
 
				+    ↓
			
 
				+HTML行2 → OCR组[5,6]
			
 
				+    ↓
			
 
				+...
			
 
				+    ↓
			
 
				+HTML行6 → OCR组[40..47]
			
 
				+```
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 🔑 关键优化点
			
 
				+
			
 
				+1. **预计算合并文本**：避免重复拼接，提升性能
			
 
				+2. **剪枝优化**：只保留前30个最优状态，防止组合爆炸
			
 
				+3. **允许跳过HTML行**：提高鲁棒性，处理OCR漏检
			
 
				+4. **长度预筛选**：快速排除明显不匹配的情况
			
 
				+5. **多组合并**：支持一个HTML行匹配多个OCR组（MAX_MERGE=4）
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 📝 总结
			
 
				+
			
 
				+这个算法通过**动态规划**实现了：
			
 
				+- ✅ 智能匹配HTML行与OCR组
			
 
				+- ✅ 支持1对多匹配（合并单元格）
			
 
				+- ✅ 处理OCR漏检（跳过HTML行）
			
 
				+- ✅ 处理顺序不一致（通过gap搜索）
			
 
				+- ✅ 高效计算（预计算+剪枝）
			
 
				+
			
 
				+最终得到每个HTML行对应的OCR组索引列表，用于后续的单元格匹配。
			
 
				+