zhengchun/ocr_platform: 非结构化文档识别统一平台 @ 26b500f3448b212e3911ff4484531a1476e1b6e3

这是一个非常好的思路！"智能画线" 确实是解决无线表格识别难题的有效兜底方案。

🎯 核心思路分析

您的场景有三个关键约束，这正是我们可以利用的先验知识：

单栏列表：不存在嵌套表格或复杂合并单元格。
列宽固定：同一文档内，列的 X 边界是稳定的。
行高可变：每条流水可能有 1~N 行文本。

基于这些约束，我们可以设计一个 "列边界检测 + 行分割线推断" 的管道。

📐 建议方案：智能画线 Pipeline

整体流程

┌──────────────────────────────────────────────────────────────────┐
│                        原始图片                                   │
└──────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  阶段 1: OCR 识别 (PaddleOCR)                                     │
│  输出: List[{text, bbox, score}]                                 │
└──────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  阶段 2: 列边界检测 (Column Boundary Detection)                   │
│  方法: X 坐标直方图聚类 / 投影分析                                 │
│  输出: List[(x_left, x_right)] 每列的左右边界                     │
└──────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  阶段 3: 行分割线推断 (Row Separator Inference)                   │
│  方法: 基于"锚点列"的 Y 坐标变化检测                              │
│  输出: List[y_separator] 每行的分割线 Y 坐标                      │
└──────────────────────────────────────────────────────────────────┘
                                │
                                ▼
┌──────────────────────────────────────────────────────────────────┐
│  阶段 4: 画线 & 重新识别 (Optional) 或 直接构建结构化数据          │
│  方案 A: 在原图上画线 -> 调用有线表格识别模型                      │
│  方案 B: 直接根据边界裁切单元格 -> 逐格 OCR (更精准)               │
│  方案 C: 直接用边界信息构建 HTML (不画线，纯逻辑处理)              │
└──────────────────────────────────────────────────────────────────┘

阶段 2: 列边界检测 (核心)

原理：在银行流水中，每列的文本的 左边界 X 坐标 是高度一致的。我们可以通过统计所有 OCR Box 的 x_left 坐标，找到聚类中心，即为列边界。

import numpy as np
from typing import List, Dict, Tuple
from collections import defaultdict

class TableLineDrawer:
    """智能表格画线器"""
    
    def __init__(self, x_tolerance: int = 15, min_column_width: int = 30):
        self.x_tolerance = x_tolerance
        self.min_column_width = min_column_width
    
    def detect_column_boundaries(self, ocr_boxes: List[Dict], 
                                  page_width: int) -> List[Tuple[int, int]]:
        """
        检测列边界
        
        Args:
            ocr_boxes: OCR 识别结果 [{text, bbox: [x1,y1,x2,y2]}]
            page_width: 页面宽度
            
        Returns:
            列边界列表 [(x_left, x_right), ...]
        """
        if not ocr_boxes:
            return []
        
        # 1. 收集所有 box 的左边界 x 坐标
        x_lefts = [box['bbox'][0] for box in ocr_boxes]
        
        # 2. 使用直方图找峰值 (聚类)
        # 创建直方图，bin 宽度 = x_tolerance
        hist, bin_edges = np.histogram(x_lefts, bins=range(0, page_width + self.x_tolerance, self.x_tolerance))
        
        # 3. 找到显著的峰值 (出现次数 > 阈值)
        # 阈值：至少有 N 个 box 的左边界落在这个区间
        threshold = max(3, len(ocr_boxes) * 0.05)  # 至少 5% 的 box
        
        peak_bins = []
        for i, count in enumerate(hist):
            if count >= threshold:
                peak_bins.append(bin_edges[i])
        
        # 4. 合并相邻的峰值 (可能因为微小偏移导致多个相邻 bin 都是峰值)
        column_lefts = self._merge_close_values(peak_bins, self.x_tolerance * 2)
        
        # 5. 推断列的右边界
        # 简单方法：下一列的左边界 - gap 即为当前列的右边界
        # 或者统计落在该列的 box 的最大 x_right
        column_boundaries = []
        for i, x_left in enumerate(column_lefts):
            # 找到属于这一列的所有 box
            col_boxes = [b for b in ocr_boxes if abs(b['bbox'][0] - x_left) < self.x_tolerance * 2]
            
            if col_boxes:
                # 右边界 = 这些 box 的最大 x_right
                x_right = max(b['bbox'][2] for b in col_boxes)
                
                # 但不能超过下一列的左边界
                if i + 1 < len(column_lefts):
                    x_right = min(x_right, column_lefts[i + 1] - 5)
                    
                column_boundaries.append((int(x_left), int(x_right)))
        
        return column_boundaries
    
    def _merge_close_values(self, values: List[float], tolerance: float) -> List[float]:
        """合并相近的值"""
        if not values:
            return []
        
        values = sorted(values)
        merged = [values[0]]
        
        for v in values[1:]:
            if v - merged[-1] > tolerance:
                merged.append(v)
            else:
                # 取平均值
                merged[-1] = (merged[-1] + v) / 2
                
        return merged

阶段 3: 行分割线推断 (核心)

原理：银行流水通常有一个"锚点列"（如"交易日期"、"序号"），这个列的每个值都是新一行的开始。我们可以：

识别锚点列（通常是第一列或第二列，且内容格式稳定，如日期）。

以锚点列中每个 Box 的 y_top 作为行的起始线。

def detect_row_separators(self, ocr_boxes: List[Dict],
                           column_boundaries: List[Tuple[int, int]],
                           anchor_column_idx: int = 0) -> List[int]:
    """
    检测行分割线
        
    Args:
        ocr_boxes: OCR 识别结果
        column_boundaries: 列边界
        anchor_column_idx: 锚点列索引 (通常是日期列)
            
    Returns:
        行分割线 Y 坐标列表 (升序)
    """
    if not column_boundaries or anchor_column_idx >= len(column_boundaries):
        return []
        
    anchor_col = column_boundaries[anchor_column_idx]
        
    # 1. 找到所有落在锚点列内的 box
    anchor_boxes = []
    for box in ocr_boxes:
        x_center = (box['bbox'][0] + box['bbox'][2]) / 2
        if anchor_col[0] <= x_center <= anchor_col[1]:
            anchor_boxes.append(box)
        
    # 2. 按 Y 坐标排序
    anchor_boxes.sort(key=lambda b: b['bbox'][1])
        
    # 3. 提取每个锚点 box 的 y_top 作为行分割线
    # 但要过滤掉表头
    row_separators = []
        
    for i, box in enumerate(anchor_boxes):
        y_top = box['bbox'][1]
            
        # 过滤：与上一行太近的可能是同一行的多行文本
        if row_separators and y_top - row_separators[-1] < 20:
            continue
                
        row_separators.append(y_top)
        
    return row_separators
    
def identify_anchor_column(self, ocr_boxes: List[Dict],
                            column_boundaries: List[Tuple[int, int]]) -> int:
    """
    自动识别锚点列 (通常是日期列或序号列)
        
    规则:
    1. 优先找日期格式 (yyyy-mm-dd 或 yyyy/mm/dd)
    2. 其次找纯数字序号 (1, 2, 3...)
    3. 默认选第一列
    """
    import re
    date_pattern = re.compile(r'^\d{4}[-/]\d{2}[-/]\d{2}')
    seq_pattern = re.compile(r'^\d{1,3}$')
        
    for col_idx, col_bound in enumerate(column_boundaries):
        # 统计该列的 box
        col_boxes = [b for b in ocr_boxes 
                    if col_bound[0] <= b['bbox'][0] <= col_bound[1]]
            
        if not col_boxes:
            continue
                
        # 检查是否大部分是日期格式
        date_count = sum(1 for b in col_boxes if date_pattern.match(b['text']))
        if date_count > len(col_boxes) * 0.5:
            return col_idx
                
        # 检查是否是序号列
        seq_count = sum(1 for b in col_boxes if seq_pattern.match(b['text']))
        if seq_count > len(col_boxes) * 0.5:
            return col_idx
        
    return 0  # 默认第一列

阶段 4: 输出方案

方案 A: 画线后重新识别 (推荐)

在原图上画上检测到的表格线，然后调用 有线表格识别模型（如 PaddleStructure）。

    def draw_table_lines(self, image: np.ndarray,
                          column_boundaries: List[Tuple[int, int]],
                          row_separators: List[int],
                          line_color: Tuple[int, int, int] = (0, 0, 0),
                          line_thickness: int = 1) -> np.ndarray:
        """
        在图片上画表格线
        """
        import cv2
        
        result = image.copy()
        height, width = image.shape[:2]
        
        # 获取表格的边界
        if not column_boundaries or not row_separators:
            return result
            
        table_left = column_boundaries[0][0] - 5
        table_right = column_boundaries[-1][1] + 5
        table_top = row_separators[0] - 5
        table_bottom = row_separators[-1] + 30  # 给最后一行留空间
        
        # 画竖线 (列分隔线)
        # 画最左边的线
        cv2.line(result, (table_left, table_top), (table_left, table_bottom), line_color, line_thickness)
        
        for col_left, col_right in column_boundaries:
            # 画每列的右边界线
            cv2.line(result, (col_right, table_top), (col_right, table_bottom), line_color, line_thickness)
        
        # 画横线 (行分隔线)
        for y in row_separators:
            cv2.line(result, (table_left, y - 2), (table_right, y - 2), line_color, line_thickness)
        
        # 画最底部的线
        cv2.line(result, (table_left, table_bottom), (table_right, table_bottom), line_color, line_thickness)
        
        return result

方案 B: 直接构建结构化数据 (不画线)

利用检测到的边界，直接将 OCR Box 分配到对应的单元格中。

    def build_structured_table(self, ocr_boxes: List[Dict],
                                column_boundaries: List[Tuple[int, int]],
                                row_separators: List[int]) -> List[List[str]]:
        """
        直接构建结构化表格数据
        
        Returns:
            二维列表 table[row_idx][col_idx] = cell_text
        """
        n_rows = len(row_separators)
        n_cols = len(column_boundaries)
        
        # 初始化表格
        table = [[[] for _ in range(n_cols)] for _ in range(n_rows)]
        
        for box in ocr_boxes:
            x_center = (box['bbox'][0] + box['bbox'][2]) / 2
            y_center = (box['bbox'][1] + box['bbox'][3]) / 2
            
            # 找到所属的列
            col_idx = -1
            for i, (x_left, x_right) in enumerate(column_boundaries):
                if x_left - 10 <= x_center <= x_right + 10:
                    col_idx = i
                    break
            
            # 找到所属的行
            row_idx = -1
            for i in range(len(row_separators)):
                row_top = row_separators[i]
                row_bottom = row_separators[i + 1] if i + 1 < len(row_separators) else float('inf')
                
                if row_top <= y_center < row_bottom:
                    row_idx = i
                    break
            
            # 添加到对应单元格
            if row_idx >= 0 and col_idx >= 0:
                table[row_idx][col_idx].append(box['text'])
        
        # 合并每个单元格的文本
        result = []
        for row in table:
            result.append([' '.join(texts) for texts in row])
            
        return result

🔧 完整的兜底流程

class TableLineFallback:
    """无线表格兜底处理器"""
    
    def __init__(self):
        self.line_drawer = TableLineDrawer()
    
    def process(self, image_path: str, ocr_result: List[Dict]) -> Dict:
        """
        兜底处理流程
        
        Args:
            image_path: 图片路径
            ocr_result: PaddleOCR 结果
            
        Returns:
            {
                'method': 'line_drawing_fallback',
                'table_data': [[...], [...], ...],
                'debug_image': np.ndarray (可选，画线后的图片)
            }
        """
        import cv2
        
        # 1. 读取图片获取尺寸
        image = cv2.imread(image_path)
        page_width = image.shape[1]
        
        # 2. 检测列边界
        column_boundaries = self.line_drawer.detect_column_boundaries(ocr_result, page_width)
        print(f"检测到 {len(column_boundaries)} 列")
        
        # 3. 识别锚点列
        anchor_idx = self.line_drawer.identify_anchor_column(ocr_result, column_boundaries)
        print(f"锚点列: 第 {anchor_idx + 1} 列")
        
        # 4. 检测行分割线
        row_separators = self.line_drawer.detect_row_separators(
            ocr_result, column_boundaries, anchor_idx
        )
        print(f"检测到 {len(row_separators)} 行")
        
        # 5. 构建结构化数据
        table_data = self.line_drawer.build_structured_table(
            ocr_result, column_boundaries, row_separators
        )
        
        # 6. (可选) 画线用于调试或重新识别
        debug_image = self.line_drawer.draw_table_lines(
            image, column_boundaries, row_separators
        )
        
        return {
            'method': 'line_drawing_fallback',
            'column_boundaries': column_boundaries,
            'row_separators': row_separators,
            'table_data': table_data,
            'debug_image': debug_image
        }

📊 方案对比

方案	优点	缺点	适用场景
A: 画线 + 有线模型	利用成熟的有线表格识别能力	需要额外调用模型，增加延迟	追求最高准确率
B: 直接裁切单元格	每个单元格独立 OCR，避免粘连	实现复杂，需要处理边界对齐	单元格内容复杂
C: 纯逻辑构建	最快，不需要额外模型	依赖 OCR 准确性	大多数银行流水

💡 建议

先用方案 C：对于标准的银行流水（招行、交行等），纯逻辑构建通常够用。
方案 A 作为兜底：如果方案 C 的结果校验失败（如某行单元格数不对），再启用画线 + 有线模型。
关键是锚点列检测：锚点列决定了行分割的准确性。日期列是最佳锚点，因为格式稳定且每行唯一。

需要我提供完整的可运行代码吗？

无线表格智能标注.md 17 KB Előzmények Nyers