批量处理模块

概述

批量处理模块用于将首页学习的表格模板应用到多个文件，适用于多页银行流水等场景。

架构设计

核心思路

从首页学习模板：主要学习列边界（竖线）
应用到所有页：
- 竖线（列边界）：完全复用
- 横线（行分割）：自适应重新计算
并行处理：提升大批量处理效率

与旧系统的差异

特性	旧系统 (batch_processor.py)	新系统 (batch_service.py)
依赖	SmartTableLineGenerator	TableAnalyzer
数据结构	TableStructure 数据类	Dict 结构
列检测	ColumnDetector 独立模块	内置聚类算法
行检测	AdaptiveRowSplitter	TableAnalyzer.analyze()
接口	命令行工具	FastAPI REST API

API 使用

1. 批量处理

端点: POST /api/batch/process

请求体:

{
  "template_structure": {
    "vertical_lines": [100, 200, 300, 400],
    "table_bbox": [50, 100, 800, 2000],
    "total_cols": 5,
    "mode": "cluster"
  },
  "file_pairs": [
    {
      "json_path": "/path/to/page_001.json",
      "image_path": "/path/to/page_001.png"
    },
    {
      "json_path": "/path/to/page_002.json",
      "image_path": "/path/to/page_002.png"
    }
  ],
  "output_dir": "/path/to/output",
  "parallel": true,
  "adjust_rows": true
}

响应:

{
  "success": true,
  "total": 20,
  "processed": 18,
  "failed": 2,
  "results": [
    {
      "success": true,
      "json_path": "/path/to/page_001.json",
      "image_path": "/path/to/page_001.png",
      "structure_path": "/path/to/output/page_001_structure.json",
      "filename": "page_001.png",
      "rows": 45,
      "cols": 5
    }
  ],
  "message": "批量处理完成: 成功 18/20"
}

2. 批量绘图

端点: POST /api/batch/draw

请求体:

{
  "results": [
    {
      "success": true,
      "image_path": "/path/to/page_001.png",
      "structure_path": "/path/to/output/page_001_structure.json",
      "filename": "page_001.png"
    }
  ],
  "line_width": 2,
  "line_color": [0, 0, 0]
}

前端集成示例

import { batchApi } from '@/api'

// 批量处理
async function processBatch() {
  try {
    const response = await batchApi.batchProcess({
      template_structure: editorStore.structure,
      file_pairs: templateStore.filePairs.map(pair => ({
        json_path: pair.json_path,
        image_path: pair.image_path
      })),
      output_dir: templateStore.scanConfig.outputDir,
      parallel: true,
      adjust_rows: true
    })
    
    console.log(`处理完成: ${response.processed}/${response.total}`)
    
    // 可选：批量绘图
    if (response.success && response.processed > 0) {
      const drawResponse = await batchApi.batchDraw({
        results: response.results.filter(r => r.success),
        line_width: 2,
        line_color: [0, 0, 0]
      })
      console.log(`绘制完成: ${drawResponse.drawn}/${drawResponse.total}`)
    }
  } catch (error) {
    console.error('批量处理失败:', error)
  }
}

工作流程

典型使用场景：银行流水批量标注

手动标注首页
- 用户在编辑器中标注第一页
- 调整行列结构至满意
- 保存为模板
选择数据源
- 在"预定义数据源"中选择对应的数据源
- 系统自动扫描所有文件对
批量应用模板
- 点击"批量处理"按钮
- 系统将首页的列结构应用到所有页
- 每页的行结构根据实际 OCR 内容自适应
查看结果
- 所有 _structure.json 文件保存到输出目录
- 可选绘制表格线图片用于验证

参数说明

adjust_rows:
- true（推荐）：每页自适应调整行分割，适应不同页面的内容高度
- false：完全复用模板的行结构，适用于行高度完全一致的场景
parallel:
- true（推荐）：并行处理，速度快
- false：串行处理，便于调试

性能

串行处理: ~1-2 秒/页
并行处理: ~0.3-0.5 秒/页（4 线程）
瓶颈: OCR 数据读取和行分割算法

未来扩展

Phase 3+: 高级功能

WebSocket 实时进度推送
后台任务队列（Celery）
批量处理历史记录
失败重试机制
处理结果可视化对比

文件说明

backend/
├── services/
│   └── batch_service.py       # 核心批量处理逻辑
├── api/
│   └── batch.py               # REST API 端点
└── main.py                    # 注册 batch router

frontend/
└── src/
    └── api/
        └── batch.ts           # 前端 API 客户端

与旧系统兼容性

旧的 table_line_generator/batch_processor.py 暂时保留作为参考，但推荐使用新系统：

✅ 更简洁的实现
✅ 统一的架构
✅ Web API 集成
✅ 前后端一体化

BATCH_README.md 4.9 KB Historique Raw

批量处理模块

概述

架构设计

核心思路

与旧系统的差异

API 使用

1. 批量处理

2. 批量绘图

前端集成示例

工作流程

典型使用场景：银行流水批量标注

参数说明

性能

未来扩展

Phase 3+: 高级功能

文件说明

与旧系统兼容性

BATCH_README.md 4.9 KB

Historique Raw