# 批量处理模块

## 概述

批量处理模块用于将首页学习的表格模板应用到多个文件，适用于多页银行流水等场景。

## 架构设计

### 核心思路
1. **从首页学习模板**：主要学习列边界（竖线）
2. **应用到所有页**：
   - **竖线（列边界）**：完全复用
   - **横线（行分割）**：自适应重新计算
3. **并行处理**：提升大批量处理效率

### 与旧系统的差异

| 特性 | 旧系统 (batch_processor.py) | 新系统 (batch_service.py) |
|------|---------------------------|-------------------------|
| 依赖 | SmartTableLineGenerator | TableAnalyzer |
| 数据结构 | TableStructure 数据类 | Dict 结构 |
| 列检测 | ColumnDetector 独立模块 | 内置聚类算法 |
| 行检测 | AdaptiveRowSplitter | TableAnalyzer.analyze() |
| 接口 | 命令行工具 | FastAPI REST API |

## API 使用

### 1. 批量处理

**端点**: `POST /api/batch/process`

**请求体**:
```json
{
  "template_structure": {
    "vertical_lines": [100, 200, 300, 400],
    "table_bbox": [50, 100, 800, 2000],
    "total_cols": 5,
    "mode": "cluster"
  },
  "file_pairs": [
    {
      "json_path": "/path/to/page_001.json",
      "image_path": "/path/to/page_001.png"
    },
    {
      "json_path": "/path/to/page_002.json",
      "image_path": "/path/to/page_002.png"
    }
  ],
  "output_dir": "/path/to/output",
  "parallel": true,
  "adjust_rows": true
}
```

**响应**:
```json
{
  "success": true,
  "total": 20,
  "processed": 18,
  "failed": 2,
  "results": [
    {
      "success": true,
      "json_path": "/path/to/page_001.json",
      "image_path": "/path/to/page_001.png",
      "structure_path": "/path/to/output/page_001_structure.json",
      "filename": "page_001.png",
      "rows": 45,
      "cols": 5
    }
  ],
  "message": "批量处理完成: 成功 18/20"
}
```

### 2. 批量绘图

**端点**: `POST /api/batch/draw`

**请求体**:
```json
{
  "results": [
    {
      "success": true,
      "image_path": "/path/to/page_001.png",
      "structure_path": "/path/to/output/page_001_structure.json",
      "filename": "page_001.png"
    }
  ],
  "line_width": 2,
  "line_color": [0, 0, 0]
}
```

## 前端集成示例

```typescript
import { batchApi } from '@/api'

// 批量处理
async function processBatch() {
  try {
    const response = await batchApi.batchProcess({
      template_structure: editorStore.structure,
      file_pairs: templateStore.filePairs.map(pair => ({
        json_path: pair.json_path,
        image_path: pair.image_path
      })),
      output_dir: templateStore.scanConfig.outputDir,
      parallel: true,
      adjust_rows: true
    })
    
    console.log(`处理完成: ${response.processed}/${response.total}`)
    
    // 可选：批量绘图
    if (response.success && response.processed > 0) {
      const drawResponse = await batchApi.batchDraw({
        results: response.results.filter(r => r.success),
        line_width: 2,
        line_color: [0, 0, 0]
      })
      console.log(`绘制完成: ${drawResponse.drawn}/${drawResponse.total}`)
    }
  } catch (error) {
    console.error('批量处理失败:', error)
  }
}
```

## 工作流程

### 典型使用场景：银行流水批量标注

1. **手动标注首页**
   - 用户在编辑器中标注第一页
   - 调整行列结构至满意
   - 保存为模板

2. **选择数据源**
   - 在"预定义数据源"中选择对应的数据源
   - 系统自动扫描所有文件对

3. **批量应用模板**
   - 点击"批量处理"按钮
   - 系统将首页的列结构应用到所有页
   - 每页的行结构根据实际 OCR 内容自适应

4. **查看结果**
   - 所有 `_structure.json` 文件保存到输出目录
   - 可选绘制表格线图片用于验证

### 参数说明

- **adjust_rows**: 
  - `true`（推荐）：每页自适应调整行分割，适应不同页面的内容高度
  - `false`：完全复用模板的行结构，适用于行高度完全一致的场景
  
- **parallel**: 
  - `true`（推荐）：并行处理，速度快
  - `false`：串行处理，便于调试

## 性能

- **串行处理**: ~1-2 秒/页
- **并行处理**: ~0.3-0.5 秒/页（4 线程）
- **瓶颈**: OCR 数据读取和行分割算法

## 未来扩展

### Phase 3+: 高级功能
- [ ] WebSocket 实时进度推送
- [ ] 后台任务队列（Celery）
- [ ] 批量处理历史记录
- [ ] 失败重试机制
- [ ] 处理结果可视化对比

## 文件说明

```
backend/
├── services/
│   └── batch_service.py       # 核心批量处理逻辑
├── api/
│   └── batch.py               # REST API 端点
└── main.py                    # 注册 batch router

frontend/
└── src/
    └── api/
        └── batch.ts           # 前端 API 客户端
```

## 与旧系统兼容性

旧的 `table_line_generator/batch_processor.py` 暂时保留作为参考，但推荐使用新系统：
- ✅ 更简洁的实现
- ✅ 统一的架构
- ✅ Web API 集成
- ✅ 前后端一体化