8 Commit-ok 75d01a1ed5 ... b210ab056b

Szerző SHA1 Üzenet Dátum
  zhch158_admin b210ab056b fix(优化水印处理与布局检测配置): 更新多个bank_statement配置文件,调整水印去除设置,启用检测前处理,优化布局检测模块,新增OCR识别和表格分类功能,提升整体OCR处理的准确性与灵活性。 1 hónapja
  zhch158_admin 70f36c0904 fix(调整水印处理与单元格预处理配置): 更新bank_statement_yusys_local.yaml中的水印处理方法和对比度增强设置,调整阈值和启用状态,以优化OCR处理效果和灵活性。 1 hónapja
  zhch158_admin b11fe5592e fix(调整阈值以优化水印处理): 修改水印处理模块中的阈值设置,将单元格处理的阈值从170调整至155,以提升OCR处理的准确性和灵活性。 1 hónapja
  zhch158_admin a2311846f1 feat(增强二次OCR处理与单元格预处理功能): 在test_second_pass_ocr_aggregate.py中新增测试类和用例,验证短文本最小字符配置、单元格预处理的对比度调整及水印处理逻辑,提升OCR处理的准确性与灵活性。 1 hónapja
  zhch158_admin df98998bd5 feat(优化文本填充与OCR识别逻辑): 更新TextFiller类,新增短文本最小字符配置,重构识别逻辑以支持更灵活的文本解析和分数归一化,优化单元格对比度调整与增强功能,提升OCR处理的准确性与灵活性。 1 hónapja
  zhch158_admin eb694a01bb feat(新增水印评估与合成模块): 添加evaluate.py用于对比baseline与LaMa GAN方法的水印去除效果,新增lama_inpaint.py实现LaMa模型的推理,新增watermark_synthesis.py用于合成水印并生成相应的mask,提升水印处理的评估与合成能力。 1 hónapja
  zhch158_admin d25c465024 feat(新增单元格预处理参数扫描功能): 在cell_preprocess_lab.py中添加参数网格扫描示例,新增cell_sweep.py文件实现单元格裁剪图的预处理参数扫描功能,支持去水印、对比度调整等多种参数配置,提升OCR处理的灵活性与准确性,同时删除不再使用的cell121_sweep.py文件。 1 hónapja
  zhch158_admin 95bfd4baed feat(更新水印去除模块文档): 扩展水印去除模块的文档,详细描述水印处理能力、适用场景及参数配置,增加对页级和格级处理的说明,优化用户理解与使用体验。 1 hónapja

+ 250 - 98
docs/ocr_tools/universal_doc_parser/水印去除技术文档.md

@@ -2,12 +2,26 @@
 
 ## 概述
 
-水印去除模块 (`ocr_utils/watermark_utils.py`) 提供了**两层独立的水印去除能力**,针对不同类型的文档和场景进行优化:
+水印去除能力位于 `ocr_utils/watermark/` 包,对外兼容入口为 `ocr_utils/watermark_utils.py`(re-export)。核心编排类为 **`WatermarkProcessor`**,支持 **页级(page)** 与 **单元格级(cell)** 两套预设(`presets.py`)。
 
-| 层级 | 处理对象 | 适用场景 | 特点 |
-|------|---------|---------|------|
-| **PDF 层级** | 文字型 PDF 的 XObject | 银行流水等文字型 PDF | 保留文字可搜索性,无损处理 |
-| **图像层级** | 扫描件/渲染图像的像素 | 扫描件、图片 | 像素级处理,适用于 OCR 前预处理 |
+除 PDF/页级预处理外,银行流水等场景在 **有线表格二次 OCR** 中可对单个单元格裁剪图再次去水印(`text_filling.cell_preprocess`)。
+
+| 层级 | 处理对象 | 配置位置 | 适用场景 |
+|------|---------|---------|---------|
+| **PDF 层级** | 文字型 PDF 的 XObject | `input.txt_pdf_watermark_removal` | 文字型 PDF,渲染前 |
+| **页级图像** | 整页渲染图 | `preprocessor.watermark_removal` | 扫描件页级 OCR 前(可选) |
+| **格级图像** | 单元格裁剪图 | `table_recognition_wired.second_pass_ocr.cell_preprocess.watermark` | 二次 OCR 前(推荐 cell-first) |
+
+**实现模块:**
+
+| 路径 | 职责 |
+|------|------|
+| `ocr_utils/watermark/presets.py` | 页级/格级预设、`merge_watermark_config` |
+| `ocr_utils/watermark/removal.py` | `threshold` / `masked_adaptive` 去水印 |
+| `ocr_utils/watermark/processor.py` | `WatermarkProcessor` 门面 |
+| `ocr_utils/watermark/pdf.py` | 文字型 PDF XObject 去水印 |
+| `models/adapters/wired_table/text_filling.py` | 格级预处理 + 二次 OCR |
+| `ocr_tools/cell_preprocess_lab/cell_sweep.py` | 单格参数网格扫描(调参) |
 
 ---
 
@@ -35,13 +49,17 @@ graph TB
     F[图像输入] --> J[阶段二: 图像级去水印]
     
     J --> K{启用 watermark_removal?}
-    K -->|是| L[检测浅色斜向水印]
+    K -->|是| L[WatermarkProcessor page]
     K -->|否| N
-    L --> M[阈值化去除水印]
+    L --> M[method: threshold / masked / masked_adaptive]
     M --> N[方向校正]
     
     N --> O[Layout 检测]
-    O --> P[OCR 识别]
+    O --> P[表格 OCR]
+    P --> Q{二次 OCR 格级 wm?}
+    Q -->|是| R[WatermarkProcessor cell + upscale]
+    Q -->|否| S
+    R --> S[格内 OCR]
     
     style C fill:#e1f5ff
     style E fill:#e1f5ff
@@ -186,17 +204,81 @@ def scan_pdf_watermark_xobjs(pdf_bytes: bytes, sample_pages: int = 3) -> bool:
 
 ### 适用场景
 
-**扫描件/图片(`pdf_type='ocr'`)**:无法从 PDF 内部结构处理,只能对渲染后的图像进行像素级处理。
+- **页级**:扫描件/图片(`pdf_type='ocr'`),在 `MinerUPreprocessor` 中通过 `WatermarkProcessor(scope="page")` 调用。
+- **格级**:有线表格 **二次 OCR** 前,对 `raw_crop` 通过 `WatermarkProcessor(scope="cell")` 调用(与页级独立配置)。
 
-### 原理
+银行流水当前推荐策略 **cell-first**:页级 `watermark_removal.enabled: false`,重点调格级 `cell_preprocess.watermark`。
+
+### 灰度值约定(必读)
+
+OpenCV / PIL 的 **8 位灰度图**统一约定:
+
+| 灰度值 | 视觉 |
+|--------|------|
+| **0** | 黑(深笔画、墨迹) |
+| **255** | 白(背景、纸面) |
+| 中间值(如 100~220) | 灰(浅色水印、淡笔画、扫描噪声) |
+
+典型银行流水扫描件上:
+
+- **汉字笔画**:低灰度(偏黑,通常远小于 threshold)
+- **纸面背景 / 浅色斜纹水印**:高灰度(偏白,通常大于 threshold)
+
+> **注意**:个别 UI 或 PhotoShop 可能用「0=白」显示,但本仓库代码与 OpenCV 一致,**以 0=黑、255=白 为准**。
+
+### 去水印方法(`method`)
+
+| method | 说明 | YAML 需写 |
+|--------|------|-----------|
+| `threshold` | 全局 `gray > threshold → 255`,简单快速 | `method` + 可选 `threshold` |
+| `masked` | 掩膜定位水印区再处理 | 仅 `method`(细参见 preset) |
+| `masked_adaptive` | 掩膜 + 掩膜内自适应阈值 | 仅 `method`(细参见 preset) |
+
+预设默认值(`presets.py`,YAML 未覆盖时生效):
+
+| scope | 默认 `threshold` | 默认 `contrast_enhancement` |
+|-------|------------------|----------------------------|
+| **page** | 175 | enabled |
+| **cell** | 155 | disabled |
+
+`merge_watermark_config(scope, user_cfg)` 将用户 YAML 与上表预设合并;`mask` / `hough` / `adaptive` 等细参不必写入场景 YAML。
+
+### `threshold` 方法原理
 
 银行流水等金融文档的水印特征:
 
-- **颜色浅**:灰度值通常在 160-220 之间(介于正文和背景之间)
-- **角度斜**:通常 45° 斜向排列
-- **文字稀疏**:水印文字占比较小
+- **颜色浅**:灰度多在 160~220(介于正文与白纸之间)
+- **角度斜**:常见 45° 斜向重复文字
+- **占比较小**:相对整页/整格为稀疏浅色纹理
+
+核心代码(`removal.py`):
+
+```python
+cleaned = gray.copy()
+cleaned[gray > threshold] = 255   # 亮于阈值的像素 → 白
+```
+
+**语义**:保留 **灰度 ≤ threshold** 的像素(深字),把 **更亮** 的像素刷成白纸,用于削弱浅色水印。
+
+### `threshold` 调高 / 调低的实际作用
+
+判断规则:`gray > threshold` 才变白 → **threshold 是「多亮才算背景」的分界线**。
 
-基于这些特征,采用**阈值化处理**:将灰度值高于阈值的像素置为白色,保留深色正文。
+| 操作 | 白化强度 | 被刷白的像素范围 | 对水印 | 对正文 / OCR |
+|------|----------|------------------|--------|----------------|
+| **调低** threshold(如 175→155) | **更强** | 更多中等灰度(如 156~175)也会变白 | 去得更干净 | 淡笔画、被水印冲淡的边缘可能被啃掉;背景更干净时 det 有时更易检出一整行 |
+| **调高** threshold(如 155→175) | **更弱** | 只有更亮的像素才变白 | 易残留斜纹、浅灰噪声 | 笔画保留更多;残留干扰可能导致 det 碎框、高分短错文 |
+
+记忆口诀:
+
+- **threshold ↓** → 更激进地去浅色 → 背景更白,**易伤淡字**
+- **threshold ↑** → 更保守 → **易留水印**,但深字更安全
+
+调参建议(单格可用 `cell_sweep.py` 在 **`*_raw.png` 原图上**扫描,勿对已预处理 debug 图二次去水印):
+
+1. 优先在 **155~175** 间扫,结合 OCR 文本是否完整、det 框是否稳定。
+2. **不要只看 rec 分数**:threshold 偏高时可能出现高分但错误的短文本(如仅「折取款」)。
+3. 格级与页级 **threshold 可不同**(预设 page=175、cell=155),按 sweep 结果分别写 YAML。
 
 ### 水印检测 (`detect_watermark`)
 
@@ -232,79 +314,136 @@ def detect_watermark(image, midtone_low=100, midtone_high=220, ratio_threshold=0
     return diagonal_count >= 2
 ```
 
-### 水印去除 (`remove_watermark_from_image`)
+### 水印去除 API
+
+**推荐(页级 / 格级统一):**
 
 ```python
-def remove_watermark_from_image(image, threshold=160, morph_close_kernel=0):
-    """
-    去除图像中的浅色斜向文字水印
-    
-    原理:
-    - 正文为深黑色(灰度 < threshold)
-    - 水印为浅灰(灰度 > threshold)
-    - 将高于阈值的像素置为白色(255)
-    
-    Args:
-        threshold: 灰度阈值,建议 140-180,默认 160
-        morph_close_kernel: 形态学闭运算核,0 表示跳过
-    """
-    gray = to_grayscale(image)
-    
-    # 阈值化:保留深色正文
-    cleaned = gray.copy()
-    cleaned[gray > threshold] = 255
-    
-    # 可选:形态学闭运算填补字符断裂
-    if morph_close_kernel > 0:
-        kernel = np.ones((morph_close_kernel, morph_close_kernel), np.uint8)
-        cleaned = cv2.morphologyEx(cleaned, cv2.MORPH_CLOSE, kernel)
-    
-    return cleaned
+from ocr_utils.watermark import WatermarkProcessor, merge_watermark_config
+
+processor = WatermarkProcessor.from_user_config(
+    {"enabled": True, "method": "threshold", "threshold": 155},
+    scope="cell",  # 或 "page"
+)
+cleaned_bgr, stages = processor.process(cell_bgr_image, force=True)
+# stages 可能含 "wm"、"contrast" 等,供 debug JSON 使用
+```
+
+**底层(兼容旧代码):**
+
+```python
+from ocr_utils.watermark import remove_watermark_from_image_rgb
+
+out = remove_watermark_from_image_rgb(
+    image,
+    threshold=175,
+    watermark_removal_cfg=merge_watermark_config("page", {"method": "threshold"}),
+)
 ```
 
-### 参数说明
+### 参数说明(`method: threshold`)
 
-| 参数 | 默认值 | 说明 | 调整建议 |
-|------|--------|------|---------|
-| `threshold` | 160 | 灰度阈值 | 140-180,越大越保守(可能残留水印) |
-| `morph_close_kernel` | 0 | 形态学核大小 | 非二值图建议设为 0(闭运算会适得其反) |
+| 参数 | page 预设 | cell 预设 | 说明 |
+|------|-----------|-----------|------|
+| `threshold` | 175 | 155 | 见上文「调高/调低」;**越大越保守**,越小白化越强 |
+| `morph_close_kernel` | 0 | 0 | 闭运算核;**0=关闭**(推荐,非二值图闭运算易引噪) |
+| `detect_before_remove` | true | false | 页级可先检测再去除;格级通常 `force=True` 直接处理 |
+| `contrast_enhancement` | 默认开 | 默认关 | 去水印后 `text_restore`;格级默认关,需时再开 |
+
+---
+
+## 阶段三:格级二次 OCR 预处理
+
+### 流程
+
+```
+表图 raw_crop
+  → WatermarkProcessor(cell)     # wm
+  → 可选 denoise / contrast      # YAML 开关
+  → upscale(light.upscale_min_side,如 192)
+  → det 分行 / whole 兜底 OCR
+```
+
+Debug 输出(`tablecell_ocr/`):
+
+| 文件 | 含义 |
+|------|------|
+| `cellNNN_*_*.png` | 送入 OCR 的预处理后图像 |
+| `cellNNN_*_*_raw.png` | 未去水印的原始裁剪(供 `cell_sweep` 调参) |
+| `cellNNN_*_*.json` | 含 `preprocess_stages`、`debug_images`、`lines`/`whole` 等 |
+
+### 参数探索工具
+
+```bash
+cd ocr_tools/cell_preprocess_lab
+
+# 单格扫描(自动优先 *_raw.png)
+python cell_sweep.py /path/to/cell219_empty_empty_raw.png \
+  -o ./output/cell219_sweep -t "ATM存折取款"
+
+# 批量 tablecell_ocr 目录
+python cell_sweep.py /path/to/tablecell_ocr/ -o ./sweep_out --quick
+```
+
+报告 `sweep_report.json` 含每条组合的 `text`、`score`(加权识别分)、`boxes[]`(逐框分数)。
 
 ---
 
 ## 配置说明
 
-### 完整配置示例
+### 完整配置示例(`bank_statement_yusys_local.yaml`)
 
 ```yaml
-# 输入配置 - PDF 层级去水印
 input:
-  dpi: 200
   txt_pdf_watermark_removal:
-    enabled: true        # 是否启用 PDF 层级去水印
-    sample_pages: 3      # 快速预扫描页数
+    enabled: true
+    sample_pages: 3
 
-# 预处理配置 - 图像级去水印
 preprocessor:
-  module: "mineru"
-  orientation_classifier:
-    enabled: true
+  order: orient_first
   watermark_removal:
-    enabled: true           # 是否启用图像级去水印
-    threshold: 160          # 灰度阈值
-    morph_close_kernel: 0   # 形态学核大小(建议 0)
+    enabled: false              # cell-first:页级可关
+    detect_before_remove: true
+    method: threshold
+    threshold: 175              # 页级预设默认 175
+    contrast_enhancement:
+      enabled: false
+
+table_recognition_wired:
+  second_pass_ocr:
+    suspicious_short_min_chars: 4
+    cell_preprocess:
+      watermark:
+        enabled: true
+        method: threshold
+        threshold: 155          # 建议显式写出;未写则用 cell 预设 155
+      denoise:
+        enabled: false
+      contrast:
+        enabled: false          # Pass1 可选 text_restore
+      light:
+        upscale_min_side: 192
+    enhance_retry:
+      enabled: false            # Pass2 增强重试(与 cell_preprocess 同级)
 ```
 
 ### 配置项详解
 
-| 配置路径 | 类型 | 默认值 | 说明 |
-|---------|------|--------|------|
-| `input.txt_pdf_watermark_removal.enabled` | bool | `false` | PDF 层级去水印开关 |
-| `input.txt_pdf_watermark_removal.sample_pages` | int | 3 | 预扫描页数 |
-| `preprocessor.watermark_removal.enabled` | bool | `false` | 图像级去水印开关 |
-| `preprocessor.watermark_removal.threshold` | int | 160 | 灰度阈值 |
-| `preprocessor.watermark_removal.morph_close_kernel` | int | 0 | 形态学核大小 |
+| 配置路径 | 说明 |
+|---------|------|
+| `input.txt_pdf_watermark_removal.*` | PDF XObject 去水印 |
+| `preprocessor.watermark_removal.*` | 页级 `WatermarkProcessor(scope=page)` |
+| `preprocessor.watermark_removal.method` | `threshold` \| `masked` \| `masked_adaptive` |
+| `preprocessor.watermark_removal.threshold` | 仅 `threshold` 法;见「调高/调低」 |
+| `second_pass_ocr.cell_preprocess.watermark.*` | 格级 `WatermarkProcessor(scope=cell)` |
+| `second_pass_ocr.cell_preprocess.light.upscale_min_side` | 去水印后最短边放大 |
+| `second_pass_ocr.enhance_retry` | Pass2 预处理(与 `cell_preprocess` 同级,非其子项) |
+
+**说明:**
 
-**注意**:两个配置均无默认值,必须在 YAML 中显式配置 `enabled: true` 才会触发。
+- `morph_close_kernel` 在 preset 中已为 `0`,一般 **不必写入 YAML**。
+- 格级 `threshold` **建议在 sweep 后显式配置**,不要假设与页级相同。
+- `enabled: true` 才会执行;页级、格级开关相互独立。
 
 ---
 
@@ -337,26 +476,29 @@ if pdf_type == 'ocr':  # 条件①:仅扫描件
 
 # mineru_adapter.py: MinerUPreprocessor.process()
 
-if config.get('watermark_removal', {}).get('enabled', False):  # 条件②
-    image = remove_watermark_from_image_rgb(image, threshold=160)
+processor = WatermarkProcessor.from_user_config(wm_cfg, scope="page")
+if processor.enabled:
+    image, _ = processor.process(image)  # 内部 detect_before_remove + method
 ```
 
 **触发条件**:
 1. PDF 类型为 `ocr`(扫描件)
 2. `preprocessor.watermark_removal.enabled: true`
 
+**格级二次 OCR**(`text_filling.py`):表体触发二次 OCR 时,对 `raw_crop` 调用 `_preprocess_cell_for_ocr` → `WatermarkProcessor(scope="cell")`。
+
 ---
 
-## 两阶段对比
+## 各层级对比
 
-| 维度 | 阶段一(PDF 层级) | 阶段二(图像级) |
-|------|------------------|-----------------|
-| **处理对象** | 文字型 PDF | 扫描件/图片 |
-| **处理层级** | PDF XObject | 图像像素 |
-| **保留文字可搜索性** | ✅ 是 | ❌ 否 |
-| **无损处理** | ✅ 是 | ❌ 否(像素修改) |
-| **处理时机** | 渲染前 | 渲染后、检测前 |
-| **依赖库** | PyMuPDF (fitz) | OpenCV, NumPy |
+| 维度 | PDF 层级 | 页级图像 | 格级图像 |
+|------|----------|----------|----------|
+| **处理对象** | 文字型 PDF XObject | 整页渲染图 | 单元格裁剪 |
+| **配置** | `input.txt_pdf_*` | `preprocessor.watermark_removal` | `second_pass_ocr.cell_preprocess.watermark` |
+| **默认 threshold 预设** | — | 175 | 155 |
+| **保留 PDF 文字层** | ✅ | — | — |
+| **处理时机** | 渲染前 | Layout/OCR 前 | 格内二次 OCR 前 |
+| **依赖库** | PyMuPDF | OpenCV | OpenCV |
 
 ---
 
@@ -367,9 +509,9 @@ if config.get('watermark_removal', {}).get('enabled', False):  # 条件②
 ```python
 # pipeline_manager_v2.py
 
-from ocr_utils.watermark_utils import (
+from ocr_utils.watermark import (
     scan_pdf_watermark_xobjs,
-    remove_txt_pdf_watermark
+    remove_txt_pdf_watermark,
 )
 
 class EnhancedDocPipeline:
@@ -400,23 +542,29 @@ class EnhancedDocPipeline:
 ```python
 # models/adapters/mineru_adapter.py
 
-from ocr_utils.watermark_utils import remove_watermark_from_image_rgb
+from ocr_utils.watermark import WatermarkProcessor
 
 class MinerUPreprocessor:
     def process(self, image):
-        # 图像级水印去除(在方向校正之前)
-        if self.config.get('watermark_removal', {}).get('enabled', False):
-            threshold = self.config.get('watermark_removal', {}).get('threshold', 160)
-            image = remove_watermark_from_image_rgb(image, threshold=threshold)
-        
-        # 方向校正
-        if self.orientation_classifier:
-            angle = self.orientation_classifier.predict(image)
-            image = self._apply_rotation(image, angle)
-        
+        wm_cfg = self.config.get("watermark_removal") or {}
+        processor = WatermarkProcessor.from_user_config(wm_cfg, scope="page")
+        if processor.enabled:
+            image, _ = processor.process(image)
+        # 方向校正 ...
         return image, angle
 ```
 
+### 格级二次 OCR 集成
+
+```python
+# models/adapters/wired_table/text_filling.py
+
+self._cell_wm_processor = WatermarkProcessor.from_user_config(wm_user, scope="cell")
+
+cell_img, stages = self._preprocess_cell_for_ocr(raw_crop, mode="light")
+# stages 示例: ["wm", "upscale"] 或 ["wm", "contrast", "upscale"]
+```
+
 ---
 
 ## 使用示例
@@ -475,17 +623,21 @@ python main_v2.py -i doc.pdf -c config.yaml --scene bank_statement --debug
 
 ## 注意事项
 
-1. **两个阶段是互补的**:阶段一处理文字型 PDF,阶段二处理扫描件,实际不会重复执行
-2. **阈值选择**:`threshold=160` 适用于大多数银行流水,如果误删浅色文字可适当提高
-3. **形态学运算**:`morph_close_kernel=0` 是推荐值,非二值图时闭运算可能引入噪声
-4. **大文件优化**:`sample_pages=3` 快速预扫描,避免对无水印的大文件全量处理
-5. **依赖要求**:PDF 层级去水印需要 `PyMuPDF`,图像级需要 `OpenCV`
+1. **三层互补**:PDF 层级、页级、格级可独立开关;银行流水推荐 **cell-first**(页级 wm 关、格级 wm 开)。
+2. **灰度方向**:**0=黑、255=白**;`gray > threshold → 255` 表示把「比阈值更亮」的像素刷白。
+3. **threshold 方向**:**调高**更保守(易留水印、少伤字);**调低**更激进(背景更干净、易啃淡笔画)。页级与格级应分别调参。
+4. **勿混用 det 阈值**:`ocr_recognition.det_threshold` 是 OCR 检测框过滤,与去水印 `threshold` 无关。
+5. **调参输入**:`cell_sweep.py` 应使用 `*_raw.png`(原裁剪),不要对已预处理的 `cell*_empty_empty.png` 再扫(等于二次去水印,结论失真)。
+6. **形态学**:preset 中 `morph_close_kernel=0`,非二值图不建议开启闭运算。
+7. **依赖**:PDF 层级需 `PyMuPDF`;图像级需 `OpenCV`。
 
 ---
 
 ## 参考资料
 
-- `ocr_utils/watermark_utils.py` - 水印工具函数实现
-- `core/pipeline_manager_v2.py` - 流水线集成
-- `models/adapters/mineru_adapter.py` - 预处理器集成
-- `config/bank_statement_*.yaml` - 配置示例
+- `ocr_utils/watermark/` — 实现包(presets / removal / processor / pdf)
+- `ocr_utils/watermark_utils.py` — 兼容 re-export
+- `ocr_tools/cell_preprocess_lab/cell_sweep.py` — 格级参数扫描
+- `models/adapters/mineru_adapter.py` — 页级预处理
+- `models/adapters/wired_table/text_filling.py` — 格级二次 OCR
+- `config/bank_statement_yusys_local.yaml` — 场景配置示例

+ 0 - 194
ocr_tools/cell_preprocess_lab/cell121_sweep.py

@@ -1,194 +0,0 @@
-#!/usr/bin/env python3
-"""cell121 参数扫描:去水印方式 / threshold / contrast / upscale / det 阈值 / 整格 rec。"""
-from __future__ import annotations
-
-import json
-import os
-import sys
-from itertools import product
-from pathlib import Path
-from typing import Any, Dict, List, Optional, Tuple
-
-import cv2
-import numpy as np
-
-_repo_root = Path(__file__).resolve().parents[2]
-if str(_repo_root) not in sys.path:
-    sys.path.insert(0, str(_repo_root))
-
-from ocr_utils.watermark import WatermarkProcessor, merge_watermark_config
-from ocr_utils.watermark.contrast import apply_contrast_enhancement_config
-
-CELL121 = Path(
-    "/Users/zhch158/workspace/data/流水分析/彭_广东兴宁农村商业银行/"
-    "bank_statement_yusys_local/debug/table_recognition_wired/tablecell_ocr/"
-    "彭_广东兴宁农村商业银行_page_002_0/cell121_empty_empty.png"
-)
-OUT_DIR = Path(__file__).parent / "output/彭_广东兴宁农村商业银行/cell121_sweep"
-MODEL_DIR = Path(
-    "/Users/zhch158/models/modelscope_cache/models/OpenDataLab/"
-    "PDF-Extract-Kit-1___0/models/OCR/paddleocr_torch"
-)
-
-TARGET = "20240927"
-
-
-def _upscale(img: np.ndarray, min_side: int) -> np.ndarray:
-    h, w = img.shape[:2]
-    if h >= min_side and w >= min_side:
-        return img
-    s = max(min_side / max(h, 1), min_side / max(w, 1), 1.0)
-    return cv2.resize(img, None, fx=s, fy=s, interpolation=cv2.INTER_CUBIC)
-
-
-def _preprocess(
-    raw: np.ndarray,
-    *,
-    method: str,
-    thresh: Optional[int],
-    contrast: bool,
-    upscale: int,
-) -> np.ndarray:
-    user: Dict[str, Any] = {"enabled": True, "method": method}
-    if method == "threshold" and thresh is not None:
-        user["threshold"] = thresh
-    cfg = merge_watermark_config("cell", user)
-    img, _ = WatermarkProcessor(cfg, scope="cell").process(raw, force=True)
-    if contrast:
-        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
-        ce = dict(cfg.get("contrast_enhancement") or {})
-        ce["enabled"] = True
-        ce["text_black_target"] = 88
-        gray = apply_contrast_enhancement_config(gray, ce)
-        img = cv2.cvtColor(gray, cv2.COLOR_GRAY2BGR)
-    return _upscale(img, upscale)
-
-
-def _ocr(engine: Any, img: np.ndarray, *, det: bool, rec: bool) -> Dict[str, Any]:
-    try:
-        res = engine.ocr(img, det=det, rec=rec)
-        texts: List[str] = []
-        if res and res[0]:
-            if det:
-                for item in res[0]:
-                    if item and len(item) >= 2 and item[1]:
-                        texts.append(str(item[1][0] or ""))
-            else:
-                for item in res[0]:
-                    if isinstance(item, (list, tuple)) and len(item) >= 1:
-                        texts.append(str(item[0] or ""))
-        text = "".join(texts).strip()
-        return {
-            "text": text,
-            "det": det,
-            "rec": rec,
-            "n_boxes": len(res[0]) if res and res[0] else 0,
-        }
-    except Exception as e:
-        return {"text": "", "error": str(e), "det": det, "rec": rec}
-
-
-def _make_engine(det_thresh: float) -> Any:
-    from ocr_tools.pytorch_models.pytorch_paddle import PytorchPaddleOCR
-
-    return PytorchPaddleOCR(
-        lang="ch",
-        det_model_path=str(MODEL_DIR / "ch_PP-OCRv5_det_infer.pth"),
-        rec_model_path=str(MODEL_DIR / "ch_PP-OCRv4_rec_server_doc_infer.pth"),
-        det_db_box_thresh=det_thresh,
-    )
-
-
-def main() -> None:
-    if not CELL121.is_file():
-        raise FileNotFoundError(CELL121)
-    raw = cv2.imread(str(CELL121))
-    OUT_DIR.mkdir(parents=True, exist_ok=True)
-
-    methods = ["threshold", "masked_adaptive"]
-    thresholds = [155, 165, 170, 175, 180, None]
-    contrasts = [False, True]
-    upscales = [64, 96, 128, 192]
-    det_threshs = [0.2, 0.3, 0.4, 0.5]
-    ocr_modes = [("det_rec", True, True), ("whole_rec", False, True)]
-
-    results: List[Dict[str, Any]] = []
-    hits: List[Dict[str, Any]] = []
-    engines: Dict[float, Any] = {}
-
-    total = 0
-    for method, thresh, contrast, upscale, det_th in product(
-        methods, thresholds, contrasts, upscales, det_threshs
-    ):
-        if method != "threshold" and thresh is not None:
-            continue
-        if det_th not in engines:
-            print(f"加载 OCR det_db_box_thresh={det_th} ...")
-            engines[det_th] = _make_engine(det_th)
-
-        img = _preprocess(
-            raw, method=method, thresh=thresh, contrast=contrast, upscale=upscale
-        )
-        tag = (
-            f"{method}_t{thresh or 'd'}_c{int(contrast)}_u{upscale}_det{det_th}"
-        )
-        cv2.imwrite(str(OUT_DIR / f"{tag}.png"), img)
-
-        for mode_name, det, rec in ocr_modes:
-            total += 1
-            ocr = _ocr(engines[det_th], img, det=det, rec=rec)
-            row = {
-                "tag": tag,
-                "method": method,
-                "threshold": thresh,
-                "contrast": contrast,
-                "upscale": upscale,
-                "det_db_box_thresh": det_th,
-                "ocr_mode": mode_name,
-                **ocr,
-            }
-            results.append(row)
-            t = row.get("text", "")
-            if TARGET in t or (len(t) >= 6 and t.isdigit()):
-                row["match"] = "full" if TARGET in t else "partial"
-                hits.append(row)
-                print(f"HIT [{row['match']}] {mode_name} {tag} -> {t!r}")
-
-    # 原图对照
-    for det_th in [0.3, 0.5]:
-        if det_th not in engines:
-            engines[det_th] = _make_engine(det_th)
-        for mode_name, det, rec in ocr_modes:
-            ocr = _ocr(engines[det_th], _upscale(raw, 128), det=det, rec=rec)
-            row = {
-                "tag": "raw_upscale128",
-                "det_db_box_thresh": det_th,
-                "ocr_mode": mode_name,
-                **ocr,
-            }
-            results.append(row)
-            if TARGET in (row.get("text") or ""):
-                hits.append(row)
-
-    report = {
-        "input": str(CELL121),
-        "target": TARGET,
-        "total_trials": total,
-        "hits": hits,
-        "all_results": results,
-    }
-    out_json = OUT_DIR / "cell121_sweep_report.json"
-    out_json.write_text(json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8")
-
-    print(f"\n完成 {total} 次 OCR 试验,命中 {len(hits)} 条")
-    print(f"报告: {out_json}")
-    if hits:
-        print("\n最佳命中:")
-        for h in hits[:10]:
-            print(f"  {h.get('ocr_mode')} {h.get('tag')}: {h.get('text')!r}")
-    else:
-        print("未出现完整 20240927,请查看 cell121_sweep/*.png 与 report 中 partial 结果")
-
-
-if __name__ == "__main__":
-    main()

+ 3 - 0
ocr_tools/cell_preprocess_lab/cell_preprocess_lab.py

@@ -8,6 +8,9 @@
     python cell_preprocess_lab.py cell219.png -o /tmp/cell_lab
     python cell_preprocess_lab.py /path/to/tablecell_ocr/ -o /tmp/batch --compare-methods
     python cell_preprocess_lab.py cell217.png -o /tmp/out --denoise --contrast
+
+参数网格扫描见 cell_sweep.py:
+    python cell_sweep.py cell219_empty_empty_raw.png -o ./out -t "ATM存折取款"
 """
 from __future__ import annotations
 

+ 554 - 0
ocr_tools/cell_preprocess_lab/cell_sweep.py

@@ -0,0 +1,554 @@
+#!/usr/bin/env python3
+"""
+单元格裁剪图预处理参数扫描:去水印 / threshold / contrast / upscale / det 阈值 / OCR 模式。
+
+默认从 **原图**(`*_raw.png`)出发,与 pipeline 二次 OCR 一致,避免对已预处理 debug 图二次去水印。
+
+用法:
+    python cell_sweep.py cell219_empty_empty_raw.png -o ./out -t "ATM存折取款"
+    python cell_sweep.py /path/to/tablecell_ocr/ -o ./out
+    python cell_sweep.py cell.png --quick --no-save-images
+    OCR_DET_MODEL_PATH=... OCR_REC_MODEL_PATH=... python cell_sweep.py cell.png
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import os
+import sys
+from itertools import product
+from pathlib import Path
+from typing import Any, Dict, Iterable, List, Optional, Sequence, Tuple
+
+import cv2
+import numpy as np
+
+_repo_root = Path(__file__).resolve().parents[2]
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from ocr_utils.watermark import WatermarkProcessor, merge_watermark_config
+from ocr_utils.watermark.contrast import apply_contrast_enhancement_config
+
+_IMAGE_SUFFIXES = {".png", ".jpg", ".jpeg", ".bmp", ".tif", ".tiff", ".webp"}
+_DEFAULT_MODEL_DIR = Path(
+    "/Users/zhch158/models/modelscope_cache/models/OpenDataLab/"
+    "PDF-Extract-Kit-1___0/models/OCR/paddleocr_torch"
+)
+
+
+def _parse_csv_ints(s: str) -> List[Optional[int]]:
+    out: List[Optional[int]] = []
+    for part in s.split(","):
+        part = part.strip()
+        if not part or part.lower() in ("none", "d", "default"):
+            out.append(None)
+        else:
+            out.append(int(part))
+    return out
+
+
+def _parse_csv_floats(s: str) -> List[float]:
+    return [float(x.strip()) for x in s.split(",") if x.strip()]
+
+
+def _parse_csv_bools(s: str) -> List[bool]:
+    out: List[bool] = []
+    for part in s.split(","):
+        p = part.strip().lower()
+        if p in ("1", "true", "yes", "on"):
+            out.append(True)
+        elif p in ("0", "false", "no", "off"):
+            out.append(False)
+        else:
+            raise ValueError(f"无效的 bool 值: {part!r}")
+    return out
+
+
+def _default_model_dir() -> Path:
+    det = os.environ.get("OCR_DET_MODEL_PATH")
+    if det:
+        return Path(det).parent
+    return _DEFAULT_MODEL_DIR
+
+
+def _upscale(img: np.ndarray, min_side: int) -> np.ndarray:
+    h, w = img.shape[:2]
+    if h >= min_side and w >= min_side:
+        return img
+    s = max(min_side / max(h, 1), min_side / max(w, 1), 1.0)
+    return cv2.resize(img, None, fx=s, fy=s, interpolation=cv2.INTER_CUBIC)
+
+
+def _preprocess(
+    raw: np.ndarray,
+    *,
+    method: str,
+    thresh: Optional[int],
+    contrast: bool,
+    upscale: int,
+    text_black_target: int,
+) -> np.ndarray:
+    user: Dict[str, Any] = {"enabled": True, "method": method}
+    if method == "threshold" and thresh is not None:
+        user["threshold"] = thresh
+    cfg = merge_watermark_config("cell", user)
+    img, _ = WatermarkProcessor(cfg, scope="cell").process(raw, force=True)
+    if contrast:
+        gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
+        ce = dict(cfg.get("contrast_enhancement") or {})
+        ce["enabled"] = True
+        ce["text_black_target"] = text_black_target
+        gray = apply_contrast_enhancement_config(gray, ce)
+        img = cv2.cvtColor(gray, cv2.COLOR_GRAY2BGR)
+    return _upscale(img, upscale)
+
+
+def _parse_rec_pair(rec_part: Any) -> Tuple[str, float]:
+    """从 OCR 返回的 (text, score) 或嵌套结构中解析识别结果。"""
+    if rec_part is None:
+        return "", 0.0
+    if isinstance(rec_part, (list, tuple)) and len(rec_part) >= 2:
+        if isinstance(rec_part[0], (list, tuple, dict)):
+            return "", 0.0
+        txt = str(rec_part[0] or "").strip()
+        try:
+            sc = float(rec_part[1] or 0.0)
+        except (TypeError, ValueError):
+            sc = 0.0
+        return txt, sc if txt else 0.0
+    if isinstance(rec_part, (list, tuple)) and len(rec_part) == 1:
+        txt = str(rec_part[0] or "").strip()
+        return txt, 0.0
+    return "", 0.0
+
+
+def _aggregate_rec_score(boxes: List[Dict[str, Any]]) -> float:
+    """按字符数加权平均识别分(与 pipeline aggregate_line_ocr 一致)。"""
+    total_len = sum(len(b.get("text") or "") for b in boxes)
+    if total_len <= 0:
+        return 0.0
+    weighted = sum(
+        len(b.get("text") or "") * float(b.get("score") or 0.0) for b in boxes
+    )
+    return weighted / total_len
+
+
+def _ocr(engine: Any, img: np.ndarray, *, det: bool, rec: bool) -> Dict[str, Any]:
+    empty: Dict[str, Any] = {
+        "text": "",
+        "score": 0.0,
+        "boxes": [],
+        "det": det,
+        "rec": rec,
+        "n_boxes": 0,
+    }
+    try:
+        res = engine.ocr(img, det=det, rec=rec)
+        items = res[0] if res and res[0] is not None else []
+        boxes_out: List[Dict[str, Any]] = []
+
+        if det:
+            for item in items:
+                if not item or len(item) < 2:
+                    continue
+                text, score = _parse_rec_pair(item[1])
+                bbox = item[0]
+                if hasattr(bbox, "tolist"):
+                    bbox = bbox.tolist()
+                entry: Dict[str, Any] = {
+                    "text": text,
+                    "score": round(score, 6),
+                }
+                if bbox is not None:
+                    entry["det_bbox"] = bbox
+                boxes_out.append(entry)
+        else:
+            for item in items:
+                text, score = _parse_rec_pair(item)
+                if not text and isinstance(item, (list, tuple)) and len(item) >= 1:
+                    text, score = _parse_rec_pair(item[0])
+                boxes_out.append({"text": text, "score": round(score, 6)})
+
+        text = "".join(b["text"] for b in boxes_out if b.get("text")).strip()
+        agg_score = _aggregate_rec_score(boxes_out)
+        return {
+            "text": text,
+            "score": round(agg_score, 6),
+            "boxes": boxes_out,
+            "det": det,
+            "rec": rec,
+            "n_boxes": len(boxes_out),
+        }
+    except Exception as e:
+        out = dict(empty)
+        out["error"] = str(e)
+        return out
+
+
+def _make_engine(det_thresh: float, model_dir: Path) -> Any:
+    from ocr_tools.pytorch_models.pytorch_paddle import PytorchPaddleOCR
+
+    det_path = os.environ.get("OCR_DET_MODEL_PATH") or str(
+        model_dir / "ch_PP-OCRv5_det_infer.pth"
+    )
+    rec_path = os.environ.get("OCR_REC_MODEL_PATH") or str(
+        model_dir / "ch_PP-OCRv4_rec_server_doc_infer.pth"
+    )
+    return PytorchPaddleOCR(
+        lang="ch",
+        det_model_path=det_path,
+        rec_model_path=rec_path,
+        det_db_box_thresh=det_thresh,
+    )
+
+
+def resolve_input_image(path: Path, *, prefer_raw: bool) -> Path:
+    """优先使用与 pipeline debug 配套的 *_raw.png。"""
+    if not prefer_raw or path.stem.endswith("_raw"):
+        return path
+    raw_path = path.parent / f"{path.stem}_raw{path.suffix}"
+    if raw_path.is_file():
+        print(f"  使用原图: {raw_path.name}(跳过 {path.name})")
+        return raw_path
+    return path
+
+
+def collect_inputs(path: Path, *, prefer_raw: bool) -> List[Path]:
+    if path.is_file():
+        if path.suffix.lower() not in _IMAGE_SUFFIXES:
+            raise ValueError(f"不支持的图像格式: {path}")
+        return [resolve_input_image(path, prefer_raw=prefer_raw)]
+
+    if not path.is_dir():
+        raise FileNotFoundError(path)
+
+    all_images = sorted(
+        p
+        for p in path.iterdir()
+        if p.is_file() and p.suffix.lower() in _IMAGE_SUFFIXES
+    )
+    if not all_images:
+        raise FileNotFoundError(f"目录内无图像: {path}")
+
+    if prefer_raw:
+        raws = [p for p in all_images if p.stem.endswith("_raw")]
+        if raws:
+            return raws
+
+    chosen: List[Path] = []
+    for p in all_images:
+        if p.stem.endswith("_raw"):
+            continue
+        raw_sibling = p.parent / f"{p.stem}_raw{p.suffix}"
+        if prefer_raw and raw_sibling.is_file():
+            continue
+        chosen.append(p)
+    return chosen or all_images
+
+
+def _match_hit(text: str, target: Optional[str]) -> Optional[str]:
+    if not text:
+        return None
+    if not target:
+        return "nonempty"
+    if target in text:
+        return "full"
+    if len(target) >= 6 and target.isdigit() and len(text) >= 6 and text.isdigit():
+        return "partial"
+    return None
+
+
+def run_sweep(
+    input_path: Path,
+    out_dir: Path,
+    *,
+    prefer_raw: bool,
+    target: Optional[str],
+    model_dir: Path,
+    methods: Sequence[str],
+    thresholds: Sequence[Optional[int]],
+    contrasts: Sequence[bool],
+    upscales: Sequence[int],
+    det_threshs: Sequence[float],
+    text_black_target: int,
+    save_images: bool,
+    run_baseline: bool,
+    baseline_upscale: int,
+) -> Dict[str, Any]:
+    resolved = resolve_input_image(input_path, prefer_raw=prefer_raw)
+    raw = cv2.imread(str(resolved))
+    if raw is None:
+        raise RuntimeError(f"无法读取图像: {resolved}")
+
+    stem = resolved.stem.removesuffix("_raw") if resolved.stem.endswith("_raw") else resolved.stem
+    cell_out = out_dir / stem
+    cell_out.mkdir(parents=True, exist_ok=True)
+
+    ocr_modes: List[Tuple[str, bool, bool]] = [
+        ("det_rec", True, True),
+        ("whole_rec", False, True),
+    ]
+
+    results: List[Dict[str, Any]] = []
+    hits: List[Dict[str, Any]] = []
+    engines: Dict[float, Any] = {}
+    total = 0
+
+    for method, thresh, contrast, upscale, det_th in product(
+        methods, thresholds, contrasts, upscales, det_threshs
+    ):
+        if method != "threshold" and thresh is not None:
+            continue
+        if det_th not in engines:
+            print(f"  [{stem}] 加载 OCR det_db_box_thresh={det_th} ...")
+            engines[det_th] = _make_engine(det_th, model_dir)
+
+        img = _preprocess(
+            raw,
+            method=method,
+            thresh=thresh,
+            contrast=contrast,
+            upscale=upscale,
+            text_black_target=text_black_target,
+        )
+        tag = f"{method}_t{thresh or 'd'}_c{int(contrast)}_u{upscale}_det{det_th}"
+        if save_images:
+            cv2.imwrite(str(cell_out / f"{tag}.png"), img)
+
+        for mode_name, det, rec in ocr_modes:
+            total += 1
+            ocr = _ocr(engines[det_th], img, det=det, rec=rec)
+            row: Dict[str, Any] = {
+                "tag": tag,
+                "method": method,
+                "threshold": thresh,
+                "contrast": contrast,
+                "upscale": upscale,
+                "det_db_box_thresh": det_th,
+                "ocr_mode": mode_name,
+                **ocr,
+            }
+            results.append(row)
+            m = _match_hit(row.get("text", ""), target)
+            if m:
+                row["match"] = m
+                hits.append(row)
+                print(
+                    f"  HIT [{m}] {mode_name} {tag} "
+                    f"score={row.get('score')} -> {row.get('text')!r}"
+                )
+
+    if run_baseline:
+        for det_th in det_threshs:
+            if det_th not in engines:
+                engines[det_th] = _make_engine(det_th, model_dir)
+            base_img = _upscale(raw, baseline_upscale)
+            if save_images:
+                cv2.imwrite(str(cell_out / f"baseline_upscale{baseline_upscale}.png"), base_img)
+            for mode_name, det, rec in ocr_modes:
+                ocr = _ocr(engines[det_th], base_img, det=det, rec=rec)
+                row = {
+                    "tag": f"baseline_upscale{baseline_upscale}",
+                    "det_db_box_thresh": det_th,
+                    "ocr_mode": mode_name,
+                    **ocr,
+                }
+                results.append(row)
+                m = _match_hit(row.get("text", ""), target)
+                if m:
+                    row["match"] = m
+                    hits.append(row)
+
+    report = {
+        "input": str(resolved),
+        "input_requested": str(input_path),
+        "output_dir": str(cell_out),
+        "target": target,
+        "total_trials": total,
+        "hits": hits,
+        "all_results": results,
+    }
+    report_path = cell_out / "sweep_report.json"
+    report_path.write_text(
+        json.dumps(report, ensure_ascii=False, indent=2), encoding="utf-8"
+    )
+    return report
+
+
+def _build_arg_parser() -> argparse.ArgumentParser:
+    p = argparse.ArgumentParser(
+        description="单元格图预处理 + OCR 参数网格扫描(对齐 pipeline 格级二次 OCR)",
+    )
+    p.add_argument(
+        "input",
+        type=Path,
+        help="单元格裁剪图路径,或 tablecell_ocr 目录(批量扫描)",
+    )
+    p.add_argument(
+        "-o",
+        "--output",
+        type=Path,
+        default=None,
+        help="输出目录,默认 <input_dir|input_parent>/sweep_out/<stem>",
+    )
+    p.add_argument(
+        "-t",
+        "--target",
+        default=None,
+        help="期望 OCR 文本;用于标记 HIT(子串匹配)。省略则任意非空为 HIT",
+    )
+    p.add_argument(
+        "--model-dir",
+        type=Path,
+        default=None,
+        help="PaddleOCR torch 模型目录(含 det/rec .pth),也可用 OCR_*_MODEL_PATH",
+    )
+    p.add_argument(
+        "--no-prefer-raw",
+        action="store_true",
+        help="不自动选用同名的 *_raw.png",
+    )
+    p.add_argument(
+        "--quick",
+        action="store_true",
+        help="缩小网格(threshold 170,175 × upscale 128,192 × det 0.3,0.5)",
+    )
+    p.add_argument(
+        "--methods",
+        default="threshold,masked_adaptive",
+        help="去水印方式,逗号分隔",
+    )
+    p.add_argument(
+        "--thresholds",
+        default="155,165,170,175,180,none",
+        help="threshold 法的阈值;none=预设默认",
+    )
+    p.add_argument(
+        "--contrasts",
+        default="false,true",
+        help="是否 contrast,逗号分隔 false,true",
+    )
+    p.add_argument(
+        "--upscales",
+        default="64,96,128,192",
+        help="最短边放大目标,逗号分隔整数",
+    )
+    p.add_argument(
+        "--det-threshs",
+        default="0.2,0.3,0.4,0.5",
+        help="det_db_box_thresh,逗号分隔",
+    )
+    p.add_argument(
+        "--text-black-target",
+        type=int,
+        default=88,
+        help="contrast text_restore 目标黑度",
+    )
+    p.add_argument(
+        "--no-save-images",
+        action="store_true",
+        help="不写出中间预处理 png(仅报告)",
+    )
+    p.add_argument(
+        "--no-baseline",
+        action="store_true",
+        help="跳过「仅放大、不去水印」对照组",
+    )
+    p.add_argument(
+        "--baseline-upscale",
+        type=int,
+        default=128,
+        help="baseline 对照组的最短边放大",
+    )
+    return p
+
+
+def main(argv: Optional[Sequence[str]] = None) -> None:
+    args = _build_arg_parser().parse_args(argv)
+    inputs = collect_inputs(args.input, prefer_raw=not args.no_prefer_raw)
+    if not inputs:
+        raise SystemExit("未找到可扫描的图像")
+
+    if args.output is not None:
+        out_root = args.output
+    elif args.input.is_file():
+        out_root = args.input.parent / "sweep_out"
+    else:
+        out_root = args.input / "sweep_out"
+    out_root.mkdir(parents=True, exist_ok=True)
+
+    model_dir = args.model_dir or _default_model_dir()
+    methods = [m.strip() for m in args.methods.split(",") if m.strip()]
+
+    if args.quick:
+        thresholds = [170, 175]
+        upscales = [128, 192]
+        det_threshs = [0.3, 0.5]
+        contrasts = [False, True]
+    else:
+        thresholds = _parse_csv_ints(args.thresholds)
+        upscales = [int(x) for x in args.upscales.split(",") if x.strip()]
+        det_threshs = _parse_csv_floats(args.det_threshs)
+        contrasts = _parse_csv_bools(args.contrasts)
+
+    print(f"扫描 {len(inputs)} 张图 -> {out_root}")
+    print(f"  methods={methods} thresholds={thresholds} upscales={upscales}")
+    if args.target:
+        print(f"  target={args.target!r}")
+
+    summary: List[Dict[str, Any]] = []
+    for img_path in inputs:
+        print(f"\n=== {img_path.name} ===")
+        report = run_sweep(
+            img_path,
+            out_root,
+            prefer_raw=not args.no_prefer_raw,
+            target=args.target,
+            model_dir=model_dir,
+            methods=methods,
+            thresholds=thresholds,
+            contrasts=contrasts,
+            upscales=upscales,
+            det_threshs=det_threshs,
+            text_black_target=args.text_black_target,
+            save_images=not args.no_save_images,
+            run_baseline=not args.no_baseline,
+            baseline_upscale=args.baseline_upscale,
+        )
+        summary.append(
+            {
+                "input": report["input"],
+                "hits": len(report["hits"]),
+                "report": str(Path(report["output_dir"]) / "sweep_report.json"),
+            }
+        )
+
+    index_path = out_root / "sweep_index.json"
+    index_path.write_text(
+        json.dumps(summary, ensure_ascii=False, indent=2), encoding="utf-8"
+    )
+    print(f"\n全部完成,索引: {index_path}")
+    for s in summary:
+        print(f"  {s['input']}: {s['hits']} hits -> {s['report']}")
+
+
+if __name__ == "__main__":
+    if len(sys.argv) == 1:
+        print("ℹ️  未提供命令行参数,使用默认配置运行...")
+        default_config = {
+            "input": "/Users/zhch158/workspace/data/流水分析/彭_广东兴宁农村商业银行/bank_statement_yusys_local/debug/table_recognition_wired/tablecell_ocr/彭_广东兴宁农村商业银行_page_002_0/cell219_empty_empty_raw.png",
+            "output": "./output/彭_广东兴宁农村商业银行/cell219_sweep",
+            "target": "ATM存折取款",
+        }
+        sys.argv = [sys.argv[0], default_config["input"]]
+        for key, value in default_config.items():
+            if key == "input":
+                continue
+            flag = f"--{key.replace('_', '-')}"
+            if isinstance(value, bool) and value:
+                sys.argv.append(flag)
+            elif not isinstance(value, bool):
+                sys.argv.extend([flag, str(value)])
+
+    sys.exit(main())

+ 439 - 0
ocr_tools/gan_experiments_lab/evaluate.py

@@ -0,0 +1,439 @@
+"""
+去水印评估脚本:对比 baseline (masked_adaptive) 与 LaMa GAN 方法。
+
+用法:
+    cd ocr_platform/ocr_tools/gan_experiments_lab
+
+    # 对 test_images/input/ 下所有图片做对比
+    python evaluate.py
+
+    # 指定输入/输出目录
+    python evaluate.py --input ./test_images/synthetic/ --output ./output/synthetic_compare
+
+    # 有clean参考图时计算 PSNR/SSIM
+    python evaluate.py --input ./test_images/synthetic/ --clean-dir ./test_images/clean/
+
+生成物:
+    output/compare/     — 三联对比图 (原图 | baseline | GAN)
+    output/inpainted/   — GAN 修复结果
+    output/mask_debug/  — 掩膜可视化
+    output/metrics/     — 评估指标 JSON
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+import time
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+
+import cv2
+import numpy as np
+
+# 将 ocr_platform 根目录加入 sys.path,以便导入 ocr_utils
+_repo_root = Path(__file__).parents[2]
+if str(_repo_root) not in sys.path:
+    sys.path.insert(0, str(_repo_root))
+
+from loguru import logger
+from PIL import Image
+
+from ocr_utils.watermark import (
+    WatermarkProcessor,
+    build_watermark_mask,
+    detect_watermark,
+    merge_watermark_config,
+    render_watermark_mask_overlay,
+)
+from lama_inpaint import LamaInpainter
+
+# ── 评估指标 ────────────────────────────────────────────────────
+
+
+def _to_gray(img: np.ndarray) -> np.ndarray:
+    if img.ndim == 3:
+        return cv2.cvtColor(img, cv2.COLOR_BGR2GRAY).astype(np.float64)
+    return img.astype(np.float64)
+
+
+def compute_psnr(img1: np.ndarray, img2: np.ndarray) -> float:
+    g1, g2 = _to_gray(img1), _to_gray(img2)
+    mse = np.mean((g1 - g2) ** 2)
+    if mse < 1e-10:
+        return 100.0
+    return float(20 * np.log10(255.0 / np.sqrt(mse)))
+
+
+def compute_ssim(img1: np.ndarray, img2: np.ndarray) -> float:
+    """简易 SSIM 实现(灰度,8x8 block)。"""
+    from math import exp, pi, sqrt
+
+    g1, g2 = _to_gray(img1), _to_gray(img2)
+    k1, k2 = 0.01, 0.03
+    l = 255.0
+    c1, c2 = (k1 * l) ** 2, (k2 * l) ** 2
+
+    kernel = cv2.getGaussianKernel(11, 1.5)
+    window = np.outer(kernel, kernel)
+    window /= window.sum()
+
+    mu1 = cv2.filter2D(g1, -1, window, borderType=cv2.BORDER_REFLECT)
+    mu2 = cv2.filter2D(g2, -1, window, borderType=cv2.BORDER_REFLECT)
+    mu1_sq = mu1 * mu1
+    mu2_sq = mu2 * mu2
+    mu1_mu2 = mu1 * mu2
+    sigma1_sq = cv2.filter2D(g1 * g1, -1, window, borderType=cv2.BORDER_REFLECT) - mu1_sq
+    sigma2_sq = cv2.filter2D(g2 * g2, -1, window, borderType=cv2.BORDER_REFLECT) - mu2_sq
+    sigma12 = cv2.filter2D(g1 * g2, -1, window, borderType=cv2.BORDER_REFLECT) - mu1_mu2
+
+    num = (2 * mu1_mu2 + c1) * (2 * sigma12 + c2)
+    denom = (mu1_sq + mu2_sq + c1) * (sigma1_sq + sigma2_sq + c2)
+    ssim_map = num / (denom + 1e-10)
+    return float(ssim_map.mean())
+
+
+# ── 水印配置 ────────────────────────────────────────────────────
+
+
+def _baseline_config() -> Dict[str, Any]:
+    return merge_watermark_config("page", {
+        "method": "masked_adaptive",
+        "threshold": 175,
+        "contrast_enhancement": {"enabled": True, "method": "text_restore", "text_black_target": 85},
+    })
+
+
+def _gan_wm_config() -> Dict[str, Any]:
+    return merge_watermark_config("page", {"method": "masked_adaptive", "threshold": 175})
+
+
+# ── 单图处理 ────────────────────────────────────────────────────
+
+
+def _load_image(path: Path) -> np.ndarray:
+    """加载图片为 BGR ndarray。"""
+    pil = Image.open(str(path)).convert("RGB")
+    np_img = np.array(pil)
+    return cv2.cvtColor(np_img, cv2.COLOR_RGB2BGR)
+
+
+def _run_baseline(bgr: np.ndarray, cfg: Dict[str, Any]) -> Tuple[np.ndarray, Dict[str, Any]]:
+    """运行 masked_adaptive 方法。"""
+    proc = WatermarkProcessor(cfg, scope="page")
+    debug: Dict[str, Any] = {}
+    result, stages = proc.process(bgr, apply_removal=True, removal_debug=debug)
+    return np.asarray(result), debug
+
+
+def _run_gan(
+    bgr: np.ndarray,
+    wm_cfg: Dict[str, Any],
+    inpainter: LamaInpainter,
+) -> Tuple[np.ndarray, Dict[str, Any]]:
+    """
+    使用GAN修复水印区域。
+
+    1. 用 build_watermark_mask 检测水印区域
+    2. 用 LaMa 修复
+    3. 失败则回退 baseline
+    """
+    debug: Dict[str, Any] = {"mode": "gan"}
+
+    gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
+    mask_cfg = wm_cfg.get("mask", {})
+    wm_mask, mask_debug = build_watermark_mask(gray, bgr=bgr, **mask_cfg)
+
+    debug.update({k: v for k, v in mask_debug.items()
+                  if not isinstance(v, np.ndarray)})
+    debug["wm_mask"] = wm_mask
+
+    if not np.any(wm_mask):
+        logger.info("  未检测到水印区域,跳过GAN")
+        debug["mode"] = "gan_no_mask"
+        clean_gray, _ = _run_baseline(bgr, wm_cfg)
+        return clean_gray, debug
+
+    logger.info(f"  水印区域: {wm_mask.sum()} 像素 "
+                f"({100 * wm_mask.sum() / wm_mask.size:.2f}%)")
+
+    t0 = time.perf_counter()
+    result = inpainter.inpaint(bgr, wm_mask)
+    elapsed = time.perf_counter() - t0
+
+    if result is not None:
+        debug["mode"] = "gan"
+        debug["gan_success"] = True
+        debug["gan_inference_time_s"] = round(elapsed, 2)
+        logger.info(f"  GAN修复成功 ({elapsed:.1f}s)")
+        # 对修复结果做对比度增强
+        from ocr_utils.watermark.contrast import apply_contrast_enhancement_config
+        ce_cfg = wm_cfg.get("contrast_enhancement")
+        result_gray = cv2.cvtColor(result, cv2.COLOR_BGR2GRAY)
+        result_gray = apply_contrast_enhancement_config(result_gray, ce_cfg)
+        return result_gray, debug
+
+    # GAN 失败,回退
+    logger.warning("  GAN修复失败,回退 baseline")
+    debug["mode"] = "gan_fallback"
+    debug["fallback_reason"] = "gan_inference_failed"
+    clean_gray, fallback_debug = _run_baseline(bgr, wm_cfg)
+    debug["fallback_debug"] = fallback_debug
+    return clean_gray, debug
+
+
+# ── 输出 ──────────────────────────────────────────────────────────
+
+
+def _make_compare_image(
+    bgr: np.ndarray,
+    baseline_gray: np.ndarray,
+    gan_result: np.ndarray,
+    wm_mask: Optional[np.ndarray] = None,
+) -> np.ndarray:
+    """生成四联对比图。"""
+    h, w = baseline_gray.shape[:2] if baseline_gray.ndim == 2 else baseline_gray.shape
+
+    def _to_bgr(arr: np.ndarray) -> np.ndarray:
+        if arr.ndim == 2:
+            return cv2.cvtColor(arr, cv2.COLOR_GRAY2BGR)
+        return arr
+
+    def _resize(arr: np.ndarray, target_h: int, target_w: int) -> np.ndarray:
+        if arr.shape[0] != target_h or arr.shape[1] != target_w:
+            return cv2.resize(arr, (target_w, target_h))
+        return arr
+
+    # GAN 结果可能是灰度或BGR
+    gan_bgr = _to_bgr(gan_result) if gan_result.ndim == 2 else gan_result
+    if gan_result.ndim == 3 and gan_result.shape[2] == 3:
+        gan_bgr = gan_result  # 已经是BGR
+
+    # 统一尺寸
+    ref_h, ref_w = bgr.shape[:2]
+    baseline_bgr = _to_bgr(baseline_gray) if baseline_gray.ndim == 2 else baseline_gray
+    baseline_bgr = _resize(baseline_bgr, ref_h, ref_w)
+    gan_bgr = _resize(gan_bgr, ref_h, ref_w)
+
+    panels = [bgr, baseline_bgr, gan_bgr]
+
+    # 如果有mask,叠加到原图上作为第四联
+    if wm_mask is not None and np.any(wm_mask):
+        mask_overlay = render_watermark_mask_overlay(bgr, wm_mask)
+        panels.append(mask_overlay)
+
+    # 添加标签
+    labels = ["Original", "Baseline (masked_adaptive)", "GAN (LaMa)"]
+    if len(panels) == 4:
+        labels.append("Watermark Mask")
+
+    labeled = []
+    for panel, label in zip(panels, labels):
+        h_p = panel.shape[0]
+        # 底部加标签条
+        bar = np.ones((36, panel.shape[1], 3), dtype=np.uint8) * 240
+        cv2.putText(bar, label, (12, 24), cv2.FONT_HERSHEY_SIMPLEX, 0.55, (0, 0, 0), 1)
+        labeled.append(np.vstack([panel, bar]))
+
+    # 水平拼接
+    max_h = max(p.shape[0] for p in labeled)
+    for i in range(len(labeled)):
+        if labeled[i].shape[0] < max_h:
+            pad = np.ones((max_h - labeled[i].shape[0], labeled[i].shape[1], 3), dtype=np.uint8) * 255
+            labeled[i] = np.vstack([labeled[i], pad])
+
+    return np.hstack(labeled)
+
+
+def _save_result(
+    stem: str,
+    result: np.ndarray,
+    output_dir: Path,
+    prefix: str = "",
+) -> Path:
+    """保存结果图片。"""
+    p = output_dir / f"{stem}_{prefix}.png"
+    if result.ndim == 2:
+        cv2.imwrite(str(p), result)
+    else:
+        cv2.imwrite(str(p), result)
+    return p
+
+
+def _save_metrics_json(
+    metrics_list: List[Dict[str, Any]],
+    output_dir: Path,
+) -> None:
+    output_dir.mkdir(parents=True, exist_ok=True)
+    p = output_dir / "metrics.json"
+    p.write_text(json.dumps(metrics_list, ensure_ascii=False, indent=2), encoding="utf-8")
+    logger.info(f"评估指标: {p}")
+
+
+# ── 主函数 ────────────────────────────────────────────────────────
+
+
+def evaluate(
+    input_dir: Path,
+    output_root: Path,
+    *,
+    clean_dir: Optional[Path] = None,
+    device: str = "cpu",
+    gan_only: bool = False,
+) -> None:
+    """批量评估。"""
+    img_files = sorted([
+        f for f in input_dir.iterdir()
+        if f.suffix.lower() in {".png", ".jpg", ".jpeg", ".bmp", ".tif", ".tiff", ".webp"}
+    ])
+    if not img_files:
+        logger.error(f"{input_dir} 下没有图片文件")
+        return
+
+    # 输出目录
+    out_compare = output_root / "compare"
+    out_inpainted = output_root / "inpainted"
+    out_mask = output_root / "mask_debug"
+    out_metrics = output_root / "metrics"
+    for d in [out_compare, out_inpainted, out_mask, out_metrics]:
+        d.mkdir(parents=True, exist_ok=True)
+
+    baseline_cfg = _baseline_config()
+    wm_cfg = _gan_wm_config()
+
+    inpainter = LamaInpainter(device=device)
+    available = inpainter.is_available
+    logger.info(f"LaMa 可用: {available}, backend: {inpainter._backend or '未加载'}")
+
+    if not available and not gan_only:
+        logger.warning("LaMa backend 不可用,GAN将回退到OpenCV inpaint")
+
+    all_metrics: List[Dict[str, Any]] = []
+
+    for f in img_files:
+        logger.info(f"\n处理: {f.name}")
+        stem = f.stem
+        bgr = _load_image(f)
+
+        # 检查是否有对应 clean 参考图
+        clean_img: Optional[np.ndarray] = None
+        if clean_dir:
+            for ext in (".png", ".jpg", ".jpeg"):
+                clean_path = clean_dir / f"{stem}{ext}"
+                if clean_path.exists():
+                    clean_img = _load_image(clean_path)
+                    break
+            if clean_img is None:
+                # 尝试移除 _watermarked 后缀
+                clean_name = stem.replace("_watermarked", "")
+                for ext in (".png", ".jpg", ".jpeg"):
+                    clean_path = clean_dir / f"{clean_name}{ext}"
+                    if clean_path.exists():
+                        clean_img = _load_image(clean_path)
+                        break
+
+        # 检测水印
+        gray = cv2.cvtColor(bgr, cv2.COLOR_BGR2GRAY)
+        has_wm = detect_watermark(gray, ratio_threshold=0.025)
+        logger.info(f"  水印检测: {'有水印' if has_wm else '无水印'}")
+
+        # ── Baseline ──
+        logger.info("  运行 baseline (masked_adaptive)...")
+        t0 = time.perf_counter()
+        baseline_result, baseline_debug = _run_baseline(bgr, baseline_cfg)
+        baseline_time = time.perf_counter() - t0
+        logger.info(f"  baseline 耗时: {baseline_time:.1f}s")
+
+        # ── GAN ──
+        t0 = time.perf_counter()
+        gan_result, gan_debug = _run_gan(bgr, wm_cfg, inpainter)
+        gan_time = time.perf_counter() - t0
+
+        # ── 保存结果 ──
+        _save_result(stem, baseline_result, out_inpainted, "baseline")
+        gan_save = gan_result
+        if gan_result.ndim == 2:
+            gan_save_bgr = cv2.cvtColor(gan_result, cv2.COLOR_GRAY2BGR)
+        else:
+            gan_save_bgr = gan_result
+        _save_result(stem, gan_save_bgr, out_inpainted, "gan")
+
+        # ── 掩膜可视化 ──
+        wm_mask = gan_debug.get("wm_mask")
+        if wm_mask is not None and np.any(wm_mask):
+            mask_overlay = render_watermark_mask_overlay(bgr, wm_mask)
+            _save_result(stem, mask_overlay, out_mask, "mask_overlay")
+
+        # ── 对比图 ──
+        compare_img = _make_compare_image(bgr, baseline_result, gan_save_bgr, wm_mask)
+        _save_result(stem, compare_img, out_compare, "compare")
+
+        # ── 评估指标 ──
+        metrics: Dict[str, Any] = {
+            "file": f.name,
+            "has_watermark": has_wm,
+            "baseline_time_s": round(baseline_time, 2),
+            "gan_time_s": round(gan_time, 2),
+            "gan_mode": gan_debug.get("mode", "unknown"),
+        }
+        if clean_img is not None:
+            # baseline vs clean
+            metrics["baseline_psnr"] = round(compute_psnr(baseline_result, clean_img), 2)
+            metrics["baseline_ssim"] = round(compute_ssim(baseline_result, clean_img), 4)
+            # gan vs clean
+            metrics["gan_psnr"] = round(compute_psnr(gan_save_bgr, clean_img), 2)
+            metrics["gan_ssim"] = round(compute_ssim(gan_save_bgr, clean_img), 4)
+            logger.info(
+                f"  PSNR: baseline={metrics['baseline_psnr']}dB, "
+                f"GAN={metrics['gan_psnr']}dB"
+            )
+        all_metrics.append(metrics)
+
+    _save_metrics_json(all_metrics, out_metrics)
+
+    # 汇总
+    logger.info(f"\n{'='*50}")
+    logger.info(f"评估完成,共 {len(img_files)} 张图")
+    logger.info(f"  对比图:   {out_compare}")
+    logger.info(f"  修复结果: {out_inpainted}")
+    logger.info(f"  掩膜:     {out_mask}")
+    logger.info(f"  指标:     {out_metrics}")
+
+    if clean_img is not None:
+        avg_baseline_psnr = np.mean([m.get("baseline_psnr", 0) for m in all_metrics])
+        avg_gan_psnr = np.mean([m.get("gan_psnr", 0) for m in all_metrics])
+        logger.info(f"  平均 PSNR: baseline={avg_baseline_psnr:.1f}dB, GAN={avg_gan_psnr:.1f}dB")
+
+
+def main():
+    root = Path(__file__).parent
+
+    parser = argparse.ArgumentParser(
+        description="去水印评估:baseline vs GAN",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__,
+    )
+    parser.add_argument("--input", type=Path, default=root / "test_images" / "input",
+                        help="输入图片目录")
+    parser.add_argument("--output", type=Path, default=root / "output",
+                        help="输出根目录")
+    parser.add_argument("--clean-dir", type=Path, default=None,
+                        help="clean参考图目录(用于计算PSNR/SSIM)")
+    parser.add_argument("--device", type=str, default="cpu",
+                        choices=["cpu", "cuda", "mps"],
+                        help="推理设备")
+    parser.add_argument("--gan-only", action="store_true",
+                        help="仅运行GAN(跳过baseline)")
+    args = parser.parse_args()
+
+    evaluate(
+        args.input,
+        args.output,
+        clean_dir=args.clean_dir,
+        device=args.device,
+        gan_only=args.gan_only,
+    )
+
+
+if __name__ == "__main__":
+    main()

+ 245 - 0
ocr_tools/gan_experiments_lab/lama_inpaint.py

@@ -0,0 +1,245 @@
+"""
+LaMa (Large Mask Inpainting) 推理模块。
+
+封装预训练LaMa模型的加载与推理,方案选择(按优先级):
+1. simple_lama_inpainting  pip包(最简)
+2. 本地 lama 仓库代码(big-lama checkpoint)
+3. OpenCV inpainting(终极回退,不用GAN)
+
+用法:
+    from gan_experiments_lab.lama_inpaint import LamaInpainter
+    inpaint = LamaInpainter(device="cpu")
+    result = inpaint.inpaint(bgr_image, mask_bool)
+"""
+from __future__ import annotations
+
+import sys
+from pathlib import Path
+from typing import Optional
+
+import cv2
+import numpy as np
+from loguru import logger
+
+
+def _check_simple_lama() -> bool:
+    try:
+        import simple_lama_inpainting  # noqa: F401
+        return True
+    except ImportError:
+        return False
+
+
+def _check_lama_repo() -> Optional[Path]:
+    """检查本地是否有 lama 仓库并已加入 sys.path。"""
+    candidates = [
+        Path(__file__).parent / "lama",
+        Path(__file__).parents[2] / "lama",
+        Path.home() / "lama",
+        Path("/tmp/lama"),
+    ]
+    for p in candidates:
+        if (p / "saicinpainting" / "__init__.py").exists():
+            return p
+    return None
+
+
+class LamaInpainter:
+    """LaMa inpainting 门面,自动选择可用后端。"""
+
+    def __init__(
+        self,
+        *,
+        device: str = "cpu",
+        inference_size: Optional[int] = None,
+        pad_to_multiple: int = 8,
+    ):
+        self._device = device
+        self._inference_size = inference_size  # None = 保持原尺寸
+        self._pad_to_multiple = pad_to_multiple
+        self._model = None
+        self._backend = None  # "simple_lama" | "lama_repo" | "opencv"
+        self._lama_repo_path: Optional[Path] = None
+
+    @property
+    def is_available(self) -> bool:
+        if self._backend is not None:
+            return self._backend != "opencv"
+        if _check_simple_lama():
+            self._backend = "simple_lama"
+            return True
+        if _check_lama_repo():
+            self._backend = "lama_repo"
+            return True
+        return False
+
+    def load(self) -> bool:
+        """加载模型,返回是否成功。"""
+        if self._model is not None:
+            return True
+
+        if _check_simple_lama():
+            return self._load_simple_lama()
+        repo = _check_lama_repo()
+        if repo:
+            return self._load_lama_repo(repo)
+
+        logger.warning("LaMa backends 都不可用,将回退 OpenCV inpainting")
+        self._backend = "opencv"
+        return False
+
+    def _load_simple_lama(self) -> bool:
+        try:
+            from simple_lama_inpainting import SimpleLama
+            self._model = SimpleLama(device=self._device)
+            self._backend = "simple_lama"
+            logger.info(f"LaMa (simple_lama_inpainting) 已加载, device={self._device}")
+            return True
+        except Exception as e:
+            logger.warning(f"simple_lama_inpainting 加载失败: {e}")
+            return False
+
+    def _load_lama_repo(self, repo_path: Path) -> bool:
+        try:
+            if str(repo_path) not in sys.path:
+                sys.path.insert(0, str(repo_path))
+
+            from omegaconf import OmegaConf
+            from saicinpainting.training.trainers import load_checkpoint
+
+            config_path = repo_path / "big-lama" / "config.yaml"
+            ckpt_path = repo_path / "big-lama" / "models" / "best.ckpt"
+
+            if not config_path.exists() or not ckpt_path.exists():
+                logger.warning(
+                    f"lama 模型文件缺失。请下载: "
+                    f"wget https://github.com/Sanster/models/releases/download/add_big_lama/big-lama.zip && "
+                    f"unzip big-lama.zip -d {repo_path}"
+                )
+                return False
+
+            conf = OmegaConf.load(str(config_path))
+            conf.training_model.predict_only = True
+            conf.visualizer.kind = "noop"
+
+            model = load_checkpoint(conf, str(ckpt_path), strict=False, map_location="cpu")
+            model.eval()
+            if self._device != "cpu":
+                model.cuda()
+            self._model = model
+            self._lama_repo_path = repo_path
+            self._backend = "lama_repo"
+            logger.info(f"LaMa (lama_repo) 已加载, device={self._device}")
+            return True
+        except Exception as e:
+            logger.warning(f"lama_repo 加载失败: {e}")
+            return False
+
+    def inpaint(self, image: np.ndarray, mask: np.ndarray) -> Optional[np.ndarray]:
+        """
+        修复图像。
+
+        Args:
+            image: BGR ndarray (H, W, 3), uint8
+            mask: bool ndarray (H, W), True=需要修复的水印区域
+
+        Returns:
+            BGR ndarray (H, W, 3), uint8, or None
+        """
+        if not self._model:
+            if not self.load():
+                return self._opencv_inpaint(image, mask)
+
+        if self._backend == "simple_lama":
+            return self._inpaint_simple_lama(image, mask)
+        elif self._backend == "lama_repo":
+            return self._inpaint_lama_repo(image, mask)
+        else:
+            return self._opencv_inpaint(image, mask)
+
+    def _inpaint_simple_lama(self, image: np.ndarray, mask: np.ndarray) -> Optional[np.ndarray]:
+        try:
+            rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
+            mask_u8 = mask.astype(np.uint8) * 255
+            # 按需 resize
+            if self._inference_size:
+                rgb, mask_u8, orig_size = self._resize_to_inference(rgb, mask_u8)
+            result_rgb = self._model(rgb, mask_u8)
+            if self._inference_size:
+                result_rgb = cv2.resize(result_rgb, (orig_size[1], orig_size[0]))
+            return cv2.cvtColor(result_rgb, cv2.COLOR_RGB2BGR)
+        except Exception as e:
+            logger.warning(f"simple_lama 推理失败: {e}")
+            return None
+
+    def _inpaint_lama_repo(self, image: np.ndarray, mask: np.ndarray) -> Optional[np.ndarray]:
+        try:
+            import torch
+            import torch.nn.functional as F
+            from saicinpainting.evaluation.data import pad_tensor_to_modulo
+
+            rgb = cv2.cvtColor(image.astype(np.float32) / 255.0, cv2.COLOR_BGR2RGB)
+            mask_f = mask.astype(np.float32)
+            orig_h, orig_w = rgb.shape[:2]
+
+            # resize
+            if self._inference_size:
+                rgb, mask_f, (orig_w, orig_h) = self._resize_image_mask(rgb, mask_f)
+
+            img_t = torch.from_numpy(rgb).permute(2, 0, 1).unsqueeze(0)
+            mask_t = torch.from_numpy(mask_f).unsqueeze(0).unsqueeze(0)
+
+            img_t = pad_tensor_to_modulo(img_t, self._pad_to_multiple)
+            mask_t = pad_tensor_to_modulo(mask_t, self._pad_to_multiple)
+
+            if self._device != "cpu":
+                img_t = img_t.cuda()
+                mask_t = mask_t.cuda()
+
+            with torch.no_grad():
+                output = self._model(img_t, mask_t)
+                # output shape: (B, C, H, W)
+                result = output[0].permute(1, 2, 0).cpu().numpy()
+                # 裁掉 pad
+                result = result[:orig_h, :orig_w, :]
+
+            result = np.clip(result, 0, 1)
+            result_u8 = (result * 255).astype(np.uint8)
+            return cv2.cvtColor(result_u8, cv2.COLOR_RGB2BGR)
+        except Exception as e:
+            logger.warning(f"lama_repo 推理失败: {e}")
+            return None
+
+    def _resize_to_inference(self, rgb: np.ndarray, mask: np.ndarray) -> tuple:
+        h, w = rgb.shape[:2]
+        size = self._inference_size or min(h, w)
+        scale = size / min(h, w)
+        new_w, new_h = int(w * scale), int(h * scale)
+        rgb_rs = cv2.resize(rgb, (new_w, new_h), interpolation=cv2.INTER_CUBIC)
+        mask_rs = cv2.resize(mask, (new_w, new_h), interpolation=cv2.INTER_NEAREST)
+        return rgb_rs, mask_rs, (w, h)
+
+    def _resize_image_mask(self, rgb: np.ndarray, mask: np.ndarray) -> tuple:
+        h, w = rgb.shape[:2]
+        size = self._inference_size or min(h, w)
+        scale = size / min(h, w)
+        new_w, new_h = int(w * scale), int(h * scale)
+        rgb_rs = cv2.resize(rgb, (new_w, new_h), interpolation=cv2.INTER_CUBIC)
+        mask_rs = cv2.resize(mask, (new_w, new_h), interpolation=cv2.INTER_NEAREST)
+        return rgb_rs, mask_rs, (w, h)
+
+    def _opencv_inpaint(self, image: np.ndarray, mask: np.ndarray) -> np.ndarray:
+        """OpenCV Telea inpainting 回退(非GAN)。"""
+        logger.info("使用 OpenCV inpainting 回退")
+        mask_u8 = mask.astype(np.uint8) * 255
+        return cv2.inpaint(image, mask_u8, inpaintRadius=5, flags=cv2.INPAINT_TELEA)
+
+
+if __name__ == "__main__":
+    # 快速功能测试
+    print("LaMa 后端检测:")
+    print(f"  simple_lama_inpainting: {_check_simple_lama()}")
+    repo = _check_lama_repo()
+    print(f"  lama_repo:              {repo}")
+    inpaint = LamaInpainter(device="cpu")
+    print(f"  is_available:           {inpaint.is_available}")

BIN
ocr_tools/gan_experiments_lab/test_images/input/彭_广东兴宁农村商业银行_page_002.png


+ 222 - 0
ocr_tools/gan_experiments_lab/watermark_synthesis.py

@@ -0,0 +1,222 @@
+"""
+水印合成脚本:在clean图片上叠加斜向浅色文字水印,输出带水印图 + 精确mask。
+
+用法:
+    python watermark_synthesis.py                          # 默认参数演示
+    python watermark_synthesis.py --input ./test_images/clean/   # 指定输入目录
+    python watermark_synthesis.py --text "SAMPLE" --opacity 0.15 --angle 45
+"""
+from __future__ import annotations
+
+import argparse
+import math
+from pathlib import Path
+from typing import Optional
+
+import cv2
+import numpy as np
+from loguru import logger
+from PIL import Image, ImageDraw, ImageFont
+
+
+def _find_font() -> str:
+    """查找可用中文字体,找不到返回默认字体。"""
+    candidates = [
+        "/System/Library/Fonts/PingFang.ttc",
+        "/System/Library/Fonts/STHeiti Light.ttc",
+        "/System/Library/Fonts/Hiragino Sans GB.ttc",
+        "/usr/share/fonts/truetype/dejavu/DejaVuSans.ttf",
+        "/usr/share/fonts/opentype/noto/NotoSansCJK-Regular.ttc",
+    ]
+    for fp in candidates:
+        if Path(fp).exists():
+            return fp
+    logger.warning("未找到中文字体,使用PIL默认字体")
+    return ""
+
+
+def _text_size_to_font_size(text_height_px: int) -> int:
+    """根据目标文字像素高度估算 font_size。"""
+    return int(text_height_px * 1.15)
+
+
+def _render_watermark_tile(
+    pil_img: Image.Image,
+    text: str,
+    font_path: str,
+    font_size: int,
+    opacity: float,
+    angle_deg: float,
+    spacing_x: int,
+    spacing_y: int,
+) -> tuple[np.ndarray, np.ndarray]:
+    """
+    在图上平铺斜向水印文字,返回 (watermarked_np, mask_np)。
+
+    mask_np: H×W bool, True=水印像素位置。
+    """
+    w, h = pil_img.size
+    text_height = int(font_size / 1.15)
+    gray_value = int(255 * (1 - opacity))
+
+    # 创建水印文字mask(稍大画布以覆盖旋转后区域)
+    diag = int(math.sqrt(w * w + h * h)) + text_height * 4
+    tile_w = diag
+    tile_h = diag
+
+    tile = Image.new("L", (tile_w, tile_h), 0)
+    draw = ImageDraw.Draw(tile)
+    font = ImageFont.truetype(font_path, font_size) if font_path else ImageFont.load_default()
+
+    # 步长取spacing + 文字大小,确保均匀分布
+    step_x = text_height + spacing_x
+    step_y = text_height + spacing_y
+
+    for y in range(0, tile_h, step_y):
+        for x in range(0, tile_w, step_x):
+            draw.text((x, y), text, fill=255, font=font)
+
+    # 旋转
+    tile_rot = tile.rotate(angle_deg, expand=False, fillcolor=0)
+
+    # 裁剪到原图大小(中心对齐)
+    cx, cy = tile_rot.size[0] // 2, tile_rot.size[1] // 2
+    left = cx - w // 2
+    top = cy - h // 2
+    watermark_tile = tile_rot.crop((left, top, left + w, top + h))
+
+    mask_np = np.array(watermark_tile) > 0
+
+    # 叠加到原图
+    base = np.array(pil_img.convert("RGB"))
+    alpha = opacity
+    result = base.copy()
+    result[mask_np] = (
+        base[mask_np].astype(np.float32) * (1 - alpha)
+        + np.array([gray_value, gray_value, gray_value], dtype=np.float32) * alpha
+    ).astype(np.uint8)
+
+    return result, mask_np
+
+
+def synthesize_watermark(
+    input_path: Path,
+    output_dir: Path,
+    *,
+    text: str = "SAMPLE",
+    font_path: str = "",
+    text_height_px: int = 36,
+    opacity: float = 0.12,
+    angle_deg: float = 45.0,
+    spacing_x: int = 180,
+    spacing_y: int = 180,
+    save_mask: bool = True,
+) -> Path:
+    """
+    在输入图片上合成水印,输出到 output_dir。
+
+    Returns:
+        合成后的图片路径
+    """
+    output_dir.mkdir(parents=True, exist_ok=True)
+    pil_img = Image.open(str(input_path)).convert("RGB")
+
+    fp = font_path or _find_font()
+    font_size = _text_size_to_font_size(text_height_px)
+
+    logger.info(
+        f"合成水印: {input_path.name} | "
+        f"text='{text}' font_size={font_size} opacity={opacity} angle={angle_deg}"
+    )
+
+    result_np, mask_np = _render_watermark_tile(
+        pil_img, text, fp, font_size, opacity, angle_deg, spacing_x, spacing_y
+    )
+
+    out_name = f"{input_path.stem}_watermarked{input_path.suffix}"
+    out_path = output_dir / out_name
+    Image.fromarray(result_np).save(str(out_path))
+    logger.info(f"  水印图: {out_path}")
+
+    if save_mask:
+        mask_path = output_dir / f"{input_path.stem}_mask.png"
+        cv2.imwrite(str(mask_path), (mask_np.astype(np.uint8) * 255))
+        logger.info(f"  mask:   {mask_path}")
+
+    return out_path
+
+
+def main():
+    parser = argparse.ArgumentParser(description="水印合成工具")
+    parser.add_argument("--input", type=Path, default=None,
+                        help="输入图片或目录(默认: test_images/clean/)")
+    parser.add_argument("--output", type=Path, default=None,
+                        help="输出目录(默认: test_images/synthetic/)")
+    parser.add_argument("--text", type=str, default="行内内部使用",
+                        help="水印文字内容")
+    parser.add_argument("--text-height", type=int, default=48,
+                        help="文字像素高度(默认48)")
+    parser.add_argument("--opacity", type=float, default=0.10,
+                        help="水印透明度 0~1(默认0.10)")
+    parser.add_argument("--angle", type=float, default=45.0,
+                        help="水印倾斜角度(默认45°)")
+    parser.add_argument("--spacing-x", type=int, default=250,
+                        help="水印文字水平间距(默认250px)")
+    parser.add_argument("--spacing-y", type=int, default=250,
+                        help="水印文字垂直间距(默认250px)")
+    parser.add_argument("--font", type=str, default="",
+                        help="字体文件路径")
+    parser.add_argument("--no-mask", action="store_true",
+                        help="不保存mask")
+    parser.add_argument("--demo", action="store_true",
+                        help="使用input目录下第一张测试图生成演示图")
+    args = parser.parse_args()
+
+    root = Path(__file__).parent
+    input_dir = args.input or (root / "test_images" / "clean")
+    output_dir = args.output or (root / "test_images" / "synthetic")
+
+    if args.demo:
+        # 无clean图时,直接用input目录的水印图再加一层合成水印做演示
+        img_files = sorted(root.glob("test_images/input/*"))
+        if not img_files:
+            logger.error("test_images/input/ 下没有测试图片,请放入图片后重试")
+            return
+        input_dir = root / "test_images" / "input"
+        output_dir = root / "test_images" / "synthetic"
+
+    input_dir = Path(input_dir)
+    output_dir = Path(output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    if input_dir.is_dir():
+        img_files = sorted([
+            f for f in input_dir.iterdir()
+            if f.suffix.lower() in {".png", ".jpg", ".jpeg", ".bmp", ".tif", ".tiff", ".webp"}
+        ])
+    elif input_dir.is_file():
+        img_files = [input_dir]
+    else:
+        logger.error(f"输入路径不存在: {input_dir}")
+        return
+
+    if not img_files:
+        logger.warning(f"{input_dir} 下没有图片文件")
+        return
+
+    for f in img_files:
+        synthesize_watermark(
+            f, output_dir,
+            text=args.text,
+            font_path=args.font,
+            text_height_px=args.text_height,
+            opacity=args.opacity,
+            angle_deg=args.angle,
+            spacing_x=args.spacing_x,
+            spacing_y=args.spacing_y,
+            save_mask=not args.no_mask,
+        )
+
+
+if __name__ == "__main__":
+    main()

+ 162 - 58
ocr_tools/universal_doc_parser/config/bank_statement_glm_vl.yaml

@@ -19,46 +19,57 @@ preprocessor:
     model_dir: null  # 使用默认路径
   unwarping:
     enabled: false
-  # -------------------------------------------------------
-  # 水印去除配置(适用于银行流水浅色斜向文字水印)
-  # -------------------------------------------------------
+  # 页级水印(细参见 ocr_utils/watermark/presets.py PAGE_WATERMARK_PRESETS)
   watermark_removal:
-    enabled: false           # 是否启用水印去除
-    method: threshold # threshold | masked | masked_adaptive
-    threshold: 175          # 全局阈值或掩膜失败时的回退阈值(140-180)
-    morph_close_kernel: 0   # 去水印后灰度图闭运算,0 跳过
-    # 去水印后对比度增强(text_restore 将笔画拉深,比全局 gamma 更接近原图)
+    enabled: false
+    detect_before_remove: true
+    method: threshold   # threshold | masked | masked_adaptive
+    threshold: 175
     contrast_enhancement:
-      enabled: true
-      method: text_restore   # text_restore | clahe | gamma | linear
-      text_black_target: 85  # 略提高,减轻去水印后笔画被拉花(原 75 过深)
-      background_threshold: 248
-      text_lo_percentile: 1.0
-      text_hi_percentile: 99.0
-      gamma: 0.75            # method=gamma 时生效
-      clip_limit: 2.0        # method=clahe
-      tile_grid_size: 8
-      black_percentile: 2.0  # method=linear
-      white_percentile: 98.0
+      enabled: false
+      method: text_restore
+      text_black_target: 85
     debug_options:
-      enabled: false              # 由命令行 --debug / --debug-layout 统一控制
-      output_dir: null            # null 时使用 pipeline 输出目录
-      prefix: ""                  # 文件名前缀(运行时注入 page_name)
-      subdir: watermark_removal   # 输出至 debug/watermark_removal/
-      save_compare: true          # 保存左右对比图 *_watermark_compare.*
-      image_format: "png"         # jpg / png
+      enabled: false
+      output_dir: null
+      prefix: ""
+      subdir: watermark_removal
+      save_compare: true
+      image_format: "png"
 
 # ============================================================
-# Layout 检测配置 - 使用 PP-DocLayoutV3
+# Layout 检测配置 - 智能路由器(按场景直接选择模型)
 # ============================================================
 layout_detection:
-  module: "paddle"
-  model_name: "PP-DocLayoutV3"
-  model_dir: "PaddlePaddle/PP-DocLayoutV3_safetensors"
-  device: "cpu"
-  conf: 0.3
-  num_threads: 4
-  batch_size: 1
+  module: "smart_router"
+  strategy: "scene"  # 按场景直接选择模型,不走ocr_eval
+
+  # 场景策略:指定场景直接选用的布局模型
+  scene_strategy:
+    bank_statement:
+      model: "docling"
+    financial_report:
+      model: "paddle_ppdoclayoutv3"
+  default_model: "docling"
+
+  # 配置多个模型
+  models:
+    docling:
+      module: "docling"
+      model_name: "docling-layout-old"
+      model_dir: "ds4sd/docling-layout-old"
+      device: "cpu"
+      conf: 0.3
+      num_threads: 4
+
+    paddle_ppdoclayoutv3:
+      module: "paddle"
+      model_name: "PP-DocLayoutV3"
+      model_dir: "PaddlePaddle/PP-DocLayoutV3_safetensors"
+      device: "cpu"
+      conf: 0.3
+      num_threads: 4
+      batch_size: 1
   
   # 后处理配置
   post_process:
@@ -70,7 +81,7 @@ layout_detection:
 
   # Debug 可视化(底图为 inference_image,与 Layout 检测输入一致)
   debug_options:
-    enabled: true              # 由命令行 --debug / --debug-layout 控制
+    enabled: false              # 由命令行 --debug / --debug-layout 控制
     output_dir: null            # null 时由 pipeline 按页注入
     prefix: ""
     subdir: layout_detection    # 输出至 debug/layout_detection/
@@ -80,7 +91,123 @@ layout_detection:
     image_format: "png"
 
 # ============================================================
-# VL识别配置 - 使用 GLM-OCR
+# OCR 识别配置
+# ============================================================
+ocr_recognition:
+  module: "mineru"
+  language: "ch"
+  det_threshold: 0.5
+  unclip_ratio: 1.5
+  enable_merge_det_boxes: false
+  batch_size: 8
+  device: "cpu"
+
+  # Debug 可视化(底图为 inference_image,与整页 OCR 输入一致)
+  debug_options:
+    enabled: false              # 由命令行 --debug / --debug-ocr 控制
+    output_dir: null
+    prefix: ""
+    subdir: ocr_recognition     # 输出至 debug/ocr_recognition/
+    save_json: true
+    image_format: png
+
+# ============================================================
+# 表格分类配置(自动区分有线/无线表格)
+# ============================================================
+table_classification:
+  enabled: true               # 启用自动表格分类
+  module: "paddle"            # 分类模型:paddle(MinerU PaddleTableClsModel)
+  confidence_threshold: 0.5   # 分类置信度阈值
+  batch_size: 16              # 批处理大小
+
+  # Debug 可视化配置
+  debug_options:
+    enabled: false              # 由命令行 --debug / --debug-table 统一控制
+    output_dir: null            # null 时由 pipeline 按页注入
+    prefix: ""
+    subdir: table_classification  # 输出至 debug/table_classification/
+    save_table_lines: true      # paddle 线条检测叠加图
+    image_format: "png"
+
+# ============================================================
+# 有线表格识别专用配置(MinerU UNet)
+# ============================================================
+table_recognition_wired:
+  use_wired_unet: false      # 不使用有线表格识别
+  upscale_ratio: 3.333
+  need_ocr: true
+  row_threshold: 10
+  col_threshold: 15
+  ocr_conf_threshold: 0.9       # 单元格 OCR 置信度阈值
+  cell_crop_margin: 2
+  use_custom_postprocess: true  # 是否使用自定义后处理(默认启用)
+
+  # 是否启用倾斜矫正
+  enable_deskew: true
+
+  # 🆕 启用多源单元格融合
+  use_cell_fusion: true
+  
+  # 融合引擎配置
+  cell_fusion:
+    # RT-DETR 模型路径(必需)
+    rtdetr_model_path: "/Users/zhch158/models/pytorch_models/Table/RT-DETR-L_wired_table_cell_det.onnx"
+    
+    # 融合权重
+    unet_weight: 0.6        # UNet 权重(结构性强)
+    rtdetr_weight: 0.4      # RT-DETR 权重(鲁棒性强)
+    
+    # 阈值配置
+    iou_merge_threshold: 0.7    # 高IoU合并阈值(>0.7则加权平均)
+    iou_nms_threshold: 0.5      # NMS去重阈值
+    rtdetr_conf_threshold: 0.5  # RT-DETR置信度阈值
+    
+    # 功能开关
+    enable_ocr_compensation: true      # 启用OCR边缘补偿
+
+  # 单元格二次 OCR(det 分行 + 整格/条带兜底 + 低分笔画增强重试)
+  second_pass_ocr:
+    reocr_mode: bank_statement       # 表体空单元必跑 + 同行多数非空则空格也跑
+    header_row: 0                    # 表头行号(0=首行)
+    row_peer_min_nonempty: 5         # 同行至少 N 个非空格时,本格空也触发二次 OCR
+    line_min_score: 0.8              # 低于此分的分行从文本与计分中丢弃
+    drop_low_score_blocks: true
+    whole_cell_fallback: true        # 整格 det=False 兜底 + 条带扫描
+    prefer_whole_on_tie: true
+    whole_longer_min_extra_chars: 2  # 整格/条带文本比分行多长至少 N 字则优先
+    strip_fallback_aspect_ratio: 1.8 # 高/宽>=该值且仅检出<=1行时滑动条带分行
+    suspicious_short_min_chars: 4    # 高分但过短仍跑整格/条带兜底(与 enhance_retry 无关)
+    cell_preprocess:
+      watermark:
+        enabled: true
+        method: threshold
+      denoise:
+        enabled: false   # 小格 median 易糊笔画;lab 用 --denoise 对比
+      contrast:
+        enabled: false   # Pass1 去水印后可选;lab 对比 text_restore
+        method: text_restore
+        text_black_target: 88
+      light:
+        upscale_min_side: 192  # 128, 192 用于难例日期列
+    enhance_retry:
+      enabled: false
+      # enabled: true 时 Pass2 预处理,默认见代码(cell_preprocess.enhance_retry 已废弃)
+
+  # Debug 可视化配置
+  debug_options:
+    enabled: false              # 由命令行 --debug / --debug-table 统一控制
+    output_dir: null            # null 时由 pipeline 按页注入
+    prefix: ""
+    subdir: table_recognition_wired  # 输出至 debug/table_recognition_wired/
+    save_table_lines: true
+    save_connected_components: true
+    save_grid_structure: true
+    save_text_overlay: true
+    image_format: "png"
+    # 单元格二次 OCR 裁剪图:debug/table_recognition_wired/tablecell_ocr/
+
+# ============================================================
+# VL识别配置 - 使用 GLM-OCR(无线表格 + seal识别)
 # ============================================================
 vl_recognition:
   module: "glmocr"
@@ -116,29 +243,6 @@ vl_recognition:
   
   # 场景特定配置
   table_recognition:
-    bank_statement_mode: true
-
-# ============================================================
-# OCR识别配置
-# ============================================================
-ocr_recognition:
-  module: "mineru" 
-  language: "ch"
-  det_threshold: 0.6
-  unclip_ratio: 1.5
-  enable_merge_det_boxes: false
-  batch_size: 8
-  device: "cpu"
-
-
-  # Debug 可视化(底图为 inference_image,与整页 OCR 输入一致)
-  debug_options:
-    enabled: false              # 由命令行 --debug / --debug-ocr 控制
-    output_dir: null
-    prefix: ""
-    subdir: ocr_recognition     # 输出至 debug/ocr_recognition/
-    save_json: true
-    image_format: png
 
 # ============================================================
 # 输出配置

+ 17 - 25
ocr_tools/universal_doc_parser/config/bank_statement_mineru_vl.yaml

@@ -19,35 +19,27 @@ preprocessor:
     model_dir: null  # 使用默认路径
   unwarping:
     enabled: false
-  # -------------------------------------------------------
-  # 水印去除配置(适用于银行流水浅色斜向文字水印)
-  # -------------------------------------------------------
+  # 页级水印(细参见 ocr_utils/watermark/presets.py PAGE_WATERMARK_PRESETS)
   watermark_removal:
-    enabled: false           # 是否启用水印去除
-    method: threshold # threshold | masked | masked_adaptive
-    threshold: 175          # 全局阈值或掩膜失败时的回退阈值(140-180)
-    morph_close_kernel: 0   # 去水印后灰度图闭运算,0 跳过
-    # 去水印后对比度增强(text_restore 将笔画拉深,比全局 gamma 更接近原图)
+    enabled: false
+    detect_before_remove: true
+    method: threshold   # threshold | masked | masked_adaptive
+    threshold: 175
     contrast_enhancement:
-      enabled: true
-      method: text_restore   # text_restore | clahe | gamma | linear
-      text_black_target: 85  # 略提高,减轻去水印后笔画被拉花(原 75 过深)
-      background_threshold: 248
-      text_lo_percentile: 1.0
-      text_hi_percentile: 99.0
-      gamma: 0.75            # method=gamma 时生效
-      clip_limit: 2.0        # method=clahe
-      tile_grid_size: 8
-      black_percentile: 2.0  # method=linear
-      white_percentile: 98.0
+      enabled: false
+      method: text_restore
+      text_black_target: 85
     debug_options:
-      enabled: false              # 由命令行 --debug / --debug-layout 统一控制
-      output_dir: null            # null 时使用 pipeline 输出目录
-      prefix: ""                  # 文件名前缀(运行时注入 page_name)
-      subdir: watermark_removal   # 输出至 debug/watermark_removal/
-      save_compare: true          # 保存左右对比图 *_watermark_compare.*
-      image_format: "png"         # jpg / png
+      enabled: false
+      output_dir: null
+      prefix: ""
+      subdir: watermark_removal
+      save_compare: true
+      image_format: "png"
 
+# ============================================================
+# Layout 检测配置 - 智能路由器(按场景直接选择模型)
+# ============================================================
 layout_detection:
   # MinerU-VL layout(通过 VLM 服务做版式检测)
   module: "mineru_vl"

+ 18 - 26
ocr_tools/universal_doc_parser/config/bank_statement_paddle_vl.yaml

@@ -19,35 +19,27 @@ preprocessor:
     model_dir: null  # 使用默认路径
   unwarping:
     enabled: false
-  # -------------------------------------------------------
-  # 水印去除配置(适用于银行流水浅色斜向文字水印)
-  # -------------------------------------------------------
+  # 页级水印(细参见 ocr_utils/watermark/presets.py PAGE_WATERMARK_PRESETS)
   watermark_removal:
-    enabled: false           # 是否启用水印去除
-    method: threshold # threshold | masked | masked_adaptive
-    threshold: 175          # 全局阈值或掩膜失败时的回退阈值(140-180)
-    morph_close_kernel: 0   # 去水印后灰度图闭运算,0 跳过
-    # 去水印后对比度增强(text_restore 将笔画拉深,比全局 gamma 更接近原图)
+    enabled: false
+    detect_before_remove: true
+    method: threshold   # threshold | masked | masked_adaptive
+    threshold: 175
     contrast_enhancement:
-      enabled: true
-      method: text_restore   # text_restore | clahe | gamma | linear
-      text_black_target: 85  # 略提高,减轻去水印后笔画被拉花(原 75 过深)
-      background_threshold: 248
-      text_lo_percentile: 1.0
-      text_hi_percentile: 99.0
-      gamma: 0.75            # method=gamma 时生效
-      clip_limit: 2.0        # method=clahe
-      tile_grid_size: 8
-      black_percentile: 2.0  # method=linear
-      white_percentile: 98.0
+      enabled: false
+      method: text_restore
+      text_black_target: 85
     debug_options:
-      enabled: false              # 由命令行 --debug / --debug-layout 统一控制
-      output_dir: null            # null 时使用 pipeline 输出目录
-      prefix: ""                  # 文件名前缀(运行时注入 page_name)
-      subdir: watermark_removal   # 输出至 debug/watermark_removal/
-      save_compare: true          # 保存左右对比图 *_watermark_compare.*
-      image_format: "png"         # jpg / png
+      enabled: false
+      output_dir: null
+      prefix: ""
+      subdir: watermark_removal
+      save_compare: true
+      image_format: "png"
 
+# ============================================================
+# Layout 检测配置 - 智能路由器(按场景直接选择模型)
+# ============================================================
 layout_detection:
   # module: "paddle"
   # model_name: "RT-DETR-H_layout_17cls"
@@ -104,7 +96,7 @@ vl_recognition:
 ocr_recognition:
   module: "mineru" 
   language: "ch"
-  det_threshold: 0.6
+  det_threshold: 0.5
   unclip_ratio: 1.5
   enable_merge_det_boxes: false
   batch_size: 8

+ 38 - 29
ocr_tools/universal_doc_parser/config/bank_statement_paddle_vl_local.yaml

@@ -22,34 +22,23 @@ preprocessor:
     model_dir: null  # 使用默认路径
   unwarping:
     enabled: false
-  # -------------------------------------------------------
-  # 水印去除配置(适用于银行流水浅色斜向文字水印)
-  # -------------------------------------------------------
+  # 页级水印(细参见 ocr_utils/watermark/presets.py PAGE_WATERMARK_PRESETS)
   watermark_removal:
-    enabled: false           # 是否启用水印去除
-    method: threshold # threshold | masked | masked_adaptive
-    threshold: 175          # 全局阈值或掩膜失败时的回退阈值(140-180)
-    morph_close_kernel: 0   # 去水印后灰度图闭运算,0 跳过
-    # 去水印后对比度增强(text_restore 将笔画拉深,比全局 gamma 更接近原图)
+    enabled: false
+    detect_before_remove: true
+    method: threshold   # threshold | masked | masked_adaptive
+    threshold: 175
     contrast_enhancement:
-      enabled: true
-      method: text_restore   # text_restore | clahe | gamma | linear
-      text_black_target: 85  # 略提高,减轻去水印后笔画被拉花(原 75 过深)
-      background_threshold: 248
-      text_lo_percentile: 1.0
-      text_hi_percentile: 99.0
-      gamma: 0.75            # method=gamma 时生效
-      clip_limit: 2.0        # method=clahe
-      tile_grid_size: 8
-      black_percentile: 2.0  # method=linear
-      white_percentile: 98.0
+      enabled: false
+      method: text_restore
+      text_black_target: 85
     debug_options:
-      enabled: false              # 由命令行 --debug / --debug-layout 统一控制
-      output_dir: null            # null 时使用 pipeline 输出目录
-      prefix: ""                  # 文件名前缀(运行时注入 page_name)
-      subdir: watermark_removal   # 输出至 debug/watermark_removal/
-      save_compare: true          # 保存左右对比图 *_watermark_compare.*
-      image_format: "png"         # jpg / png
+      enabled: false
+      output_dir: null
+      prefix: ""
+      subdir: watermark_removal
+      save_compare: true
+      image_format: "png"
 
 # ============================================================
 # Layout 检测配置 - 智能路由器(按场景直接选择模型)
@@ -180,13 +169,33 @@ table_recognition_wired:
     # 功能开关
     enable_ocr_compensation: true      # 启用OCR边缘补偿
 
-
-  # 单元格二次 OCR(det 分行 + 整格兜底 + 低分块过滤)
+  # 单元格二次 OCR(det 分行 + 整格/条带兜底 + 低分笔画增强重试)
   second_pass_ocr:
-    line_min_score: 0.8
+    reocr_mode: bank_statement       # 表体空单元必跑 + 同行多数非空则空格也跑
+    header_row: 0                    # 表头行号(0=首行)
+    row_peer_min_nonempty: 5         # 同行至少 N 个非空格时,本格空也触发二次 OCR
+    line_min_score: 0.8              # 低于此分的分行从文本与计分中丢弃
     drop_low_score_blocks: true
-    whole_cell_fallback: true
+    whole_cell_fallback: true        # 整格 det=False 兜底 + 条带扫描
     prefer_whole_on_tie: true
+    whole_longer_min_extra_chars: 2  # 整格/条带文本比分行多长至少 N 字则优先
+    strip_fallback_aspect_ratio: 1.8 # 高/宽>=该值且仅检出<=1行时滑动条带分行
+    suspicious_short_min_chars: 4    # 高分但过短仍跑整格/条带兜底(与 enhance_retry 无关)
+    cell_preprocess:
+      watermark:
+        enabled: true
+        method: threshold
+      denoise:
+        enabled: false   # 小格 median 易糊笔画;lab 用 --denoise 对比
+      contrast:
+        enabled: false   # Pass1 去水印后可选;lab 对比 text_restore
+        method: text_restore
+        text_black_target: 88
+      light:
+        upscale_min_side: 192  # 128, 192 用于难例日期列
+    enhance_retry:
+      enabled: false
+      # enabled: true 时 Pass2 预处理,默认见代码(cell_preprocess.enhance_retry 已废弃)
 
   # Debug 可视化配置
   debug_options:

+ 99 - 55
ocr_tools/universal_doc_parser/config/bank_statement_smart_router.yaml

@@ -21,35 +21,27 @@ preprocessor:
     model_dir: null  # 使用默认路径
   unwarping:
     enabled: false
-  # -------------------------------------------------------
-  # 水印去除配置(适用于银行流水浅色斜向文字水印)
-  # -------------------------------------------------------
+  # 页级水印(细参见 ocr_utils/watermark/presets.py PAGE_WATERMARK_PRESETS)
   watermark_removal:
-    enabled: false           # 是否启用水印去除
-    method: threshold # threshold | masked | masked_adaptive
-    threshold: 175          # 全局阈值或掩膜失败时的回退阈值(140-180)
-    morph_close_kernel: 0   # 去水印后灰度图闭运算,0 跳过
-    # 去水印后对比度增强(text_restore 将笔画拉深,比全局 gamma 更接近原图)
+    enabled: false
+    detect_before_remove: true
+    method: threshold   # threshold | masked | masked_adaptive
+    threshold: 175
     contrast_enhancement:
-      enabled: true
-      method: text_restore   # text_restore | clahe | gamma | linear
-      text_black_target: 85  # 略提高,减轻去水印后笔画被拉花(原 75 过深)
-      background_threshold: 248
-      text_lo_percentile: 1.0
-      text_hi_percentile: 99.0
-      gamma: 0.75            # method=gamma 时生效
-      clip_limit: 2.0        # method=clahe
-      tile_grid_size: 8
-      black_percentile: 2.0  # method=linear
-      white_percentile: 98.0
+      enabled: false
+      method: text_restore
+      text_black_target: 85
     debug_options:
-      enabled: false              # 由命令行 --debug / --debug-layout 统一控制
-      output_dir: null            # null 时使用 pipeline 输出目录
-      prefix: ""                  # 文件名前缀(运行时注入 page_name)
-      subdir: watermark_removal   # 输出至 debug/watermark_removal/
-      save_compare: true          # 保存左右对比图 *_watermark_compare.*
-      image_format: "png"         # jpg / png
-
+      enabled: false
+      output_dir: null
+      prefix: ""
+      subdir: watermark_removal
+      save_compare: true
+      image_format: "png"
+
+# ============================================================
+# Layout 检测配置 - 智能路由器(按场景直接选择模型)
+# ============================================================
 layout_detection:
   module: "smart_router"
   strategy: "ocr_eval"  # ocr_eval(推荐,基于OCR评估选择最佳), auto(快速模式,基于文档特征)
@@ -73,14 +65,6 @@ layout_detection:
       model_dir: null  # 使用默认路径
       device: "cpu"
     
-  # Debug 可视化配置(与 MinerUWiredTableRecognizer.DebugOptions 对齐)
-  # 默认关闭。开启后将保存:layout检测结果
-  debug_options:
-    enabled: true               # 是否开启调试可视化输出
-    output_dir: null             # 调试输出目录;null不输出
-    prefix: ""                  # 保存文件名前缀(如设置为页码)
-
-  
   # 可选:回退模型(当所有模型都失败时使用)
   fallback_model:
     module: "mineru"
@@ -90,11 +74,25 @@ layout_detection:
   # 后处理配置
   post_process:
     # 将大面积文本块转换为表格(后处理)
-    convert_large_text_to_table: true
-    min_text_area_ratio: 0.25
-    min_text_width_ratio: 0.4
-    min_text_height_ratio: 0.3
+    convert_large_text_to_table: true  # 是否启用
+    min_text_area_ratio: 0.25         # 最小面积占比(25%)
+    min_text_width_ratio: 0.4         # 最小宽度占比(40%)
+    min_text_height_ratio: 0.3        # 最小高度占比(30%)
+
+  # Debug 可视化(底图为 inference_image,与 Layout 检测输入一致)
+  debug_options:
+    enabled: false              # 由命令行 --debug / --debug-layout 控制
+    output_dir: null            # null 时由 pipeline 按页注入
+    prefix: ""
+    subdir: layout_detection    # 输出至 debug/layout_detection/
+    save_raw: true              # 后处理前
+    save_post_processed: true   # 后处理后
+    save_json: true
+    image_format: "png"
 
+# ============================================================
+# OCR 识别配置
+# ============================================================
 ocr_recognition:
   module: "mineru"
   language: "ch"
@@ -104,7 +102,6 @@ ocr_recognition:
   batch_size: 8
   device: "cpu"
 
-
   # Debug 可视化(底图为 inference_image,与整页 OCR 输入一致)
   debug_options:
     enabled: true              # 由命令行 --debug / --debug-ocr 控制
@@ -114,56 +111,100 @@ ocr_recognition:
     save_json: true
     image_format: png
 
+# ============================================================
 # 表格分类配置(自动区分有线/无线表格)
+# ============================================================
 table_classification:
   enabled: true               # 是否启用自动表格分类(默认关闭,使用手动配置)
   module: "paddle"            # 分类模型:paddle(MinerU PaddleTableClsModel)
   confidence_threshold: 0.5   # 分类置信度阈值
   batch_size: 16              # 批处理大小
 
-
-
-  # Debug 可视化(底图为 inference_image,与 Layout 检测输入一致)
+  # Debug 可视化配置
   debug_options:
-    enabled: true              # 由命令行 --debug / --debug-layout 控制
+    enabled: false              # 由命令行 --debug / --debug-table 统一控制
     output_dir: null            # null 时由 pipeline 按页注入
     prefix: ""
-    subdir: layout_detection    # 输出至 debug/layout_detection/
-    save_raw: true              # 后处理前
-    save_post_processed: true   # 后处理后
-    save_json: true
+    subdir: table_classification  # 输出至 debug/table_classification/
+    save_table_lines: true      # paddle 线条检测叠加图
     image_format: "png"
 
-# 有线表格识别专用配置
+# ============================================================
+# 有线表格识别专用配置(MinerU UNet)
+# ============================================================
 table_recognition_wired:
   use_wired_unet: true
   upscale_ratio: 3.333
   need_ocr: true
   row_threshold: 10
   col_threshold: 15
-  ocr_conf_threshold: 0.8
+  ocr_conf_threshold: 0.9       # 单元格 OCR 置信度阈值
   cell_crop_margin: 2
   use_custom_postprocess: true  # 是否使用自定义后处理(默认启用)
 
   # 是否启用倾斜矫正
   enable_deskew: true
 
+  # 🆕 启用多源单元格融合
+  use_cell_fusion: true
+  
+  # 融合引擎配置
+  cell_fusion:
+    # RT-DETR 模型路径(必需)
+    rtdetr_model_path: "/Users/zhch158/models/pytorch_models/Table/RT-DETR-L_wired_table_cell_det.onnx"
+    
+    # 融合权重
+    unet_weight: 0.6        # UNet 权重(结构性强)
+    rtdetr_weight: 0.4      # RT-DETR 权重(鲁棒性强)
+    
+    # 阈值配置
+    iou_merge_threshold: 0.7    # 高IoU合并阈值(>0.7则加权平均)
+    iou_nms_threshold: 0.5      # NMS去重阈值
+    rtdetr_conf_threshold: 0.5  # RT-DETR置信度阈值
+    
+    # 功能开关
+    enable_ocr_compensation: true      # 启用OCR边缘补偿
 
-  # 单元格二次 OCR(det 分行 + 整格兜底 + 低分块过滤)
+  # 单元格二次 OCR(det 分行 + 整格/条带兜底 + 低分笔画增强重试
   second_pass_ocr:
-    line_min_score: 0.8
+    reocr_mode: bank_statement       # 表体空单元必跑 + 同行多数非空则空格也跑
+    header_row: 0                    # 表头行号(0=首行)
+    row_peer_min_nonempty: 5         # 同行至少 N 个非空格时,本格空也触发二次 OCR
+    line_min_score: 0.8              # 低于此分的分行从文本与计分中丢弃
     drop_low_score_blocks: true
-    whole_cell_fallback: true
+    whole_cell_fallback: true        # 整格 det=False 兜底 + 条带扫描
     prefer_whole_on_tie: true
+    whole_longer_min_extra_chars: 2  # 整格/条带文本比分行多长至少 N 字则优先
+    strip_fallback_aspect_ratio: 1.8 # 高/宽>=该值且仅检出<=1行时滑动条带分行
+    suspicious_short_min_chars: 4    # 高分但过短仍跑整格/条带兜底(与 enhance_retry 无关)
+    cell_preprocess:
+      watermark:
+        enabled: true
+        method: threshold
+      denoise:
+        enabled: false   # 小格 median 易糊笔画;lab 用 --denoise 对比
+      contrast:
+        enabled: false   # Pass1 去水印后可选;lab 对比 text_restore
+        method: text_restore
+        text_black_target: 88
+      light:
+        upscale_min_side: 192  # 128, 192 用于难例日期列
+    enhance_retry:
+      enabled: false
+      # enabled: true 时 Pass2 预处理,默认见代码(cell_preprocess.enhance_retry 已废弃)
 
   # Debug 可视化配置
   debug_options:
-    enabled: true              # 由命令行 --debug / --debug-table 统一控制
+    enabled: false              # 由命令行 --debug / --debug-table 统一控制
     output_dir: null            # null 时由 pipeline 按页注入
     prefix: ""
-    subdir: table_classification  # 输出至 debug/table_classification/
-    save_table_lines: true      # paddle 线条检测叠加图
+    subdir: table_recognition_wired  # 输出至 debug/table_recognition_wired/
+    save_table_lines: true
+    save_connected_components: true
+    save_grid_structure: true
+    save_text_overlay: true
     image_format: "png"
+    # 单元格二次 OCR 裁剪图:debug/table_recognition_wired/tablecell_ocr/
 
 # VLM 表格识别配置(当分类为 'wireless' 时使用)
 vl_recognition:
@@ -187,6 +228,9 @@ vl_recognition:
   # 表格识别特定配置
   table_recognition:
 
+# ============================================================
+# 输出配置
+# ============================================================
 output:
   create_subdir: false
   save_pdf_images: true

+ 9 - 19
ocr_tools/universal_doc_parser/config/bank_statement_yusys_local.yaml

@@ -26,11 +26,10 @@ preprocessor:
   watermark_removal:
     enabled: false
     detect_before_remove: true
-    method: masked_adaptive   # threshold | masked | masked_adaptive
+    method: threshold   # threshold | masked | masked_adaptive
     threshold: 175
-    morph_close_kernel: 0
     contrast_enhancement:
-      enabled: true
+      enabled: false
       method: text_restore
       text_black_target: 85
     debug_options:
@@ -180,31 +179,22 @@ table_recognition_wired:
     prefer_whole_on_tie: true
     whole_longer_min_extra_chars: 2  # 整格/条带文本比分行多长至少 N 字则优先
     strip_fallback_aspect_ratio: 1.8 # 高/宽>=该值且仅检出<=1行时滑动条带分行
+    suspicious_short_min_chars: 4    # 高分但过短仍跑整格/条带兜底(与 enhance_retry 无关)
     cell_preprocess:
       watermark:
         enabled: true
-        method: masked_adaptive
+        method: threshold
       denoise:
         enabled: false   # 小格 median 易糊笔画;lab 用 --denoise 对比
-        method: median
       contrast:
-        enabled: false
+        enabled: false   # Pass1 去水印后可选;lab 对比 text_restore
         method: text_restore
         text_black_target: 88
       light:
-        upscale_min_side: 64
-      enhance_retry:
-        enabled: false
-        score_below: 0.90
-        min_chars: 4
-        short_text_in_tall_cell: true
-        contrast:
-          enabled: true
-          method: text_restore
-          text_black_target: 75
-        sharpen:
-          enabled: false
-          amount: 0.3
+        upscale_min_side: 192  # 128, 192 用于难例日期列
+    enhance_retry:
+      enabled: false
+      # enabled: true 时 Pass2 预处理,默认见代码(cell_preprocess.enhance_retry 已废弃)
 
   # Debug 可视化配置
   debug_options:

+ 0 - 158
ocr_tools/universal_doc_parser/config/bank_statement_yusys_v2.yaml

@@ -1,158 +0,0 @@
-# 银行交易流水场景配置 v2
-# 支持完整的处理流程:PDF分类 → 方向识别 → Layout检测 → OCR/VLM并行处理 → 坐标匹配
-
-scene_name: "bank_statement"
-description: "银行交易流水、对账单等场景 - 增强版"
-
-# ============================================================
-# 输入配置
-# ============================================================
-input:
-  supported_formats: [".pdf", ".png", ".jpg", ".jpeg", ".bmp", ".tiff"]
-  dpi: 200  # PDF转图片的DPI
-  txt_pdf_watermark_removal:
-    enabled: true   # 文字型PDF渲染前去除水印XObject(保留文字可搜索性)
-    sample_pages: 3  # 扫描前N页快速预检
-
-# ============================================================
-# 预处理配置(方向识别)
-# ============================================================
-preprocessor:
-  module: "mineru"
-  orientation_classifier:
-    enabled: true  # 扫描件自动开启,数字PDF自动跳过
-    model_name: "paddle_orientation_classification"
-    model_dir: null  # 使用默认路径
-  unwarping:
-    enabled: false  # 图像矫正(可选)
-  # -------------------------------------------------------
-  # 水印去除配置(适用于银行流水浅色斜向文字水印)
-  # -------------------------------------------------------
-  watermark_removal:
-    enabled: true           # 是否启用水印去除
-    threshold: 160          # 灰度阈值(140-180):高于此值视为水印变白
-                            # 值越大保守(残留水印),值越小激进(损失浅色正文)
-    morph_close_kernel: 0   # 形态学闭运算核大小(像素),默认的 morph_kernel 改为 0(非二值图像时形态学闭运算会适得其反)
-
-# ============================================================
-# 版式检测配置
-# ============================================================
-layout_detection:
-  module: "docling"
-  model_name: "docling-layout-old"
-  model_dir: ds4sd/docling-layout-old  # 使用默认路径,自动下载 doclayout_yolo_docstructbench_imgsz1280_2501.pt
-  device: "cpu"  # 可选: "cpu", "cuda", "mps"
-  conf: 0.3
-  num_threads: 4
-  
-  # 后处理配置
-  post_process:
-    # 将大面积文本块转换为表格(后处理)
-    convert_large_text_to_table: true  # 是否启用
-    min_text_area_ratio: 0.25         # 最小面积占比(25%)
-    min_text_width_ratio: 0.4         # 最小宽度占比(40%)
-    min_text_height_ratio: 0.3        # 最小高度占比(30%)
-
-# ============================================================
-# VL识别配置(表格、公式)
-# ============================================================
-vl_recognition:
-  # 可选: "mineru" (MinerU VLM) 或 "paddle" (PaddleOCR-VL)
-  module: "paddle"
-  model_name: "PaddleOCR-VL-0.9B"
-  
-  # 后端配置
-  backend: "http-client"  # 可选: "http-client", "vllm-engine", "transformers"
-  server_url: "http://10.192.72.11:20016"  # PaddleOCR-VL 服务地址
-  
-  # 图片尺寸限制(避免序列长度超限)
-  max_image_size: 4096
-  resize_mode: 'max'  # 'max' 保持宽高比, 'fixed' 固定尺寸
-  
-  device: "cpu"
-  batch_size: 1
-  
-  model_params:
-    max_concurrency: 10
-    http_timeout: 600
-  
-  # 表格识别特定配置
-  table_recognition:
-    bank_statement_mode: true      # 银行流水优化模式
-
-# ============================================================
-# OCR识别配置(文本检测+识别)
-# ============================================================
-ocr_recognition:
-  module: "mineru"
-  language: "ch"  # 语言: ch, ch_lite, en, japan 等
-  det_threshold: 0.6  # 检测阈值
-  unclip_ratio: 1.5   # 文本框扩展比例
-  enable_merge_det_boxes: false  # 不合并框
-  batch_size: 8
-  device: "cpu"
-
-# ============================================================
-# 输出配置
-# ============================================================
-output:
-  # 基础输出
-  create_subdir: false       # 创建子目录
-  save_json: true           # 保存 middle.json(MinerU标准格式)
-  save_markdown: true       # 保存 Markdown 文件
-  save_html: true           # 保存表格 HTML 文件
-  
-  # Debug 输出(通过命令行 --debug 开启)
-  save_layout_image: true  # 保存 layout 可视化图片
-  save_ocr_image: true     # 保存 OCR 可视化图片
-  draw_type_label: true     # 在可视化图片上标注类型
-  draw_bbox_number: true    # 在可视化图片上标注序号
-  
-  # 增强输出
-  save_enhanced_json: true  # 保存增强版 JSON(包含单元格坐标)
-
-  normalize_numbers: true  # 金额数字标准化(全角→半角)
-
-# ============================================================
-# 场景特定配置
-# ============================================================
-scene_config:
-  bank_statement:
-    # 表格结构特征
-    table_structure: "single_column_list"  # 单栏列表形式
-    merged_cells: false                     # 无合并单元格
-    
-    # 预期列名(用于验证)
-    expected_columns: ["日期", "摘要", "收入", "支出", "余额"]
-    
-    # 验证规则
-    amount_validation: true   # 金额格式验证
-    date_validation: true     # 日期格式验证
-    balance_validation: true  # 余额一致性验证
-    
-  processing_rules:
-    # 表格处理规则
-    table_rules:
-      - detect_table_type: ["wired", "wireless"]  # 检测有线/无线表格
-      - extract_header_automatically: true         # 自动提取表头
-      - validate_amount_format: true               # 验证金额格式
-      - merge_continuation_rows: true              # 合并续行
-      
-    # OCR后处理规则
-    ocr_rules:
-      - filter_low_confidence: 0.7      # 过滤低置信度结果
-      - merge_adjacent_text: true       # 合并相邻文本
-      - number_format_normalization: true  # 数字格式标准化
-
-# ============================================================
-# 跨页表格合并配置
-# ============================================================
-cross_page_merge:
-  enabled: true
-  # 判断表格是否跨页的条件
-  conditions:
-    - table_at_page_bottom: true    # 表格位于页面底部
-    - table_at_page_top: true       # 下一页表格位于顶部
-    - similar_column_count: true    # 列数相似
-    - header_match: false           # 表头匹配(跨页表格通常没有重复表头)
-

+ 38 - 30
ocr_tools/universal_doc_parser/config/bank_statement_yusys_v4.yaml

@@ -21,34 +21,23 @@ preprocessor:
     model_dir: null  # 使用默认路径
   unwarping:
     enabled: false
-  # -------------------------------------------------------
-  # 水印去除配置(适用于银行流水浅色斜向文字水印)
-  # -------------------------------------------------------
+  # 页级水印(细参见 ocr_utils/watermark/presets.py PAGE_WATERMARK_PRESETS)
   watermark_removal:
-    enabled: false           # 是否启用水印去除
-    method: threshold # threshold | masked | masked_adaptive
-    threshold: 175          # 全局阈值或掩膜失败时的回退阈值(140-180)
-    morph_close_kernel: 0   # 去水印后灰度图闭运算,0 跳过
-    # 去水印后对比度增强(text_restore 将笔画拉深,比全局 gamma 更接近原图)
+    enabled: false
+    detect_before_remove: true
+    method: threshold   # threshold | masked | masked_adaptive
+    threshold: 175
     contrast_enhancement:
-      enabled: true
-      method: text_restore   # text_restore | clahe | gamma | linear
-      text_black_target: 85  # 略提高,减轻去水印后笔画被拉花(原 75 过深)
-      background_threshold: 248
-      text_lo_percentile: 1.0
-      text_hi_percentile: 99.0
-      gamma: 0.75            # method=gamma 时生效
-      clip_limit: 2.0        # method=clahe
-      tile_grid_size: 8
-      black_percentile: 2.0  # method=linear
-      white_percentile: 98.0
+      enabled: false
+      method: text_restore
+      text_black_target: 85
     debug_options:
-      enabled: false              # 由命令行 --debug / --debug-layout 统一控制
-      output_dir: null            # null 时使用 pipeline 输出目录
-      prefix: ""                  # 文件名前缀(运行时注入 page_name)
-      subdir: watermark_removal   # 输出至 debug/watermark_removal/
-      save_compare: true          # 保存左右对比图 *_watermark_compare.*
-      image_format: "png"         # jpg / png
+      enabled: false
+      output_dir: null
+      prefix: ""
+      subdir: watermark_removal
+      save_compare: true
+      image_format: "png"
 
 # ============================================================
 # Layout 检测配置 - 智能路由器(按场景直接选择模型)
@@ -115,7 +104,6 @@ ocr_recognition:
   batch_size: 8
   device: "cpu"
 
-
   # Debug 可视化(底图为 inference_image,与整页 OCR 输入一致)
   debug_options:
     enabled: false              # 由命令行 --debug / --debug-ocr 控制
@@ -179,13 +167,33 @@ table_recognition_wired:
     # 功能开关
     enable_ocr_compensation: true      # 启用OCR边缘补偿
 
-
-  # 单元格二次 OCR(det 分行 + 整格兜底 + 低分块过滤)
+  # 单元格二次 OCR(det 分行 + 整格/条带兜底 + 低分笔画增强重试)
   second_pass_ocr:
-    line_min_score: 0.8
+    reocr_mode: bank_statement       # 表体空单元必跑 + 同行多数非空则空格也跑
+    header_row: 0                    # 表头行号(0=首行)
+    row_peer_min_nonempty: 5         # 同行至少 N 个非空格时,本格空也触发二次 OCR
+    line_min_score: 0.8              # 低于此分的分行从文本与计分中丢弃
     drop_low_score_blocks: true
-    whole_cell_fallback: true
+    whole_cell_fallback: true        # 整格 det=False 兜底 + 条带扫描
     prefer_whole_on_tie: true
+    whole_longer_min_extra_chars: 2  # 整格/条带文本比分行多长至少 N 字则优先
+    strip_fallback_aspect_ratio: 1.8 # 高/宽>=该值且仅检出<=1行时滑动条带分行
+    suspicious_short_min_chars: 4    # 高分但过短仍跑整格/条带兜底(与 enhance_retry 无关)
+    cell_preprocess:
+      watermark:
+        enabled: true
+        method: threshold
+      denoise:
+        enabled: false   # 小格 median 易糊笔画;lab 用 --denoise 对比
+      contrast:
+        enabled: false   # Pass1 去水印后可选;lab 对比 text_restore
+        method: text_restore
+        text_black_target: 88
+      light:
+        upscale_min_side: 192  # 128, 192 用于难例日期列
+    enhance_retry:
+      enabled: false
+      # enabled: true 时 Pass2 预处理,默认见代码(cell_preprocess.enhance_retry 已废弃)
 
   # Debug 可视化配置
   debug_options:

+ 107 - 23
ocr_tools/universal_doc_parser/models/adapters/wired_table/text_filling.py

@@ -63,6 +63,7 @@ class TextFiller:
         self.second_pass_row_peer_min_nonempty: int = int(
             sp_cfg.get("row_peer_min_nonempty", 5)
         )
+        _short_min = sp_cfg.get("suspicious_short_min_chars")
         cpp = sp_cfg.get("cell_preprocess") or {}
         if not isinstance(cpp, dict):
             cpp = {}
@@ -70,16 +71,18 @@ class TextFiller:
         if not isinstance(light, dict):
             light = {}
         self.second_pass_light_upscale_min: int = int(
-            light.get("upscale_min_side", 64)
+            light.get("upscale_min_side", 192)
         )
-        er = cpp.get("enhance_retry") or {}
+        er = sp_cfg.get("enhance_retry") or cpp.get("enhance_retry") or {}
         if not isinstance(er, dict):
             er = {}
+        if _short_min is None:
+            _short_min = er.get("min_chars", 4)
+        self.second_pass_suspicious_short_min_chars: int = int(_short_min)
         self.second_pass_enhance_retry_enabled: bool = bool(er.get("enabled", True))
         self.second_pass_enhance_score_below: float = float(
             er.get("score_below", 0.90)
         )
-        self.second_pass_enhance_min_chars: int = int(er.get("min_chars", 4))
         self.second_pass_enhance_short_tall: bool = bool(
             er.get("short_text_in_tall_cell", True)
         )
@@ -101,7 +104,7 @@ class TextFiller:
         denoise = cpp.get("denoise") or {}
         if not isinstance(denoise, dict):
             denoise = {}
-        self._cell_denoise_enabled: bool = bool(denoise.get("enabled", True))
+        self._cell_denoise_enabled: bool = bool(denoise.get("enabled", False))
         self._cell_denoise_method: str = str(denoise.get("method", "median"))
         cell_contrast = cpp.get("contrast") or {}
         if not isinstance(cell_contrast, dict):
@@ -245,12 +248,40 @@ class TextFiller:
         return x1, y1, x2, y2
 
     @staticmethod
+    def _normalize_rec_score(score: float) -> float:
+        """识别分归一化到 [0,1];部分引擎返回 0~100。"""
+        try:
+            sc = float(score)
+        except (TypeError, ValueError):
+            return 0.0
+        if sc != sc:  # NaN
+            return 0.0
+        if sc > 1.0:
+            if sc <= 100.0:
+                return sc / 100.0
+            return 0.0
+        if sc < 0.0:
+            return 0.0
+        return sc
+
+    @staticmethod
+    def _parse_det_rec_item(item: Any) -> Tuple[str, float]:
+        """解析 det+rec 一体结果的一项:[[box], (text, score)]。"""
+        if item is None:
+            return "", 0.0
+        if isinstance(item, (list, tuple)) and len(item) >= 2:
+            head = item[0]
+            if isinstance(head, (list, tuple)) and len(head) >= 4:
+                return TextFiller._parse_single_rec_item(item[1])
+        return TextFiller._parse_single_rec_item(item)
+
+    @staticmethod
     def _parse_single_rec_item(rec_item: Any) -> Tuple[str, float]:
         if rec_item is None:
             return "", 0.0
         if isinstance(rec_item, tuple) and len(rec_item) >= 2:
             txt = str(rec_item[0] or "").strip()
-            sc = float(rec_item[1] or 0.0)
+            sc = TextFiller._normalize_rec_score(float(rec_item[1] or 0.0))
             return txt, 0.0 if not txt else sc
         if isinstance(rec_item, list) and len(rec_item) >= 2:
             if isinstance(rec_item[0], (list, tuple, dict)):
@@ -266,15 +297,19 @@ class TextFiller:
                     total_len = sum(len(t) for t in texts_list)
                     if total_len > 0:
                         weighted = sum(len(t) * s for t, s in zip(texts_list, scores_list)) / total_len
-                        return combined, weighted
-                    return combined, sum(scores_list) / len(scores_list)
+                        return combined, TextFiller._normalize_rec_score(weighted)
+                    return combined, TextFiller._normalize_rec_score(
+                        sum(scores_list) / len(scores_list)
+                    )
                 return "", 0.0
             txt = str(rec_item[0] or "").strip()
-            sc = float(rec_item[1] or 0.0)
+            sc = TextFiller._normalize_rec_score(float(rec_item[1] or 0.0))
             return txt, 0.0 if not txt else sc
         if isinstance(rec_item, dict):
             txt = str(rec_item.get("text") or rec_item.get("label") or "").strip()
-            sc = float(rec_item.get("score") or rec_item.get("confidence") or 0.0)
+            sc = TextFiller._normalize_rec_score(
+                float(rec_item.get("score") or rec_item.get("confidence") or 0.0)
+            )
             return txt, 0.0 if not txt else sc
         return "", 0.0
 
@@ -293,7 +328,18 @@ class TextFiller:
             items = self._extract_ocr_batch_results(rec_res)
             if not items:
                 return "", 0.0
-            return self._parse_single_rec_item(items[0] if len(items) == 1 else items)
+            blocks: List[Tuple[str, float]] = []
+            for item in items:
+                text, score = self._parse_det_rec_item(item)
+                if text:
+                    blocks.append((text, score))
+            if not blocks:
+                return "", 0.0
+            return self.aggregate_line_ocr(
+                blocks,
+                line_min_score=0.0,
+                drop_low_score_blocks=False,
+            )
         except Exception as e:
             logger.warning(f"整格 OCR 失败: {e}")
             return "", 0.0
@@ -418,7 +464,11 @@ class TextFiller:
         return cell_img
 
     def _apply_cell_contrast(
-        self, cell_img: np.ndarray, contrast_cfg: Dict[str, Any]
+        self,
+        cell_img: np.ndarray,
+        contrast_cfg: Dict[str, Any],
+        *,
+        sharpen_cfg: Optional[Dict[str, Any]] = None,
     ) -> np.ndarray:
         from ocr_utils.watermark.contrast import apply_contrast_enhancement_config
 
@@ -429,8 +479,9 @@ class TextFiller:
         else:
             gray = cell_img
         gray = apply_contrast_enhancement_config(gray, contrast_cfg)
-        if self.second_pass_enhance_sharpen.get("enabled", False):
-            amount = float(self.second_pass_enhance_sharpen.get("amount", 0.3))
+        sharpen = sharpen_cfg or {}
+        if sharpen.get("enabled", False):
+            amount = float(sharpen.get("amount", 0.3))
             blurred = cv2.GaussianBlur(gray, (0, 0), 1.0)
             gray = cv2.addWeighted(gray, 1.0 + amount, blurred, -amount, 0)
         if cell_img.ndim == 3:
@@ -451,12 +502,18 @@ class TextFiller:
             img = self._denoise_cell(img)
             stages.append("denoise")
 
-        if mode == "enhance":
+        if mode == "light":
+            if self._cell_contrast_cfg.get("enabled", False) and "wm" in stages:
+                img = self._apply_cell_contrast(img, self._cell_contrast_cfg)
+                stages.append("contrast")
+        elif mode == "enhance":
             contrast_cfg = self.second_pass_enhance_contrast
             if self._cell_contrast_cfg.get("enabled", False):
                 contrast_cfg = self._cell_contrast_cfg
             if contrast_cfg.get("enabled", False) and "wm" in stages:
-                img = self._apply_cell_contrast(img, contrast_cfg)
+                img = self._apply_cell_contrast(
+                    img, contrast_cfg, sharpen_cfg=self.second_pass_enhance_sharpen
+                )
                 stages.append("contrast")
 
         img = self._upscale_cell_if_small(img)
@@ -473,10 +530,18 @@ class TextFiller:
         strip_score: float = 0.0,
     ) -> Tuple[str, float, str]:
         """返回 (text, score, strategy)。"""
+        line_score = self._normalize_rec_score(line_score)
+        whole_score = self._normalize_rec_score(whole_score)
+        strip_score = self._normalize_rec_score(strip_score)
+
         candidates: List[Tuple[str, float, str]] = []
         if line_text:
             candidates.append((line_text, line_score, "lines"))
-        if whole_text and self.second_pass_whole_fallback:
+        if (
+            whole_text
+            and self.second_pass_whole_fallback
+            and 0.0 < whole_score <= 1.0
+        ):
             candidates.append((whole_text, whole_score, "whole"))
         if strip_text:
             candidates.append((strip_text, strip_score, "strip"))
@@ -487,6 +552,7 @@ class TextFiller:
         if (
             whole_text
             and line_text
+            and 0.0 < whole_score <= 1.0
             and line_score > whole_score
             and len(whole_text) >= len(line_text) + self.second_pass_whole_longer_extra
             and len(whole_text) > len(line_text)
@@ -567,7 +633,7 @@ class TextFiller:
         if (
             line_text
             and line_score >= base_conf_th
-            and len(line_text) < self.second_pass_enhance_min_chars
+            and len(line_text) < self.second_pass_suspicious_short_min_chars
         ):
             return True
         return False
@@ -587,7 +653,7 @@ class TextFiller:
             reasons.append("not_accepted")
         if score < self.second_pass_enhance_score_below:
             reasons.append("score_below_threshold")
-        if text and len(text) < self.second_pass_enhance_min_chars:
+        if text and len(text) < self.second_pass_suspicious_short_min_chars:
             reasons.append("suspicious_short_text")
         h, w = cell_img.shape[:2]
         if (
@@ -595,7 +661,7 @@ class TextFiller:
             and w > 0
             and h / w >= self.second_pass_strip_aspect
             and len(result.get("lines") or []) <= 1
-            and len(text) < self.second_pass_enhance_min_chars + 2
+            and len(text) < self.second_pass_suspicious_short_min_chars + 2
         ):
             reasons.append("tall_cell_single_line")
         return bool(reasons), reasons
@@ -620,7 +686,7 @@ class TextFiller:
             whole_text, whole_score = self._recognize_whole_cell(cell_img)
             whole_skipped = None
         elif line_text and line_score >= base_conf_th:
-            if len(line_text) < self.second_pass_enhance_min_chars:
+            if len(line_text) < self.second_pass_suspicious_short_min_chars:
                 whole_skipped = "short_text_high_score"
             else:
                 whole_skipped = "line_score>=%.2f" % base_conf_th
@@ -757,6 +823,7 @@ class TextFiller:
         debug_img: np.ndarray,
         result: Dict[str, Any],
         *,
+        raw_img: Optional[np.ndarray] = None,
         first_pass_text: str = "",
         first_pass_score: float = 0.0,
         trigger_reasons: Optional[List[str]] = None,
@@ -769,15 +836,31 @@ class TextFiller:
         if pass_label:
             stem += f"_{pass_label}"
         stem += f"_{strategy}_{tag}"
-        png_path = os.path.join(cell_ocr_dir, f"{stem}.png")
+        preprocessed_name = f"{stem}.png"
+        preprocessed_path = os.path.join(cell_ocr_dir, preprocessed_name)
         try:
-            cv2.imwrite(png_path, debug_img)
+            cv2.imwrite(preprocessed_path, debug_img)
         except Exception as e:
             logger.warning(f"保存单元格OCR图片失败 (cell {cell_idx}): {e}")
             return
+
+        raw_name: Optional[str] = None
+        if raw_img is not None and raw_img.size > 0:
+            raw_name = f"{stem}_raw.png"
+            raw_path = os.path.join(cell_ocr_dir, raw_name)
+            try:
+                cv2.imwrite(raw_path, raw_img)
+            except Exception as e:
+                logger.warning(f"保存单元格原图失败 (cell {cell_idx}): {e}")
+                raw_name = None
+
         payload = {
             "cell_idx": cell_idx,
             "bbox": bbox,
+            "debug_images": {
+                "raw": raw_name,
+                "preprocessed": preprocessed_name,
+            },
             "first_pass": {"text": first_pass_text, "score": first_pass_score},
             "trigger_reason": trigger_reasons or [],
             "lines": result.get("lines") or [],
@@ -828,7 +911,7 @@ class TextFiller:
         
         if text_len == 1:
             # 单字符:提高阈值 +0.05
-            return min(0.95, base_threshold + 0.1)
+            return min(0.92, base_threshold + 0.1)
         elif text_len <= 3:
             # 2-3字符:轻微提高阈值 +0.02
             return min(0.92, base_threshold + 0.02)
@@ -1456,6 +1539,7 @@ class TextFiller:
                         cell_idx,
                         debug_img,
                         result,
+                        raw_img=raw_crop,
                         first_pass_text=fp_text,
                         first_pass_score=fp_score,
                         trigger_reasons=trigger_reasons,

+ 56 - 1
ocr_tools/universal_doc_parser/tests/test_second_pass_ocr_aggregate.py

@@ -72,7 +72,7 @@ class TestShouldRunWholeFallback:
             config={
                 "second_pass_ocr": {
                     "whole_cell_fallback": True,
-                    "enhance_retry": {"min_chars": 4},
+                    "suspicious_short_min_chars": 4,
                 }
             },
         )
@@ -94,6 +94,61 @@ class TestShouldRunWholeFallback:
         assert f._should_run_whole_fallback("", 0.0, cell, [], 0.9)
 
 
+class TestCellPreprocessConfig:
+    def test_suspicious_short_from_top_level(self):
+        f = TextFiller(
+            ocr_engine=None,
+            config={"second_pass_ocr": {"suspicious_short_min_chars": 6}},
+        )
+        assert f.second_pass_suspicious_short_min_chars == 6
+
+    def test_light_contrast_stage_when_enabled(self):
+        import numpy as np
+
+        f = TextFiller(
+            ocr_engine=None,
+            config={
+                "second_pass_ocr": {
+                    "cell_preprocess": {
+                        "watermark": {"enabled": True, "method": "threshold"},
+                        "contrast": {
+                            "enabled": True,
+                            "method": "text_restore",
+                            "text_black_target": 88,
+                        },
+                    }
+                }
+            },
+        )
+        cell = np.ones((40, 80, 3), dtype=np.uint8) * 200
+        _, stages = f._preprocess_cell_for_ocr(cell, mode="light")
+        assert "wm" in stages
+        assert "contrast" in stages
+
+
+class TestWholeCellParse:
+    def test_parse_det_rec_item_uses_rec_not_box(self):
+        item = [
+            [[146.0, 15.0], [199.0, 15.0], [199.0, 85.0], [146.0, 85.0]],
+            ("/", 0.9213118553161621),
+        ]
+        t, s = TextFiller._parse_det_rec_item(item)
+        assert t == "/"
+        assert abs(s - 0.9213118553161621) < 1e-6
+
+    def test_normalize_rec_score_percent(self):
+        assert abs(TextFiller._normalize_rec_score(92.5) - 0.925) < 1e-6
+        assert TextFiller._normalize_rec_score(0.921) == 0.921
+        assert TextFiller._normalize_rec_score(999) == 0.0
+
+    def test_pick_line_when_whole_score_invalid(self):
+        f = TextFiller(ocr_engine=None, config={"second_pass_ocr": {}})
+        t, s, strat = f._pick_line_vs_whole("/", 0.92, "146.0199.0146.0/", 999.0)
+        assert t == "/"
+        assert strat == "lines"
+        assert abs(s - 0.92) < 1e-6
+
+
 class TestPickBetterOcrResult:
     def test_reject_invalid_pass2_score(self):
         pass1 = {"final_text": "取款", "final_score": 0.99, "accepted": True}

+ 1 - 1
ocr_utils/watermark/presets.py

@@ -113,7 +113,7 @@ def _base_preset(scope: Scope, method: Method) -> Dict[str, Any]:
         if scope == "cell"
         else copy.deepcopy(_CONTRAST_PAGE_DEFAULT)
     )
-    threshold = 175 if scope == "page" else 170
+    threshold = 175 if scope == "page" else 155
     cfg: Dict[str, Any] = {
         "enabled": True,
         "detect_before_remove": scope == "page",

+ 1 - 1
ocr_utils/watermark/processor.py

@@ -45,7 +45,7 @@ class WatermarkProcessor:
 
     @property
     def threshold(self) -> int:
-        return int(self.config.get("threshold", 175))
+        return int(self.config.get("threshold", 155))
 
     @property
     def morph_close_kernel(self) -> int: