1 月之前 · 95bfd4baed
--- a/docs/ocr_tools/universal_doc_parser/水印去除技术文档.md
+++ b/docs/ocr_tools/universal_doc_parser/水印去除技术文档.md
@@ -2,12 +2,26 @@
 
				 
			
 
				 ## 概述
			
 
				 
			
 
				-水印去除模块 (`ocr_utils/watermark_utils.py`) 提供了**两层独立的水印去除能力**，针对不同类型的文档和场景进行优化：
			
 
				+水印去除能力位于 `ocr_utils/watermark/` 包，对外兼容入口为 `ocr_utils/watermark_utils.py`（re-export）。核心编排类为 **`WatermarkProcessor`**，支持 **页级（page）** 与 **单元格级（cell）** 两套预设（`presets.py`）。
			
 
				 
			
 
				-| 层级 | 处理对象 | 适用场景 | 特点 |
			
 
				-|------|---------|---------|------|
			
 
				-| **PDF 层级** | 文字型 PDF 的 XObject | 银行流水等文字型 PDF | 保留文字可搜索性，无损处理 |
			
 
				-| **图像层级** | 扫描件/渲染图像的像素 | 扫描件、图片 | 像素级处理，适用于 OCR 前预处理 |
			
 
				+除 PDF/页级预处理外，银行流水等场景在 **有线表格二次 OCR** 中可对单个单元格裁剪图再次去水印（`text_filling.cell_preprocess`）。
			
 
				+
			
 
				+| 层级 | 处理对象 | 配置位置 | 适用场景 |
			
 
				+|------|---------|---------|---------|
			
 
				+| **PDF 层级** | 文字型 PDF 的 XObject | `input.txt_pdf_watermark_removal` | 文字型 PDF，渲染前 |
			
 
				+| **页级图像** | 整页渲染图 | `preprocessor.watermark_removal` | 扫描件页级 OCR 前（可选） |
			
 
				+| **格级图像** | 单元格裁剪图 | `table_recognition_wired.second_pass_ocr.cell_preprocess.watermark` | 二次 OCR 前（推荐 cell-first） |
			
 
				+
			
 
				+**实现模块：**
			
 
				+
			
 
				+| 路径 | 职责 |
			
 
				+|------|------|
			
 
				+| `ocr_utils/watermark/presets.py` | 页级/格级预设、`merge_watermark_config` |
			
 
				+| `ocr_utils/watermark/removal.py` | `threshold` / `masked_adaptive` 去水印 |
			
 
				+| `ocr_utils/watermark/processor.py` | `WatermarkProcessor` 门面 |
			
 
				+| `ocr_utils/watermark/pdf.py` | 文字型 PDF XObject 去水印 |
			
 
				+| `models/adapters/wired_table/text_filling.py` | 格级预处理 + 二次 OCR |
			
 
				+| `ocr_tools/cell_preprocess_lab/cell_sweep.py` | 单格参数网格扫描（调参） |
			
 
				 
			
 
				 ---
			
 
				 
			
@@ -35,13 +49,17 @@ graph TB
 
				     F[图像输入] --> J[阶段二: 图像级去水印]
			
 
				     
			
 
				     J --> K{启用 watermark_removal?}
			
 
				-    K -->|是| L[检测浅色斜向水印]
			
 
				+    K -->|是| L[WatermarkProcessor page]
			
 
				     K -->|否| N
			
 
				-    L --> M[阈值化去除水印]
			
 
				+    L --> M[method: threshold / masked / masked_adaptive]
			
 
				     M --> N[方向校正]
			
 
				     
			
 
				     N --> O[Layout 检测]
			
 
				-    O --> P[OCR 识别]
			
 
				+    O --> P[表格 OCR]
			
 
				+    P --> Q{二次 OCR 格级 wm?}
			
 
				+    Q -->|是| R[WatermarkProcessor cell + upscale]
			
 
				+    Q -->|否| S
			
 
				+    R --> S[格内 OCR]
			
 
				     
			
 
				     style C fill:#e1f5ff
			
 
				     style E fill:#e1f5ff
			
@@ -186,17 +204,81 @@ def scan_pdf_watermark_xobjs(pdf_bytes: bytes, sample_pages: int = 3) -> bool:
 
				 
			
 
				 ### 适用场景
			
 
				 
			
 
				-**扫描件/图片（`pdf_type='ocr'`）**：无法从 PDF 内部结构处理，只能对渲染后的图像进行像素级处理。
			
 
				+- **页级**：扫描件/图片（`pdf_type='ocr'`），在 `MinerUPreprocessor` 中通过 `WatermarkProcessor(scope="page")` 调用。
			
 
				+- **格级**：有线表格 **二次 OCR** 前，对 `raw_crop` 通过 `WatermarkProcessor(scope="cell")` 调用（与页级独立配置）。
			
 
				 
			
 
				-### 原理
			
 
				+银行流水当前推荐策略 **cell-first**：页级 `watermark_removal.enabled: false`，重点调格级 `cell_preprocess.watermark`。
			
 
				+
			
 
				+### 灰度值约定（必读）
			
 
				+
			
 
				+OpenCV / PIL 的 **8 位灰度图**统一约定：
			
 
				+
			
 
				+| 灰度值 | 视觉 |
			
 
				+|--------|------|
			
 
				+| **0** | 黑（深笔画、墨迹） |
			
 
				+| **255** | 白（背景、纸面） |
			
 
				+| 中间值（如 100～220） | 灰（浅色水印、淡笔画、扫描噪声） |
			
 
				+
			
 
				+典型银行流水扫描件上：
			
 
				+
			
 
				+- **汉字笔画**：低灰度（偏黑，通常远小于 threshold）
			
 
				+- **纸面背景 / 浅色斜纹水印**：高灰度（偏白，通常大于 threshold）
			
 
				+
			
 
				+> **注意**：个别 UI 或 PhotoShop 可能用「0=白」显示，但本仓库代码与 OpenCV 一致，**以 0=黑、255=白 为准**。
			
 
				+
			
 
				+### 去水印方法（`method`）
			
 
				+
			
 
				+| method | 说明 | YAML 需写 |
			
 
				+|--------|------|-----------|
			
 
				+| `threshold` | 全局 `gray > threshold → 255`，简单快速 | `method` + 可选 `threshold` |
			
 
				+| `masked` | 掩膜定位水印区再处理 | 仅 `method`（细参见 preset） |
			
 
				+| `masked_adaptive` | 掩膜 + 掩膜内自适应阈值 | 仅 `method`（细参见 preset） |
			
 
				+
			
 
				+预设默认值（`presets.py`，YAML 未覆盖时生效）：
			
 
				+
			
 
				+| scope | 默认 `threshold` | 默认 `contrast_enhancement` |
			
 
				+|-------|------------------|----------------------------|
			
 
				+| **page** | 175 | enabled |
			
 
				+| **cell** | 155 | disabled |
			
 
				+
			
 
				+`merge_watermark_config(scope, user_cfg)` 将用户 YAML 与上表预设合并；`mask` / `hough` / `adaptive` 等细参不必写入场景 YAML。
			
 
				+
			
 
				+### `threshold` 方法原理
			
 
				 
			
 
				 银行流水等金融文档的水印特征：
			
 
				 
			
 
				-- **颜色浅**：灰度值通常在 160-220 之间（介于正文和背景之间）
			
 
				-- **角度斜**：通常 45° 斜向排列
			
 
				-- **文字稀疏**：水印文字占比较小
			
 
				+- **颜色浅**：灰度多在 160～220（介于正文与白纸之间）
			
 
				+- **角度斜**：常见 45° 斜向重复文字
			
 
				+- **占比较小**：相对整页/整格为稀疏浅色纹理
			
 
				+
			
 
				+核心代码（`removal.py`）：
			
 
				+
			
 
				+```python
			
 
				+cleaned = gray.copy()
			
 
				+cleaned[gray > threshold] = 255   # 亮于阈值的像素 → 白
			
 
				+```
			
 
				+
			
 
				+**语义**：保留 **灰度 ≤ threshold** 的像素（深字），把 **更亮** 的像素刷成白纸，用于削弱浅色水印。
			
 
				+
			
 
				+### `threshold` 调高 / 调低的实际作用
			
 
				+
			
 
				+判断规则：`gray > threshold` 才变白 → **threshold 是「多亮才算背景」的分界线**。
			
 
				 
			
 
				-基于这些特征，采用**阈值化处理**：将灰度值高于阈值的像素置为白色，保留深色正文。
			
 
				+| 操作 | 白化强度 | 被刷白的像素范围 | 对水印 | 对正文 / OCR |
			
 
				+|------|----------|------------------|--------|----------------|
			
 
				+| **调低** threshold（如 175→155） | **更强** | 更多中等灰度（如 156～175）也会变白 | 去得更干净 | 淡笔画、被水印冲淡的边缘可能被啃掉；背景更干净时 det 有时更易检出一整行 |
			
 
				+| **调高** threshold（如 155→175） | **更弱** | 只有更亮的像素才变白 | 易残留斜纹、浅灰噪声 | 笔画保留更多；残留干扰可能导致 det 碎框、高分短错文 |
			
 
				+
			
 
				+记忆口诀：
			
 
				+
			
 
				+- **threshold ↓** → 更激进地去浅色 → 背景更白，**易伤淡字**
			
 
				+- **threshold ↑** → 更保守 → **易留水印**，但深字更安全
			
 
				+
			
 
				+调参建议（单格可用 `cell_sweep.py` 在 **`*_raw.png` 原图上**扫描，勿对已预处理 debug 图二次去水印）：
			
 
				+
			
 
				+1. 优先在 **155～175** 间扫，结合 OCR 文本是否完整、det 框是否稳定。
			
 
				+2. **不要只看 rec 分数**：threshold 偏高时可能出现高分但错误的短文本（如仅「折取款」）。
			
 
				+3. 格级与页级 **threshold 可不同**（预设 page=175、cell=155），按 sweep 结果分别写 YAML。
			
 
				 
			
 
				 ### 水印检测 (`detect_watermark`)
			
 
				 
			
@@ -232,79 +314,136 @@ def detect_watermark(image, midtone_low=100, midtone_high=220, ratio_threshold=0
 
				     return diagonal_count >= 2
			
 
				 ```
			
 
				 
			
 
				-### 水印去除 (`remove_watermark_from_image`)
			
 
				+### 水印去除 API
			
 
				+
			
 
				+**推荐（页级 / 格级统一）：**
			
 
				 
			
 
				 ```python
			
 
				-def remove_watermark_from_image(image, threshold=160, morph_close_kernel=0):
			
 
				-    """
			
 
				-    去除图像中的浅色斜向文字水印
			
 
				-    
			
 
				-    原理：
			
 
				-    - 正文为深黑色（灰度 < threshold）
			
 
				-    - 水印为浅灰（灰度 > threshold）
			
 
				-    - 将高于阈值的像素置为白色（255）
			
 
				-    
			
 
				-    Args:
			
 
				-        threshold: 灰度阈值，建议 140-180，默认 160
			
 
				-        morph_close_kernel: 形态学闭运算核，0 表示跳过
			
 
				-    """
			
 
				-    gray = to_grayscale(image)
			
 
				-    
			
 
				-    # 阈值化：保留深色正文
			
 
				-    cleaned = gray.copy()
			
 
				-    cleaned[gray > threshold] = 255
			
 
				-    
			
 
				-    # 可选：形态学闭运算填补字符断裂
			
 
				-    if morph_close_kernel > 0:
			
 
				-        kernel = np.ones((morph_close_kernel, morph_close_kernel), np.uint8)
			
 
				-        cleaned = cv2.morphologyEx(cleaned, cv2.MORPH_CLOSE, kernel)
			
 
				-    
			
 
				-    return cleaned
			
 
				+from ocr_utils.watermark import WatermarkProcessor, merge_watermark_config
			
 
				+
			
 
				+processor = WatermarkProcessor.from_user_config(
			
 
				+    {"enabled": True, "method": "threshold", "threshold": 155},
			
 
				+    scope="cell",  # 或 "page"
			
 
				+)
			
 
				+cleaned_bgr, stages = processor.process(cell_bgr_image, force=True)
			
 
				+# stages 可能含 "wm"、"contrast" 等，供 debug JSON 使用
			
 
				+```
			
 
				+
			
 
				+**底层（兼容旧代码）：**
			
 
				+
			
 
				+```python
			
 
				+from ocr_utils.watermark import remove_watermark_from_image_rgb
			
 
				+
			
 
				+out = remove_watermark_from_image_rgb(
			
 
				+    image,
			
 
				+    threshold=175,
			
 
				+    watermark_removal_cfg=merge_watermark_config("page", {"method": "threshold"}),
			
 
				+)
			
 
				 ```
			
 
				 
			
 
				-### 参数说明
			
 
				+### 参数说明（`method: threshold`）
			
 
				 
			
 
				-| 参数 | 默认值 | 说明 | 调整建议 |
			
 
				-|------|--------|------|---------|
			
 
				-| `threshold` | 160 | 灰度阈值 | 140-180，越大越保守（可能残留水印） |
			
 
				-| `morph_close_kernel` | 0 | 形态学核大小 | 非二值图建议设为 0（闭运算会适得其反） |
			
 
				+| 参数 | page 预设 | cell 预设 | 说明 |
			
 
				+|------|-----------|-----------|------|
			
 
				+| `threshold` | 175 | 155 | 见上文「调高/调低」；**越大越保守**，越小白化越强 |
			
 
				+| `morph_close_kernel` | 0 | 0 | 闭运算核；**0=关闭**（推荐，非二值图闭运算易引噪） |
			
 
				+| `detect_before_remove` | true | false | 页级可先检测再去除；格级通常 `force=True` 直接处理 |
			
 
				+| `contrast_enhancement` | 默认开 | 默认关 | 去水印后 `text_restore`；格级默认关，需时再开 |
			
 
				+
			
 
				+---
			
 
				+
			
 
				+## 阶段三：格级二次 OCR 预处理
			
 
				+
			
 
				+### 流程
			
 
				+
			
 
				+```
			
 
				+表图 raw_crop
			
 
				+  → WatermarkProcessor(cell)     # wm
			
 
				+  → 可选 denoise / contrast      # YAML 开关
			
 
				+  → upscale（light.upscale_min_side，如 192）
			
 
				+  → det 分行 / whole 兜底 OCR
			
 
				+```
			
 
				+
			
 
				+Debug 输出（`tablecell_ocr/`）：
			
 
				+
			
 
				+| 文件 | 含义 |
			
 
				+|------|------|
			
 
				+| `cellNNN_*_*.png` | 送入 OCR 的预处理后图像 |
			
 
				+| `cellNNN_*_*_raw.png` | 未去水印的原始裁剪（供 `cell_sweep` 调参） |
			
 
				+| `cellNNN_*_*.json` | 含 `preprocess_stages`、`debug_images`、`lines`/`whole` 等 |
			
 
				+
			
 
				+### 参数探索工具
			
 
				+
			
 
				+```bash
			
 
				+cd ocr_tools/cell_preprocess_lab
			
 
				+
			
 
				+# 单格扫描（自动优先 *_raw.png）
			
 
				+python cell_sweep.py /path/to/cell219_empty_empty_raw.png \
			
 
				+  -o ./output/cell219_sweep -t "ATM存折取款"
			
 
				+
			
 
				+# 批量 tablecell_ocr 目录
			
 
				+python cell_sweep.py /path/to/tablecell_ocr/ -o ./sweep_out --quick
			
 
				+```
			
 
				+
			
 
				+报告 `sweep_report.json` 含每条组合的 `text`、`score`（加权识别分）、`boxes[]`（逐框分数）。
			
 
				 
			
 
				 ---
			
 
				 
			
 
				 ## 配置说明
			
 
				 
			
 
				-### 完整配置示例
			
 
				+### 完整配置示例（`bank_statement_yusys_local.yaml`）
			
 
				 
			
 
				 ```yaml
			
 
				-# 输入配置 - PDF 层级去水印
			
 
				 input:
			
 
				-  dpi: 200
			
 
				   txt_pdf_watermark_removal:
			
 
				-    enabled: true        # 是否启用 PDF 层级去水印
			
 
				-    sample_pages: 3      # 快速预扫描页数
			
 
				+    enabled: true
			
 
				+    sample_pages: 3
			
 
				 
			
 
				-# 预处理配置 - 图像级去水印
			
 
				 preprocessor:
			
 
				-  module: "mineru"
			
 
				-  orientation_classifier:
			
 
				-    enabled: true
			
 
				+  order: orient_first
			
 
				   watermark_removal:
			
 
				-    enabled: true           # 是否启用图像级去水印
			
 
				-    threshold: 160          # 灰度阈值
			
 
				-    morph_close_kernel: 0   # 形态学核大小（建议 0）
			
 
				+    enabled: false              # cell-first：页级可关
			
 
				+    detect_before_remove: true
			
 
				+    method: threshold
			
 
				+    threshold: 175              # 页级预设默认 175
			
 
				+    contrast_enhancement:
			
 
				+      enabled: false
			
 
				+
			
 
				+table_recognition_wired:
			
 
				+  second_pass_ocr:
			
 
				+    suspicious_short_min_chars: 4
			
 
				+    cell_preprocess:
			
 
				+      watermark:
			
 
				+        enabled: true
			
 
				+        method: threshold
			
 
				+        threshold: 155          # 建议显式写出；未写则用 cell 预设 155
			
 
				+      denoise:
			
 
				+        enabled: false
			
 
				+      contrast:
			
 
				+        enabled: false          # Pass1 可选 text_restore
			
 
				+      light:
			
 
				+        upscale_min_side: 192
			
 
				+    enhance_retry:
			
 
				+      enabled: false            # Pass2 增强重试（与 cell_preprocess 同级）
			
 
				 ```
			
 
				 
			
 
				 ### 配置项详解
			
 
				 
			
 
				-| 配置路径 | 类型 | 默认值 | 说明 |
			
 
				-|---------|------|--------|------|
			
 
				-| `input.txt_pdf_watermark_removal.enabled` | bool | `false` | PDF 层级去水印开关 |
			
 
				-| `input.txt_pdf_watermark_removal.sample_pages` | int | 3 | 预扫描页数 |
			
 
				-| `preprocessor.watermark_removal.enabled` | bool | `false` | 图像级去水印开关 |
			
 
				-| `preprocessor.watermark_removal.threshold` | int | 160 | 灰度阈值 |
			
 
				-| `preprocessor.watermark_removal.morph_close_kernel` | int | 0 | 形态学核大小 |
			
 
				+| 配置路径 | 说明 |
			
 
				+|---------|------|
			
 
				+| `input.txt_pdf_watermark_removal.*` | PDF XObject 去水印 |
			
 
				+| `preprocessor.watermark_removal.*` | 页级 `WatermarkProcessor(scope=page)` |
			
 
				+| `preprocessor.watermark_removal.method` | `threshold` \| `masked` \| `masked_adaptive` |
			
 
				+| `preprocessor.watermark_removal.threshold` | 仅 `threshold` 法；见「调高/调低」 |
			
 
				+| `second_pass_ocr.cell_preprocess.watermark.*` | 格级 `WatermarkProcessor(scope=cell)` |
			
 
				+| `second_pass_ocr.cell_preprocess.light.upscale_min_side` | 去水印后最短边放大 |
			
 
				+| `second_pass_ocr.enhance_retry` | Pass2 预处理（与 `cell_preprocess` 同级，非其子项） |
			
 
				+
			
 
				+**说明：**
			
 
				 
			
 
				-**注意**：两个配置均无默认值，必须在 YAML 中显式配置 `enabled: true` 才会触发。
			
 
				+- `morph_close_kernel` 在 preset 中已为 `0`，一般 **不必写入 YAML**。
			
 
				+- 格级 `threshold` **建议在 sweep 后显式配置**，不要假设与页级相同。
			
 
				+- `enabled: true` 才会执行；页级、格级开关相互独立。
			
 
				 
			
 
				 ---
			
 
				 
			
@@ -337,26 +476,29 @@ if pdf_type == 'ocr':  # 条件①：仅扫描件
 
				 
			
 
				 # mineru_adapter.py: MinerUPreprocessor.process()
			
 
				 
			
 
				-if config.get('watermark_removal', {}).get('enabled', False):  # 条件②
			
 
				-    image = remove_watermark_from_image_rgb(image, threshold=160)
			
 
				+processor = WatermarkProcessor.from_user_config(wm_cfg, scope="page")
			
 
				+if processor.enabled:
			
 
				+    image, _ = processor.process(image)  # 内部 detect_before_remove + method
			
 
				 ```
			
 
				 
			
 
				 **触发条件**：
			
 
				 1. PDF 类型为 `ocr`（扫描件）
			
 
				 2. `preprocessor.watermark_removal.enabled: true`
			
 
				 
			
 
				+**格级二次 OCR**（`text_filling.py`）：表体触发二次 OCR 时，对 `raw_crop` 调用 `_preprocess_cell_for_ocr` → `WatermarkProcessor(scope="cell")`。
			
 
				+
			
 
				 ---
			
 
				 
			
 
				-## 两阶段对比
			
 
				+## 各层级对比
			
 
				 
			
 
				-| 维度 | 阶段一（PDF 层级） | 阶段二（图像级） |
			
 
				-|------|------------------|-----------------|
			
 
				-| **处理对象** | 文字型 PDF | 扫描件/图片 |
			
 
				-| **处理层级** | PDF XObject | 图像像素 |
			
 
				-| **保留文字可搜索性** | ✅ 是 | ❌ 否 |
			
 
				-| **无损处理** | ✅ 是 | ❌ 否（像素修改） |
			
 
				-| **处理时机** | 渲染前 | 渲染后、检测前 |
			
 
				-| **依赖库** | PyMuPDF (fitz) | OpenCV, NumPy |
			
 
				+| 维度 | PDF 层级 | 页级图像 | 格级图像 |
			
 
				+|------|----------|----------|----------|
			
 
				+| **处理对象** | 文字型 PDF XObject | 整页渲染图 | 单元格裁剪 |
			
 
				+| **配置** | `input.txt_pdf_*` | `preprocessor.watermark_removal` | `second_pass_ocr.cell_preprocess.watermark` |
			
 
				+| **默认 threshold 预设** | — | 175 | 155 |
			
 
				+| **保留 PDF 文字层** | ✅ | — | — |
			
 
				+| **处理时机** | 渲染前 | Layout/OCR 前 | 格内二次 OCR 前 |
			
 
				+| **依赖库** | PyMuPDF | OpenCV | OpenCV |
			
 
				 
			
 
				 ---
			
 
				 
			
@@ -367,9 +509,9 @@ if config.get('watermark_removal', {}).get('enabled', False):  # 条件②
 
				 ```python
			
 
				 # pipeline_manager_v2.py
			
 
				 
			
 
				-from ocr_utils.watermark_utils import (
			
 
				+from ocr_utils.watermark import (
			
 
				     scan_pdf_watermark_xobjs,
			
 
				-    remove_txt_pdf_watermark
			
 
				+    remove_txt_pdf_watermark,
			
 
				 )
			
 
				 
			
 
				 class EnhancedDocPipeline:
			
@@ -400,23 +542,29 @@ class EnhancedDocPipeline:
 
				 ```python
			
 
				 # models/adapters/mineru_adapter.py
			
 
				 
			
 
				-from ocr_utils.watermark_utils import remove_watermark_from_image_rgb
			
 
				+from ocr_utils.watermark import WatermarkProcessor
			
 
				 
			
 
				 class MinerUPreprocessor:
			
 
				     def process(self, image):
			
 
				-        # 图像级水印去除（在方向校正之前）
			
 
				-        if self.config.get('watermark_removal', {}).get('enabled', False):
			
 
				-            threshold = self.config.get('watermark_removal', {}).get('threshold', 160)
			
 
				-            image = remove_watermark_from_image_rgb(image, threshold=threshold)
			
 
				-        
			
 
				-        # 方向校正
			
 
				-        if self.orientation_classifier:
			
 
				-            angle = self.orientation_classifier.predict(image)
			
 
				-            image = self._apply_rotation(image, angle)
			
 
				-        
			
 
				+        wm_cfg = self.config.get("watermark_removal") or {}
			
 
				+        processor = WatermarkProcessor.from_user_config(wm_cfg, scope="page")
			
 
				+        if processor.enabled:
			
 
				+            image, _ = processor.process(image)
			
 
				+        # 方向校正 ...
			
 
				         return image, angle
			
 
				 ```
			
 
				 
			
 
				+### 格级二次 OCR 集成
			
 
				+
			
 
				+```python
			
 
				+# models/adapters/wired_table/text_filling.py
			
 
				+
			
 
				+self._cell_wm_processor = WatermarkProcessor.from_user_config(wm_user, scope="cell")
			
 
				+
			
 
				+cell_img, stages = self._preprocess_cell_for_ocr(raw_crop, mode="light")
			
 
				+# stages 示例: ["wm", "upscale"] 或 ["wm", "contrast", "upscale"]
			
 
				+```
			
 
				+
			
 
				 ---
			
 
				 
			
 
				 ## 使用示例
			
@@ -475,17 +623,21 @@ python main_v2.py -i doc.pdf -c config.yaml --scene bank_statement --debug
 
				 
			
 
				 ## 注意事项
			
 
				 
			
 
				-1. **两个阶段是互补的**：阶段一处理文字型 PDF，阶段二处理扫描件，实际不会重复执行
			
 
				-2. **阈值选择**：`threshold=160` 适用于大多数银行流水，如果误删浅色文字可适当提高
			
 
				-3. **形态学运算**：`morph_close_kernel=0` 是推荐值，非二值图时闭运算可能引入噪声
			
 
				-4. **大文件优化**：`sample_pages=3` 快速预扫描，避免对无水印的大文件全量处理
			
 
				-5. **依赖要求**：PDF 层级去水印需要 `PyMuPDF`，图像级需要 `OpenCV`
			
 
				+1. **三层互补**：PDF 层级、页级、格级可独立开关；银行流水推荐 **cell-first**（页级 wm 关、格级 wm 开）。
			
 
				+2. **灰度方向**：**0=黑、255=白**；`gray > threshold → 255` 表示把「比阈值更亮」的像素刷白。
			
 
				+3. **threshold 方向**：**调高**更保守（易留水印、少伤字）；**调低**更激进（背景更干净、易啃淡笔画）。页级与格级应分别调参。
			
 
				+4. **勿混用 det 阈值**：`ocr_recognition.det_threshold` 是 OCR 检测框过滤，与去水印 `threshold` 无关。
			
 
				+5. **调参输入**：`cell_sweep.py` 应使用 `*_raw.png`（原裁剪），不要对已预处理的 `cell*_empty_empty.png` 再扫（等于二次去水印，结论失真）。
			
 
				+6. **形态学**：preset 中 `morph_close_kernel=0`，非二值图不建议开启闭运算。
			
 
				+7. **依赖**：PDF 层级需 `PyMuPDF`；图像级需 `OpenCV`。
			
 
				 
			
 
				 ---
			
 
				 
			
 
				 ## 参考资料
			
 
				 
			
 
				-- `ocr_utils/watermark_utils.py` - 水印工具函数实现
			
 
				-- `core/pipeline_manager_v2.py` - 流水线集成
			
 
				-- `models/adapters/mineru_adapter.py` - 预处理器集成
			
 
				-- `config/bank_statement_*.yaml` - 配置示例
			
 
				+- `ocr_utils/watermark/` — 实现包（presets / removal / processor / pdf）
			
 
				+- `ocr_utils/watermark_utils.py` — 兼容 re-export
			
 
				+- `ocr_tools/cell_preprocess_lab/cell_sweep.py` — 格级参数扫描
			
 
				+- `models/adapters/mineru_adapter.py` — 页级预处理
			
 
				+- `models/adapters/wired_table/text_filling.py` — 格级二次 OCR
			
 
				+- `config/bank_statement_yusys_local.yaml` — 场景配置示例