1 lună în urmă · 95bfd4baed
--- a/docs/ocr_tools/universal_doc_parser/水印去除技术文档.md
+++ b/docs/ocr_tools/universal_doc_parser/水印去除技术文档.md
@@ -2,12 +2,26 @@
 
															 ## 概述
														
 
															-水印去除模块 (`ocr_utils/watermark_utils.py`) 提供了**两层独立的水印去除能力**，针对不同类型的文档和场景进行优化：
														
 
															+水印去除能力位于 `ocr_utils/watermark/` 包，对外兼容入口为 `ocr_utils/watermark_utils.py`（re-export）。核心编排类为 **`WatermarkProcessor`**，支持 **页级（page）** 与 **单元格级（cell）** 两套预设（`presets.py`）。
														
 
															-| 层级 | 处理对象 | 适用场景 | 特点 |
														
 
															-|------|---------|---------|------|
														
 
															-| **PDF 层级** | 文字型 PDF 的 XObject | 银行流水等文字型 PDF | 保留文字可搜索性，无损处理 |
														
 
															-| **图像层级** | 扫描件/渲染图像的像素 | 扫描件、图片 | 像素级处理，适用于 OCR 前预处理 |
														
 
															+除 PDF/页级预处理外，银行流水等场景在 **有线表格二次 OCR** 中可对单个单元格裁剪图再次去水印（`text_filling.cell_preprocess`）。
														
 
															+
														
 
															+| 层级 | 处理对象 | 配置位置 | 适用场景 |
														
 
															+|------|---------|---------|---------|
														
 
															+| **PDF 层级** | 文字型 PDF 的 XObject | `input.txt_pdf_watermark_removal` | 文字型 PDF，渲染前 |
														
 
															+| **页级图像** | 整页渲染图 | `preprocessor.watermark_removal` | 扫描件页级 OCR 前（可选） |
														
 
															+| **格级图像** | 单元格裁剪图 | `table_recognition_wired.second_pass_ocr.cell_preprocess.watermark` | 二次 OCR 前（推荐 cell-first） |
														
 
															+
														
 
															+**实现模块：**
														
 
															+
														
 
															+| 路径 | 职责 |
														
 
															+|------|------|
														
 
															+| `ocr_utils/watermark/presets.py` | 页级/格级预设、`merge_watermark_config` |
														
 
															+| `ocr_utils/watermark/removal.py` | `threshold` / `masked_adaptive` 去水印 |
														
 
															+| `ocr_utils/watermark/processor.py` | `WatermarkProcessor` 门面 |
														
 
															+| `ocr_utils/watermark/pdf.py` | 文字型 PDF XObject 去水印 |
														
 
															+| `models/adapters/wired_table/text_filling.py` | 格级预处理 + 二次 OCR |
														
 
															+| `ocr_tools/cell_preprocess_lab/cell_sweep.py` | 单格参数网格扫描（调参） |
														
 
															 ---
														
@@ -35,13 +49,17 @@ graph TB
 
															     F[图像输入] --> J[阶段二: 图像级去水印]
														
 
															     J --> K{启用 watermark_removal?}
														
 
															-    K -->|是| L[检测浅色斜向水印]
														
 
															+    K -->|是| L[WatermarkProcessor page]
														
 
															     K -->|否| N
														
 
															-    L --> M[阈值化去除水印]
														
 
															+    L --> M[method: threshold / masked / masked_adaptive]
														
 
															     M --> N[方向校正]
														
 
															     N --> O[Layout 检测]
														
 
															-    O --> P[OCR 识别]
														
 
															+    O --> P[表格 OCR]
														
 
															+    P --> Q{二次 OCR 格级 wm?}
														
 
															+    Q -->|是| R[WatermarkProcessor cell + upscale]
														
 
															+    Q -->|否| S
														
 
															+    R --> S[格内 OCR]
														
 
															     style C fill:#e1f5ff
														
 
															     style E fill:#e1f5ff
														
@@ -186,17 +204,81 @@ def scan_pdf_watermark_xobjs(pdf_bytes: bytes, sample_pages: int = 3) -> bool:
 
															 ### 适用场景
														
 
															-**扫描件/图片（`pdf_type='ocr'`）**：无法从 PDF 内部结构处理，只能对渲染后的图像进行像素级处理。
														
 
															+- **页级**：扫描件/图片（`pdf_type='ocr'`），在 `MinerUPreprocessor` 中通过 `WatermarkProcessor(scope="page")` 调用。
														
 
															+- **格级**：有线表格 **二次 OCR** 前，对 `raw_crop` 通过 `WatermarkProcessor(scope="cell")` 调用（与页级独立配置）。
														
 
															-### 原理
														
 
															+银行流水当前推荐策略 **cell-first**：页级 `watermark_removal.enabled: false`，重点调格级 `cell_preprocess.watermark`。
														
 
															+
														
 
															+### 灰度值约定（必读）
														
 
															+
														
 
															+OpenCV / PIL 的 **8 位灰度图**统一约定：
														
 
															+
														
 
															+| 灰度值 | 视觉 |
														
 
															+|--------|------|
														
 
															+| **0** | 黑（深笔画、墨迹） |
														
 
															+| **255** | 白（背景、纸面） |
														
 
															+| 中间值（如 100～220） | 灰（浅色水印、淡笔画、扫描噪声） |
														
 
															+
														
 
															+典型银行流水扫描件上：
														
 
															+
														
 
															+- **汉字笔画**：低灰度（偏黑，通常远小于 threshold）
														
 
															+- **纸面背景 / 浅色斜纹水印**：高灰度（偏白，通常大于 threshold）
														
 
															+
														
 
															+> **注意**：个别 UI 或 PhotoShop 可能用「0=白」显示，但本仓库代码与 OpenCV 一致，**以 0=黑、255=白 为准**。
														
 
															+
														
 
															+### 去水印方法（`method`）
														
 
															+
														
 
															+| method | 说明 | YAML 需写 |
														
 
															+|--------|------|-----------|
														
 
															+| `threshold` | 全局 `gray > threshold → 255`，简单快速 | `method` + 可选 `threshold` |
														
 
															+| `masked` | 掩膜定位水印区再处理 | 仅 `method`（细参见 preset） |
														
 
															+| `masked_adaptive` | 掩膜 + 掩膜内自适应阈值 | 仅 `method`（细参见 preset） |
														
 
															+
														
 
															+预设默认值（`presets.py`，YAML 未覆盖时生效）：
														
 
															+
														
 
															+| scope | 默认 `threshold` | 默认 `contrast_enhancement` |
														
 
															+|-------|------------------|----------------------------|
														
 
															+| **page** | 175 | enabled |
														
 
															+| **cell** | 155 | disabled |
														
 
															+
														
 
															+`merge_watermark_config(scope, user_cfg)` 将用户 YAML 与上表预设合并；`mask` / `hough` / `adaptive` 等细参不必写入场景 YAML。
														
 
															+
														
 
															+### `threshold` 方法原理
														
 
															 银行流水等金融文档的水印特征：
														
 
															-- **颜色浅**：灰度值通常在 160-220 之间（介于正文和背景之间）
														
 
															-- **角度斜**：通常 45° 斜向排列
														
 
															-- **文字稀疏**：水印文字占比较小
														
 
															+- **颜色浅**：灰度多在 160～220（介于正文与白纸之间）
														
 
															+- **角度斜**：常见 45° 斜向重复文字
														
 
															+- **占比较小**：相对整页/整格为稀疏浅色纹理
														
 
															+
														
 
															+核心代码（`removal.py`）：
														
 
															+
														
 
															+```python
														
 
															+cleaned = gray.copy()
														
 
															+cleaned[gray > threshold] = 255   # 亮于阈值的像素 → 白
														
 
															+```
														
 
															+
														
 
															+**语义**：保留 **灰度 ≤ threshold** 的像素（深字），把 **更亮** 的像素刷成白纸，用于削弱浅色水印。
														
 
															+
														
 
															+### `threshold` 调高 / 调低的实际作用
														
 
															+
														
 
															+判断规则：`gray > threshold` 才变白 → **threshold 是「多亮才算背景」的分界线**。
														
 
															-基于这些特征，采用**阈值化处理**：将灰度值高于阈值的像素置为白色，保留深色正文。
														
 
															+| 操作 | 白化强度 | 被刷白的像素范围 | 对水印 | 对正文 / OCR |
														
 
															+|------|----------|------------------|--------|----------------|
														
 
															+| **调低** threshold（如 175→155） | **更强** | 更多中等灰度（如 156～175）也会变白 | 去得更干净 | 淡笔画、被水印冲淡的边缘可能被啃掉；背景更干净时 det 有时更易检出一整行 |
														
 
															+| **调高** threshold（如 155→175） | **更弱** | 只有更亮的像素才变白 | 易残留斜纹、浅灰噪声 | 笔画保留更多；残留干扰可能导致 det 碎框、高分短错文 |
														
 
															+
														
 
															+记忆口诀：
														
 
															+
														
 
															+- **threshold ↓** → 更激进地去浅色 → 背景更白，**易伤淡字**
														
 
															+- **threshold ↑** → 更保守 → **易留水印**，但深字更安全
														
 
															+
														
 
															+调参建议（单格可用 `cell_sweep.py` 在 **`*_raw.png` 原图上**扫描，勿对已预处理 debug 图二次去水印）：
														
 
															+
														
 
															+1. 优先在 **155～175** 间扫，结合 OCR 文本是否完整、det 框是否稳定。
														
 
															+2. **不要只看 rec 分数**：threshold 偏高时可能出现高分但错误的短文本（如仅「折取款」）。
														
 
															+3. 格级与页级 **threshold 可不同**（预设 page=175、cell=155），按 sweep 结果分别写 YAML。
														
 
															 ### 水印检测 (`detect_watermark`)
														
@@ -232,79 +314,136 @@ def detect_watermark(image, midtone_low=100, midtone_high=220, ratio_threshold=0
 
															     return diagonal_count >= 2
														
 
															 ```
														
 
															-### 水印去除 (`remove_watermark_from_image`)
														
 
															+### 水印去除 API
														
 
															+
														
 
															+**推荐（页级 / 格级统一）：**
														
 
															 ```python
														
 
															-def remove_watermark_from_image(image, threshold=160, morph_close_kernel=0):
														
 
															-    """
														
 
															-    去除图像中的浅色斜向文字水印
														
 
															-    
														
 
															-    原理：
														
 
															-    - 正文为深黑色（灰度 < threshold）
														
 
															-    - 水印为浅灰（灰度 > threshold）
														
 
															-    - 将高于阈值的像素置为白色（255）
														
 
															-    
														
 
															-    Args:
														
 
															-        threshold: 灰度阈值，建议 140-180，默认 160
														
 
															-        morph_close_kernel: 形态学闭运算核，0 表示跳过
														
 
															-    """
														
 
															-    gray = to_grayscale(image)
														
 
															-    
														
 
															-    # 阈值化：保留深色正文
														
 
															-    cleaned = gray.copy()
														
 
															-    cleaned[gray > threshold] = 255
														
 
															-    
														
 
															-    # 可选：形态学闭运算填补字符断裂
														
 
															-    if morph_close_kernel > 0:
														
 
															-        kernel = np.ones((morph_close_kernel, morph_close_kernel), np.uint8)
														
 
															-        cleaned = cv2.morphologyEx(cleaned, cv2.MORPH_CLOSE, kernel)
														
 
															-    
														
 
															-    return cleaned
														
 
															+from ocr_utils.watermark import WatermarkProcessor, merge_watermark_config
														
 
															+
														
 
															+processor = WatermarkProcessor.from_user_config(
														
 
															+    {"enabled": True, "method": "threshold", "threshold": 155},
														
 
															+    scope="cell",  # 或 "page"
														
 
															+)
														
 
															+cleaned_bgr, stages = processor.process(cell_bgr_image, force=True)
														
 
															+# stages 可能含 "wm"、"contrast" 等，供 debug JSON 使用
														
 
															+```
														
 
															+
														
 
															+**底层（兼容旧代码）：**
														
 
															+
														
 
															+```python
														
 
															+from ocr_utils.watermark import remove_watermark_from_image_rgb
														
 
															+
														
 
															+out = remove_watermark_from_image_rgb(
														
 
															+    image,
														
 
															+    threshold=175,
														
 
															+    watermark_removal_cfg=merge_watermark_config("page", {"method": "threshold"}),
														
 
															+)
														
 
															 ```
														
 
															-### 参数说明
														
 
															+### 参数说明（`method: threshold`）
														
 
															-| 参数 | 默认值 | 说明 | 调整建议 |
														
 
															-|------|--------|------|---------|
														
 
															-| `threshold` | 160 | 灰度阈值 | 140-180，越大越保守（可能残留水印） |
														
 
															-| `morph_close_kernel` | 0 | 形态学核大小 | 非二值图建议设为 0（闭运算会适得其反） |
														
 
															+| 参数 | page 预设 | cell 预设 | 说明 |
														
 
															+|------|-----------|-----------|------|
														
 
															+| `threshold` | 175 | 155 | 见上文「调高/调低」；**越大越保守**，越小白化越强 |
														
 
															+| `morph_close_kernel` | 0 | 0 | 闭运算核；**0=关闭**（推荐，非二值图闭运算易引噪） |
														
 
															+| `detect_before_remove` | true | false | 页级可先检测再去除；格级通常 `force=True` 直接处理 |
														
 
															+| `contrast_enhancement` | 默认开 | 默认关 | 去水印后 `text_restore`；格级默认关，需时再开 |
														
 
															+
														
 
															+---
														
 
															+
														
 
															+## 阶段三：格级二次 OCR 预处理
														
 
															+
														
 
															+### 流程
														
 
															+
														
 
															+```
														
 
															+表图 raw_crop
														
 
															+  → WatermarkProcessor(cell)     # wm
														
 
															+  → 可选 denoise / contrast      # YAML 开关
														
 
															+  → upscale（light.upscale_min_side，如 192）
														
 
															+  → det 分行 / whole 兜底 OCR
														
 
															+```
														
 
															+
														
 
															+Debug 输出（`tablecell_ocr/`）：
														
 
															+
														
 
															+| 文件 | 含义 |
														
 
															+|------|------|
														
 
															+| `cellNNN_*_*.png` | 送入 OCR 的预处理后图像 |
														
 
															+| `cellNNN_*_*_raw.png` | 未去水印的原始裁剪（供 `cell_sweep` 调参） |
														
 
															+| `cellNNN_*_*.json` | 含 `preprocess_stages`、`debug_images`、`lines`/`whole` 等 |
														
 
															+
														
 
															+### 参数探索工具
														
 
															+
														
 
															+```bash
														
 
															+cd ocr_tools/cell_preprocess_lab
														
 
															+
														
 
															+# 单格扫描（自动优先 *_raw.png）
														
 
															+python cell_sweep.py /path/to/cell219_empty_empty_raw.png \
														
 
															+  -o ./output/cell219_sweep -t "ATM存折取款"
														
 
															+
														
 
															+# 批量 tablecell_ocr 目录
														
 
															+python cell_sweep.py /path/to/tablecell_ocr/ -o ./sweep_out --quick
														
 
															+```
														
 
															+
														
 
															+报告 `sweep_report.json` 含每条组合的 `text`、`score`（加权识别分）、`boxes[]`（逐框分数）。
														
 
															 ---
														
 
															 ## 配置说明
														
 
															-### 完整配置示例
														
 
															+### 完整配置示例（`bank_statement_yusys_local.yaml`）
														
 
															 ```yaml
														
 
															-# 输入配置 - PDF 层级去水印
														
 
															 input:
														
 
															-  dpi: 200
														
 
															   txt_pdf_watermark_removal:
														
 
															-    enabled: true        # 是否启用 PDF 层级去水印
														
 
															-    sample_pages: 3      # 快速预扫描页数
														
 
															+    enabled: true
														
 
															+    sample_pages: 3
														
 
															-# 预处理配置 - 图像级去水印
														
 
															 preprocessor:
														
 
															-  module: "mineru"
														
 
															-  orientation_classifier:
														
 
															-    enabled: true
														
 
															+  order: orient_first
														
 
															   watermark_removal:
														
 
															-    enabled: true           # 是否启用图像级去水印
														
 
															-    threshold: 160          # 灰度阈值
														
 
															-    morph_close_kernel: 0   # 形态学核大小（建议 0）
														
 
															+    enabled: false              # cell-first：页级可关
														
 
															+    detect_before_remove: true
														
 
															+    method: threshold
														
 
															+    threshold: 175              # 页级预设默认 175
														
 
															+    contrast_enhancement:
														
 
															+      enabled: false
														
 
															+
														
 
															+table_recognition_wired:
														
 
															+  second_pass_ocr:
														
 
															+    suspicious_short_min_chars: 4
														
 
															+    cell_preprocess:
														
 
															+      watermark:
														
 
															+        enabled: true
														
 
															+        method: threshold
														
 
															+        threshold: 155          # 建议显式写出；未写则用 cell 预设 155
														
 
															+      denoise:
														
 
															+        enabled: false
														
 
															+      contrast:
														
 
															+        enabled: false          # Pass1 可选 text_restore
														
 
															+      light:
														
 
															+        upscale_min_side: 192
														
 
															+    enhance_retry:
														
 
															+      enabled: false            # Pass2 增强重试（与 cell_preprocess 同级）
														
 
															 ```
														
 
															 ### 配置项详解
														
 
															-| 配置路径 | 类型 | 默认值 | 说明 |
														
 
															-|---------|------|--------|------|
														
 
															-| `input.txt_pdf_watermark_removal.enabled` | bool | `false` | PDF 层级去水印开关 |
														
 
															-| `input.txt_pdf_watermark_removal.sample_pages` | int | 3 | 预扫描页数 |
														
 
															-| `preprocessor.watermark_removal.enabled` | bool | `false` | 图像级去水印开关 |
														
 
															-| `preprocessor.watermark_removal.threshold` | int | 160 | 灰度阈值 |
														
 
															-| `preprocessor.watermark_removal.morph_close_kernel` | int | 0 | 形态学核大小 |
														
 
															+| 配置路径 | 说明 |
														
 
															+|---------|------|
														
 
															+| `input.txt_pdf_watermark_removal.*` | PDF XObject 去水印 |
														
 
															+| `preprocessor.watermark_removal.*` | 页级 `WatermarkProcessor(scope=page)` |
														
 
															+| `preprocessor.watermark_removal.method` | `threshold` \| `masked` \| `masked_adaptive` |
														
 
															+| `preprocessor.watermark_removal.threshold` | 仅 `threshold` 法；见「调高/调低」 |
														
 
															+| `second_pass_ocr.cell_preprocess.watermark.*` | 格级 `WatermarkProcessor(scope=cell)` |
														
 
															+| `second_pass_ocr.cell_preprocess.light.upscale_min_side` | 去水印后最短边放大 |
														
 
															+| `second_pass_ocr.enhance_retry` | Pass2 预处理（与 `cell_preprocess` 同级，非其子项） |
														
 
															+
														
 
															+**说明：**
														
 
															-**注意**：两个配置均无默认值，必须在 YAML 中显式配置 `enabled: true` 才会触发。
														
 
															+- `morph_close_kernel` 在 preset 中已为 `0`，一般 **不必写入 YAML**。
														
 
															+- 格级 `threshold` **建议在 sweep 后显式配置**，不要假设与页级相同。
														
 
															+- `enabled: true` 才会执行；页级、格级开关相互独立。
														
 
															 ---
														
@@ -337,26 +476,29 @@ if pdf_type == 'ocr':  # 条件①：仅扫描件
 
															 # mineru_adapter.py: MinerUPreprocessor.process()
														
 
															-if config.get('watermark_removal', {}).get('enabled', False):  # 条件②
														
 
															-    image = remove_watermark_from_image_rgb(image, threshold=160)
														
 
															+processor = WatermarkProcessor.from_user_config(wm_cfg, scope="page")
														
 
															+if processor.enabled:
														
 
															+    image, _ = processor.process(image)  # 内部 detect_before_remove + method
														
 
															 ```
														
 
															 **触发条件**：
														
 
															 1. PDF 类型为 `ocr`（扫描件）
														
 
															 2. `preprocessor.watermark_removal.enabled: true`
														
 
															+**格级二次 OCR**（`text_filling.py`）：表体触发二次 OCR 时，对 `raw_crop` 调用 `_preprocess_cell_for_ocr` → `WatermarkProcessor(scope="cell")`。
														
 
															+
														
 
															 ---
														
 
															-## 两阶段对比
														
 
															+## 各层级对比
														
 
															-| 维度 | 阶段一（PDF 层级） | 阶段二（图像级） |
														
 
															-|------|------------------|-----------------|
														
 
															-| **处理对象** | 文字型 PDF | 扫描件/图片 |
														
 
															-| **处理层级** | PDF XObject | 图像像素 |
														
 
															-| **保留文字可搜索性** | ✅ 是 | ❌ 否 |
														
 
															-| **无损处理** | ✅ 是 | ❌ 否（像素修改） |
														
 
															-| **处理时机** | 渲染前 | 渲染后、检测前 |
														
 
															-| **依赖库** | PyMuPDF (fitz) | OpenCV, NumPy |
														
 
															+| 维度 | PDF 层级 | 页级图像 | 格级图像 |
														
 
															+|------|----------|----------|----------|
														
 
															+| **处理对象** | 文字型 PDF XObject | 整页渲染图 | 单元格裁剪 |
														
 
															+| **配置** | `input.txt_pdf_*` | `preprocessor.watermark_removal` | `second_pass_ocr.cell_preprocess.watermark` |
														
 
															+| **默认 threshold 预设** | — | 175 | 155 |
														
 
															+| **保留 PDF 文字层** | ✅ | — | — |
														
 
															+| **处理时机** | 渲染前 | Layout/OCR 前 | 格内二次 OCR 前 |
														
 
															+| **依赖库** | PyMuPDF | OpenCV | OpenCV |
														
 
															 ---
														
@@ -367,9 +509,9 @@ if config.get('watermark_removal', {}).get('enabled', False):  # 条件②
 
															 ```python
														
 
															 # pipeline_manager_v2.py
														
 
															-from ocr_utils.watermark_utils import (
														
 
															+from ocr_utils.watermark import (
														
 
															     scan_pdf_watermark_xobjs,
														
 
															-    remove_txt_pdf_watermark
														
 
															+    remove_txt_pdf_watermark,
														
 
															 )
														
 
															 class EnhancedDocPipeline:
														
@@ -400,23 +542,29 @@ class EnhancedDocPipeline:
 
															 ```python
														
 
															 # models/adapters/mineru_adapter.py
														
 
															-from ocr_utils.watermark_utils import remove_watermark_from_image_rgb
														
 
															+from ocr_utils.watermark import WatermarkProcessor
														
 
															 class MinerUPreprocessor:
														
 
															     def process(self, image):
														
 
															-        # 图像级水印去除（在方向校正之前）
														
 
															-        if self.config.get('watermark_removal', {}).get('enabled', False):
														
 
															-            threshold = self.config.get('watermark_removal', {}).get('threshold', 160)
														
 
															-            image = remove_watermark_from_image_rgb(image, threshold=threshold)
														
 
															-        
														
 
															-        # 方向校正
														
 
															-        if self.orientation_classifier:
														
 
															-            angle = self.orientation_classifier.predict(image)
														
 
															-            image = self._apply_rotation(image, angle)
														
 
															-        
														
 
															+        wm_cfg = self.config.get("watermark_removal") or {}
														
 
															+        processor = WatermarkProcessor.from_user_config(wm_cfg, scope="page")
														
 
															+        if processor.enabled:
														
 
															+            image, _ = processor.process(image)
														
 
															+        # 方向校正 ...
														
 
															         return image, angle
														
 
															 ```
														
 
															+### 格级二次 OCR 集成
														
 
															+
														
 
															+```python
														
 
															+# models/adapters/wired_table/text_filling.py
														
 
															+
														
 
															+self._cell_wm_processor = WatermarkProcessor.from_user_config(wm_user, scope="cell")
														
 
															+
														
 
															+cell_img, stages = self._preprocess_cell_for_ocr(raw_crop, mode="light")
														
 
															+# stages 示例: ["wm", "upscale"] 或 ["wm", "contrast", "upscale"]
														
 
															+```
														
 
															+
														
 
															 ---
														
 
															 ## 使用示例
														
@@ -475,17 +623,21 @@ python main_v2.py -i doc.pdf -c config.yaml --scene bank_statement --debug
 
															 ## 注意事项
														
 
															-1. **两个阶段是互补的**：阶段一处理文字型 PDF，阶段二处理扫描件，实际不会重复执行
														
 
															-2. **阈值选择**：`threshold=160` 适用于大多数银行流水，如果误删浅色文字可适当提高
														
 
															-3. **形态学运算**：`morph_close_kernel=0` 是推荐值，非二值图时闭运算可能引入噪声
														
 
															-4. **大文件优化**：`sample_pages=3` 快速预扫描，避免对无水印的大文件全量处理
														
 
															-5. **依赖要求**：PDF 层级去水印需要 `PyMuPDF`，图像级需要 `OpenCV`
														
 
															+1. **三层互补**：PDF 层级、页级、格级可独立开关；银行流水推荐 **cell-first**（页级 wm 关、格级 wm 开）。
														
 
															+2. **灰度方向**：**0=黑、255=白**；`gray > threshold → 255` 表示把「比阈值更亮」的像素刷白。
														
 
															+3. **threshold 方向**：**调高**更保守（易留水印、少伤字）；**调低**更激进（背景更干净、易啃淡笔画）。页级与格级应分别调参。
														
 
															+4. **勿混用 det 阈值**：`ocr_recognition.det_threshold` 是 OCR 检测框过滤，与去水印 `threshold` 无关。
														
 
															+5. **调参输入**：`cell_sweep.py` 应使用 `*_raw.png`（原裁剪），不要对已预处理的 `cell*_empty_empty.png` 再扫（等于二次去水印，结论失真）。
														
 
															+6. **形态学**：preset 中 `morph_close_kernel=0`，非二值图不建议开启闭运算。
														
 
															+7. **依赖**：PDF 层级需 `PyMuPDF`；图像级需 `OpenCV`。
														
 
															 ---
														
 
															 ## 参考资料
														
 
															-- `ocr_utils/watermark_utils.py` - 水印工具函数实现
														
 
															-- `core/pipeline_manager_v2.py` - 流水线集成
														
 
															-- `models/adapters/mineru_adapter.py` - 预处理器集成
														
 
															-- `config/bank_statement_*.yaml` - 配置示例
														
 
															+- `ocr_utils/watermark/` — 实现包（presets / removal / processor / pdf）
														
 
															+- `ocr_utils/watermark_utils.py` — 兼容 re-export
														
 
															+- `ocr_tools/cell_preprocess_lab/cell_sweep.py` — 格级参数扫描
														
 
															+- `models/adapters/mineru_adapter.py` — 页级预处理
														
 
															+- `models/adapters/wired_table/text_filling.py` — 格级二次 OCR
														
 
															+- `config/bank_statement_yusys_local.yaml` — 场景配置示例