# 水印去除技术文档

## 概述

水印去除能力位于 `ocr_utils/watermark/` 包，对外兼容入口为 `ocr_utils/watermark_utils.py`（re-export）。核心编排类为 **`WatermarkProcessor`**，支持 **页级（page）** 与 **单元格级（cell）** 两套预设（`presets.py`）。

除 PDF/页级预处理外，银行流水等场景在 **有线表格二次 OCR** 中可对单个单元格裁剪图再次去水印（`text_filling.cell_preprocess`）。

### 两种完全不同的水印机制

**关键认知**：文字 PDF 和图片 PDF 的水印是两种完全不同的机制，不可混用配置。

| 维度 | 文字 PDF（`pdf_type='txt'`） | 图片 PDF（`pdf_type='ocr'`） |
|------|---------------------------|---------------------------|
| **水印形态** | PDF 内部的 Form/Image XObject | 像素化的浅色文字纹理 |
| **去水印原理** | 操作 PDF 字节流（清空/空白化 XObject），再渲染 | OpenCV 像素级阈值处理 |
| **是否需要阈值** | **不需要** | 需要（threshold 等参数） |
| **配置位置** | `input.txt_pdf_watermark_removal` | `preprocessor.watermark_removal` |
| **默认推荐** | **打开**（`enabled: true`） | **关闭**（`enabled: false`，对图片 PDF 效果不佳） |
| **实现文件** | `ocr_utils/watermark/pdf.py` | `ocr_utils/watermark/removal.py` |

### 三层水印防护

| 层级 | 处理对象 | 配置位置 | 适用场景 |
|------|---------|---------|---------|
| **PDF 层级** | 文字型 PDF 的 XObject | `input.txt_pdf_watermark_removal` | 文字型 PDF，渲染前 |
| **页级图像** | 整页渲染图 | `preprocessor.watermark_removal` | 扫描件页级 OCR 前（可选） |
| **格级图像** | 单元格裁剪图 | `table_recognition_wired.second_pass_ocr.cell_preprocess.watermark` | 二次 OCR 前（推荐 cell-first） |

**当前策略**（银行流水场景）：`txt_pdf_watermark_removal.enabled: true`（文字 PDF 走 XObject 去水印），`preprocessor.watermark_removal.enabled: false`（图片 PDF 关掉像素级去水印），格级独立配置。

**实现模块：**

| 路径 | 职责 |
|------|------|
| `ocr_utils/watermark/presets.py` | 页级/格级预设、`merge_watermark_config` |
| `ocr_utils/watermark/removal.py` | `threshold` / `masked_adaptive` 去水印 |
| `ocr_utils/watermark/processor.py` | `WatermarkProcessor` 门面 |
| `ocr_utils/watermark/pdf.py` | 文字型 PDF XObject 去水印 |
| `models/adapters/wired_table/text_filling.py` | 格级预处理 + 二次 OCR |
| `ocr_tools/cell_preprocess_lab/cell_sweep.py` | 单格参数网格扫描（调参） |

---

## 处理流程

```mermaid
graph TB
    A[输入文档] --> B{是否为 PDF?}
    
    B -->|是| C[阶段一: PDF 层级去水印<br/>XObject 清理，无需阈值]
    B -->|否| F
    
    C --> D{启用 txt_pdf_watermark_removal?}
    D -->|是| E[扫描前 N 页检测水印 XObject]
    D -->|否| G
    E --> E1{发现水印?}
    E1 -->|是| E2[Form XObject → 清空内容流<br/>Image XObject → 替换全白]
    E1 -->|否| G
    E2 --> G
    
    G[渲染为图像] --> H{PDF 类型?}
    H -->|文字型 txt| I[跳过阶段二<br/>水印已在PDF层面清除]
    H -->|扫描件 ocr| J
    
    F[图像输入] --> J[阶段二: 图像级去水印<br/>像素阈值处理]
    
    J --> K{启用 watermark_removal?}
    K -->|是| L[WatermarkProcessor page<br/>method: threshold]
    K -->|否| N
    L --> M[gray > threshold → 255]
    M --> N[方向校正]
    
    N --> O[Layout 检测]
    O --> P[表格 OCR]
    P --> Q{二次 OCR 格级 wm?}
    Q -->|是| R[WatermarkProcessor cell + upscale]
    Q -->|否| S
    R --> S[格内 OCR]
    
    style C fill:#e1f5ff
    style E fill:#e1f5ff
    style E2 fill:#e1f5ff
    style J fill:#fff4e1
    style L fill:#fff4e1
    style M fill:#fff4e1
```

---

## 阶段一：PDF 层级水印去除

### 适用场景

**文字型 PDF（`pdf_type='txt'`）**：PDF 内部包含可提取的文字层，水印通常以 XObject 形式叠加在文字上方。

### 原理

PDF 文件中的水印通常通过以下两种 XObject 实现：

1. **Form XObject**：矢量绘图对象，包含旋转、透明度等变换矩阵
2. **Image XObject**：位图对象，通常是半透明的全页背景图

通过 PyMuPDF (fitz) 直接操作 PDF 内部结构，**清空或替换水印 XObject 的内容流**，而不影响文字层的可搜索性。

### 水印 XObject 判断规则

#### Form XObject 水印判断 (`_is_watermark_xobj`)

满足以下条件之一即判定为水印：

| 规则 | 说明 | 原理 |
|------|------|------|
| 旋转变换 | 内容流中 `cm` 指令的 sin/cos 分量非零 | 水印通常斜向 45° 放置 |
| 透明度组 + 透明操作符 | `/Group` 存在且内容流含 `ca/CA` | 水印具有半透明效果 |
| 透明度组 + 大体积流 | `/Group` 存在且流体积 > 2KB | 大量重复绘图 = 平铺水印 |

```python
# 判断逻辑伪代码
def _is_watermark_xobj(doc, xref, obj_str):
    if "/Form" not in obj_str:
        return False
    
    stream_text = doc.xref_stream(xref).decode("latin-1")
    
    # 规则1：旋转变换
    if has_rotation_transform(stream_text):
        return True
    
    # 规则2-3：透明度组相关
    if "/Group" in obj_str:
        if has_transparency_operators(stream_text):
            return True
        if len(stream_text) > 2048:
            return True
    
    return False
```

#### Image XObject 水印判断 (`_is_watermark_image_xobj`)

必须同时满足以下条件：

| 条件 | 说明 |
|------|------|
| `/Subtype /Image` | 确认是图像对象 |
| 存在 `/SMask` | 有透明通道（半透明） |
| 宽 >= 600 且 高 >= 800 | 全页尺寸（排除小图标） |
| 像素均值 >= 240 | 近乎全白（水印文字稀疏） |

### 处理方法

```python
def remove_txt_pdf_watermark(pdf_bytes: bytes) -> Optional[bytes]:
    """
    对文字型 PDF 执行原生水印去除
    
    处理方式：
    - Form XObject：清空内容流 (update_stream(b""))
    - Image XObject：替换为全白像素 + 移除 DecodeParms
    
    Returns:
        去水印后的 PDF bytes，若未发现水印返回 None
    """
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    
    for page in doc:
        # 处理 Form XObject 水印
        for xref, name, *_ in page.get_xobjects():
            if _is_watermark_xobj(doc, xref, obj_str):
                doc.update_stream(xref, b"")  # 清空内容流
        
        # 处理 Image XObject 水印
        for img_tuple in page.get_images(full=True):
            img_xref = img_tuple[0]
            if _is_watermark_image_xobj(doc, img_xref, obj_str):
                _blank_watermark_image(doc, img_xref)  # 替换为全白
    
    return doc.tobytes(garbage=4, deflate=True)
```

### 关键技术细节

**移除 `/DecodeParms` 的必要性**：

当 Image XObject 使用 Predictor 压缩时，必须先移除 `/DecodeParms` 再调用 `update_stream`，否则渲染器会尝试 Predictor 解码失败后回退原始数据，水印依然可见。

```python
def _blank_watermark_image(doc, img_xref):
    # 关键：先移除 DecodeParms
    doc.xref_set_key(img_xref, "DecodeParms", "null")
    # 再更新为全白像素
    doc.update_stream(img_xref, bytes([255]) * (w * h * channels))
```

### 快速预扫描 (`scan_pdf_watermark_xobjs`)

对于大型 PDF（如财报），先执行只读扫描判断是否存在水印，避免不必要的全量处理：

```python
def scan_pdf_watermark_xobjs(pdf_bytes: bytes, sample_pages: int = 3) -> bool:
    """
    快速扫描前 N 页，判断是否含水印 XObject
    
    Args:
        sample_pages: 扫描页数上限，默认 3（银行流水通常前几页有水印）
    
    Returns:
        True 表示发现水印 XObject
    """
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    for i in range(min(sample_pages, len(doc))):
        # 检查 Form XObject 和 Image XObject
        ...
    return False
```

---

## 阶段二：图像级水印去除

### 适用场景

- **页级**：扫描件/图片（`pdf_type='ocr'`），在 `MinerUPreprocessor` 中通过 `WatermarkProcessor(scope="page")` 调用。
- **格级**：有线表格 **二次 OCR** 前，对 `raw_crop` 通过 `WatermarkProcessor(scope="cell")` 调用（与页级独立配置）。

银行流水当前推荐策略 **cell-first**：页级 `watermark_removal.enabled: false`，重点调格级 `cell_preprocess.watermark`。

### 灰度值约定（必读）

OpenCV / PIL 的 **8 位灰度图**统一约定：

| 灰度值 | 视觉 |
|--------|------|
| **0** | 黑（深笔画、墨迹） |
| **255** | 白（背景、纸面） |
| 中间值（如 100～220） | 灰（浅色水印、淡笔画、扫描噪声） |

典型银行流水扫描件上：

- **汉字笔画**：低灰度（偏黑，通常远小于 threshold）
- **纸面背景 / 浅色斜纹水印**：高灰度（偏白，通常大于 threshold）

> **注意**：个别 UI 或 PhotoShop 可能用「0=白」显示，但本仓库代码与 OpenCV 一致，**以 0=黑、255=白 为准**。

### 去水印方法（`method`）

| method | 说明 | YAML 需写 |
|--------|------|-----------|
| `threshold` | 全局 `gray > threshold → 255`，简单快速 | `method` + 可选 `threshold` |
| `masked` | 掩膜定位水印区再处理 | 仅 `method`（细参见 preset） |
| `masked_adaptive` | 掩膜 + 掩膜内自适应阈值 | 仅 `method`（细参见 preset） |

预设默认值（`presets.py`，YAML 未覆盖时生效）：

| scope | 默认 `threshold` | 默认 `contrast_enhancement` |
|-------|------------------|----------------------------|
| **page** | 175 | enabled |
| **cell** | 155 | disabled |

`merge_watermark_config(scope, user_cfg)` 将用户 YAML 与上表预设合并；`mask` / `hough` / `adaptive` 等细参不必写入场景 YAML。

### `threshold` 方法原理

银行流水等金融文档的水印特征：

- **颜色浅**：灰度多在 160～220（介于正文与白纸之间）
- **角度斜**：常见 45° 斜向重复文字
- **占比较小**：相对整页/整格为稀疏浅色纹理

核心代码（`removal.py`）：

```python
cleaned = gray.copy()
cleaned[gray > threshold] = 255   # 亮于阈值的像素 → 白
```

**语义**：保留 **灰度 ≤ threshold** 的像素（深字），把 **更亮** 的像素刷成白纸，用于削弱浅色水印。

### `threshold` 调高 / 调低的实际作用

判断规则：`gray > threshold` 才变白 → **threshold 是「多亮才算背景」的分界线**。

| 操作 | 白化强度 | 被刷白的像素范围 | 对水印 | 对正文 / OCR |
|------|----------|------------------|--------|----------------|
| **调低** threshold（如 175→155） | **更强** | 更多中等灰度（如 156～175）也会变白 | 去得更干净 | 淡笔画、被水印冲淡的边缘可能被啃掉；背景更干净时 det 有时更易检出一整行 |
| **调高** threshold（如 155→175） | **更弱** | 只有更亮的像素才变白 | 易残留斜纹、浅灰噪声 | 笔画保留更多；残留干扰可能导致 det 碎框、高分短错文 |

记忆口诀：

- **threshold ↓** → 更激进地去浅色 → 背景更白，**易伤淡字**
- **threshold ↑** → 更保守 → **易留水印**，但深字更安全

调参建议（单格可用 `cell_sweep.py` 在 **`*_raw.png` 原图上**扫描，勿对已预处理 debug 图二次去水印）：

1. 优先在 **155～175** 间扫，结合 OCR 文本是否完整、det 框是否稳定。
2. **不要只看 rec 分数**：threshold 偏高时可能出现高分但错误的短文本（如仅「折取款」）。
3. 格级与页级 **threshold 可不同**（预设 page=175、cell=155），按 sweep 结果分别写 YAML。

### 水印检测 (`detect_watermark`)

采用两阶段检测策略：

1. **中间调检测**：统计灰度在 100-220 之间的像素占比
2. **斜向验证**：使用 Hough 直线变换验证是否存在斜向纹理

```python
def detect_watermark(image, midtone_low=100, midtone_high=220, ratio_threshold=0.03):
    """
    检测图像中是否存在浅色斜向文字水印
    
    步骤：
    1. 提取中间调像素（100-220），计算占比
    2. 若占比 > 3%，进行斜向验证
    3. 使用 Canny 边缘检测 + Hough 直线变换
    4. 统计 30-60° 斜向直线数量
    """
    gray = to_grayscale(image)
    
    # 步骤1：中间调检测
    midtone_mask = (gray > midtone_low) & (gray < midtone_high)
    if midtone_mask.sum() / gray.size < ratio_threshold:
        return False
    
    # 步骤2：斜向验证
    edges = cv2.Canny(midtone_mask, 50, 150)
    lines = cv2.HoughLines(edges, rho=1, theta=np.pi/180, threshold=80)
    
    # 统计斜向（30-60°）直线
    diagonal_count = count_diagonal_lines(lines, angle_range=(30, 60))
    return diagonal_count >= 2
```

### 水印去除 API

**推荐（页级 / 格级统一）：**

```python
from ocr_utils.watermark import WatermarkProcessor, merge_watermark_config

processor = WatermarkProcessor.from_user_config(
    {"enabled": True, "method": "threshold", "threshold": 155},
    scope="cell",  # 或 "page"
)
cleaned_bgr, stages = processor.process(cell_bgr_image, force=True)
# stages 可能含 "wm"、"contrast" 等，供 debug JSON 使用
```

**底层（兼容旧代码）：**

```python
from ocr_utils.watermark import remove_watermark_from_image_rgb

out = remove_watermark_from_image_rgb(
    image,
    threshold=175,
    watermark_removal_cfg=merge_watermark_config("page", {"method": "threshold"}),
)
```

### 参数说明（`method: threshold`）

| 参数 | page 预设 | cell 预设 | 说明 |
|------|-----------|-----------|------|
| `threshold` | 175 | 155 | 见上文「调高/调低」；**越大越保守**，越小白化越强 |
| `morph_close_kernel` | 0 | 0 | 闭运算核；**0=关闭**（推荐，非二值图闭运算易引噪） |
| `detect_before_remove` | true | false | 页级可先检测再去除；格级通常 `force=True` 直接处理 |
| `contrast_enhancement` | 默认开 | 默认关 | 去水印后 `text_restore`；格级默认关，需时再开 |

---

## 阶段三：格级二次 OCR 预处理

### 流程

```
表图 raw_crop
  → WatermarkProcessor(cell)     # wm
  → 可选 denoise / contrast      # YAML 开关
  → upscale（light.upscale_min_side，如 192）
  → det 分行 / whole 兜底 OCR
```

Debug 输出（`tablecell_ocr/`）：

| 文件 | 含义 |
|------|------|
| `cellNNN_*_*.png` | 送入 OCR 的预处理后图像 |
| `cellNNN_*_*_raw.png` | 未去水印的原始裁剪（供 `cell_sweep` 调参） |
| `cellNNN_*_*.json` | 含 `preprocess_stages`、`debug_images`、`lines`/`whole` 等 |

### 参数探索工具

```bash
cd ocr_tools/cell_preprocess_lab

# 单格扫描（自动优先 *_raw.png）
python cell_sweep.py /path/to/cell219_empty_empty_raw.png \
  -o ./output/cell219_sweep -t "ATM存折取款"

# 批量 tablecell_ocr 目录
python cell_sweep.py /path/to/tablecell_ocr/ -o ./sweep_out --quick
```

报告 `sweep_report.json` 含每条组合的 `text`、`score`（加权识别分）、`boxes[]`（逐框分数）。

---

## 配置说明

### 完整配置示例（`bank_statement_yusys_local.yaml`）

```yaml
input:
  txt_pdf_watermark_removal:
    enabled: true          # 文字PDF：XObject清理，无需阈值
    sample_pages: 3

preprocessor:
  order: orient_first
  watermark_removal:
    enabled: false         # 图片PDF：像素阈值去除，默认关闭（效果不佳）
    detect_before_remove: true
    method: threshold
    threshold: 175
    contrast_enhancement:
      enabled: false

table_recognition_wired:
  second_pass_ocr:
    cell_preprocess:
      watermark:
        enabled: true
        method: threshold
        threshold: 155     # 格级阈值，与页级独立
      denoise:
        enabled: false
      contrast:
        enabled: false
      upscale_min_side: 96
    enhance_retry:
      enabled: true
      upscale_min_side: 128
      contrast:
        enabled: true
        method: clahe
        clip_limit: 1.0
        tile_grid_size: 4
```

### 配置项详解

| 配置路径 | 说明 |
|---------|------|
| `input.txt_pdf_watermark_removal.*` | PDF XObject 去水印 |
| `preprocessor.watermark_removal.*` | 页级 `WatermarkProcessor(scope=page)` |
| `preprocessor.watermark_removal.method` | `threshold` \| `masked` \| `masked_adaptive` |
| `preprocessor.watermark_removal.threshold` | 仅 `threshold` 法；见「调高/调低」 |
| `second_pass_ocr.cell_preprocess.watermark.*` | 格级 `WatermarkProcessor(scope=cell)` |
| `second_pass_ocr.cell_preprocess.light.upscale_min_side` | 去水印后最短边放大 |
| `second_pass_ocr.enhance_retry` | Pass2 预处理（与 `cell_preprocess` 同级，非其子项） |

**说明：**

- `morph_close_kernel` 在 preset 中已为 `0`，一般 **不必写入 YAML**。
- 格级 `threshold` **建议在 sweep 后显式配置**，不要假设与页级相同。
- `enabled: true` 才会执行；页级、格级开关相互独立。

---

## 触发条件

### 阶段一触发条件

```python
# pipeline_manager_v2.py: process_document()

if is_pdf:
    wm_cfg = config.get('input', {}).get('txt_pdf_watermark_removal', {})
    if wm_cfg.get('enabled', False):                    # 条件①
        if scan_pdf_watermark_xobjs(pdf_bytes, sample_pages=3):  # 条件②
            cleaned = remove_txt_pdf_watermark(pdf_bytes)
```

**触发条件**：
1. 文件是 PDF
2. `enabled: true`
3. 扫描发现水印 XObject

> **注意**：此阶段**无需阈值参数**，直接操作 PDF XObject。

### 阶段二触发条件

```python
# pipeline_manager_v2.py: _process_single_page()

# 页级水印去除在 prepare_detection_image 中执行
# preprocessor 从 config[preprocessor][watermark_removal] 读取配置
detection_image, rotate_angle = self.preprocessor.prepare_detection_image(
    original_image.copy(),
    pdf_rotate_angle=pdf_rotate_angle,
    use_orientation_classifier=pdf_type == 'ocr',  # 仅扫描件走方向分类
)
```

**触发条件**：
1. `preprocessor.watermark_removal.enabled: true`
2. preprocessor 内部执行像素级去水印（`method: threshold` 等）

> **当前推荐**：对银行流水场景，阶段二页级水印 **默认关闭**（`enabled: false`），因为图片 PDF 的像素阈值去除效果不佳，文字 PDF 的水印已在阶段一清除。格级去水印在二次 OCR 时独立控制。

**格级二次 OCR**（`text_filling.py`）：表体触发二次 OCR 时，对 `raw_crop` 调用 `_preprocess_cell_for_ocr` → `WatermarkProcessor(scope="cell")`。

---

## 各层级对比

| 维度 | PDF 层级 | 页级图像 | 格级图像 |
|------|----------|----------|----------|
| **处理对象** | 文字型 PDF XObject | 整页渲染图 | 单元格裁剪 |
| **配置** | `input.txt_pdf_*` | `preprocessor.watermark_removal` | `second_pass_ocr.cell_preprocess.watermark` |
| **默认 threshold 预设** | — | 175 | 155 |
| **保留 PDF 文字层** | ✅ | — | — |
| **处理时机** | 渲染前 | Layout/OCR 前 | 格内二次 OCR 前 |
| **依赖库** | PyMuPDF | OpenCV | OpenCV |

---

## 代码集成

### 流水线集成

```python
# pipeline_manager_v2.py

from ocr_utils.watermark import (
    scan_pdf_watermark_xobjs,
    remove_txt_pdf_watermark,
)

class EnhancedDocPipeline:
    def process_document(self, doc_path):
        # 阶段一：PDF 层级去水印
        _pdf_bytes_override = None
        if is_pdf and config['input']['txt_pdf_watermark_removal']['enabled']:
            raw_bytes = doc_path.read_bytes()
            if scan_pdf_watermark_xobjs(raw_bytes):
                _pdf_bytes_override = remove_txt_pdf_watermark(raw_bytes)
        
        # 渲染 PDF（使用去水印后的 bytes）
        images, pdf_type, pdf_doc = PDFUtils.load_and_classify_document(
            doc_path, pdf_bytes=_pdf_bytes_override
        )
        
        # 逐页处理
        for page_idx, original_image in enumerate(images):
            # 阶段二：图像级去水印（在 preprocessor.process 中）
            if pdf_type == 'ocr':
                detection_image, angle = self.preprocessor.process(original_image)
            
            # Layout 检测、OCR...
```

### 预处理器集成

```python
# models/adapters/mineru_adapter.py

from ocr_utils.watermark import WatermarkProcessor

class MinerUPreprocessor:
    def process(self, image):
        wm_cfg = self.config.get("watermark_removal") or {}
        processor = WatermarkProcessor.from_user_config(wm_cfg, scope="page")
        if processor.enabled:
            image, _ = processor.process(image)
        # 方向校正 ...
        return image, angle
```

### 格级二次 OCR 集成

```python
# models/adapters/wired_table/text_filling.py

self._cell_wm_processor = WatermarkProcessor.from_user_config(wm_user, scope="cell")

cell_img, stages = self._preprocess_cell_for_ocr(raw_crop, mode="light")
# stages 示例: ["wm", "upscale"] 或 ["wm", "contrast", "upscale"]
```

---

## 使用示例

### 命令行

```bash
# 处理含水印的银行流水 PDF
python main_v2.py -i bank_statement.pdf -c config/bank_statement_yusys_v4.yaml --scene bank_statement

# 配置文件中已启用：
# input.txt_pdf_watermark_removal.enabled: true
# preprocessor.watermark_removal.enabled: true
```

### Python API

```python
from core.pipeline_manager_v2 import EnhancedDocPipeline

# 使用包含水印去除配置的 YAML
with EnhancedDocPipeline("config/bank_statement_yusys_v4.yaml") as pipeline:
    results = pipeline.process_document("document.pdf")
```

---

## 调试与验证

### 日志输出

```python
# 阶段一日志
logger.info(f"🧹 文字型 PDF 原生去水印完成（{doc_path.name}）")
logger.debug(f"  [Form XObject] 清空水印 xref={xref}, name={name}")
logger.debug(f"  [Image XObject] 替换水印图像 xref={img_xref}")

# 阶段二日志
logger.info(f"🧹 Watermark removed (threshold={threshold})")
```

### 可视化验证

在 debug 模式下，可以通过对比去水印前后的图像来验证效果：

```bash
# 开启 debug 模式
python main_v2.py -i doc.pdf -c config.yaml --scene bank_statement --debug

# 输出文件：
# {doc}_pdf_page_001.png  - 渲染后的页面图像（去水印后）
# {doc}_page_001_layout.png - Layout 可视化
```

---

## 注意事项

1. **三种机制各自独立**：PDF XObject 清理（无阈值）、页级像素去水印（有阈值）、格级去水印（有阈值）可独立开关。银行流水推荐 **PDF XObject + 格级**（页级 wm 关）。
2. **灰度方向**：**0=黑、255=白**；`gray > threshold → 255` 表示把「比阈值更亮」的像素刷白。
3. **threshold 方向**：**调高**更保守（易留水印、少伤字）；**调低**更激进（背景更干净、易啃淡笔画）。页级与格级应分别调参。
4. **勿混用 det 阈值**：`ocr_recognition.det_threshold` 是 OCR 检测框过滤，与去水印 `threshold` 无关。
5. **调参输入**：`cell_sweep.py` 应使用 `*_raw.png`（原裁剪），不要对已预处理的 `cell*_empty_empty.png` 再扫（等于二次去水印，结论失真）。
6. **形态学**：preset 中 `morph_close_kernel=0`，非二值图不建议开启闭运算。
7. **依赖**：PDF 层级需 `PyMuPDF`；图像级需 `OpenCV`。

---

## 参考资料

- `ocr_utils/watermark/` — 实现包（presets / removal / processor / pdf）
- `ocr_utils/watermark_utils.py` — 兼容 re-export
- `ocr_tools/cell_preprocess_lab/cell_sweep.py` — 格级参数扫描
- `models/adapters/mineru_adapter.py` — 页级预处理
- `models/adapters/wired_table/text_filling.py` — 格级二次 OCR
- `config/bank_statement_yusys_local.yaml` — 场景配置示例