水印去除技术文档

概述

水印去除能力位于 ocr_utils/watermark/ 包，对外兼容入口为 ocr_utils/watermark_utils.py（re-export）。核心编排类为 WatermarkProcessor，支持 页级（page） 与 单元格级（cell） 两套预设（presets.py）。

除 PDF/页级预处理外，银行流水等场景在 有线表格二次 OCR 中可对单个单元格裁剪图再次去水印（text_filling.cell_preprocess）。

层级	处理对象	配置位置	适用场景
PDF 层级	文字型 PDF 的 XObject	`input.txt_pdf_watermark_removal`	文字型 PDF，渲染前
页级图像	整页渲染图	`preprocessor.watermark_removal`	扫描件页级 OCR 前（可选）
格级图像	单元格裁剪图	`table_recognition_wired.second_pass_ocr.cell_preprocess.watermark`	二次 OCR 前（推荐 cell-first）

实现模块：

路径	职责
`ocr_utils/watermark/presets.py`	页级/格级预设、`merge_watermark_config`
`ocr_utils/watermark/removal.py`	`threshold` / `masked_adaptive` 去水印
`ocr_utils/watermark/processor.py`	`WatermarkProcessor` 门面
`ocr_utils/watermark/pdf.py`	文字型 PDF XObject 去水印
`models/adapters/wired_table/text_filling.py`	格级预处理 + 二次 OCR
`ocr_tools/cell_preprocess_lab/cell_sweep.py`	单格参数网格扫描（调参）

处理流程

graph TB
    A[输入文档] --> B{是否为 PDF?}
    
    B -->|是| C[阶段一: PDF 层级去水印]
    B -->|否| F
    
    C --> D{启用 txt_pdf_watermark_removal?}
    D -->|是| E[扫描前 N 页检测水印 XObject]
    D -->|否| G
    E --> E1{发现水印?}
    E1 -->|是| E2[清除 XObject 内容流]
    E1 -->|否| G
    E2 --> G
    
    G[渲染为图像] --> H{PDF 类型?}
    H -->|文字型 txt| I[跳过阶段二]
    H -->|扫描件 ocr| J
    
    F[图像输入] --> J[阶段二: 图像级去水印]
    
    J --> K{启用 watermark_removal?}
    K -->|是| L[WatermarkProcessor page]
    K -->|否| N
    L --> M[method: threshold / masked / masked_adaptive]
    M --> N[方向校正]
    
    N --> O[Layout 检测]
    O --> P[表格 OCR]
    P --> Q{二次 OCR 格级 wm?}
    Q -->|是| R[WatermarkProcessor cell + upscale]
    Q -->|否| S
    R --> S[格内 OCR]
    
    style C fill:#e1f5ff
    style E fill:#e1f5ff
    style E2 fill:#e1f5ff
    style J fill:#fff4e1
    style L fill:#fff4e1
    style M fill:#fff4e1

阶段一：PDF 层级水印去除

适用场景

文字型 PDF（pdf_type='txt'）：PDF 内部包含可提取的文字层，水印通常以 XObject 形式叠加在文字上方。

原理

PDF 文件中的水印通常通过以下两种 XObject 实现：

Form XObject：矢量绘图对象，包含旋转、透明度等变换矩阵
Image XObject：位图对象，通常是半透明的全页背景图

通过 PyMuPDF (fitz) 直接操作 PDF 内部结构，清空或替换水印 XObject 的内容流，而不影响文字层的可搜索性。

水印 XObject 判断规则

Form XObject 水印判断 (`_is_watermark_xobj`)

满足以下条件之一即判定为水印：

规则	说明	原理
旋转变换	内容流中 `cm` 指令的 sin/cos 分量非零	水印通常斜向 45° 放置
透明度组 + 透明操作符	`/Group` 存在且内容流含 `ca/CA`	水印具有半透明效果
透明度组 + 大体积流	`/Group` 存在且流体积 > 2KB	大量重复绘图 = 平铺水印

# 判断逻辑伪代码
def _is_watermark_xobj(doc, xref, obj_str):
    if "/Form" not in obj_str:
        return False
    
    stream_text = doc.xref_stream(xref).decode("latin-1")
    
    # 规则1：旋转变换
    if has_rotation_transform(stream_text):
        return True
    
    # 规则2-3：透明度组相关
    if "/Group" in obj_str:
        if has_transparency_operators(stream_text):
            return True
        if len(stream_text) > 2048:
            return True
    
    return False

Image XObject 水印判断 (`_is_watermark_image_xobj`)

必须同时满足以下条件：

条件	说明
`/Subtype /Image`	确认是图像对象
存在 `/SMask`	有透明通道（半透明）
宽 >= 600 且高 >= 800	全页尺寸（排除小图标）
像素均值 >= 240	近乎全白（水印文字稀疏）

处理方法

def remove_txt_pdf_watermark(pdf_bytes: bytes) -> Optional[bytes]:
    """
    对文字型 PDF 执行原生水印去除
    
    处理方式：
    - Form XObject：清空内容流 (update_stream(b""))
    - Image XObject：替换为全白像素 + 移除 DecodeParms
    
    Returns:
        去水印后的 PDF bytes，若未发现水印返回 None
    """
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    
    for page in doc:
        # 处理 Form XObject 水印
        for xref, name, *_ in page.get_xobjects():
            if _is_watermark_xobj(doc, xref, obj_str):
                doc.update_stream(xref, b"")  # 清空内容流
        
        # 处理 Image XObject 水印
        for img_tuple in page.get_images(full=True):
            img_xref = img_tuple[0]
            if _is_watermark_image_xobj(doc, img_xref, obj_str):
                _blank_watermark_image(doc, img_xref)  # 替换为全白
    
    return doc.tobytes(garbage=4, deflate=True)

关键技术细节

移除 /DecodeParms 的必要性：

当 Image XObject 使用 Predictor 压缩时，必须先移除 /DecodeParms 再调用 update_stream，否则渲染器会尝试 Predictor 解码失败后回退原始数据，水印依然可见。

def _blank_watermark_image(doc, img_xref):
    # 关键：先移除 DecodeParms
    doc.xref_set_key(img_xref, "DecodeParms", "null")
    # 再更新为全白像素
    doc.update_stream(img_xref, bytes([255]) * (w * h * channels))

快速预扫描 (`scan_pdf_watermark_xobjs`)

对于大型 PDF（如财报），先执行只读扫描判断是否存在水印，避免不必要的全量处理：

def scan_pdf_watermark_xobjs(pdf_bytes: bytes, sample_pages: int = 3) -> bool:
    """
    快速扫描前 N 页，判断是否含水印 XObject
    
    Args:
        sample_pages: 扫描页数上限，默认 3（银行流水通常前几页有水印）
    
    Returns:
        True 表示发现水印 XObject
    """
    doc = fitz.open(stream=pdf_bytes, filetype="pdf")
    for i in range(min(sample_pages, len(doc))):
        # 检查 Form XObject 和 Image XObject
        ...
    return False

阶段二：图像级水印去除

适用场景

页级：扫描件/图片（pdf_type='ocr'），在 MinerUPreprocessor 中通过 WatermarkProcessor(scope="page") 调用。
格级：有线表格 二次 OCR 前，对 raw_crop 通过 WatermarkProcessor(scope="cell") 调用（与页级独立配置）。

银行流水当前推荐策略 cell-first：页级 watermark_removal.enabled: false，重点调格级 cell_preprocess.watermark。

灰度值约定（必读）

OpenCV / PIL 的 8 位灰度图统一约定：

灰度值	视觉
0	黑（深笔画、墨迹）
255	白（背景、纸面）
中间值（如 100～220）	灰（浅色水印、淡笔画、扫描噪声）

典型银行流水扫描件上：

汉字笔画：低灰度（偏黑，通常远小于 threshold）
纸面背景 / 浅色斜纹水印：高灰度（偏白，通常大于 threshold）

注意：个别 UI 或 PhotoShop 可能用「0=白」显示，但本仓库代码与 OpenCV 一致，以 0=黑、255=白为准。

去水印方法（`method`）

method	说明	YAML 需写
`threshold`	全局 `gray > threshold → 255`，简单快速	`method` + 可选 `threshold`
`masked`	掩膜定位水印区再处理	仅 `method`（细参见 preset）
`masked_adaptive`	掩膜 + 掩膜内自适应阈值	仅 `method`（细参见 preset）

预设默认值（presets.py，YAML 未覆盖时生效）：

scope	默认 `threshold`	默认 `contrast_enhancement`
page	175	enabled
cell	155	disabled

merge_watermark_config(scope, user_cfg) 将用户 YAML 与上表预设合并；mask / hough / adaptive 等细参不必写入场景 YAML。

`threshold` 方法原理

银行流水等金融文档的水印特征：

颜色浅：灰度多在 160～220（介于正文与白纸之间）
角度斜：常见 45° 斜向重复文字
占比较小：相对整页/整格为稀疏浅色纹理

核心代码（removal.py）：

cleaned = gray.copy()
cleaned[gray > threshold] = 255   # 亮于阈值的像素 → 白

语义：保留 灰度 ≤ threshold 的像素（深字），把更亮的像素刷成白纸，用于削弱浅色水印。

`threshold` 调高 / 调低的实际作用

判断规则：gray > threshold 才变白 → threshold 是「多亮才算背景」的分界线。

操作	白化强度	被刷白的像素范围	对水印	对正文 / OCR
调低 threshold（如 175→155）	更强	更多中等灰度（如 156～175）也会变白	去得更干净	淡笔画、被水印冲淡的边缘可能被啃掉；背景更干净时 det 有时更易检出一整行
调高 threshold（如 155→175）	更弱	只有更亮的像素才变白	易残留斜纹、浅灰噪声	笔画保留更多；残留干扰可能导致 det 碎框、高分短错文

记忆口诀：

threshold ↓ → 更激进地去浅色 → 背景更白，易伤淡字
threshold ↑ → 更保守 → 易留水印，但深字更安全

调参建议（单格可用 cell_sweep.py 在 *_raw.png 原图上扫描，勿对已预处理 debug 图二次去水印）：

优先在 155～175 间扫，结合 OCR 文本是否完整、det 框是否稳定。
不要只看 rec 分数：threshold 偏高时可能出现高分但错误的短文本（如仅「折取款」）。
格级与页级 threshold 可不同（预设 page=175、cell=155），按 sweep 结果分别写 YAML。

水印检测 (`detect_watermark`)

采用两阶段检测策略：

中间调检测：统计灰度在 100-220 之间的像素占比

斜向验证：使用 Hough 直线变换验证是否存在斜向纹理

def detect_watermark(image, midtone_low=100, midtone_high=220, ratio_threshold=0.03):
"""
检测图像中是否存在浅色斜向文字水印
    
步骤：
1. 提取中间调像素（100-220），计算占比
2. 若占比 > 3%，进行斜向验证
3. 使用 Canny 边缘检测 + Hough 直线变换
4. 统计 30-60° 斜向直线数量
"""
gray = to_grayscale(image)
    
# 步骤1：中间调检测
midtone_mask = (gray > midtone_low) & (gray < midtone_high)
if midtone_mask.sum() / gray.size < ratio_threshold:
    return False
    
# 步骤2：斜向验证
edges = cv2.Canny(midtone_mask, 50, 150)
lines = cv2.HoughLines(edges, rho=1, theta=np.pi/180, threshold=80)
    
# 统计斜向（30-60°）直线
diagonal_count = count_diagonal_lines(lines, angle_range=(30, 60))
return diagonal_count >= 2

水印去除 API

推荐（页级 / 格级统一）：

from ocr_utils.watermark import WatermarkProcessor, merge_watermark_config

processor = WatermarkProcessor.from_user_config(
    {"enabled": True, "method": "threshold", "threshold": 155},
    scope="cell",  # 或 "page"
)
cleaned_bgr, stages = processor.process(cell_bgr_image, force=True)
# stages 可能含 "wm"、"contrast" 等，供 debug JSON 使用

底层（兼容旧代码）：

from ocr_utils.watermark import remove_watermark_from_image_rgb

out = remove_watermark_from_image_rgb(
    image,
    threshold=175,
    watermark_removal_cfg=merge_watermark_config("page", {"method": "threshold"}),
)

参数说明（`method: threshold`）

参数	page 预设	cell 预设	说明
`threshold`	175	155	见上文「调高/调低」；越大越保守，越小白化越强
`morph_close_kernel`	0	0	闭运算核；0=关闭（推荐，非二值图闭运算易引噪）
`detect_before_remove`	true	false	页级可先检测再去除；格级通常 `force=True` 直接处理
`contrast_enhancement`	默认开	默认关	去水印后 `text_restore`；格级默认关，需时再开

阶段三：格级二次 OCR 预处理

流程

表图 raw_crop
  → WatermarkProcessor(cell)     # wm
  → 可选 denoise / contrast      # YAML 开关
  → upscale（light.upscale_min_side，如 192）
  → det 分行 / whole 兜底 OCR

Debug 输出（tablecell_ocr/）：

文件	含义
`cellNNN__.png`	送入 OCR 的预处理后图像
`cellNNN___raw.png`	未去水印的原始裁剪（供 `cell_sweep` 调参）
`cellNNN__.json`	含 `preprocess_stages`、`debug_images`、`lines`/`whole` 等

参数探索工具

cd ocr_tools/cell_preprocess_lab

# 单格扫描（自动优先 *_raw.png）
python cell_sweep.py /path/to/cell219_empty_empty_raw.png \
  -o ./output/cell219_sweep -t "ATM存折取款"

# 批量 tablecell_ocr 目录
python cell_sweep.py /path/to/tablecell_ocr/ -o ./sweep_out --quick

报告 sweep_report.json 含每条组合的 text、score（加权识别分）、boxes[]（逐框分数）。

配置说明

完整配置示例（`bank_statement_yusys_local.yaml`）

input:
  txt_pdf_watermark_removal:
    enabled: true
    sample_pages: 3

preprocessor:
  order: orient_first
  watermark_removal:
    enabled: false              # cell-first：页级可关
    detect_before_remove: true
    method: threshold
    threshold: 175              # 页级预设默认 175
    contrast_enhancement:
      enabled: false

table_recognition_wired:
  second_pass_ocr:
    suspicious_short_min_chars: 4
    cell_preprocess:
      watermark:
        enabled: true
        method: threshold
        threshold: 155          # 建议显式写出；未写则用 cell 预设 155
      denoise:
        enabled: false
      contrast:
        enabled: false          # Pass1 可选 text_restore
      light:
        upscale_min_side: 192
    enhance_retry:
      enabled: false            # Pass2 增强重试（与 cell_preprocess 同级）

配置项详解

配置路径	说明
`input.txt_pdf_watermark_removal.*`	PDF XObject 去水印
`preprocessor.watermark_removal.*`	页级 `WatermarkProcessor(scope=page)`
`preprocessor.watermark_removal.method`	`threshold` \| `masked` \| `masked_adaptive`
`preprocessor.watermark_removal.threshold`	仅 `threshold` 法；见「调高/调低」
`second_pass_ocr.cell_preprocess.watermark.*`	格级 `WatermarkProcessor(scope=cell)`
`second_pass_ocr.cell_preprocess.light.upscale_min_side`	去水印后最短边放大
`second_pass_ocr.enhance_retry`	Pass2 预处理（与 `cell_preprocess` 同级，非其子项）

说明：

morph_close_kernel 在 preset 中已为 0，一般 不必写入 YAML。
格级 threshold 建议在 sweep 后显式配置，不要假设与页级相同。
enabled: true 才会执行；页级、格级开关相互独立。

触发条件

阶段一触发条件

# pipeline_manager_v2.py: process_document()

if is_pdf:
    wm_cfg = config.get('input', {}).get('txt_pdf_watermark_removal', {})
    if wm_cfg.get('enabled', False):                    # 条件①
        if scan_pdf_watermark_xobjs(pdf_bytes, sample_pages=3):  # 条件②
            cleaned = remove_txt_pdf_watermark(pdf_bytes)

触发条件：

文件是 PDF
enabled: true
扫描发现水印 XObject

阶段二触发条件

# pipeline_manager_v2.py: _process_single_page()

if pdf_type == 'ocr':  # 条件①：仅扫描件
    detection_image, angle = self.preprocessor.process(original_image)

# mineru_adapter.py: MinerUPreprocessor.process()

processor = WatermarkProcessor.from_user_config(wm_cfg, scope="page")
if processor.enabled:
    image, _ = processor.process(image)  # 内部 detect_before_remove + method

触发条件：

PDF 类型为 ocr（扫描件）
preprocessor.watermark_removal.enabled: true

格级二次 OCR（text_filling.py）：表体触发二次 OCR 时，对 raw_crop 调用 _preprocess_cell_for_ocr → WatermarkProcessor(scope="cell")。

各层级对比

维度	PDF 层级	页级图像	格级图像
处理对象	文字型 PDF XObject	整页渲染图	单元格裁剪
配置	`input.txt_pdf_*`	`preprocessor.watermark_removal`	`second_pass_ocr.cell_preprocess.watermark`
默认 threshold 预设	—	175	155
保留 PDF 文字层	✅	—	—
处理时机	渲染前	Layout/OCR 前	格内二次 OCR 前
依赖库	PyMuPDF	OpenCV	OpenCV

代码集成

流水线集成

# pipeline_manager_v2.py

from ocr_utils.watermark import (
    scan_pdf_watermark_xobjs,
    remove_txt_pdf_watermark,
)

class EnhancedDocPipeline:
    def process_document(self, doc_path):
        # 阶段一：PDF 层级去水印
        _pdf_bytes_override = None
        if is_pdf and config['input']['txt_pdf_watermark_removal']['enabled']:
            raw_bytes = doc_path.read_bytes()
            if scan_pdf_watermark_xobjs(raw_bytes):
                _pdf_bytes_override = remove_txt_pdf_watermark(raw_bytes)
        
        # 渲染 PDF（使用去水印后的 bytes）
        images, pdf_type, pdf_doc = PDFUtils.load_and_classify_document(
            doc_path, pdf_bytes=_pdf_bytes_override
        )
        
        # 逐页处理
        for page_idx, original_image in enumerate(images):
            # 阶段二：图像级去水印（在 preprocessor.process 中）
            if pdf_type == 'ocr':
                detection_image, angle = self.preprocessor.process(original_image)
            
            # Layout 检测、OCR...

预处理器集成

# models/adapters/mineru_adapter.py

from ocr_utils.watermark import WatermarkProcessor

class MinerUPreprocessor:
    def process(self, image):
        wm_cfg = self.config.get("watermark_removal") or {}
        processor = WatermarkProcessor.from_user_config(wm_cfg, scope="page")
        if processor.enabled:
            image, _ = processor.process(image)
        # 方向校正 ...
        return image, angle

格级二次 OCR 集成

# models/adapters/wired_table/text_filling.py

self._cell_wm_processor = WatermarkProcessor.from_user_config(wm_user, scope="cell")

cell_img, stages = self._preprocess_cell_for_ocr(raw_crop, mode="light")
# stages 示例: ["wm", "upscale"] 或 ["wm", "contrast", "upscale"]

使用示例

命令行

# 处理含水印的银行流水 PDF
python main_v2.py -i bank_statement.pdf -c config/bank_statement_yusys_v4.yaml --scene bank_statement

# 配置文件中已启用：
# input.txt_pdf_watermark_removal.enabled: true
# preprocessor.watermark_removal.enabled: true

Python API

from core.pipeline_manager_v2 import EnhancedDocPipeline

# 使用包含水印去除配置的 YAML
with EnhancedDocPipeline("config/bank_statement_yusys_v4.yaml") as pipeline:
    results = pipeline.process_document("document.pdf")

调试与验证

日志输出

# 阶段一日志
logger.info(f"🧹 文字型 PDF 原生去水印完成（{doc_path.name}）")
logger.debug(f"  [Form XObject] 清空水印 xref={xref}, name={name}")
logger.debug(f"  [Image XObject] 替换水印图像 xref={img_xref}")

# 阶段二日志
logger.info(f"🧹 Watermark removed (threshold={threshold})")

可视化验证

在 debug 模式下，可以通过对比去水印前后的图像来验证效果：

# 开启 debug 模式
python main_v2.py -i doc.pdf -c config.yaml --scene bank_statement --debug

# 输出文件：
# {doc}_pdf_page_001.png  - 渲染后的页面图像（去水印后）
# {doc}_page_001_layout.png - Layout 可视化

注意事项

三层互补：PDF 层级、页级、格级可独立开关；银行流水推荐 cell-first（页级 wm 关、格级 wm 开）。
灰度方向：0=黑、255=白；gray > threshold → 255 表示把「比阈值更亮」的像素刷白。
threshold 方向：调高更保守（易留水印、少伤字）；调低更激进（背景更干净、易啃淡笔画）。页级与格级应分别调参。
勿混用 det 阈值：ocr_recognition.det_threshold 是 OCR 检测框过滤，与去水印 threshold 无关。
调参输入：cell_sweep.py 应使用 *_raw.png（原裁剪），不要对已预处理的 cell*_empty_empty.png 再扫（等于二次去水印，结论失真）。
形态学：preset 中 morph_close_kernel=0，非二值图不建议开启闭运算。
依赖：PDF 层级需 PyMuPDF；图像级需 OpenCV。

参考资料

ocr_utils/watermark/ — 实现包（presets / removal / processor / pdf）
ocr_utils/watermark_utils.py — 兼容 re-export
ocr_tools/cell_preprocess_lab/cell_sweep.py — 格级参数扫描
models/adapters/mineru_adapter.py — 页级预处理
models/adapters/wired_table/text_filling.py — 格级二次 OCR
config/bank_statement_yusys_local.yaml — 场景配置示例

水印去除技术文档.md 22 KB 文件历史 原始文件

水印去除技术文档

概述

处理流程

阶段一：PDF 层级水印去除

适用场景

原理

水印 XObject 判断规则

Form XObject 水印判断 (_is_watermark_xobj)

Image XObject 水印判断 (_is_watermark_image_xobj)

处理方法

关键技术细节

快速预扫描 (scan_pdf_watermark_xobjs)

阶段二：图像级水印去除

适用场景

灰度值约定（必读）

去水印方法（method）

threshold 方法原理

threshold 调高 / 调低的实际作用

水印检测 (detect_watermark)

水印去除 API

参数说明（method: threshold）

阶段三：格级二次 OCR 预处理

流程

参数探索工具

配置说明

完整配置示例（bank_statement_yusys_local.yaml）

配置项详解

触发条件

阶段一触发条件

阶段二触发条件

各层级对比

代码集成

流水线集成

预处理器集成

格级二次 OCR 集成

使用示例

命令行

Python API

调试与验证

日志输出

可视化验证

注意事项

参考资料

水印去除技术文档.md 22 KB

文件历史原始文件

Form XObject 水印判断 (`_is_watermark_xobj`)

Image XObject 水印判断 (`_is_watermark_image_xobj`)

快速预扫描 (`scan_pdf_watermark_xobjs`)

去水印方法（`method`）

`threshold` 方法原理

`threshold` 调高 / 调低的实际作用

水印检测 (`detect_watermark`)

参数说明（`method: threshold`）

完整配置示例（`bank_statement_yusys_local.yaml`）