PaddleOCR-VL 表格文字丢失问题与 OTSL 运行时补丁

关联文档：paddleocr_vl 1.6->GGUF.md、llama.cpp配置说明.md

1. 背景与现象

使用本机编译的 llama.cpp（llama-server）以 GGUF 形式部署 PaddleOCR-VL-1.6 模型，经 universal_doc_parser 走 bank_statement_yusys_paddleocr_local 流程解析银行流水图片时，输出 JSON（如 陈3_微信图_page_001.json）中的表格只有表格结构（<tr>/<td> 骨架），所有单元格文字为空。

关键观察：

llama-server 日志显示模型确实生成了大量包含文字的 token，并非模型没输出。

模型原始输出（_PredictResult.text）是带文字的 OTSL，例如：

交易明细对应时间段<fcel>2023-08-12 00:00:00至2024-08-11 23:59:59<lcel><lcel>...<nl><fcel>具体交易明细<lcel>...<nl><fcel>交易单号<fcel>交易时间<fcel>...

即文字在「模型输出」阶段是完整的，是在后处理转 HTML 时丢失的。

2. 根因分析

2.1 OTSL 与转换入口

PaddleOCR-VL 以 OTSL（Open Table Structure Language） 表达表格，结构 token 有： <nl>（换行/换行）、<fcel>（首文本单元格）、<ecel>（空单元格）、<lcel>/<ucel>/<xcel>（跨列/跨行/跨格）。

后处理由第三方库 mineru_vl_utils 负责，OTSL→HTML 的转换函数为 mineru_vl_utils/post_process/otsl2html.py:convert_otsl_to_html。

2.2 文字丢失的直接原因

convert_otsl_to_html 内部依次调用：

otsl_extract_tokens_and_text(otsl_content) → 拆出 tokens 与 mixed_texts；
otsl_parse_texts(mixed_texts, tokens) → 用 text_idx 把文字回填到各 TableCell。

otsl_parse_texts 的文本回填逻辑假设 mixed_texts 以结构 token 开头。而 PaddleOCR-VL 的输出中，整张表的第一个单元格缺少前导 <fcel> token （如上例直接以「交易明细对应时间段」纯文本打头）。

这导致 text_idx 从一开始就永久错位，后续所有单元格都取不到对应文字，最终 table_cells 里每个 cell 的 text 都是空字符串 —— 表格只剩骨架。

2.3 完整调用链

adapter.content_extract()                       # mineru_adapter.py:436
  → MinerUClient.content_extract()              # mineru_client.py:832
      blocks[0].content = output.text           # ← 原始 OTSL，文字尚在
      → helper.post_process()                   # :845
        → post_process() → simple_process()     # post_process/__init__.py:150
          → convert_otsl_to_html(content)       # __init__.py:95  ← 文字在此丢失

重要结论：文字在 convert_otsl_to_html 内部就已丢失，对最终 HTML 做后处理无法挽回 （空 <td> 里已经没有文字）。修复必须发生在该函数执行之前。

3. 方案选型

方案	说明	结论
改 `chat_template.jinja`	该模板是输入提示词格式，管不到模型输出	❌ 无效
对最终 HTML 后处理补字	文字已在转换中丢失，无源可补	❌ 不可行
直接改 `site-packages` 源码	升级/重装即丢失，团队不同步	❌ 仅临时
fork + `pip install -e` 自有分支	维护成本最高，且需改第三方项目	⚠️ 过重
运行时 monkey-patch（最终采用）	不改第三方源码、随本仓库版本化、可开关、升级不丢	✅ 采用

为什么 monkey-patch 打在 `post_process.convert_otsl_to_html`

post_process/__init__.py 顶部 from .otsl2html import convert_otsl_to_html，其内部 simple_process / _convert_pure_table_content_to_html 在调用时 按 mineru_vl_utils.post_process 模块全局名在运行时查找该函数。

因此只要替换 mineru_vl_utils.post_process.convert_otsl_to_html 这个名字，即可拦截库内全部内部调用（__init__.py:72 与 :95），而无需改任何源码。本仓库的 mineru_adapter.py 只调用 content_extract / batch_content_extract，没有自行 import 该函数，所以无需额外覆盖其他命名空间。

4. 最终实现

4.1 补丁模块

新增 ocr_tools/universal_doc_parser/models/adapters/_mineru_vl_patches.py，核心逻辑：调用原始 convert_otsl_to_html 之前，若 OTSL 以纯文本（非 <table、非结构 token）打头，则补一个前导 <fcel>：

def _make_otsl_normalizer(orig_convert):
    def _normalize_then_convert(otsl_content):
        if isinstance(otsl_content, str):
            stripped = otsl_content.lstrip()
            if (stripped
                    and not stripped.startswith("<table")
                    and not stripped.startswith(_OTSL_STRUCT_TOKENS)):
                otsl_content = "<fcel>" + stripped
        return orig_convert(otsl_content)
    _normalize_then_convert.__wrapped__ = orig_convert
    return _normalize_then_convert

通过 apply_once() 应用，特性：

幂等：模块级 _applied 标志，仅首次真正打补丁。
失败大声：上游接口改名/找不到 convert_otsl_to_html 时抛 RuntimeError，避免补丁静默失效后又开始丢字。
双重覆盖：同时覆盖 post_process.convert_otsl_to_html（关键）与 otsl2html.convert_otsl_to_html（兜底）。

4.2 调用点（放在 `init`，覆盖 mineru 与 paddle 两条路径）

⚠️ 实际生产走的是 PaddleVLRecognizer（module: paddle），它继承 MinerUVLRecognizer 但重写了 initialize()（直接 MinerUClient(...)，未调用父类 initialize）。因此补丁若只放在 MinerUVLRecognizer.initialize()，paddle 路径不会执行。

解决：把补丁调用放在 MinerUVLRecognizer.__init__。PaddleVLRecognizer.__init__ 会经 super().__init__(config) 到达这里，于是 mineru / paddle / glmocr（均继承自该基类）三条路径都会在创建识别器时应用补丁，且早于任何 content_extract。

class MinerUVLRecognizer(BaseVLRecognizer):
    def __init__(self, config):
        super().__init__(config)
        if not MINERU_AVAILABLE:
            raise ImportError("MinerU components not available")
        self.vlm_model = None
        self.max_image_size = config.get('max_image_size', 1568)
        self.resize_mode = config.get('resize_mode', 'max')

        # 应用 mineru_vl_utils 运行时补丁（paddle 重写了 initialize，但其 __init__ 经 super 到达此处）
        try:
            from ._mineru_vl_patches import apply_once as _apply_mineru_vl_patches
            _apply_mineru_vl_patches()
        except Exception as e:
            # 补丁失败不阻断识别器创建，退回默认行为，但明确告警
            logger.warning(f"应用 mineru_vl_utils 补丁失败（退回默认行为，表格可能丢字）: {e}")

关于失败语义：补丁模块 apply_once() 自身遵循「失败大声」（上游接口改名等会抛 RuntimeError）；适配器调用点再用 try/except 兜底，把异常降级为 logger.warning，避免一个修复补丁把整个识别器的创建搞挂。

5. 验证

在 mineru 环境下加载补丁并用真实 OTSL 片段验证：

conda run -n mineru python -c "
import mineru_vl_utils.post_process as pp
import importlib.util
spec = importlib.util.spec_from_file_location('_mineru_vl_patches', '_mineru_vl_patches.py')
m = importlib.util.module_from_spec(spec); spec.loader.exec_module(m)
print('apply_once ->', m.apply_once())          # True
print('apply_once again ->', m.apply_once())    # False（幂等）
otsl = '交易明细对应时间段<fcel>X<lcel><nl><fcel>具体交易明细<lcel><nl>'
html = pp.convert_otsl_to_html(otsl)
print('首格文字保留:', '交易明细对应时间段' in html)   # True
print(html[:120])
"

输出（节选）：

已应用 mineru_vl_utils 补丁：OTSL 整表首格 <fcel> 归一化
apply_once -> True
apply_once again -> False
首格文字保留: True
<table><tr><td>交易明细对应时间段</td><td colspan="2">X</td></tr>...

首格文字 交易明细对应时间段 被正确保留，问题修复。

6. 维护注意事项

不要再改 site-packages / mineru-vl-utils 源码。临时改动已还原为原始状态，所有修复都集中在 _mineru_vl_patches.py，随本仓库版本化。
升级 mineru_vl_utils 后，请重跑第 5 节验证脚本；若上游已修复同名问题，可考虑移除本补丁；若 convert_otsl_to_html 被改名，apply_once() 会抛 RuntimeError 提示。
新增第三方运行时修补统一加到 _mineru_vl_patches.py 并由 apply_once() 串联，保持「补丁集中、可开关、可追溯」。
若未来 Layout 检测器等其他路径也直接触发 OTSL 转换，由于补丁是进程级全局生效，只要在该路径初始化时同样调用过 apply_once() 即可（幂等，可安全重复调用）。

7. 涉及文件

文件	变更
`models/adapters/_mineru_vl_patches.py`	新增：运行时补丁模块
`models/adapters/mineru_adapter.py`	`MinerUVLRecognizer.__init__` 接入 `apply_once()`（覆盖 paddle 继承路径）
`models/adapters/paddle_vl_adapter.py`	无需改动：`PaddleVLRecognizer` 经 `super().__init__` 自动应用补丁
`site-packages/.../otsl2html.py`	还原为原始状态（移除临时改动）

PaddleOCR-VL表格文字丢失-OTSL补丁.md 9.4 KB Histórico Raw