Browse Source

fix: add environment variable configurations for Chinese formula parsing and table merging features

myhloli 3 weeks ago
parent
commit
b5922086cb
4 changed files with 34 additions and 2 deletions
  1. 12 0
      README.md
  2. 1 0
      README_zh-CN.md
  3. 12 2
      docs/en/usage/cli_tools.md
  4. 9 0
      docs/zh/usage/cli_tools.md

+ 12 - 0
README.md

@@ -44,6 +44,18 @@
 </div>
 
 # Changelog
+- 2025/10/24 2.6.0 Release
+  - `pipeline` backend optimizations
+    - Added experimental support for Chinese formulas, which can be enabled by setting the environment variable `export MINERU_FORMULA_CH_SUPPORT=1`. This feature may cause a slight decrease in MFR speed and failures in recognizing some long formulas. It is recommended to enable it only when parsing Chinese formulas is needed. To disable this feature, set the environment variable to `0`.
+    - `OCR` speed significantly improved by 200%~300%, thanks to the optimization solution provided by @cjsdurj
+    - `OCR` models updated to `ppocr-v5` version for Cyrillic, Arabic, Devanagari, Telugu (te), and Tamil (ta) languages, with accuracy improved by over 40% compared to previous models
+  - `vlm` backend optimizations
+    - `table_caption` and `table_footnote` matching logic optimized to improve the accuracy of table caption and footnote matching and reading order rationality in scenarios with multiple consecutive tables on a page
+    - Optimized CPU resource usage during high concurrency when using `vllm` backend, reducing server pressure
+    - Adapted to `vllm` version 0.11.0
+  - General optimizations
+    - Cross-page table merging effect optimized, added support for cross-page continuation table merging, improving table merging effectiveness in multi-column merge scenarios
+    - Added environment variable configuration option `MINERU_TABLE_MERGE_ENABLE` for table merging feature. Table merging is enabled by default and can be disabled by setting this variable to `0`
 
 - 2025/09/26 2.5.4 released
   - 🎉🎉 The MinerU2.5 [Technical Report](https://arxiv.org/abs/2509.22186) is now available! We welcome you to read it for a comprehensive overview of its model architecture, training strategy, data engineering and evaluation results.

+ 1 - 0
README_zh-CN.md

@@ -52,6 +52,7 @@
   - `vlm`后端优化
     - `table_caption`、`table_footnote`匹配逻辑优化,提升页内多张连续表场景下的表格标题和脚注的匹配准确率和阅读顺序合理性
     - 优化使用`vllm`后端时高并发时的cpu资源占用,降低服务端压力
+    - 适配`vllm`0.11.0版本
   - 通用优化
     - 跨页表格合并效果优化,新增跨页续表合并支持,提升在多列合并场景下的表格合并效果
     - 为表格合并功能增加环境变量配置选项`MINERU_TABLE_MERGE_ENABLE`,表格合并功能默认开启,可通过设置该变量为`0`来关闭表格合并功能

+ 12 - 2
docs/en/usage/cli_tools.md

@@ -87,6 +87,16 @@ Here are the environment variables and their descriptions:
     * Used to enable formula parsing
     * defaults to `true`, can be set to `false` through environment variables to disable formula parsing.
   
-- `MINERU_TABLE_ENABLE`: 
+- `MINERU_FORMULA_CH_SUPPORT`:
+    * Used to enable Chinese formula parsing optimization (experimental feature)
+    * Default is `false`, can be set to `true` via environment variable to enable Chinese formula parsing optimization.
+    * Only effective for `pipeline` backend.
+  
+- `MINERU_TABLE_ENABLE`:
     * Used to enable table parsing
-    * defaults to `true`, can be set to `false` through environment variables to disable table parsing.
+    * Default is `true`, can be set to `false` via environment variable to disable table parsing.
+
+- `MINERU_TABLE_MERGE_ENABLE`:
+    * Used to enable table merging functionality
+    * Default is `true`, can be set to `false` via environment variable to disable table merging functionality.
+

+ 9 - 0
docs/zh/usage/cli_tools.md

@@ -81,7 +81,16 @@ MinerU命令行工具的某些参数存在相同功能的环境变量配置,
 - `MINERU_FORMULA_ENABLE`:
     * 用于启用公式解析
     * 默认为`true`,可通过环境变量设置为`false`来禁用公式解析。
+
+- `MINERU_FORMULA_CH_SUPPORT`:
+    * 用于启用中文公式解析优化(实验性功能)
+    * 默认为`false`,可通过环境变量设置为`true`来启用中文公式解析优化。
+    * 仅对`pipeline`后端生效。
   
 - `MINERU_TABLE_ENABLE`:
     * 用于启用表格解析
     * 默认为`true`,可通过环境变量设置为`false`来禁用表格解析。
+
+- `MINERU_TABLE_MERGE_ENABLE`:
+    * 用于启用表格合并功能
+    * 默认为`true`,可通过环境变量设置为`false`来禁用表格合并功能。