瀏覽代碼

Merge pull request #3819 from myhloli/dev

update docs
Xiaomeng Zhao 3 周之前
父節點
當前提交
dd92c5b723

+ 12 - 0
README.md

@@ -44,6 +44,18 @@
 </div>
 
 # Changelog
+- 2025/10/24 2.6.0 Release
+  - `pipeline` backend optimizations
+    - Added experimental support for Chinese formulas, which can be enabled by setting the environment variable `export MINERU_FORMULA_CH_SUPPORT=1`. This feature may cause a slight decrease in MFR speed and failures in recognizing some long formulas. It is recommended to enable it only when parsing Chinese formulas is needed. To disable this feature, set the environment variable to `0`.
+    - `OCR` speed significantly improved by 200%~300%, thanks to the optimization solution provided by @cjsdurj
+    - `OCR` models updated to `ppocr-v5` version for Cyrillic, Arabic, Devanagari, Telugu (te), and Tamil (ta) languages, with accuracy improved by over 40% compared to previous models
+  - `vlm` backend optimizations
+    - `table_caption` and `table_footnote` matching logic optimized to improve the accuracy of table caption and footnote matching and reading order rationality in scenarios with multiple consecutive tables on a page
+    - Optimized CPU resource usage during high concurrency when using `vllm` backend, reducing server pressure
+    - Adapted to `vllm` version 0.11.0
+  - General optimizations
+    - Cross-page table merging effect optimized, added support for cross-page continuation table merging, improving table merging effectiveness in multi-column merge scenarios
+    - Added environment variable configuration option `MINERU_TABLE_MERGE_ENABLE` for table merging feature. Table merging is enabled by default and can be disabled by setting this variable to `0`
 
 - 2025/09/26 2.5.4 released
   - 🎉🎉 The MinerU2.5 [Technical Report](https://arxiv.org/abs/2509.22186) is now available! We welcome you to read it for a comprehensive overview of its model architecture, training strategy, data engineering and evaluation results.

+ 13 - 1
README_zh-CN.md

@@ -44,7 +44,19 @@
 </div>
 
 # 更新记录
-
+- 2025/10/24 2.6.0 发布
+  - `pipline`后端优化
+    - 增加对中文公式的实验性支持,可通过配置环境变量`export MINERU_FORMULA_CH_SUPPORT=1`开启。该功能可能会导致MFR速率略微下降、部分长公式识别失败等问题,建议仅在需要解析中文公式的场景下开启。如需关闭该功能,可将环境变量设置为`0`。
+    - `OCR`速度大幅提升200%~300%,感谢 @cjsdurj 提供的优化方案
+    - `OCR`模型更新西里尔文(cyrillic)、阿拉伯文(arabic)、天城文(devanagari)、泰卢固语(te)、泰米尔语(ta)语系至`ppocr-v5`版本,精度相比上代模型提升40%以上
+  - `vlm`后端优化
+    - `table_caption`、`table_footnote`匹配逻辑优化,提升页内多张连续表场景下的表格标题和脚注的匹配准确率和阅读顺序合理性
+    - 优化使用`vllm`后端时高并发时的cpu资源占用,降低服务端压力
+    - 适配`vllm`0.11.0版本
+  - 通用优化
+    - 跨页表格合并效果优化,新增跨页续表合并支持,提升在多列合并场景下的表格合并效果
+    - 为表格合并功能增加环境变量配置选项`MINERU_TABLE_MERGE_ENABLE`,表格合并功能默认开启,可通过设置该变量为`0`来关闭表格合并功能
+    
 - 2025/09/26 2.5.4 发布
   - 🎉🎉 MinerU2.5[技术报告](https://arxiv.org/abs/2509.22186)现已发布,欢迎阅读全面了解其模型架构、训练策略、数据工程和评测结果。
   - 修复部分`pdf`文件被识别成`ai`文件导致无法解析的问题

+ 12 - 2
docs/en/usage/cli_tools.md

@@ -87,6 +87,16 @@ Here are the environment variables and their descriptions:
     * Used to enable formula parsing
     * defaults to `true`, can be set to `false` through environment variables to disable formula parsing.
   
-- `MINERU_TABLE_ENABLE`: 
+- `MINERU_FORMULA_CH_SUPPORT`:
+    * Used to enable Chinese formula parsing optimization (experimental feature)
+    * Default is `false`, can be set to `true` via environment variable to enable Chinese formula parsing optimization.
+    * Only effective for `pipeline` backend.
+  
+- `MINERU_TABLE_ENABLE`:
     * Used to enable table parsing
-    * defaults to `true`, can be set to `false` through environment variables to disable table parsing.
+    * Default is `true`, can be set to `false` via environment variable to disable table parsing.
+
+- `MINERU_TABLE_MERGE_ENABLE`:
+    * Used to enable table merging functionality
+    * Default is `true`, can be set to `false` via environment variable to disable table merging functionality.
+

+ 9 - 0
docs/zh/usage/cli_tools.md

@@ -81,7 +81,16 @@ MinerU命令行工具的某些参数存在相同功能的环境变量配置,
 - `MINERU_FORMULA_ENABLE`:
     * 用于启用公式解析
     * 默认为`true`,可通过环境变量设置为`false`来禁用公式解析。
+
+- `MINERU_FORMULA_CH_SUPPORT`:
+    * 用于启用中文公式解析优化(实验性功能)
+    * 默认为`false`,可通过环境变量设置为`true`来启用中文公式解析优化。
+    * 仅对`pipeline`后端生效。
   
 - `MINERU_TABLE_ENABLE`:
     * 用于启用表格解析
     * 默认为`true`,可通过环境变量设置为`false`来禁用表格解析。
+
+- `MINERU_TABLE_MERGE_ENABLE`:
+    * 用于启用表格合并功能
+    * 默认为`true`,可通过环境变量设置为`false`来禁用表格合并功能。

+ 7 - 3
mineru/backend/pipeline/model_init.py

@@ -17,10 +17,14 @@ from ...model.table.rec.unet_table.main import UnetTableModel
 from ...utils.enum_class import ModelPath
 from ...utils.models_download_utils import auto_download_and_get_model_root_path
 
-MFR_MODEL = os.getenv('MINERU_MFR_MODEL', None)
-if MFR_MODEL is None:
+MFR_MODEL = os.getenv('MINERU_FORMULA_CH_SUPPORT', 'False')
+if MFR_MODEL.lower() in ['true', '1', 'yes']:
+    MFR_MODEL = "pp_formulanet_plus_m"
+elif MFR_MODEL.lower() in ['false', '0', 'no']:
+    MFR_MODEL = "unimernet_small"
+else:
+    logger.warning(f"Invalid MINERU_FORMULA_CH_SUPPORT value: {MFR_MODEL}, set to default 'False'")
     MFR_MODEL = "unimernet_small"
-    # MFR_MODEL = "pp_formulanet_plus_m"
 
 
 def img_orientation_cls_model_init():

+ 2 - 9
mineru/backend/pipeline/model_json_to_middle_json.py

@@ -5,6 +5,7 @@ import time
 from loguru import logger
 from tqdm import tqdm
 
+from mineru.backend.utils import cross_page_table_merge
 from mineru.utils.config_reader import get_device, get_llm_aided_config, get_formula_enable
 from mineru.backend.pipeline.model_init import AtomModelSingleton
 from mineru.backend.pipeline.para_split import para_split
@@ -20,7 +21,6 @@ from mineru.utils.ocr_utils import OcrConfidence
 from mineru.utils.span_block_fix import fill_spans_in_blocks, fix_discarded_block, fix_block_spans
 from mineru.utils.span_pre_proc import remove_outside_spans, remove_overlaps_low_confidence_spans, \
     remove_overlaps_min_spans, txt_spans_extract
-from mineru.utils.table_merge import merge_table
 from mineru.version import __version__
 from mineru.utils.hash_utils import bytes_md5
 
@@ -231,14 +231,7 @@ def result_to_middle_json(model_list, images_list, pdf_doc, image_writer, lang=N
     para_split(middle_json["pdf_info"])
 
     """表格跨页合并"""
-    is_merge_table = os.getenv('MINERU_MERGE_TABLE', 'true')
-    if is_merge_table.lower() == 'true':
-        merge_table(middle_json["pdf_info"])
-    elif is_merge_table.lower() == 'false':
-        pass
-    else:
-        logger.warning(f'unknown MINERU_MERGE_TABLE config: {is_merge_table}, pass')
-        pass
+    cross_page_table_merge(middle_json["pdf_info"])
 
     """llm优化"""
     llm_aided_config = get_llm_aided_config()

+ 24 - 0
mineru/backend/utils.py

@@ -0,0 +1,24 @@
+import os
+
+from loguru import logger
+
+from mineru.utils.table_merge import merge_table
+
+
+def cross_page_table_merge(pdf_info: list[dict]):
+    """Merge tables that span across multiple pages in a PDF document.
+
+    Args:
+        pdf_info (list[dict]): A list of dictionaries containing information about each page in the PDF.
+
+    Returns:
+        None
+    """
+    is_merge_table = os.getenv('MINERU_TABLE_MERGE_ENABLE', 'true')
+    if is_merge_table.lower() in ['true', '1', 'yes']:
+        merge_table(pdf_info)
+    elif is_merge_table.lower() in ['false', '0', 'no']:
+        pass
+    else:
+        logger.warning(f'unknown MINERU_TABLE_MERGE_ENABLE config: {is_merge_table}, pass')
+        pass

+ 2 - 9
mineru/backend/vlm/model_output_to_middle_json.py

@@ -5,13 +5,13 @@ import cv2
 import numpy as np
 from loguru import logger
 
+from mineru.backend.utils import cross_page_table_merge
 from mineru.backend.vlm.vlm_magic_model import MagicModel
 from mineru.utils.config_reader import get_table_enable, get_llm_aided_config
 from mineru.utils.cut_image import cut_image_and_table
 from mineru.utils.enum_class import ContentType
 from mineru.utils.hash_utils import bytes_md5
 from mineru.utils.pdf_image_tools import get_crop_img
-from mineru.utils.table_merge import merge_table
 from mineru.version import __version__
 
 
@@ -110,14 +110,7 @@ def result_to_middle_json(model_output_blocks_list, images_list, pdf_doc, image_
     """表格跨页合并"""
     table_enable = get_table_enable(os.getenv('MINERU_VLM_TABLE_ENABLE', 'True').lower() == 'true')
     if table_enable:
-        is_merge_table = os.getenv('MINERU_MERGE_TABLE', 'true')
-        if is_merge_table.lower() == 'true':
-            merge_table(middle_json["pdf_info"])
-        elif is_merge_table.lower() == 'false':
-            pass
-        else:
-            logger.warning(f'unknown MINERU_MERGE_TABLE config: {is_merge_table}, pass')
-            pass
+        cross_page_table_merge(middle_json["pdf_info"])
 
     """llm优化标题分级"""
     if heading_level_import_success:

+ 24 - 21
mineru/cli/common.py

@@ -44,34 +44,37 @@ def prepare_env(output_dir, pdf_file_name, parse_method):
 
 
 def convert_pdf_bytes_to_bytes_by_pypdfium2(pdf_bytes, start_page_id=0, end_page_id=None):
+    try:
+        # 从字节数据加载PDF
+        pdf = pdfium.PdfDocument(pdf_bytes)
 
-    # 从字节数据加载PDF
-    pdf = pdfium.PdfDocument(pdf_bytes)
-
-    # 确定结束页
-    end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(pdf) - 1
-    if end_page_id > len(pdf) - 1:
-        logger.warning("end_page_id is out of range, use pdf_docs length")
-        end_page_id = len(pdf) - 1
+        # 确定结束页
+        end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(pdf) - 1
+        if end_page_id > len(pdf) - 1:
+            logger.warning("end_page_id is out of range, use pdf_docs length")
+            end_page_id = len(pdf) - 1
 
-    # 创建一个新的PDF文档
-    output_pdf = pdfium.PdfDocument.new()
+        # 创建一个新的PDF文档
+        output_pdf = pdfium.PdfDocument.new()
 
-    # 选择要导入的页面索引
-    page_indices = list(range(start_page_id, end_page_id + 1))
+        # 选择要导入的页面索引
+        page_indices = list(range(start_page_id, end_page_id + 1))
 
-    # 从原PDF导入页面到新PDF
-    output_pdf.import_pages(pdf, page_indices)
+        # 从原PDF导入页面到新PDF
+        output_pdf.import_pages(pdf, page_indices)
 
-    # 将新PDF保存到内存缓冲区
-    output_buffer = io.BytesIO()
-    output_pdf.save(output_buffer)
+        # 将新PDF保存到内存缓冲区
+        output_buffer = io.BytesIO()
+        output_pdf.save(output_buffer)
 
-    # 获取字节数据
-    output_bytes = output_buffer.getvalue()
+        # 获取字节数据
+        output_bytes = output_buffer.getvalue()
 
-    pdf.close()  # 关闭原PDF文档以释放资源
-    output_pdf.close()  # 关闭新PDF文档以释放资源
+        pdf.close()  # 关闭原PDF文档以释放资源
+        output_pdf.close()  # 关闭新PDF文档以释放资源
+    except Exception as e:
+        logger.warning(f"Error in converting PDF bytes: {e}, Using original PDF bytes.")
+        output_bytes = pdf_bytes
 
     return output_bytes
 

+ 1 - 1
mineru/model/utils/pytorchocr/modeling/backbones/rec_lcnetv3.py

@@ -256,7 +256,7 @@ class LearnableRepLayer(nn.Module):
                 input_dim = self.in_channels // self.groups
                 kernel_value = torch.zeros(
                     (self.in_channels, input_dim, self.kernel_size, self.kernel_size),
-                    dtype=branch.weight.dtype,  device= branch.weight.device,
+                    dtype=branch.weight.dtype,  device=branch.weight.device,
                 )
                 for i in range(self.in_channels):
                     kernel_value[