1 year ago · 845a3ff067
--- a/.gitignore
+++ b/.gitignore
@@ -48,3 +48,6 @@ debug_utils/
 
															 # sphinx docs
														
 
															 _build/
														
 
															+
														
 
															+
														
 
															+output/
														
--- a/README.md
+++ b/README.md
@@ -42,6 +42,7 @@
 
															 </div>
														
 
															 # Changelog
														
 
															+- 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
														
 
															 - 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
														
 
															 - 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
														
 
															   - Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
														
@@ -246,7 +247,7 @@ You can modify certain configurations in this file to enable or disable features
 
															         "enable": true  // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
														
 
															     },
														
 
															     "table-config": {
														
 
															-        "model": "tablemaster",  // When using structEqTable, please change to "struct_eqtable".
														
 
															+        "model": "rapid_table",  // When using structEqTable, please change to "struct_eqtable".
														
 
															         "enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true".
														
 
															         "max_time": 400
														
 
															     }
														
@@ -261,7 +262,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi
 
															 - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
														
 
															 - Quick Deployment with Docker
														
 
															 > [!IMPORTANT]
														
 
															-> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
														
 
															+> Docker requires a GPU with at least 8GB of VRAM, and all acceleration features are enabled by default.
														
 
															 >
														
 
															 > Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
														
 
															 > 
														
@@ -421,7 +422,9 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
 
															 # Acknowledgments
														
 
															 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
														
 
															+- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
														
 
															 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
														
 
															+- [RapidTable](https://github.com/RapidAI/RapidTable)
														
 
															 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
														
 
															 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
														
 
															 - [layoutreader](https://github.com/ppaanngggg/layoutreader)
														
--- a/README_ja-JP.md
+++ b/README_ja-JP.md
@@ -1,3 +1,5 @@
 
															+> [!Warning]
														
 
															+> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください：[ENGLISH](README.md)。
														
 
															 <div id="top">
														
 
															 <p align="center">
														
@@ -18,9 +20,7 @@
 
															 <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
														
 
															-<div align="center" style="color: red; background-color: #ffdddd; padding: 10px; border: 1px solid red; border-radius: 5px;">
														
 
															-  <strong>NOTE：</strong> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください。
														
 
															-</div>
														
 
															+
														
 
															 [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
														
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
@@ -42,7 +42,7 @@
 
															 </div>
														
 
															 # 更新记录
														
 
															-
														
 
															+- 2024/11/15 0.9.3发布，为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上，准确率更高，显存占用更低
														
 
															 - 2024/11/06 0.9.2发布，为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
														
 
															 - 2024/10/31 0.9.0发布，这是我们进行了大量代码重构的全新版本，解决了众多问题，提升了性能，降低了硬件需求，并提供了更丰富的易用性：
														
 
															   - 重构排序模块代码，使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序，确保在各种排版下都能实现极高准确率
														
@@ -188,13 +188,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
 
															         <td rowspan="2">GPU硬件支持列表</td>
														
 
															         <td colspan="2">最低要求 8G+显存</td>
														
 
															         <td colspan="2">3060ti/3070/4060<br>
														
 
															-        8G显存可开启layout、公式识别和ocr加速</td>
														
 
															+        8G显存可开启全部加速功能(表格仅限rapid_table)</td>
														
 
															         <td rowspan="2">None</td>
														
 
															     </tr>
														
 
															     <tr>
														
 
															         <td colspan="2">推荐配置 10G+显存</td>
														
 
															         <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
														
 
															-        10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速<br>
														
 
															+        10G显存及以上可开启全部加速功能<br>
														
 
															         </td>
														
 
															     </tr>
														
 
															 </table>
														
@@ -251,7 +251,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
 
															         "enable": true  // 公式识别功能默认是开启的，如果需要关闭请修改此处的值为"false"
														
 
															     },
														
 
															     "table-config": {
														
 
															-        "model": "tablemaster",  // 使用structEqTable请修改为"struct_eqtable"
														
 
															+        "model": "rapid_table",  // 使用structEqTable请修改为"struct_eqtable"
														
 
															         "enable": false, // 表格识别功能默认是关闭的，如果需要开启请修改此处的值为"true"
														
 
															         "max_time": 400
														
 
															     }
														
@@ -266,7 +266,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
 
															 - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
														
 
															 - 使用Docker快速部署
														
 
															 > [!IMPORTANT]
														
 
															-> Docker 需设备gpu显存大于等于16GB，默认开启所有加速功能
														
 
															+> Docker 需设备gpu显存大于等于8GB，默认开启所有加速功能
														
 
															 > 
														
 
															 > 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
														
 
															 > 
														
@@ -431,6 +431,7 @@ TODO
 
															 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
														
 
															 - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
														
 
															 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
														
 
															+- [RapidTable](https://github.com/RapidAI/RapidTable)
														
 
															 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
														
 
															 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
														
 
															 - [layoutreader](https://github.com/ppaanngggg/layoutreader)
														
--- a/demo/magic_pdf_parse_main.py
+++ b/demo/magic_pdf_parse_main.py
@@ -19,9 +19,10 @@ def json_md_dump(
 
															         pdf_name,

														
 
															         content_list,

														
 
															         md_content,

														
 
															+        orig_model_list,

														
 
															 ):

														
 
															     # 写入模型结果到 model.json

														
 
															-    orig_model_list = copy.deepcopy(pipe.model_list)

														
 
															+

														
 
															     md_writer.write(

														
 
															         content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),

														
 
															         path=f"{pdf_name}_model.json"

														
@@ -87,9 +88,12 @@ def pdf_parse_main(
 
															         pdf_bytes = open(pdf_path, "rb").read()  # 读取 pdf 文件的二进制数据

														
 
															+        orig_model_list = []

														
 
															+

														
 
															         if model_json_path:

														
 
															             # 读取已经被模型解析后的pdf文件的 json 原始数据，list 类型

														
 
															             model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())

														
 
															+            orig_model_list = copy.deepcopy(model_json)

														
 
															         else:

														
 
															             model_json = []

														
@@ -115,8 +119,9 @@ def pdf_parse_main(
 
															         pipe.pipe_classify()

														
 
															         # 如果没有传入模型数据，则使用内置模型解析

														
 
															-        if not model_json:

														
 
															+        if len(model_json) == 0:

														
 
															             pipe.pipe_analyze()  # 解析

														
 
															+            orig_model_list = copy.deepcopy(pipe.model_list)

														
 
															         # 执行解析

														
 
															         pipe.pipe_parse()

														
@@ -126,7 +131,7 @@ def pdf_parse_main(
 
															         md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")

														
 
															         if is_json_md_dump:

														
 
															-            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)

														
 
															+            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content, orig_model_list)

														
 
															         if is_draw_visualization_bbox:

														
 
															             draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name)

														
--- a/magic-pdf.template.json
+++ b/magic-pdf.template.json
@@ -15,7 +15,7 @@
 
															         "enable": true
														
 
															     },
														
 
															     "table-config": {
														
 
															-        "model": "tablemaster",
														
 
															+        "model": "rapid_table",
														
 
															         "enable": false,
														
 
															         "max_time": 400
														
 
															     },
														
--- a/magic_pdf/dict2md/ocr_mkcontent.py
+++ b/magic_pdf/dict2md/ocr_mkcontent.py
@@ -168,7 +168,7 @@ def merge_para_with_text(para_block):
 
															                         # 如果是前一行带有-连字符，那么末尾不应该加空格
														
 
															                         if __is_hyphen_at_line_end(content):
														
 
															                             para_text += content[:-1]
														
 
															-                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i']:
														
 
															+                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i'] and not content.isdigit():
														
 
															                             para_text += content
														
 
															                         else:  # 西方文本语境下 content间需要空格分隔
														
 
															                             para_text += f"{content} "
														
--- a/magic_pdf/libs/Constants.py
+++ b/magic_pdf/libs/Constants.py
@@ -50,4 +50,6 @@ class MODEL_NAME:
 
															     YOLO_V8_MFD = "yolo_v8_mfd"
														
 
															-    UniMerNet_v2_Small = "unimernet_small"
														
 
															+    UniMerNet_v2_Small = "unimernet_small"
														
 
															+
														
 
															+    RAPID_TABLE = "rapid_table"
														
--- a/magic_pdf/libs/config_reader.py
+++ b/magic_pdf/libs/config_reader.py
@@ -92,7 +92,7 @@ def get_table_recog_config():
 
															     table_config = config.get('table-config')
														
 
															     if table_config is None:
														
 
															         logger.warning(f"'table-config' not found in {CONFIG_FILE_NAME}, use 'False' as default")
														
 
															-        return json.loads(f'{{"model": "{MODEL_NAME.TABLE_MASTER}","enable": false, "max_time": 400}}')
														
 
															+        return json.loads(f'{{"model": "{MODEL_NAME.RAPID_TABLE}","enable": false, "max_time": 400}}')
														
 
															     else:
														
 
															         return table_config
														
--- a/magic_pdf/libs/draw_bbox.py
+++ b/magic_pdf/libs/draw_bbox.py
@@ -369,10 +369,16 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
 
															             if block['type'] in [BlockType.Image, BlockType.Table]:
														
 
															                 for sub_block in block['blocks']:
														
 
															                     if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
														
 
															-                        for line in sub_block['virtual_lines']:
														
 
															-                            bbox = line['bbox']
														
 
															-                            index = line['index']
														
 
															-                            page_line_list.append({'index': index, 'bbox': bbox})
														
 
															+                        if len(sub_block['virtual_lines']) > 0 and sub_block['virtual_lines'][0].get('index', None) is not None:
														
 
															+                            for line in sub_block['virtual_lines']:
														
 
															+                                bbox = line['bbox']
														
 
															+                                index = line['index']
														
 
															+                                page_line_list.append({'index': index, 'bbox': bbox})
														
 
															+                        else:
														
 
															+                            for line in sub_block['lines']:
														
 
															+                                bbox = line['bbox']
														
 
															+                                index = line['index']
														
 
															+                                page_line_list.append({'index': index, 'bbox': bbox})
														
 
															                     elif sub_block['type'] in [BlockType.ImageCaption, BlockType.TableCaption, BlockType.ImageFootnote, BlockType.TableFootnote]:
														
 
															                         for line in sub_block['lines']:
														
 
															                             bbox = line['bbox']
														
--- a/magic_pdf/model/pdf_extract_kit.py
+++ b/magic_pdf/model/pdf_extract_kit.py
@@ -1,195 +1,28 @@
 
															+import numpy as np
														
 
															+import torch
														
 
															 from loguru import logger
														
 
															 import os
														
 
															 import time
														
 
															-from pathlib import Path
														
 
															-import shutil
														
 
															-from magic_pdf.libs.Constants import *
														
 
															-from magic_pdf.libs.clean_memory import clean_memory
														
 
															-from magic_pdf.model.model_list import AtomicModel
														
 
															+import cv2
														
 
															+import yaml
														
 
															+from PIL import Image
														
 
															 os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1'  # 禁止albumentations检查更新
														
 
															 os.environ['YOLO_VERBOSE'] = 'False'  # disable yolo logger
														
 
															+
														
 
															 try:
														
 
															-    import cv2
														
 
															-    import yaml
														
 
															-    import argparse
														
 
															-    import numpy as np
														
 
															-    import torch
														
 
															     import torchtext
														
 
															     if torchtext.__version__ >= "0.18.0":
														
 
															         torchtext.disable_torchtext_deprecation_warning()
														
 
															-    from PIL import Image
														
 
															-    from torchvision import transforms
														
 
															-    from torch.utils.data import Dataset, DataLoader
														
 
															-    from ultralytics import YOLO
														
 
															-    from unimernet.common.config import Config
														
 
															-    import unimernet.tasks as tasks
														
 
															-    from unimernet.processors import load_processor
														
 
															-    from doclayout_yolo import YOLOv10
														
 
															-
														
 
															-except ImportError as e:
														
 
															-    logger.exception(e)
														
 
															-    logger.error(
														
 
															-        'Required dependency not installed, please install by \n'
														
 
															-        '"pip install magic-pdf[full] --extra-index-url https://myhloli.github.io/wheels/"')
														
 
															-    exit(1)
														
 
															-
														
 
															-from magic_pdf.model.pek_sub_modules.layoutlmv3.model_init import Layoutlmv3_Predictor
														
 
															-from magic_pdf.model.pek_sub_modules.post_process import latex_rm_whitespace
														
 
															-from magic_pdf.model.pek_sub_modules.self_modify import ModifiedPaddleOCR
														
 
															-from magic_pdf.model.pek_sub_modules.structeqtable.StructTableModel import StructTableModel
														
 
															-from magic_pdf.model.ppTableModel import ppTableModel
														
 
															-
														
 
															-
														
 
															-def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
														
 
															-    if table_model_type == MODEL_NAME.STRUCT_EQTABLE:
														
 
															-        table_model = StructTableModel(model_path, max_time=max_time)
														
 
															-    elif table_model_type == MODEL_NAME.TABLE_MASTER:
														
 
															-        config = {
														
 
															-            "model_dir": model_path,
														
 
															-            "device": _device_
														
 
															-        }
														
 
															-        table_model = ppTableModel(config)
														
 
															-    else:
														
 
															-        logger.error("table model type not allow")
														
 
															-        exit(1)
														
 
															-    return table_model
														
 
															-
														
 
															-
														
 
															-def mfd_model_init(weight):
														
 
															-    mfd_model = YOLO(weight)
														
 
															-    return mfd_model
														
 
															-
														
 
															-
														
 
															-def mfr_model_init(weight_dir, cfg_path, _device_='cpu'):
														
 
															-    args = argparse.Namespace(cfg_path=cfg_path, options=None)
														
 
															-    cfg = Config(args)
														
 
															-    cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.pth")
														
 
															-    cfg.config.model.model_config.model_name = weight_dir
														
 
															-    cfg.config.model.tokenizer_config.path = weight_dir
														
 
															-    task = tasks.setup_task(cfg)
														
 
															-    model = task.build_model(cfg)
														
 
															-    model.to(_device_)
														
 
															-    model.eval()
														
 
															-    vis_processor = load_processor('formula_image_eval', cfg.config.datasets.formula_rec_eval.vis_processor.eval)
														
 
															-    mfr_transform = transforms.Compose([vis_processor, ])
														
 
															-    return [model, mfr_transform]
														
 
															-
														
 
															-
														
 
															-def layout_model_init(weight, config_file, device):
														
 
															-    model = Layoutlmv3_Predictor(weight, config_file, device)
														
 
															-    return model
														
 
															-
														
 
															-
														
 
															-def doclayout_yolo_model_init(weight):
														
 
															-    model = YOLOv10(weight)
														
 
															-    return model
														
 
															-
														
 
															-
														
 
															-def ocr_model_init(show_log: bool = False, det_db_box_thresh=0.3, lang=None, use_dilation=True, det_db_unclip_ratio=1.8):
														
 
															-    if lang is not None:
														
 
															-        model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh, lang=lang, use_dilation=use_dilation, det_db_unclip_ratio=det_db_unclip_ratio)
														
 
															-    else:
														
 
															-        model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh, use_dilation=use_dilation, det_db_unclip_ratio=det_db_unclip_ratio)
														
 
															-    return model
														
 
															-
														
 
															-
														
 
															-class MathDataset(Dataset):
														
 
															-    def __init__(self, image_paths, transform=None):
														
 
															-        self.image_paths = image_paths
														
 
															-        self.transform = transform
														
 
															-
														
 
															-    def __len__(self):
														
 
															-        return len(self.image_paths)
														
 
															-
														
 
															-    def __getitem__(self, idx):
														
 
															-        # if not pil image, then convert to pil image
														
 
															-        if isinstance(self.image_paths[idx], str):
														
 
															-            raw_image = Image.open(self.image_paths[idx])
														
 
															-        else:
														
 
															-            raw_image = self.image_paths[idx]
														
 
															-        if self.transform:
														
 
															-            image = self.transform(raw_image)
														
 
															-            return image
														
 
															-
														
 
															-
														
 
															-class AtomModelSingleton:
														
 
															-    _instance = None
														
 
															-    _models = {}
														
 
															-
														
 
															-    def __new__(cls, *args, **kwargs):
														
 
															-        if cls._instance is None:
														
 
															-            cls._instance = super().__new__(cls)
														
 
															-        return cls._instance
														
 
															-
														
 
															-    def get_atom_model(self, atom_model_name: str, **kwargs):
														
 
															-        lang = kwargs.get("lang", None)
														
 
															-        layout_model_name = kwargs.get("layout_model_name", None)
														
 
															-        key = (atom_model_name, layout_model_name, lang)
														
 
															-        if key not in self._models:
														
 
															-            self._models[key] = atom_model_init(model_name=atom_model_name, **kwargs)
														
 
															-        return self._models[key]
														
 
															-
														
 
															-
														
 
															-def atom_model_init(model_name: str, **kwargs):
														
 
															-
														
 
															-    if model_name == AtomicModel.Layout:
														
 
															-        if kwargs.get("layout_model_name") == MODEL_NAME.LAYOUTLMv3:
														
 
															-            atom_model = layout_model_init(
														
 
															-                kwargs.get("layout_weights"),
														
 
															-                kwargs.get("layout_config_file"),
														
 
															-                kwargs.get("device")
														
 
															-            )
														
 
															-        elif kwargs.get("layout_model_name") == MODEL_NAME.DocLayout_YOLO:
														
 
															-            atom_model = doclayout_yolo_model_init(
														
 
															-                kwargs.get("doclayout_yolo_weights"),
														
 
															-            )
														
 
															-    elif model_name == AtomicModel.MFD:
														
 
															-        atom_model = mfd_model_init(
														
 
															-            kwargs.get("mfd_weights")
														
 
															-        )
														
 
															-    elif model_name == AtomicModel.MFR:
														
 
															-        atom_model = mfr_model_init(
														
 
															-            kwargs.get("mfr_weight_dir"),
														
 
															-            kwargs.get("mfr_cfg_path"),
														
 
															-            kwargs.get("device")
														
 
															-        )
														
 
															-    elif model_name == AtomicModel.OCR:
														
 
															-        atom_model = ocr_model_init(
														
 
															-            kwargs.get("ocr_show_log"),
														
 
															-            kwargs.get("det_db_box_thresh"),
														
 
															-            kwargs.get("lang")
														
 
															-        )
														
 
															-    elif model_name == AtomicModel.Table:
														
 
															-        atom_model = table_model_init(
														
 
															-            kwargs.get("table_model_name"),
														
 
															-            kwargs.get("table_model_path"),
														
 
															-            kwargs.get("table_max_time"),
														
 
															-            kwargs.get("device")
														
 
															-        )
														
 
															-    else:
														
 
															-        logger.error("model name not allow")
														
 
															-        exit(1)
														
 
															-
														
 
															-    return atom_model
														
 
															-
														
 
															+except ImportError:
														
 
															+    pass
														
 
															-#  Unified crop img logic
														
 
															-def crop_img(input_res, input_pil_img, crop_paste_x=0, crop_paste_y=0):
														
 
															-    crop_xmin, crop_ymin = int(input_res['poly'][0]), int(input_res['poly'][1])
														
 
															-    crop_xmax, crop_ymax = int(input_res['poly'][4]), int(input_res['poly'][5])
														
 
															-    # Create a white background with an additional width and height of 50
														
 
															-    crop_new_width = crop_xmax - crop_xmin + crop_paste_x * 2
														
 
															-    crop_new_height = crop_ymax - crop_ymin + crop_paste_y * 2
														
 
															-    return_image = Image.new('RGB', (crop_new_width, crop_new_height), 'white')
														
 
															-
														
 
															-    # Crop image
														
 
															-    crop_box = (crop_xmin, crop_ymin, crop_xmax, crop_ymax)
														
 
															-    cropped_img = input_pil_img.crop(crop_box)
														
 
															-    return_image.paste(cropped_img, (crop_paste_x, crop_paste_y))
														
 
															-    return_list = [crop_paste_x, crop_paste_y, crop_xmin, crop_ymin, crop_xmax, crop_ymax, crop_new_width, crop_new_height]
														
 
															-    return return_image, return_list
														
 
															+from magic_pdf.libs.Constants import *
														
 
															+from magic_pdf.model.model_list import AtomicModel
														
 
															+from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
														
 
															+from magic_pdf.model.sub_modules.model_utils import get_res_list_from_layout_res, crop_img, clean_vram
														
 
															+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import get_adjusted_mfdetrec_res, get_ocr_result_list
														
 
															 class CustomPEKModel:
														
@@ -226,7 +59,7 @@ class CustomPEKModel:
 
															         self.table_config = kwargs.get("table_config")
														
 
															         self.apply_table = self.table_config.get("enable", False)
														
 
															         self.table_max_time = self.table_config.get("max_time", TABLE_MAX_TIME_VALUE)
														
 
															-        self.table_model_name = self.table_config.get("model", MODEL_NAME.TABLE_MASTER)
														
 
															+        self.table_model_name = self.table_config.get("model", MODEL_NAME.RAPID_TABLE)
														
 
															         # ocr config
														
 
															         self.apply_ocr = ocr
														
@@ -235,7 +68,8 @@ class CustomPEKModel:
 
															         logger.info(
														
 
															             "DocAnalysis init, this may take some times, layout_model: {}, apply_formula: {}, apply_ocr: {}, "
														
 
															             "apply_table: {}, table_model: {}, lang: {}".format(
														
 
															-                self.layout_model_name, self.apply_formula, self.apply_ocr, self.apply_table, self.table_model_name, self.lang
														
 
															+                self.layout_model_name, self.apply_formula, self.apply_ocr, self.apply_table, self.table_model_name,
														
 
															+                self.lang
														
 
															             )
														
 
															         )
														
 
															         # 初始化解析方案
														
@@ -248,17 +82,17 @@ class CustomPEKModel:
 
															         # 初始化公式识别
														
 
															         if self.apply_formula:
														
 
															-
														
 
															             # 初始化公式检测模型
														
 
															             self.mfd_model = atom_model_manager.get_atom_model(
														
 
															                 atom_model_name=AtomicModel.MFD,
														
 
															-                mfd_weights=str(os.path.join(models_dir, self.configs["weights"][self.mfd_model_name]))
														
 
															+                mfd_weights=str(os.path.join(models_dir, self.configs["weights"][self.mfd_model_name])),
														
 
															+                device=self.device
														
 
															             )
														
 
															             # 初始化公式解析模型
														
 
															             mfr_weight_dir = str(os.path.join(models_dir, self.configs["weights"][self.mfr_model_name]))
														
 
															             mfr_cfg_path = str(os.path.join(model_config_dir, "UniMERNet", "demo.yaml"))
														
 
															-            self.mfr_model, self.mfr_transform = atom_model_manager.get_atom_model(
														
 
															+            self.mfr_model = atom_model_manager.get_atom_model(
														
 
															                 atom_model_name=AtomicModel.MFR,
														
 
															                 mfr_weight_dir=mfr_weight_dir,
														
 
															                 mfr_cfg_path=mfr_cfg_path,
														
@@ -278,7 +112,8 @@ class CustomPEKModel:
 
															             self.layout_model = atom_model_manager.get_atom_model(
														
 
															                 atom_model_name=AtomicModel.Layout,
														
 
															                 layout_model_name=MODEL_NAME.DocLayout_YOLO,
														
 
															-                doclayout_yolo_weights=str(os.path.join(models_dir, self.configs['weights'][self.layout_model_name]))
														
 
															+                doclayout_yolo_weights=str(os.path.join(models_dir, self.configs['weights'][self.layout_model_name])),
														
 
															+                device=self.device
														
 
															             )
														
 
															         # 初始化ocr
														
 
															         if self.apply_ocr:
														
@@ -305,26 +140,15 @@ class CustomPEKModel:
 
															         page_start = time.time()
														
 
															-        latex_filling_list = []
														
 
															-        mf_image_list = []
														
 
															-
														
 
															         # layout检测
														
 
															         layout_start = time.time()
														
 
															+        layout_res = []
														
 
															         if self.layout_model_name == MODEL_NAME.LAYOUTLMv3:
														
 
															             # layoutlmv3
														
 
															             layout_res = self.layout_model(image, ignore_catids=[])
														
 
															         elif self.layout_model_name == MODEL_NAME.DocLayout_YOLO:
														
 
															             # doclayout_yolo
														
 
															-            layout_res = []
														
 
															-            doclayout_yolo_res = self.layout_model.predict(image, imgsz=1024, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
														
 
															-            for xyxy, conf, cla in zip(doclayout_yolo_res.boxes.xyxy.cpu(), doclayout_yolo_res.boxes.conf.cpu(), doclayout_yolo_res.boxes.cls.cpu()):
														
 
															-                xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
														
 
															-                new_item = {
														
 
															-                    'category_id': int(cla.item()),
														
 
															-                    'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
														
 
															-                    'score': round(float(conf.item()), 3),
														
 
															-                }
														
 
															-                layout_res.append(new_item)
														
 
															+            layout_res = self.layout_model.predict(image)
														
 
															         layout_cost = round(time.time() - layout_start, 2)
														
 
															         logger.info(f"layout detection time: {layout_cost}")
														
@@ -333,59 +157,21 @@ class CustomPEKModel:
 
															         if self.apply_formula:
														
 
															             # 公式检测
														
 
															             mfd_start = time.time()
														
 
															-            mfd_res = self.mfd_model.predict(image, imgsz=1888, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
														
 
															+            mfd_res = self.mfd_model.predict(image)
														
 
															             logger.info(f"mfd time: {round(time.time() - mfd_start, 2)}")
														
 
															-            for xyxy, conf, cla in zip(mfd_res.boxes.xyxy.cpu(), mfd_res.boxes.conf.cpu(), mfd_res.boxes.cls.cpu()):
														
 
															-                xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
														
 
															-                new_item = {
														
 
															-                    'category_id': 13 + int(cla.item()),
														
 
															-                    'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
														
 
															-                    'score': round(float(conf.item()), 2),
														
 
															-                    'latex': '',
														
 
															-                }
														
 
															-                layout_res.append(new_item)
														
 
															-                latex_filling_list.append(new_item)
														
 
															-                bbox_img = pil_img.crop((xmin, ymin, xmax, ymax))
														
 
															-                mf_image_list.append(bbox_img)
														
 
															             # 公式识别
														
 
															             mfr_start = time.time()
														
 
															-            dataset = MathDataset(mf_image_list, transform=self.mfr_transform)
														
 
															-            dataloader = DataLoader(dataset, batch_size=64, num_workers=0)
														
 
															-            mfr_res = []
														
 
															-            for mf_img in dataloader:
														
 
															-                mf_img = mf_img.to(self.device)
														
 
															-                with torch.no_grad():
														
 
															-                    output = self.mfr_model.generate({'image': mf_img})
														
 
															-                mfr_res.extend(output['pred_str'])
														
 
															-            for res, latex in zip(latex_filling_list, mfr_res):
														
 
															-                res['latex'] = latex_rm_whitespace(latex)
														
 
															+            formula_list = self.mfr_model.predict(mfd_res, image)
														
 
															+            layout_res.extend(formula_list)
														
 
															             mfr_cost = round(time.time() - mfr_start, 2)
														
 
															-            logger.info(f"formula nums: {len(mf_image_list)}, mfr time: {mfr_cost}")
														
 
															-
														
 
															-        # Select regions for OCR / formula regions / table regions
														
 
															-        ocr_res_list = []
														
 
															-        table_res_list = []
														
 
															-        single_page_mfdetrec_res = []
														
 
															-        for res in layout_res:
														
 
															-            if int(res['category_id']) in [13, 14]:
														
 
															-                single_page_mfdetrec_res.append({
														
 
															-                    "bbox": [int(res['poly'][0]), int(res['poly'][1]),
														
 
															-                             int(res['poly'][4]), int(res['poly'][5])],
														
 
															-                })
														
 
															-            elif int(res['category_id']) in [0, 1, 2, 4, 6, 7]:
														
 
															-                ocr_res_list.append(res)
														
 
															-            elif int(res['category_id']) in [5]:
														
 
															-                table_res_list.append(res)
														
 
															-
														
 
															-        if torch.cuda.is_available() and self.device != 'cpu':
														
 
															-            properties = torch.cuda.get_device_properties(self.device)
														
 
															-            total_memory = properties.total_memory / (1024 ** 3)  # 将字节转换为 GB
														
 
															-            if total_memory <= 10:
														
 
															-                gc_start = time.time()
														
 
															-                clean_memory()
														
 
															-                gc_time = round(time.time() - gc_start, 2)
														
 
															-                logger.info(f"gc time: {gc_time}")
														
 
															+            logger.info(f"formula nums: {len(formula_list)}, mfr time: {mfr_cost}")
														
 
															+
														
 
															+        # 清理显存
														
 
															+        clean_vram(self.device, vram_threshold=8)
														
 
															+
														
 
															+        # 从layout_res中获取ocr区域、表格区域、公式区域
														
 
															+        ocr_res_list, table_res_list, single_page_mfdetrec_res = get_res_list_from_layout_res(layout_res)
														
 
															         # ocr识别
														
 
															         if self.apply_ocr:
														
@@ -393,23 +179,7 @@ class CustomPEKModel:
 
															             # Process each area that requires OCR processing
														
 
															             for res in ocr_res_list:
														
 
															                 new_image, useful_list = crop_img(res, pil_img, crop_paste_x=50, crop_paste_y=50)
														
 
															-                paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
														
 
															-                # Adjust the coordinates of the formula area
														
 
															-                adjusted_mfdetrec_res = []
														
 
															-                for mf_res in single_page_mfdetrec_res:
														
 
															-                    mf_xmin, mf_ymin, mf_xmax, mf_ymax = mf_res["bbox"]
														
 
															-                    # Adjust the coordinates of the formula area to the coordinates relative to the cropping area
														
 
															-                    x0 = mf_xmin - xmin + paste_x
														
 
															-                    y0 = mf_ymin - ymin + paste_y
														
 
															-                    x1 = mf_xmax - xmin + paste_x
														
 
															-                    y1 = mf_ymax - ymin + paste_y
														
 
															-                    # Filter formula blocks outside the graph
														
 
															-                    if any([x1 < 0, y1 < 0]) or any([x0 > new_width, y0 > new_height]):
														
 
															-                        continue
														
 
															-                    else:
														
 
															-                        adjusted_mfdetrec_res.append({
														
 
															-                            "bbox": [x0, y0, x1, y1],
														
 
															-                        })
														
 
															+                adjusted_mfdetrec_res = get_adjusted_mfdetrec_res(single_page_mfdetrec_res, useful_list)
														
 
															                 # OCR recognition
														
 
															                 new_image = cv2.cvtColor(np.asarray(new_image), cv2.COLOR_RGB2BGR)
														
@@ -417,22 +187,8 @@ class CustomPEKModel:
 
															                 # Integration results
														
 
															                 if ocr_res:
														
 
															-                    for box_ocr_res in ocr_res:
														
 
															-                        p1, p2, p3, p4 = box_ocr_res[0]
														
 
															-                        text, score = box_ocr_res[1]
														
 
															-
														
 
															-                        # Convert the coordinates back to the original coordinate system
														
 
															-                        p1 = [p1[0] - paste_x + xmin, p1[1] - paste_y + ymin]
														
 
															-                        p2 = [p2[0] - paste_x + xmin, p2[1] - paste_y + ymin]
														
 
															-                        p3 = [p3[0] - paste_x + xmin, p3[1] - paste_y + ymin]
														
 
															-                        p4 = [p4[0] - paste_x + xmin, p4[1] - paste_y + ymin]
														
 
															-
														
 
															-                        layout_res.append({
														
 
															-                            'category_id': 15,
														
 
															-                            'poly': p1 + p2 + p3 + p4,
														
 
															-                            'score': round(score, 2),
														
 
															-                            'text': text,
														
 
															-                        })
														
 
															+                    ocr_result_list = get_ocr_result_list(ocr_res, useful_list)
														
 
															+                    layout_res.extend(ocr_result_list)
														
 
															             ocr_cost = round(time.time() - ocr_start, 2)
														
 
															             logger.info(f"ocr time: {ocr_cost}")
														
@@ -443,41 +199,30 @@ class CustomPEKModel:
 
															             for res in table_res_list:
														
 
															                 new_image, _ = crop_img(res, pil_img)
														
 
															                 single_table_start_time = time.time()
														
 
															-                # logger.info("------------------table recognition processing begins-----------------")
														
 
															-                latex_code = None
														
 
															                 html_code = None
														
 
															                 if self.table_model_name == MODEL_NAME.STRUCT_EQTABLE:
														
 
															                     with torch.no_grad():
														
 
															                         table_result = self.table_model.predict(new_image, "html")
														
 
															                         if len(table_result) > 0:
														
 
															                             html_code = table_result[0]
														
 
															-                else:
														
 
															+                elif self.table_model_name == MODEL_NAME.TABLE_MASTER:
														
 
															                     html_code = self.table_model.img2html(new_image)
														
 
															-
														
 
															+                elif self.table_model_name == MODEL_NAME.RAPID_TABLE:
														
 
															+                    html_code, table_cell_bboxes, elapse = self.table_model.predict(new_image)
														
 
															                 run_time = time.time() - single_table_start_time
														
 
															-                # logger.info(f"------------table recognition processing ends within {run_time}s-----")
														
 
															                 if run_time > self.table_max_time:
														
 
															-                    logger.warning(f"------------table recognition processing exceeds max time {self.table_max_time}s----------")
														
 
															+                    logger.warning(f"table recognition processing exceeds max time {self.table_max_time}s")
														
 
															                 # 判断是否返回正常
														
 
															-
														
 
															-                if latex_code:
														
 
															-                    expected_ending = latex_code.strip().endswith('end{tabular}') or latex_code.strip().endswith('end{table}')
														
 
															-                    if expected_ending:
														
 
															-                        res["latex"] = latex_code
														
 
															-                    else:
														
 
															-                        logger.warning(f"table recognition processing fails, not found expected LaTeX table end")
														
 
															-                elif html_code:
														
 
															+                if html_code:
														
 
															                     expected_ending = html_code.strip().endswith('</html>') or html_code.strip().endswith('</table>')
														
 
															                     if expected_ending:
														
 
															                         res["html"] = html_code
														
 
															                     else:
														
 
															                         logger.warning(f"table recognition processing fails, not found expected HTML table end")
														
 
															                 else:
														
 
															-                    logger.warning(f"table recognition processing fails, not get latex or html return")
														
 
															+                    logger.warning(f"table recognition processing fails, not get html return")
														
 
															             logger.info(f"table time: {round(time.time() - table_start, 2)}")
														
 
															         logger.info(f"-----page total time: {round(time.time() - page_start, 2)}-----")
														
 
															         return layout_res
														
 
															-
														
 
															-
														
--- a/magic_pdf/model/pek_sub_modules/post_process.py
+++ b/magic_pdf/model/pek_sub_modules/post_process.py
@@ -1,36 +0,0 @@
 
															-import re
														
 
															-
														
 
															-def layout_rm_equation(layout_res):
														
 
															-    rm_idxs = []
														
 
															-    for idx, ele in enumerate(layout_res['layout_dets']):
														
 
															-        if ele['category_id'] == 10:
														
 
															-            rm_idxs.append(idx)
														
 
															-    
														
 
															-    for idx in rm_idxs[::-1]:
														
 
															-        del layout_res['layout_dets'][idx]
														
 
															-    return layout_res
														
 
															-
														
 
															-
														
 
															-def get_croped_image(image_pil, bbox):
														
 
															-    x_min, y_min, x_max, y_max = bbox
														
 
															-    croped_img = image_pil.crop((x_min, y_min, x_max, y_max))
														
 
															-    return croped_img
														
 
															-
														
 
															-
														
 
															-def latex_rm_whitespace(s: str):
														
 
															-    """Remove unnecessary whitespace from LaTeX code.
														
 
															-    """
														
 
															-    text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
														
 
															-    letter = '[a-zA-Z]'
														
 
															-    noletter = '[\W_^\d]'
														
 
															-    names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
														
 
															-    s = re.sub(text_reg, lambda match: str(names.pop(0)), s)
														
 
															-    news = s
														
 
															-    while True:
														
 
															-        s = news
														
 
															-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
														
 
															-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
														
 
															-        news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
														
 
															-        if news == s:
														
 
															-            break
														
 
															-    return s
														
--- a/magic_pdf/model/pek_sub_modules/self_modify.py
+++ b/magic_pdf/model/pek_sub_modules/self_modify.py
@@ -1,388 +0,0 @@
 
															-import time
														
 
															-import copy
														
 
															-import base64
														
 
															-import cv2
														
 
															-import numpy as np
														
 
															-from io import BytesIO
														
 
															-from PIL import Image
														
 
															-
														
 
															-from paddleocr import PaddleOCR
														
 
															-from paddleocr.ppocr.utils.logging import get_logger
														
 
															-from paddleocr.ppocr.utils.utility import check_and_read, alpha_to_color, binarize_img
														
 
															-from paddleocr.tools.infer.utility import draw_ocr_box_txt, get_rotate_crop_image, get_minarea_rect_crop
														
 
															-
														
 
															-from magic_pdf.libs.boxbase import __is_overlaps_y_exceeds_threshold
														
 
															-from magic_pdf.pre_proc.ocr_dict_merge import merge_spans_to_line
														
 
															-
														
 
															-logger = get_logger()
														
 
															-
														
 
															-
														
 
															-def img_decode(content: bytes):
														
 
															-    np_arr = np.frombuffer(content, dtype=np.uint8)
														
 
															-    return cv2.imdecode(np_arr, cv2.IMREAD_UNCHANGED)
														
 
															-
														
 
															-
														
 
															-def check_img(img):
														
 
															-    if isinstance(img, bytes):
														
 
															-        img = img_decode(img)
														
 
															-    if isinstance(img, str):
														
 
															-        image_file = img
														
 
															-        img, flag_gif, flag_pdf = check_and_read(image_file)
														
 
															-        if not flag_gif and not flag_pdf:
														
 
															-            with open(image_file, 'rb') as f:
														
 
															-                img_str = f.read()
														
 
															-                img = img_decode(img_str)
														
 
															-            if img is None:
														
 
															-                try:
														
 
															-                    buf = BytesIO()
														
 
															-                    image = BytesIO(img_str)
														
 
															-                    im = Image.open(image)
														
 
															-                    rgb = im.convert('RGB')
														
 
															-                    rgb.save(buf, 'jpeg')
														
 
															-                    buf.seek(0)
														
 
															-                    image_bytes = buf.read()
														
 
															-                    data_base64 = str(base64.b64encode(image_bytes),
														
 
															-                                      encoding="utf-8")
														
 
															-                    image_decode = base64.b64decode(data_base64)
														
 
															-                    img_array = np.frombuffer(image_decode, np.uint8)
														
 
															-                    img = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
														
 
															-                except:
														
 
															-                    logger.error("error in loading image:{}".format(image_file))
														
 
															-                    return None
														
 
															-        if img is None:
														
 
															-            logger.error("error in loading image:{}".format(image_file))
														
 
															-            return None
														
 
															-    if isinstance(img, np.ndarray) and len(img.shape) == 2:
														
 
															-        img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
														
 
															-
														
 
															-    return img
														
 
															-
														
 
															-
														
 
															-def sorted_boxes(dt_boxes):
														
 
															-    """
														
 
															-    Sort text boxes in order from top to bottom, left to right
														
 
															-    args:
														
 
															-        dt_boxes(array):detected text boxes with shape [4, 2]
														
 
															-    return:
														
 
															-        sorted boxes(array) with shape [4, 2]
														
 
															-    """
														
 
															-    num_boxes = dt_boxes.shape[0]
														
 
															-    sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0]))
														
 
															-    _boxes = list(sorted_boxes)
														
 
															-
														
 
															-    for i in range(num_boxes - 1):
														
 
															-        for j in range(i, -1, -1):
														
 
															-            if abs(_boxes[j + 1][0][1] - _boxes[j][0][1]) < 10 and \
														
 
															-                    (_boxes[j + 1][0][0] < _boxes[j][0][0]):
														
 
															-                tmp = _boxes[j]
														
 
															-                _boxes[j] = _boxes[j + 1]
														
 
															-                _boxes[j + 1] = tmp
														
 
															-            else:
														
 
															-                break
														
 
															-    return _boxes
														
 
															-
														
 
															-
														
 
															-def bbox_to_points(bbox):
														
 
															-    """ 将bbox格式转换为四个顶点的数组 """
														
 
															-    x0, y0, x1, y1 = bbox
														
 
															-    return np.array([[x0, y0], [x1, y0], [x1, y1], [x0, y1]]).astype('float32')
														
 
															-
														
 
															-
														
 
															-def points_to_bbox(points):
														
 
															-    """ 将四个顶点的数组转换为bbox格式 """
														
 
															-    x0, y0 = points[0]
														
 
															-    x1, _ = points[1]
														
 
															-    _, y1 = points[2]
														
 
															-    return [x0, y0, x1, y1]
														
 
															-
														
 
															-
														
 
															-def merge_intervals(intervals):
														
 
															-    # Sort the intervals based on the start value
														
 
															-    intervals.sort(key=lambda x: x[0])
														
 
															-
														
 
															-    merged = []
														
 
															-    for interval in intervals:
														
 
															-        # If the list of merged intervals is empty or if the current
														
 
															-        # interval does not overlap with the previous, simply append it.
														
 
															-        if not merged or merged[-1][1] < interval[0]:
														
 
															-            merged.append(interval)
														
 
															-        else:
														
 
															-            # Otherwise, there is overlap, so we merge the current and previous intervals.
														
 
															-            merged[-1][1] = max(merged[-1][1], interval[1])
														
 
															-
														
 
															-    return merged
														
 
															-
														
 
															-
														
 
															-def remove_intervals(original, masks):
														
 
															-    # Merge all mask intervals
														
 
															-    merged_masks = merge_intervals(masks)
														
 
															-
														
 
															-    result = []
														
 
															-    original_start, original_end = original
														
 
															-
														
 
															-    for mask in merged_masks:
														
 
															-        mask_start, mask_end = mask
														
 
															-
														
 
															-        # If the mask starts after the original range, ignore it
														
 
															-        if mask_start > original_end:
														
 
															-            continue
														
 
															-
														
 
															-        # If the mask ends before the original range starts, ignore it
														
 
															-        if mask_end < original_start:
														
 
															-            continue
														
 
															-
														
 
															-        # Remove the masked part from the original range
														
 
															-        if original_start < mask_start:
														
 
															-            result.append([original_start, mask_start - 1])
														
 
															-
														
 
															-        original_start = max(mask_end + 1, original_start)
														
 
															-
														
 
															-    # Add the remaining part of the original range, if any
														
 
															-    if original_start <= original_end:
														
 
															-        result.append([original_start, original_end])
														
 
															-
														
 
															-    return result
														
 
															-
														
 
															-
														
 
															-def update_det_boxes(dt_boxes, mfd_res):
														
 
															-    new_dt_boxes = []
														
 
															-    for text_box in dt_boxes:
														
 
															-        text_bbox = points_to_bbox(text_box)
														
 
															-        masks_list = []
														
 
															-        for mf_box in mfd_res:
														
 
															-            mf_bbox = mf_box['bbox']
														
 
															-            if __is_overlaps_y_exceeds_threshold(text_bbox, mf_bbox):
														
 
															-                masks_list.append([mf_bbox[0], mf_bbox[2]])
														
 
															-        text_x_range = [text_bbox[0], text_bbox[2]]
														
 
															-        text_remove_mask_range = remove_intervals(text_x_range, masks_list)
														
 
															-        temp_dt_box = []
														
 
															-        for text_remove_mask in text_remove_mask_range:
														
 
															-            temp_dt_box.append(bbox_to_points([text_remove_mask[0], text_bbox[1], text_remove_mask[1], text_bbox[3]]))
														
 
															-        if len(temp_dt_box) > 0:
														
 
															-            new_dt_boxes.extend(temp_dt_box)
														
 
															-    return new_dt_boxes
														
 
															-
														
 
															-
														
 
															-def merge_overlapping_spans(spans):
														
 
															-    """
														
 
															-    Merges overlapping spans on the same line.
														
 
															-
														
 
															-    :param spans: A list of span coordinates [(x1, y1, x2, y2), ...]
														
 
															-    :return: A list of merged spans
														
 
															-    """
														
 
															-    # Return an empty list if the input spans list is empty
														
 
															-    if not spans:
														
 
															-        return []
														
 
															-
														
 
															-    # Sort spans by their starting x-coordinate
														
 
															-    spans.sort(key=lambda x: x[0])
														
 
															-
														
 
															-    # Initialize the list of merged spans
														
 
															-    merged = []
														
 
															-    for span in spans:
														
 
															-        # Unpack span coordinates
														
 
															-        x1, y1, x2, y2 = span
														
 
															-        # If the merged list is empty or there's no horizontal overlap, add the span directly
														
 
															-        if not merged or merged[-1][2] < x1:
														
 
															-            merged.append(span)
														
 
															-        else:
														
 
															-            # If there is horizontal overlap, merge the current span with the previous one
														
 
															-            last_span = merged.pop()
														
 
															-            # Update the merged span's top-left corner to the smaller (x1, y1) and bottom-right to the larger (x2, y2)
														
 
															-            x1 = min(last_span[0], x1)
														
 
															-            y1 = min(last_span[1], y1)
														
 
															-            x2 = max(last_span[2], x2)
														
 
															-            y2 = max(last_span[3], y2)
														
 
															-            # Add the merged span back to the list
														
 
															-            merged.append((x1, y1, x2, y2))
														
 
															-
														
 
															-    # Return the list of merged spans
														
 
															-    return merged
														
 
															-
														
 
															-
														
 
															-def merge_det_boxes(dt_boxes):
														
 
															-    """
														
 
															-    Merge detection boxes.
														
 
															-
														
 
															-    This function takes a list of detected bounding boxes, each represented by four corner points.
														
 
															-    The goal is to merge these bounding boxes into larger text regions.
														
 
															-
														
 
															-    Parameters:
														
 
															-    dt_boxes (list): A list containing multiple text detection boxes, where each box is defined by four corner points.
														
 
															-
														
 
															-    Returns:
														
 
															-    list: A list containing the merged text regions, where each region is represented by four corner points.
														
 
															-    """
														
 
															-    # Convert the detection boxes into a dictionary format with bounding boxes and type
														
 
															-    dt_boxes_dict_list = []
														
 
															-    for text_box in dt_boxes:
														
 
															-        text_bbox = points_to_bbox(text_box)
														
 
															-        text_box_dict = {
														
 
															-            'bbox': text_bbox,
														
 
															-            'type': 'text',
														
 
															-        }
														
 
															-        dt_boxes_dict_list.append(text_box_dict)
														
 
															-
														
 
															-    # Merge adjacent text regions into lines
														
 
															-    lines = merge_spans_to_line(dt_boxes_dict_list)
														
 
															-
														
 
															-    # Initialize a new list for storing the merged text regions
														
 
															-    new_dt_boxes = []
														
 
															-    for line in lines:
														
 
															-        line_bbox_list = []
														
 
															-        for span in line:
														
 
															-            line_bbox_list.append(span['bbox'])
														
 
															-
														
 
															-        # Merge overlapping text regions within the same line
														
 
															-        merged_spans = merge_overlapping_spans(line_bbox_list)
														
 
															-
														
 
															-        # Convert the merged text regions back to point format and add them to the new detection box list
														
 
															-        for span in merged_spans:
														
 
															-            new_dt_boxes.append(bbox_to_points(span))
														
 
															-
														
 
															-    return new_dt_boxes
														
 
															-
														
 
															-
														
 
															-class ModifiedPaddleOCR(PaddleOCR):
														
 
															-    def ocr(self, img, det=True, rec=True, cls=True, bin=False, inv=False, mfd_res=None, alpha_color=(255, 255, 255)):
														
 
															-        """
														
 
															-        OCR with PaddleOCR
														
 
															-        args：
														
 
															-            img: img for OCR, support ndarray, img_path and list or ndarray
														
 
															-            det: use text detection or not. If False, only rec will be exec. Default is True
														
 
															-            rec: use text recognition or not. If False, only det will be exec. Default is True
														
 
															-            cls: use angle classifier or not. Default is True. If True, the text with rotation of 180 degrees can be recognized. If no text is rotated by 180 degrees, use cls=False to get better performance. Text with rotation of 90 or 270 degrees can be recognized even if cls=False.
														
 
															-            bin: binarize image to black and white. Default is False.
														
 
															-            inv: invert image colors. Default is False.
														
 
															-            alpha_color: set RGB color Tuple for transparent parts replacement. Default is pure white.
														
 
															-        """
														
 
															-        assert isinstance(img, (np.ndarray, list, str, bytes))
														
 
															-        if isinstance(img, list) and det == True:
														
 
															-            logger.error('When input a list of images, det must be false')
														
 
															-            exit(0)
														
 
															-        if cls == True and self.use_angle_cls == False:
														
 
															-            pass
														
 
															-            # logger.warning(
														
 
															-            #     'Since the angle classifier is not initialized, it will not be used during the forward process'
														
 
															-            # )
														
 
															-
														
 
															-        img = check_img(img)
														
 
															-        # for infer pdf file
														
 
															-        if isinstance(img, list):
														
 
															-            if self.page_num > len(img) or self.page_num == 0:
														
 
															-                self.page_num = len(img)
														
 
															-            imgs = img[:self.page_num]
														
 
															-        else:
														
 
															-            imgs = [img]
														
 
															-
														
 
															-        def preprocess_image(_image):
														
 
															-            _image = alpha_to_color(_image, alpha_color)
														
 
															-            if inv:
														
 
															-                _image = cv2.bitwise_not(_image)
														
 
															-            if bin:
														
 
															-                _image = binarize_img(_image)
														
 
															-            return _image
														
 
															-
														
 
															-        if det and rec:
														
 
															-            ocr_res = []
														
 
															-            for idx, img in enumerate(imgs):
														
 
															-                img = preprocess_image(img)
														
 
															-                dt_boxes, rec_res, _ = self.__call__(img, cls, mfd_res=mfd_res)
														
 
															-                if not dt_boxes and not rec_res:
														
 
															-                    ocr_res.append(None)
														
 
															-                    continue
														
 
															-                tmp_res = [[box.tolist(), res]
														
 
															-                           for box, res in zip(dt_boxes, rec_res)]
														
 
															-                ocr_res.append(tmp_res)
														
 
															-            return ocr_res
														
 
															-        elif det and not rec:
														
 
															-            ocr_res = []
														
 
															-            for idx, img in enumerate(imgs):
														
 
															-                img = preprocess_image(img)
														
 
															-                dt_boxes, elapse = self.text_detector(img)
														
 
															-                if not dt_boxes:
														
 
															-                    ocr_res.append(None)
														
 
															-                    continue
														
 
															-                tmp_res = [box.tolist() for box in dt_boxes]
														
 
															-                ocr_res.append(tmp_res)
														
 
															-            return ocr_res
														
 
															-        else:
														
 
															-            ocr_res = []
														
 
															-            cls_res = []
														
 
															-            for idx, img in enumerate(imgs):
														
 
															-                if not isinstance(img, list):
														
 
															-                    img = preprocess_image(img)
														
 
															-                    img = [img]
														
 
															-                if self.use_angle_cls and cls:
														
 
															-                    img, cls_res_tmp, elapse = self.text_classifier(img)
														
 
															-                    if not rec:
														
 
															-                        cls_res.append(cls_res_tmp)
														
 
															-                rec_res, elapse = self.text_recognizer(img)
														
 
															-                ocr_res.append(rec_res)
														
 
															-            if not rec:
														
 
															-                return cls_res
														
 
															-            return ocr_res
														
 
															-
														
 
															-    def __call__(self, img, cls=True, mfd_res=None):
														
 
															-        time_dict = {'det': 0, 'rec': 0, 'cls': 0, 'all': 0}
														
 
															-
														
 
															-        if img is None:
														
 
															-            logger.debug("no valid image provided")
														
 
															-            return None, None, time_dict
														
 
															-
														
 
															-        start = time.time()
														
 
															-        ori_im = img.copy()
														
 
															-        dt_boxes, elapse = self.text_detector(img)
														
 
															-        time_dict['det'] = elapse
														
 
															-
														
 
															-        if dt_boxes is None:
														
 
															-            logger.debug("no dt_boxes found, elapsed : {}".format(elapse))
														
 
															-            end = time.time()
														
 
															-            time_dict['all'] = end - start
														
 
															-            return None, None, time_dict
														
 
															-        else:
														
 
															-            logger.debug("dt_boxes num : {}, elapsed : {}".format(
														
 
															-                len(dt_boxes), elapse))
														
 
															-        img_crop_list = []
														
 
															-
														
 
															-        dt_boxes = sorted_boxes(dt_boxes)
														
 
															-
														
 
															-        dt_boxes = merge_det_boxes(dt_boxes)
														
 
															-
														
 
															-        if mfd_res:
														
 
															-            bef = time.time()
														
 
															-            dt_boxes = update_det_boxes(dt_boxes, mfd_res)
														
 
															-            aft = time.time()
														
 
															-            logger.debug("split text box by formula, new dt_boxes num : {}, elapsed : {}".format(
														
 
															-                len(dt_boxes), aft - bef))
														
 
															-
														
 
															-        for bno in range(len(dt_boxes)):
														
 
															-            tmp_box = copy.deepcopy(dt_boxes[bno])
														
 
															-            if self.args.det_box_type == "quad":
														
 
															-                img_crop = get_rotate_crop_image(ori_im, tmp_box)
														
 
															-            else:
														
 
															-                img_crop = get_minarea_rect_crop(ori_im, tmp_box)
														
 
															-            img_crop_list.append(img_crop)
														
 
															-        if self.use_angle_cls and cls:
														
 
															-            img_crop_list, angle_list, elapse = self.text_classifier(
														
 
															-                img_crop_list)
														
 
															-            time_dict['cls'] = elapse
														
 
															-            logger.debug("cls num  : {}, elapsed : {}".format(
														
 
															-                len(img_crop_list), elapse))
														
 
															-
														
 
															-        rec_res, elapse = self.text_recognizer(img_crop_list)
														
 
															-        time_dict['rec'] = elapse
														
 
															-        logger.debug("rec_res num  : {}, elapsed : {}".format(
														
 
															-            len(rec_res), elapse))
														
 
															-        if self.args.save_crop_res:
														
 
															-            self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list,
														
 
															-                                   rec_res)
														
 
															-        filter_boxes, filter_rec_res = [], []
														
 
															-        for box, rec_result in zip(dt_boxes, rec_res):
														
 
															-            text, score = rec_result
														
 
															-            if score >= self.drop_score:
														
 
															-                filter_boxes.append(box)
														
 
															-                filter_rec_res.append(rec_result)
														
 
															-        end = time.time()
														
 
															-        time_dict['all'] = end - start
														
 
															-        return filter_boxes, filter_rec_res, time_dict
														
--- a/magic_pdf/model/pek_sub_modules/__init__.py
+++ b/magic_pdf/model/pek_sub_modules/__init__.py
--- a/magic_pdf/model/pek_sub_modules/layoutlmv3/__init__.py
+++ b/magic_pdf/model/pek_sub_modules/layoutlmv3/__init__.py
--- a/magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py
+++ b/magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py
@@ -0,0 +1,21 @@
 
															+from doclayout_yolo import YOLOv10
														
 
															+
														
 
															+
														
 
															+class DocLayoutYOLOModel(object):
														
 
															+    def __init__(self, weight, device):
														
 
															+        self.model = YOLOv10(weight)
														
 
															+        self.device = device
														
 
															+
														
 
															+    def predict(self, image):
														
 
															+        layout_res = []
														
 
															+        doclayout_yolo_res = self.model.predict(image, imgsz=1024, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
														
 
															+        for xyxy, conf, cla in zip(doclayout_yolo_res.boxes.xyxy.cpu(), doclayout_yolo_res.boxes.conf.cpu(),
														
 
															+                                   doclayout_yolo_res.boxes.cls.cpu()):
														
 
															+            xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
														
 
															+            new_item = {
														
 
															+                'category_id': int(cla.item()),
														
 
															+                'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
														
 
															+                'score': round(float(conf.item()), 3),
														
 
															+            }
														
 
															+            layout_res.append(new_item)
														
 
															+        return layout_res
														
--- a/magic_pdf/model/sub_modules/layout/doclayout_yolo/__init__.py
+++ b/magic_pdf/model/sub_modules/layout/doclayout_yolo/__init__.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/__init__.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/__init__.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/backbone.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/backbone.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/beit.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/beit.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/deit.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/deit.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/__init__.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/__init__.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/__init__.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/__init__.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/cord.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/cord.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/data_collator.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/data_collator.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/funsd.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/funsd.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/image_utils.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/image_utils.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/xfund.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/xfund.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/__init__.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/__init__.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/__init__.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/__init__.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/configuration_layoutlmv3.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/configuration_layoutlmv3.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/modeling_layoutlmv3.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/modeling_layoutlmv3.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3_fast.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3_fast.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/model_init.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/model_init.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/rcnn_vl.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/rcnn_vl.py
--- a/magic_pdf/model/sub_modules/layout/layoutlmv3/visualizer.py
+++ b/magic_pdf/model/sub_modules/layout/layoutlmv3/visualizer.py
--- a/magic_pdf/model/sub_modules/mfd/__init__.py
+++ b/magic_pdf/model/sub_modules/mfd/__init__.py
--- a/magic_pdf/model/sub_modules/mfd/yolov8/YOLOv8.py
+++ b/magic_pdf/model/sub_modules/mfd/yolov8/YOLOv8.py
@@ -0,0 +1,12 @@
 
															+from ultralytics import YOLO
														
 
															+
														
 
															+
														
 
															+class YOLOv8MFDModel(object):
														
 
															+    def __init__(self, weight, device='cpu'):
														
 
															+        self.mfd_model = YOLO(weight)
														
 
															+        self.device = device
														
 
															+
														
 
															+    def predict(self, image):
														
 
															+        mfd_res = self.mfd_model.predict(image, imgsz=1888, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
														
 
															+        return mfd_res
														
 
															+
														
--- a/magic_pdf/model/sub_modules/mfd/yolov8/__init__.py
+++ b/magic_pdf/model/sub_modules/mfd/yolov8/__init__.py
--- a/magic_pdf/model/sub_modules/mfr/__init__.py
+++ b/magic_pdf/model/sub_modules/mfr/__init__.py
--- a/magic_pdf/model/sub_modules/mfr/unimernet/Unimernet.py
+++ b/magic_pdf/model/sub_modules/mfr/unimernet/Unimernet.py
@@ -0,0 +1,98 @@
 
															+import os
														
 
															+import argparse
														
 
															+import re
														
 
															+
														
 
															+from PIL import Image
														
 
															+import torch
														
 
															+from torch.utils.data import Dataset, DataLoader
														
 
															+from torchvision import transforms
														
 
															+from unimernet.common.config import Config
														
 
															+import unimernet.tasks as tasks
														
 
															+from unimernet.processors import load_processor
														
 
															+
														
 
															+
														
 
															+class MathDataset(Dataset):
														
 
															+    def __init__(self, image_paths, transform=None):
														
 
															+        self.image_paths = image_paths
														
 
															+        self.transform = transform
														
 
															+
														
 
															+    def __len__(self):
														
 
															+        return len(self.image_paths)
														
 
															+
														
 
															+    def __getitem__(self, idx):
														
 
															+        # if not pil image, then convert to pil image
														
 
															+        if isinstance(self.image_paths[idx], str):
														
 
															+            raw_image = Image.open(self.image_paths[idx])
														
 
															+        else:
														
 
															+            raw_image = self.image_paths[idx]
														
 
															+        if self.transform:
														
 
															+            image = self.transform(raw_image)
														
 
															+            return image
														
 
															+
														
 
															+
														
 
															+def latex_rm_whitespace(s: str):
														
 
															+    """Remove unnecessary whitespace from LaTeX code.
														
 
															+    """
														
 
															+    text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
														
 
															+    letter = '[a-zA-Z]'
														
 
															+    noletter = '[\W_^\d]'
														
 
															+    names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
														
 
															+    s = re.sub(text_reg, lambda match: str(names.pop(0)), s)
														
 
															+    news = s
														
 
															+    while True:
														
 
															+        s = news
														
 
															+        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
														
 
															+        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
														
 
															+        news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
														
 
															+        if news == s:
														
 
															+            break
														
 
															+    return s
														
 
															+
														
 
															+
														
 
															+class UnimernetModel(object):
														
 
															+    def __init__(self, weight_dir, cfg_path, _device_='cpu'):
														
 
															+
														
 
															+        args = argparse.Namespace(cfg_path=cfg_path, options=None)
														
 
															+        cfg = Config(args)
														
 
															+        cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.pth")
														
 
															+        cfg.config.model.model_config.model_name = weight_dir
														
 
															+        cfg.config.model.tokenizer_config.path = weight_dir
														
 
															+        task = tasks.setup_task(cfg)
														
 
															+        self.model = task.build_model(cfg)
														
 
															+        self.device = _device_
														
 
															+        self.model.to(_device_)
														
 
															+        self.model.eval()
														
 
															+        vis_processor = load_processor('formula_image_eval', cfg.config.datasets.formula_rec_eval.vis_processor.eval)
														
 
															+        self.mfr_transform = transforms.Compose([vis_processor, ])
														
 
															+
														
 
															+    def predict(self, mfd_res, image):
														
 
															+
														
 
															+        formula_list = []
														
 
															+        mf_image_list = []
														
 
															+        for xyxy, conf, cla in zip(mfd_res.boxes.xyxy.cpu(), mfd_res.boxes.conf.cpu(), mfd_res.boxes.cls.cpu()):
														
 
															+            xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
														
 
															+            new_item = {
														
 
															+                'category_id': 13 + int(cla.item()),
														
 
															+                'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
														
 
															+                'score': round(float(conf.item()), 2),
														
 
															+                'latex': '',
														
 
															+            }
														
 
															+            formula_list.append(new_item)
														
 
															+            pil_img = Image.fromarray(image)
														
 
															+            bbox_img = pil_img.crop((xmin, ymin, xmax, ymax))
														
 
															+            mf_image_list.append(bbox_img)
														
 
															+
														
 
															+        dataset = MathDataset(mf_image_list, transform=self.mfr_transform)
														
 
															+        dataloader = DataLoader(dataset, batch_size=64, num_workers=0)
														
 
															+        mfr_res = []
														
 
															+        for mf_img in dataloader:
														
 
															+            mf_img = mf_img.to(self.device)
														
 
															+            with torch.no_grad():
														
 
															+                output = self.model.generate({'image': mf_img})
														
 
															+            mfr_res.extend(output['pred_str'])
														
 
															+        for res, latex in zip(formula_list, mfr_res):
														
 
															+            res['latex'] = latex_rm_whitespace(latex)
														
 
															+        return formula_list
														
 
															+
														
 
															+
														
 
															+
														
--- a/magic_pdf/model/sub_modules/mfr/unimernet/__init__.py
+++ b/magic_pdf/model/sub_modules/mfr/unimernet/__init__.py
--- a/magic_pdf/model/sub_modules/model_init.py
+++ b/magic_pdf/model/sub_modules/model_init.py
@@ -0,0 +1,144 @@
 
															+from loguru import logger
														
 
															+
														
 
															+from magic_pdf.libs.Constants import MODEL_NAME
														
 
															+from magic_pdf.model.model_list import AtomicModel
														
 
															+from magic_pdf.model.sub_modules.layout.doclayout_yolo.DocLayoutYOLO import DocLayoutYOLOModel
														
 
															+from magic_pdf.model.sub_modules.layout.layoutlmv3.model_init import Layoutlmv3_Predictor
														
 
															+from magic_pdf.model.sub_modules.mfd.yolov8.YOLOv8 import YOLOv8MFDModel
														
 
															+
														
 
															+from magic_pdf.model.sub_modules.mfr.unimernet.Unimernet import UnimernetModel
														
 
															+from magic_pdf.model.sub_modules.ocr.paddleocr.ppocr_273_mod import ModifiedPaddleOCR
														
 
															+# from magic_pdf.model.sub_modules.ocr.paddleocr.ppocr_291_mod import ModifiedPaddleOCR
														
 
															+from magic_pdf.model.sub_modules.table.structeqtable.struct_eqtable import StructTableModel
														
 
															+from magic_pdf.model.sub_modules.table.tablemaster.tablemaster_paddle import TableMasterPaddleModel
														
 
															+from magic_pdf.model.sub_modules.table.rapidtable.rapid_table import RapidTableModel
														
 
															+
														
 
															+
														
 
															+def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
														
 
															+    if table_model_type == MODEL_NAME.STRUCT_EQTABLE:
														
 
															+        table_model = StructTableModel(model_path, max_new_tokens=2048, max_time=max_time)
														
 
															+    elif table_model_type == MODEL_NAME.TABLE_MASTER:
														
 
															+        config = {
														
 
															+            "model_dir": model_path,
														
 
															+            "device": _device_
														
 
															+        }
														
 
															+        table_model = TableMasterPaddleModel(config)
														
 
															+    elif table_model_type == MODEL_NAME.RAPID_TABLE:
														
 
															+        table_model = RapidTableModel()
														
 
															+    else:
														
 
															+        logger.error("table model type not allow")
														
 
															+        exit(1)
														
 
															+
														
 
															+    return table_model
														
 
															+
														
 
															+
														
 
															+def mfd_model_init(weight, device='cpu'):
														
 
															+    mfd_model = YOLOv8MFDModel(weight, device)
														
 
															+    return mfd_model
														
 
															+
														
 
															+
														
 
															+def mfr_model_init(weight_dir, cfg_path, device='cpu'):
														
 
															+    mfr_model = UnimernetModel(weight_dir, cfg_path, device)
														
 
															+    return mfr_model
														
 
															+
														
 
															+
														
 
															+def layout_model_init(weight, config_file, device):
														
 
															+    model = Layoutlmv3_Predictor(weight, config_file, device)
														
 
															+    return model
														
 
															+
														
 
															+
														
 
															+def doclayout_yolo_model_init(weight, device='cpu'):
														
 
															+    model = DocLayoutYOLOModel(weight, device)
														
 
															+    return model
														
 
															+
														
 
															+
														
 
															+def ocr_model_init(show_log: bool = False,
														
 
															+                   det_db_box_thresh=0.3,
														
 
															+                   lang=None,
														
 
															+                   use_dilation=True,
														
 
															+                   det_db_unclip_ratio=1.8,
														
 
															+                   ):
														
 
															+    if lang is not None:
														
 
															+        model = ModifiedPaddleOCR(
														
 
															+            show_log=show_log,
														
 
															+            det_db_box_thresh=det_db_box_thresh,
														
 
															+            lang=lang,
														
 
															+            use_dilation=use_dilation,
														
 
															+            det_db_unclip_ratio=det_db_unclip_ratio,
														
 
															+        )
														
 
															+    else:
														
 
															+        model = ModifiedPaddleOCR(
														
 
															+            show_log=show_log,
														
 
															+            det_db_box_thresh=det_db_box_thresh,
														
 
															+            use_dilation=use_dilation,
														
 
															+            det_db_unclip_ratio=det_db_unclip_ratio,
														
 
															+            # use_angle_cls=True,
														
 
															+        )
														
 
															+    return model
														
 
															+
														
 
															+
														
 
															+class AtomModelSingleton:
														
 
															+    _instance = None
														
 
															+    _models = {}
														
 
															+
														
 
															+    def __new__(cls, *args, **kwargs):
														
 
															+        if cls._instance is None:
														
 
															+            cls._instance = super().__new__(cls)
														
 
															+        return cls._instance
														
 
															+
														
 
															+    def get_atom_model(self, atom_model_name: str, **kwargs):
														
 
															+        lang = kwargs.get("lang", None)
														
 
															+        layout_model_name = kwargs.get("layout_model_name", None)
														
 
															+        key = (atom_model_name, layout_model_name, lang)
														
 
															+        if key not in self._models:
														
 
															+            self._models[key] = atom_model_init(model_name=atom_model_name, **kwargs)
														
 
															+        return self._models[key]
														
 
															+
														
 
															+
														
 
															+def atom_model_init(model_name: str, **kwargs):
														
 
															+    atom_model = None
														
 
															+    if model_name == AtomicModel.Layout:
														
 
															+        if kwargs.get("layout_model_name") == MODEL_NAME.LAYOUTLMv3:
														
 
															+            atom_model = layout_model_init(
														
 
															+                kwargs.get("layout_weights"),
														
 
															+                kwargs.get("layout_config_file"),
														
 
															+                kwargs.get("device")
														
 
															+            )
														
 
															+        elif kwargs.get("layout_model_name") == MODEL_NAME.DocLayout_YOLO:
														
 
															+            atom_model = doclayout_yolo_model_init(
														
 
															+                kwargs.get("doclayout_yolo_weights"),
														
 
															+                kwargs.get("device")
														
 
															+            )
														
 
															+    elif model_name == AtomicModel.MFD:
														
 
															+        atom_model = mfd_model_init(
														
 
															+            kwargs.get("mfd_weights"),
														
 
															+            kwargs.get("device")
														
 
															+        )
														
 
															+    elif model_name == AtomicModel.MFR:
														
 
															+        atom_model = mfr_model_init(
														
 
															+            kwargs.get("mfr_weight_dir"),
														
 
															+            kwargs.get("mfr_cfg_path"),
														
 
															+            kwargs.get("device")
														
 
															+        )
														
 
															+    elif model_name == AtomicModel.OCR:
														
 
															+        atom_model = ocr_model_init(
														
 
															+            kwargs.get("ocr_show_log"),
														
 
															+            kwargs.get("det_db_box_thresh"),
														
 
															+            kwargs.get("lang")
														
 
															+        )
														
 
															+    elif model_name == AtomicModel.Table:
														
 
															+        atom_model = table_model_init(
														
 
															+            kwargs.get("table_model_name"),
														
 
															+            kwargs.get("table_model_path"),
														
 
															+            kwargs.get("table_max_time"),
														
 
															+            kwargs.get("device")
														
 
															+        )
														
 
															+    else:
														
 
															+        logger.error("model name not allow")
														
 
															+        exit(1)
														
 
															+
														
 
															+    if atom_model is None:
														
 
															+        logger.error("model init failed")
														
 
															+        exit(1)
														
 
															+    else:
														
 
															+        return atom_model
														
--- a/magic_pdf/model/sub_modules/model_utils.py
+++ b/magic_pdf/model/sub_modules/model_utils.py
@@ -0,0 +1,51 @@
 
															+import time
														
 
															+
														
 
															+import torch
														
 
															+from PIL import Image
														
 
															+from loguru import logger
														
 
															+
														
 
															+from magic_pdf.libs.clean_memory import clean_memory
														
 
															+
														
 
															+
														
 
															+def crop_img(input_res, input_pil_img, crop_paste_x=0, crop_paste_y=0):
														
 
															+    crop_xmin, crop_ymin = int(input_res['poly'][0]), int(input_res['poly'][1])
														
 
															+    crop_xmax, crop_ymax = int(input_res['poly'][4]), int(input_res['poly'][5])
														
 
															+    # Create a white background with an additional width and height of 50
														
 
															+    crop_new_width = crop_xmax - crop_xmin + crop_paste_x * 2
														
 
															+    crop_new_height = crop_ymax - crop_ymin + crop_paste_y * 2
														
 
															+    return_image = Image.new('RGB', (crop_new_width, crop_new_height), 'white')
														
 
															+
														
 
															+    # Crop image
														
 
															+    crop_box = (crop_xmin, crop_ymin, crop_xmax, crop_ymax)
														
 
															+    cropped_img = input_pil_img.crop(crop_box)
														
 
															+    return_image.paste(cropped_img, (crop_paste_x, crop_paste_y))
														
 
															+    return_list = [crop_paste_x, crop_paste_y, crop_xmin, crop_ymin, crop_xmax, crop_ymax, crop_new_width, crop_new_height]
														
 
															+    return return_image, return_list
														
 
															+
														
 
															+
														
 
															+# Select regions for OCR / formula regions / table regions
														
 
															+def get_res_list_from_layout_res(layout_res):
														
 
															+    ocr_res_list = []
														
 
															+    table_res_list = []
														
 
															+    single_page_mfdetrec_res = []
														
 
															+    for res in layout_res:
														
 
															+        if int(res['category_id']) in [13, 14]:
														
 
															+            single_page_mfdetrec_res.append({
														
 
															+                "bbox": [int(res['poly'][0]), int(res['poly'][1]),
														
 
															+                         int(res['poly'][4]), int(res['poly'][5])],
														
 
															+            })
														
 
															+        elif int(res['category_id']) in [0, 1, 2, 4, 6, 7]:
														
 
															+            ocr_res_list.append(res)
														
 
															+        elif int(res['category_id']) in [5]:
														
 
															+            table_res_list.append(res)
														
 
															+    return ocr_res_list, table_res_list, single_page_mfdetrec_res
														
 
															+
														
 
															+
														
 
															+def clean_vram(device, vram_threshold=8):
														
 
															+    if torch.cuda.is_available() and device != 'cpu':
														
 
															+        total_memory = torch.cuda.get_device_properties(device).total_memory / (1024 ** 3)  # 将字节转换为 GB
														
 
															+        if total_memory <= vram_threshold:
														
 
															+            gc_start = time.time()
														
 
															+            clean_memory()
														
 
															+            gc_time = round(time.time() - gc_start, 2)
														
 
															+            logger.info(f"gc time: {gc_time}")
														
--- a/magic_pdf/model/sub_modules/ocr/__init__.py
+++ b/magic_pdf/model/sub_modules/ocr/__init__.py
--- a/magic_pdf/model/sub_modules/ocr/paddleocr/__init__.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr/__init__.py
--- a/magic_pdf/model/sub_modules/ocr/paddleocr/ocr_utils.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr/ocr_utils.py
@@ -0,0 +1,259 @@
 
															+import math
														
 
															+
														
 
															+import numpy as np
														
 
															+from loguru import logger
														
 
															+
														
 
															+from magic_pdf.libs.boxbase import __is_overlaps_y_exceeds_threshold
														
 
															+from magic_pdf.pre_proc.ocr_dict_merge import merge_spans_to_line
														
 
															+
														
 
															+
														
 
															+def bbox_to_points(bbox):
														
 
															+    """ 将bbox格式转换为四个顶点的数组 """
														
 
															+    x0, y0, x1, y1 = bbox
														
 
															+    return np.array([[x0, y0], [x1, y0], [x1, y1], [x0, y1]]).astype('float32')
														
 
															+
														
 
															+
														
 
															+def points_to_bbox(points):
														
 
															+    """ 将四个顶点的数组转换为bbox格式 """
														
 
															+    x0, y0 = points[0]
														
 
															+    x1, _ = points[1]
														
 
															+    _, y1 = points[2]
														
 
															+    return [x0, y0, x1, y1]
														
 
															+
														
 
															+
														
 
															+def merge_intervals(intervals):
														
 
															+    # Sort the intervals based on the start value
														
 
															+    intervals.sort(key=lambda x: x[0])
														
 
															+
														
 
															+    merged = []
														
 
															+    for interval in intervals:
														
 
															+        # If the list of merged intervals is empty or if the current
														
 
															+        # interval does not overlap with the previous, simply append it.
														
 
															+        if not merged or merged[-1][1] < interval[0]:
														
 
															+            merged.append(interval)
														
 
															+        else:
														
 
															+            # Otherwise, there is overlap, so we merge the current and previous intervals.
														
 
															+            merged[-1][1] = max(merged[-1][1], interval[1])
														
 
															+
														
 
															+    return merged
														
 
															+
														
 
															+
														
 
															+def remove_intervals(original, masks):
														
 
															+    # Merge all mask intervals
														
 
															+    merged_masks = merge_intervals(masks)
														
 
															+
														
 
															+    result = []
														
 
															+    original_start, original_end = original
														
 
															+
														
 
															+    for mask in merged_masks:
														
 
															+        mask_start, mask_end = mask
														
 
															+
														
 
															+        # If the mask starts after the original range, ignore it
														
 
															+        if mask_start > original_end:
														
 
															+            continue
														
 
															+
														
 
															+        # If the mask ends before the original range starts, ignore it
														
 
															+        if mask_end < original_start:
														
 
															+            continue
														
 
															+
														
 
															+        # Remove the masked part from the original range
														
 
															+        if original_start < mask_start:
														
 
															+            result.append([original_start, mask_start - 1])
														
 
															+
														
 
															+        original_start = max(mask_end + 1, original_start)
														
 
															+
														
 
															+    # Add the remaining part of the original range, if any
														
 
															+    if original_start <= original_end:
														
 
															+        result.append([original_start, original_end])
														
 
															+
														
 
															+    return result
														
 
															+
														
 
															+
														
 
															+def update_det_boxes(dt_boxes, mfd_res):
														
 
															+    new_dt_boxes = []
														
 
															+    for text_box in dt_boxes:
														
 
															+        text_bbox = points_to_bbox(text_box)
														
 
															+        masks_list = []
														
 
															+        for mf_box in mfd_res:
														
 
															+            mf_bbox = mf_box['bbox']
														
 
															+            if __is_overlaps_y_exceeds_threshold(text_bbox, mf_bbox):
														
 
															+                masks_list.append([mf_bbox[0], mf_bbox[2]])
														
 
															+        text_x_range = [text_bbox[0], text_bbox[2]]
														
 
															+        text_remove_mask_range = remove_intervals(text_x_range, masks_list)
														
 
															+        temp_dt_box = []
														
 
															+        for text_remove_mask in text_remove_mask_range:
														
 
															+            temp_dt_box.append(bbox_to_points([text_remove_mask[0], text_bbox[1], text_remove_mask[1], text_bbox[3]]))
														
 
															+        if len(temp_dt_box) > 0:
														
 
															+            new_dt_boxes.extend(temp_dt_box)
														
 
															+    return new_dt_boxes
														
 
															+
														
 
															+
														
 
															+def merge_overlapping_spans(spans):
														
 
															+    """
														
 
															+    Merges overlapping spans on the same line.
														
 
															+
														
 
															+    :param spans: A list of span coordinates [(x1, y1, x2, y2), ...]
														
 
															+    :return: A list of merged spans
														
 
															+    """
														
 
															+    # Return an empty list if the input spans list is empty
														
 
															+    if not spans:
														
 
															+        return []
														
 
															+
														
 
															+    # Sort spans by their starting x-coordinate
														
 
															+    spans.sort(key=lambda x: x[0])
														
 
															+
														
 
															+    # Initialize the list of merged spans
														
 
															+    merged = []
														
 
															+    for span in spans:
														
 
															+        # Unpack span coordinates
														
 
															+        x1, y1, x2, y2 = span
														
 
															+        # If the merged list is empty or there's no horizontal overlap, add the span directly
														
 
															+        if not merged or merged[-1][2] < x1:
														
 
															+            merged.append(span)
														
 
															+        else:
														
 
															+            # If there is horizontal overlap, merge the current span with the previous one
														
 
															+            last_span = merged.pop()
														
 
															+            # Update the merged span's top-left corner to the smaller (x1, y1) and bottom-right to the larger (x2, y2)
														
 
															+            x1 = min(last_span[0], x1)
														
 
															+            y1 = min(last_span[1], y1)
														
 
															+            x2 = max(last_span[2], x2)
														
 
															+            y2 = max(last_span[3], y2)
														
 
															+            # Add the merged span back to the list
														
 
															+            merged.append((x1, y1, x2, y2))
														
 
															+
														
 
															+    # Return the list of merged spans
														
 
															+    return merged
														
 
															+
														
 
															+
														
 
															+def merge_det_boxes(dt_boxes):
														
 
															+    """
														
 
															+    Merge detection boxes.
														
 
															+
														
 
															+    This function takes a list of detected bounding boxes, each represented by four corner points.
														
 
															+    The goal is to merge these bounding boxes into larger text regions.
														
 
															+
														
 
															+    Parameters:
														
 
															+    dt_boxes (list): A list containing multiple text detection boxes, where each box is defined by four corner points.
														
 
															+
														
 
															+    Returns:
														
 
															+    list: A list containing the merged text regions, where each region is represented by four corner points.
														
 
															+    """
														
 
															+    # Convert the detection boxes into a dictionary format with bounding boxes and type
														
 
															+    dt_boxes_dict_list = []
														
 
															+    angle_boxes_list = []
														
 
															+    for text_box in dt_boxes:
														
 
															+        text_bbox = points_to_bbox(text_box)
														
 
															+        if text_bbox[2] <= text_bbox[0] or text_bbox[3] <= text_bbox[1]:
														
 
															+            angle_boxes_list.append(text_box)
														
 
															+            continue
														
 
															+        text_box_dict = {
														
 
															+            'bbox': text_bbox,
														
 
															+            'type': 'text',
														
 
															+        }
														
 
															+        dt_boxes_dict_list.append(text_box_dict)
														
 
															+
														
 
															+    # Merge adjacent text regions into lines
														
 
															+    lines = merge_spans_to_line(dt_boxes_dict_list)
														
 
															+
														
 
															+    # Initialize a new list for storing the merged text regions
														
 
															+    new_dt_boxes = []
														
 
															+    for line in lines:
														
 
															+        line_bbox_list = []
														
 
															+        for span in line:
														
 
															+            line_bbox_list.append(span['bbox'])
														
 
															+
														
 
															+        # Merge overlapping text regions within the same line
														
 
															+        merged_spans = merge_overlapping_spans(line_bbox_list)
														
 
															+
														
 
															+        # Convert the merged text regions back to point format and add them to the new detection box list
														
 
															+        for span in merged_spans:
														
 
															+            new_dt_boxes.append(bbox_to_points(span))
														
 
															+
														
 
															+    new_dt_boxes.extend(angle_boxes_list)
														
 
															+
														
 
															+    return new_dt_boxes
														
 
															+
														
 
															+
														
 
															+def get_adjusted_mfdetrec_res(single_page_mfdetrec_res, useful_list):
														
 
															+    paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
														
 
															+    # Adjust the coordinates of the formula area
														
 
															+    adjusted_mfdetrec_res = []
														
 
															+    for mf_res in single_page_mfdetrec_res:
														
 
															+        mf_xmin, mf_ymin, mf_xmax, mf_ymax = mf_res["bbox"]
														
 
															+        # Adjust the coordinates of the formula area to the coordinates relative to the cropping area
														
 
															+        x0 = mf_xmin - xmin + paste_x
														
 
															+        y0 = mf_ymin - ymin + paste_y
														
 
															+        x1 = mf_xmax - xmin + paste_x
														
 
															+        y1 = mf_ymax - ymin + paste_y
														
 
															+        # Filter formula blocks outside the graph
														
 
															+        if any([x1 < 0, y1 < 0]) or any([x0 > new_width, y0 > new_height]):
														
 
															+            continue
														
 
															+        else:
														
 
															+            adjusted_mfdetrec_res.append({
														
 
															+                "bbox": [x0, y0, x1, y1],
														
 
															+            })
														
 
															+    return adjusted_mfdetrec_res
														
 
															+
														
 
															+
														
 
															+def get_ocr_result_list(ocr_res, useful_list):
														
 
															+    paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
														
 
															+    ocr_result_list = []
														
 
															+    for box_ocr_res in ocr_res:
														
 
															+
														
 
															+        p1, p2, p3, p4 = box_ocr_res[0]
														
 
															+        text, score = box_ocr_res[1]
														
 
															+        average_angle_degrees = calculate_angle_degrees(box_ocr_res[0])
														
 
															+        if average_angle_degrees > 0.5:
														
 
															+            # logger.info(f"average_angle_degrees: {average_angle_degrees}, text: {text}")
														
 
															+            # 与x轴的夹角超过0.5度，对边界做一下矫正
														
 
															+            # 计算几何中心
														
 
															+            x_center = sum(point[0] for point in box_ocr_res[0]) / 4
														
 
															+            y_center = sum(point[1] for point in box_ocr_res[0]) / 4
														
 
															+            new_height = ((p4[1] - p1[1]) + (p3[1] - p2[1])) / 2
														
 
															+            new_width = p3[0] - p1[0]
														
 
															+            p1 = [x_center - new_width / 2, y_center - new_height / 2]
														
 
															+            p2 = [x_center + new_width / 2, y_center - new_height / 2]
														
 
															+            p3 = [x_center + new_width / 2, y_center + new_height / 2]
														
 
															+            p4 = [x_center - new_width / 2, y_center + new_height / 2]
														
 
															+
														
 
															+        # Convert the coordinates back to the original coordinate system
														
 
															+        p1 = [p1[0] - paste_x + xmin, p1[1] - paste_y + ymin]
														
 
															+        p2 = [p2[0] - paste_x + xmin, p2[1] - paste_y + ymin]
														
 
															+        p3 = [p3[0] - paste_x + xmin, p3[1] - paste_y + ymin]
														
 
															+        p4 = [p4[0] - paste_x + xmin, p4[1] - paste_y + ymin]
														
 
															+
														
 
															+        ocr_result_list.append({
														
 
															+            'category_id': 15,
														
 
															+            'poly': p1 + p2 + p3 + p4,
														
 
															+            'score': float(round(score, 2)),
														
 
															+            'text': text,
														
 
															+        })
														
 
															+
														
 
															+    return ocr_result_list
														
 
															+
														
 
															+
														
 
															+def calculate_angle_degrees(poly):
														
 
															+    # 定义对角线的顶点
														
 
															+    diagonal1 = (poly[0], poly[2])
														
 
															+    diagonal2 = (poly[1], poly[3])
														
 
															+
														
 
															+    # 计算对角线的斜率
														
 
															+    def slope(p1, p2):
														
 
															+        return (p2[1] - p1[1]) / (p2[0] - p1[0]) if p2[0] != p1[0] else float('inf')
														
 
															+
														
 
															+    slope1 = slope(diagonal1[0], diagonal1[1])
														
 
															+    slope2 = slope(diagonal2[0], diagonal2[1])
														
 
															+
														
 
															+    # 计算对角线与x轴的夹角（以弧度为单位）
														
 
															+    angle1_radians = math.atan(slope1)
														
 
															+    angle2_radians = math.atan(slope2)
														
 
															+
														
 
															+    # 将弧度转换为角度
														
 
															+    angle1_degrees = math.degrees(angle1_radians)
														
 
															+    angle2_degrees = math.degrees(angle2_radians)
														
 
															+
														
 
															+    # 取两条对角线与x轴夹角的平均值
														
 
															+    average_angle_degrees = abs((angle1_degrees + angle2_degrees) / 2)
														
 
															+    # logger.info(f"average_angle_degrees: {average_angle_degrees}")
														
 
															+    return average_angle_degrees
														
 
															+
														
--- a/magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py
@@ -0,0 +1,168 @@
 
															+import copy
														
 
															+import time
														
 
															+
														
 
															+import cv2
														
 
															+import numpy as np
														
 
															+from paddleocr import PaddleOCR
														
 
															+from paddleocr.paddleocr import check_img, logger
														
 
															+from paddleocr.ppocr.utils.utility import alpha_to_color, binarize_img
														
 
															+from paddleocr.tools.infer.predict_system import sorted_boxes
														
 
															+from paddleocr.tools.infer.utility import get_rotate_crop_image, get_minarea_rect_crop
														
 
															+
														
 
															+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import update_det_boxes, merge_det_boxes
														
 
															+
														
 
															+
														
 
															+class ModifiedPaddleOCR(PaddleOCR):
														
 
															+    def ocr(self,
														
 
															+            img,
														
 
															+            det=True,
														
 
															+            rec=True,
														
 
															+            cls=True,
														
 
															+            bin=False,
														
 
															+            inv=False,
														
 
															+            alpha_color=(255, 255, 255),
														
 
															+            mfd_res=None,
														
 
															+            ):
														
 
															+        """
														
 
															+        OCR with PaddleOCR
														
 
															+        args：
														
 
															+            img: img for OCR, support ndarray, img_path and list or ndarray
														
 
															+            det: use text detection or not. If False, only rec will be exec. Default is True
														
 
															+            rec: use text recognition or not. If False, only det will be exec. Default is True
														
 
															+            cls: use angle classifier or not. Default is True. If True, the text with rotation of 180 degrees can be recognized. If no text is rotated by 180 degrees, use cls=False to get better performance. Text with rotation of 90 or 270 degrees can be recognized even if cls=False.
														
 
															+            bin: binarize image to black and white. Default is False.
														
 
															+            inv: invert image colors. Default is False.
														
 
															+            alpha_color: set RGB color Tuple for transparent parts replacement. Default is pure white.
														
 
															+        """
														
 
															+        assert isinstance(img, (np.ndarray, list, str, bytes))
														
 
															+        if isinstance(img, list) and det == True:
														
 
															+            logger.error('When input a list of images, det must be false')
														
 
															+            exit(0)
														
 
															+        if cls == True and self.use_angle_cls == False:
														
 
															+            pass
														
 
															+            # logger.warning(
														
 
															+            #     'Since the angle classifier is not initialized, it will not be used during the forward process'
														
 
															+            # )
														
 
															+
														
 
															+        img = check_img(img)
														
 
															+        # for infer pdf file
														
 
															+        if isinstance(img, list):
														
 
															+            if self.page_num > len(img) or self.page_num == 0:
														
 
															+                self.page_num = len(img)
														
 
															+            imgs = img[:self.page_num]
														
 
															+        else:
														
 
															+            imgs = [img]
														
 
															+
														
 
															+        def preprocess_image(_image):
														
 
															+            _image = alpha_to_color(_image, alpha_color)
														
 
															+            if inv:
														
 
															+                _image = cv2.bitwise_not(_image)
														
 
															+            if bin:
														
 
															+                _image = binarize_img(_image)
														
 
															+            return _image
														
 
															+
														
 
															+        if det and rec:
														
 
															+            ocr_res = []
														
 
															+            for idx, img in enumerate(imgs):
														
 
															+                img = preprocess_image(img)
														
 
															+                dt_boxes, rec_res, _ = self.__call__(img, cls, mfd_res=mfd_res)
														
 
															+                if not dt_boxes and not rec_res:
														
 
															+                    ocr_res.append(None)
														
 
															+                    continue
														
 
															+                tmp_res = [[box.tolist(), res]
														
 
															+                           for box, res in zip(dt_boxes, rec_res)]
														
 
															+                ocr_res.append(tmp_res)
														
 
															+            return ocr_res
														
 
															+        elif det and not rec:
														
 
															+            ocr_res = []
														
 
															+            for idx, img in enumerate(imgs):
														
 
															+                img = preprocess_image(img)
														
 
															+                dt_boxes, elapse = self.text_detector(img)
														
 
															+                if not dt_boxes:
														
 
															+                    ocr_res.append(None)
														
 
															+                    continue
														
 
															+                tmp_res = [box.tolist() for box in dt_boxes]
														
 
															+                ocr_res.append(tmp_res)
														
 
															+            return ocr_res
														
 
															+        else:
														
 
															+            ocr_res = []
														
 
															+            cls_res = []
														
 
															+            for idx, img in enumerate(imgs):
														
 
															+                if not isinstance(img, list):
														
 
															+                    img = preprocess_image(img)
														
 
															+                    img = [img]
														
 
															+                if self.use_angle_cls and cls:
														
 
															+                    img, cls_res_tmp, elapse = self.text_classifier(img)
														
 
															+                    if not rec:
														
 
															+                        cls_res.append(cls_res_tmp)
														
 
															+                rec_res, elapse = self.text_recognizer(img)
														
 
															+                ocr_res.append(rec_res)
														
 
															+            if not rec:
														
 
															+                return cls_res
														
 
															+            return ocr_res
														
 
															+
														
 
															+    def __call__(self, img, cls=True, mfd_res=None):
														
 
															+        time_dict = {'det': 0, 'rec': 0, 'cls': 0, 'all': 0}
														
 
															+
														
 
															+        if img is None:
														
 
															+            logger.debug("no valid image provided")
														
 
															+            return None, None, time_dict
														
 
															+
														
 
															+        start = time.time()
														
 
															+        ori_im = img.copy()
														
 
															+        dt_boxes, elapse = self.text_detector(img)
														
 
															+        time_dict['det'] = elapse
														
 
															+
														
 
															+        if dt_boxes is None:
														
 
															+            logger.debug("no dt_boxes found, elapsed : {}".format(elapse))
														
 
															+            end = time.time()
														
 
															+            time_dict['all'] = end - start
														
 
															+            return None, None, time_dict
														
 
															+        else:
														
 
															+            logger.debug("dt_boxes num : {}, elapsed : {}".format(
														
 
															+                len(dt_boxes), elapse))
														
 
															+        img_crop_list = []
														
 
															+
														
 
															+        dt_boxes = sorted_boxes(dt_boxes)
														
 
															+
														
 
															+        # @todo 目前是在bbox层merge，对倾斜文本行的兼容性不佳，需要修改成支持poly的merge
														
 
															+        # dt_boxes = merge_det_boxes(dt_boxes)
														
 
															+
														
 
															+
														
 
															+        if mfd_res:
														
 
															+            bef = time.time()
														
 
															+            dt_boxes = update_det_boxes(dt_boxes, mfd_res)
														
 
															+            aft = time.time()
														
 
															+            logger.debug("split text box by formula, new dt_boxes num : {}, elapsed : {}".format(
														
 
															+                len(dt_boxes), aft - bef))
														
 
															+
														
 
															+        for bno in range(len(dt_boxes)):
														
 
															+            tmp_box = copy.deepcopy(dt_boxes[bno])
														
 
															+            if self.args.det_box_type == "quad":
														
 
															+                img_crop = get_rotate_crop_image(ori_im, tmp_box)
														
 
															+            else:
														
 
															+                img_crop = get_minarea_rect_crop(ori_im, tmp_box)
														
 
															+            img_crop_list.append(img_crop)
														
 
															+        if self.use_angle_cls and cls:
														
 
															+            img_crop_list, angle_list, elapse = self.text_classifier(
														
 
															+                img_crop_list)
														
 
															+            time_dict['cls'] = elapse
														
 
															+            logger.debug("cls num  : {}, elapsed : {}".format(
														
 
															+                len(img_crop_list), elapse))
														
 
															+
														
 
															+        rec_res, elapse = self.text_recognizer(img_crop_list)
														
 
															+        time_dict['rec'] = elapse
														
 
															+        logger.debug("rec_res num  : {}, elapsed : {}".format(
														
 
															+            len(rec_res), elapse))
														
 
															+        if self.args.save_crop_res:
														
 
															+            self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list,
														
 
															+                                   rec_res)
														
 
															+        filter_boxes, filter_rec_res = [], []
														
 
															+        for box, rec_result in zip(dt_boxes, rec_res):
														
 
															+            text, score = rec_result
														
 
															+            if score >= self.drop_score:
														
 
															+                filter_boxes.append(box)
														
 
															+                filter_rec_res.append(rec_result)
														
 
															+        end = time.time()
														
 
															+        time_dict['all'] = end - start
														
 
															+        return filter_boxes, filter_rec_res, time_dict
														
--- a/magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_291_mod.py
+++ b/magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_291_mod.py
@@ -0,0 +1,213 @@
 
															+import copy
														
 
															+import time
														
 
															+
														
 
															+
														
 
															+import cv2
														
 
															+import numpy as np
														
 
															+from paddleocr import PaddleOCR
														
 
															+from paddleocr.paddleocr import check_img, logger
														
 
															+from paddleocr.ppocr.utils.utility import alpha_to_color, binarize_img
														
 
															+from paddleocr.tools.infer.predict_system import sorted_boxes
														
 
															+from paddleocr.tools.infer.utility import slice_generator, merge_fragmented, get_rotate_crop_image, \
														
 
															+    get_minarea_rect_crop
														
 
															+
														
 
															+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import update_det_boxes
														
 
															+
														
 
															+
														
 
															+class ModifiedPaddleOCR(PaddleOCR):
														
 
															+
														
 
															+    def ocr(
														
 
															+        self,
														
 
															+        img,
														
 
															+        det=True,
														
 
															+        rec=True,
														
 
															+        cls=True,
														
 
															+        bin=False,
														
 
															+        inv=False,
														
 
															+        alpha_color=(255, 255, 255),
														
 
															+        slice={},
														
 
															+        mfd_res=None,
														
 
															+    ):
														
 
															+        """
														
 
															+        OCR with PaddleOCR
														
 
															+
														
 
															+        Args:
														
 
															+            img: Image for OCR. It can be an ndarray, img_path, or a list of ndarrays.
														
 
															+            det: Use text detection or not. If False, only text recognition will be executed. Default is True.
														
 
															+            rec: Use text recognition or not. If False, only text detection will be executed. Default is True.
														
 
															+            cls: Use angle classifier or not. Default is True. If True, the text with a rotation of 180 degrees can be recognized. If no text is rotated by 180 degrees, use cls=False to get better performance.
														
 
															+            bin: Binarize image to black and white. Default is False.
														
 
															+            inv: Invert image colors. Default is False.
														
 
															+            alpha_color: Set RGB color Tuple for transparent parts replacement. Default is pure white.
														
 
															+            slice: Use sliding window inference for large images. Both det and rec must be True. Requires int values for slice["horizontal_stride"], slice["vertical_stride"], slice["merge_x_thres"], slice["merge_y_thres"] (See doc/doc_en/slice_en.md). Default is {}.
														
 
															+
														
 
															+        Returns:
														
 
															+            If both det and rec are True, returns a list of OCR results for each image. Each OCR result is a list of bounding boxes and recognized text for each detected text region.
														
 
															+            If det is True and rec is False, returns a list of detected bounding boxes for each image.
														
 
															+            If det is False and rec is True, returns a list of recognized text for each image.
														
 
															+            If both det and rec are False, returns a list of angle classification results for each image.
														
 
															+
														
 
															+        Raises:
														
 
															+            AssertionError: If the input image is not of type ndarray, list, str, or bytes.
														
 
															+            SystemExit: If det is True and the input is a list of images.
														
 
															+
														
 
															+        Note:
														
 
															+            - If the angle classifier is not initialized (use_angle_cls=False), it will not be used during the forward process.
														
 
															+            - For PDF files, if the input is a list of images and the page_num is specified, only the first page_num images will be processed.
														
 
															+            - The preprocess_image function is used to preprocess the input image by applying alpha color replacement, inversion, and binarization if specified.
														
 
															+        """
														
 
															+        assert isinstance(img, (np.ndarray, list, str, bytes))
														
 
															+        if isinstance(img, list) and det == True:
														
 
															+            logger.error("When input a list of images, det must be false")
														
 
															+            exit(0)
														
 
															+        if cls == True and self.use_angle_cls == False:
														
 
															+            logger.warning(
														
 
															+                "Since the angle classifier is not initialized, it will not be used during the forward process"
														
 
															+            )
														
 
															+
														
 
															+        img, flag_gif, flag_pdf = check_img(img, alpha_color)
														
 
															+        # for infer pdf file
														
 
															+        if isinstance(img, list) and flag_pdf:
														
 
															+            if self.page_num > len(img) or self.page_num == 0:
														
 
															+                imgs = img
														
 
															+            else:
														
 
															+                imgs = img[: self.page_num]
														
 
															+        else:
														
 
															+            imgs = [img]
														
 
															+
														
 
															+        def preprocess_image(_image):
														
 
															+            _image = alpha_to_color(_image, alpha_color)
														
 
															+            if inv:
														
 
															+                _image = cv2.bitwise_not(_image)
														
 
															+            if bin:
														
 
															+                _image = binarize_img(_image)
														
 
															+            return _image
														
 
															+
														
 
															+        if det and rec:
														
 
															+            ocr_res = []
														
 
															+            for img in imgs:
														
 
															+                img = preprocess_image(img)
														
 
															+                dt_boxes, rec_res, _ = self.__call__(img, cls, slice, mfd_res=mfd_res)
														
 
															+                if not dt_boxes and not rec_res:
														
 
															+                    ocr_res.append(None)
														
 
															+                    continue
														
 
															+                tmp_res = [[box.tolist(), res] for box, res in zip(dt_boxes, rec_res)]
														
 
															+                ocr_res.append(tmp_res)
														
 
															+            return ocr_res
														
 
															+        elif det and not rec:
														
 
															+            ocr_res = []
														
 
															+            for img in imgs:
														
 
															+                img = preprocess_image(img)
														
 
															+                dt_boxes, elapse = self.text_detector(img)
														
 
															+                if dt_boxes.size == 0:
														
 
															+                    ocr_res.append(None)
														
 
															+                    continue
														
 
															+                tmp_res = [box.tolist() for box in dt_boxes]
														
 
															+                ocr_res.append(tmp_res)
														
 
															+            return ocr_res
														
 
															+        else:
														
 
															+            ocr_res = []
														
 
															+            cls_res = []
														
 
															+            for img in imgs:
														
 
															+                if not isinstance(img, list):
														
 
															+                    img = preprocess_image(img)
														
 
															+                    img = [img]
														
 
															+                if self.use_angle_cls and cls:
														
 
															+                    img, cls_res_tmp, elapse = self.text_classifier(img)
														
 
															+                    if not rec:
														
 
															+                        cls_res.append(cls_res_tmp)
														
 
															+                rec_res, elapse = self.text_recognizer(img)
														
 
															+                ocr_res.append(rec_res)
														
 
															+            if not rec:
														
 
															+                return cls_res
														
 
															+            return ocr_res
														
 
															+
														
 
															+    def __call__(self, img, cls=True, slice={}, mfd_res=None):
														
 
															+        time_dict = {"det": 0, "rec": 0, "cls": 0, "all": 0}
														
 
															+
														
 
															+        if img is None:
														
 
															+            logger.debug("no valid image provided")
														
 
															+            return None, None, time_dict
														
 
															+
														
 
															+        start = time.time()
														
 
															+        ori_im = img.copy()
														
 
															+        if slice:
														
 
															+            slice_gen = slice_generator(
														
 
															+                img,
														
 
															+                horizontal_stride=slice["horizontal_stride"],
														
 
															+                vertical_stride=slice["vertical_stride"],
														
 
															+            )
														
 
															+            elapsed = []
														
 
															+            dt_slice_boxes = []
														
 
															+            for slice_crop, v_start, h_start in slice_gen:
														
 
															+                dt_boxes, elapse = self.text_detector(slice_crop, use_slice=True)
														
 
															+                if dt_boxes.size:
														
 
															+                    dt_boxes[:, :, 0] += h_start
														
 
															+                    dt_boxes[:, :, 1] += v_start
														
 
															+                    dt_slice_boxes.append(dt_boxes)
														
 
															+                    elapsed.append(elapse)
														
 
															+            dt_boxes = np.concatenate(dt_slice_boxes)
														
 
															+
														
 
															+            dt_boxes = merge_fragmented(
														
 
															+                boxes=dt_boxes,
														
 
															+                x_threshold=slice["merge_x_thres"],
														
 
															+                y_threshold=slice["merge_y_thres"],
														
 
															+            )
														
 
															+            elapse = sum(elapsed)
														
 
															+        else:
														
 
															+            dt_boxes, elapse = self.text_detector(img)
														
 
															+
														
 
															+        time_dict["det"] = elapse
														
 
															+
														
 
															+        if dt_boxes is None:
														
 
															+            logger.debug("no dt_boxes found, elapsed : {}".format(elapse))
														
 
															+            end = time.time()
														
 
															+            time_dict["all"] = end - start
														
 
															+            return None, None, time_dict
														
 
															+        else:
														
 
															+            logger.debug(
														
 
															+                "dt_boxes num : {}, elapsed : {}".format(len(dt_boxes), elapse)
														
 
															+            )
														
 
															+        img_crop_list = []
														
 
															+
														
 
															+        dt_boxes = sorted_boxes(dt_boxes)
														
 
															+
														
 
															+        if mfd_res:
														
 
															+            bef = time.time()
														
 
															+            dt_boxes = update_det_boxes(dt_boxes, mfd_res)
														
 
															+            aft = time.time()
														
 
															+            logger.debug("split text box by formula, new dt_boxes num : {}, elapsed : {}".format(
														
 
															+                len(dt_boxes), aft - bef))
														
 
															+
														
 
															+        for bno in range(len(dt_boxes)):
														
 
															+            tmp_box = copy.deepcopy(dt_boxes[bno])
														
 
															+            if self.args.det_box_type == "quad":
														
 
															+                img_crop = get_rotate_crop_image(ori_im, tmp_box)
														
 
															+            else:
														
 
															+                img_crop = get_minarea_rect_crop(ori_im, tmp_box)
														
 
															+            img_crop_list.append(img_crop)
														
 
															+        if self.use_angle_cls and cls:
														
 
															+            img_crop_list, angle_list, elapse = self.text_classifier(img_crop_list)
														
 
															+            time_dict["cls"] = elapse
														
 
															+            logger.debug(
														
 
															+                "cls num  : {}, elapsed : {}".format(len(img_crop_list), elapse)
														
 
															+            )
														
 
															+        if len(img_crop_list) > 1000:
														
 
															+            logger.debug(
														
 
															+                f"rec crops num: {len(img_crop_list)}, time and memory cost may be large."
														
 
															+            )
														
 
															+
														
 
															+        rec_res, elapse = self.text_recognizer(img_crop_list)
														
 
															+        time_dict["rec"] = elapse
														
 
															+        logger.debug("rec_res num  : {}, elapsed : {}".format(len(rec_res), elapse))
														
 
															+        if self.args.save_crop_res:
														
 
															+            self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list, rec_res)
														
 
															+        filter_boxes, filter_rec_res = [], []
														
 
															+        for box, rec_result in zip(dt_boxes, rec_res):
														
 
															+            text, score = rec_result[0], rec_result[1]
														
 
															+            if score >= self.drop_score:
														
 
															+                filter_boxes.append(box)
														
 
															+                filter_rec_res.append(rec_result)
														
 
															+        end = time.time()
														
 
															+        time_dict["all"] = end - start
														
 
															+        return filter_boxes, filter_rec_res, time_dict
														
--- a/magic_pdf/model/sub_modules/reading_oreder/__init__.py
+++ b/magic_pdf/model/sub_modules/reading_oreder/__init__.py
--- a/magic_pdf/model/sub_modules/reading_oreder/layoutreader/__init__.py
+++ b/magic_pdf/model/sub_modules/reading_oreder/layoutreader/__init__.py
--- a/magic_pdf/model/sub_modules/reading_oreder/layoutreader/helpers.py
+++ b/magic_pdf/model/sub_modules/reading_oreder/layoutreader/helpers.py
--- a/magic_pdf/model/sub_modules/reading_oreder/layoutreader/xycut.py
+++ b/magic_pdf/model/sub_modules/reading_oreder/layoutreader/xycut.py
@@ -0,0 +1,242 @@
 
															+from typing import List
														
 
															+import cv2
														
 
															+import numpy as np
														
 
															+
														
 
															+
														
 
															+def projection_by_bboxes(boxes: np.array, axis: int) -> np.ndarray:
														
 
															+    """
														
 
															+     通过一组 bbox 获得投影直方图，最后以 per-pixel 形式输出
														
 
															+
														
 
															+    Args:
														
 
															+        boxes: [N, 4]
														
 
															+        axis: 0-x坐标向水平方向投影， 1-y坐标向垂直方向投影
														
 
															+
														
 
															+    Returns:
														
 
															+        1D 投影直方图，长度为投影方向坐标的最大值(我们不需要图片的实际边长，因为只是要找文本框的间隔)
														
 
															+
														
 
															+    """
														
 
															+    assert axis in [0, 1]
														
 
															+    length = np.max(boxes[:, axis::2])
														
 
															+    res = np.zeros(length, dtype=int)
														
 
															+    # TODO: how to remove for loop?
														
 
															+    for start, end in boxes[:, axis::2]:
														
 
															+        res[start:end] += 1
														
 
															+    return res
														
 
															+
														
 
															+
														
 
															+# from: https://dothinking.github.io/2021-06-19-%E9%80%92%E5%BD%92%E6%8A%95%E5%BD%B1%E5%88%86%E5%89%B2%E7%AE%97%E6%B3%95/#:~:text=%E9%80%92%E5%BD%92%E6%8A%95%E5%BD%B1%E5%88%86%E5%89%B2%EF%BC%88Recursive%20XY,%EF%BC%8C%E5%8F%AF%E4%BB%A5%E5%88%92%E5%88%86%E6%AE%B5%E8%90%BD%E3%80%81%E8%A1%8C%E3%80%82
														
 
															+def split_projection_profile(arr_values: np.array, min_value: float, min_gap: float):
														
 
															+    """Split projection profile:
														
 
															+
														
 
															+    ```
														
 
															+                              ┌──┐
														
 
															+         arr_values           │  │       ┌─┐───
														
 
															+             ┌──┐             │  │       │ │ |
														
 
															+             │  │             │  │ ┌───┐ │ │min_value
														
 
															+             │  │<- min_gap ->│  │ │   │ │ │ |
														
 
															+         ────┴──┴─────────────┴──┴─┴───┴─┴─┴─┴───
														
 
															+         0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
														
 
															+    ```
														
 
															+
														
 
															+    Args:
														
 
															+        arr_values (np.array): 1-d array representing the projection profile.
														
 
															+        min_value (float): Ignore the profile if `arr_value` is less than `min_value`.
														
 
															+        min_gap (float): Ignore the gap if less than this value.
														
 
															+
														
 
															+    Returns:
														
 
															+        tuple: Start indexes and end indexes of split groups.
														
 
															+    """
														
 
															+    # all indexes with projection height exceeding the threshold
														
 
															+    arr_index = np.where(arr_values > min_value)[0]
														
 
															+    if not len(arr_index):
														
 
															+        return
														
 
															+
														
 
															+    # find zero intervals between adjacent projections
														
 
															+    # |  |                    ||
														
 
															+    # ||||<- zero-interval -> |||||
														
 
															+    arr_diff = arr_index[1:] - arr_index[0:-1]
														
 
															+    arr_diff_index = np.where(arr_diff > min_gap)[0]
														
 
															+    arr_zero_intvl_start = arr_index[arr_diff_index]
														
 
															+    arr_zero_intvl_end = arr_index[arr_diff_index + 1]
														
 
															+
														
 
															+    # convert to index of projection range:
														
 
															+    # the start index of zero interval is the end index of projection
														
 
															+    arr_start = np.insert(arr_zero_intvl_end, 0, arr_index[0])
														
 
															+    arr_end = np.append(arr_zero_intvl_start, arr_index[-1])
														
 
															+    arr_end += 1  # end index will be excluded as index slice
														
 
															+
														
 
															+    return arr_start, arr_end
														
 
															+
														
 
															+
														
 
															+def recursive_xy_cut(boxes: np.ndarray, indices: List[int], res: List[int]):
														
 
															+    """
														
 
															+
														
 
															+    Args:
														
 
															+        boxes: (N, 4)
														
 
															+        indices: 递归过程中始终表示 box 在原始数据中的索引
														
 
															+        res: 保存输出结果
														
 
															+
														
 
															+    """
														
 
															+    # 向 y 轴投影
														
 
															+    assert len(boxes) == len(indices)
														
 
															+
														
 
															+    _indices = boxes[:, 1].argsort()
														
 
															+    y_sorted_boxes = boxes[_indices]
														
 
															+    y_sorted_indices = indices[_indices]
														
 
															+
														
 
															+    # debug_vis(y_sorted_boxes, y_sorted_indices)
														
 
															+
														
 
															+    y_projection = projection_by_bboxes(boxes=y_sorted_boxes, axis=1)
														
 
															+    pos_y = split_projection_profile(y_projection, 0, 1)
														
 
															+    if not pos_y:
														
 
															+        return
														
 
															+
														
 
															+    arr_y0, arr_y1 = pos_y
														
 
															+    for r0, r1 in zip(arr_y0, arr_y1):
														
 
															+        # [r0, r1] 表示按照水平切分，有 bbox 的区域，对这些区域会再进行垂直切分
														
 
															+        _indices = (r0 <= y_sorted_boxes[:, 1]) & (y_sorted_boxes[:, 1] < r1)
														
 
															+
														
 
															+        y_sorted_boxes_chunk = y_sorted_boxes[_indices]
														
 
															+        y_sorted_indices_chunk = y_sorted_indices[_indices]
														
 
															+
														
 
															+        _indices = y_sorted_boxes_chunk[:, 0].argsort()
														
 
															+        x_sorted_boxes_chunk = y_sorted_boxes_chunk[_indices]
														
 
															+        x_sorted_indices_chunk = y_sorted_indices_chunk[_indices]
														
 
															+
														
 
															+        # 往 x 方向投影
														
 
															+        x_projection = projection_by_bboxes(boxes=x_sorted_boxes_chunk, axis=0)
														
 
															+        pos_x = split_projection_profile(x_projection, 0, 1)
														
 
															+        if not pos_x:
														
 
															+            continue
														
 
															+
														
 
															+        arr_x0, arr_x1 = pos_x
														
 
															+        if len(arr_x0) == 1:
														
 
															+            # x 方向无法切分
														
 
															+            res.extend(x_sorted_indices_chunk)
														
 
															+            continue
														
 
															+
														
 
															+        # x 方向上能分开，继续递归调用
														
 
															+        for c0, c1 in zip(arr_x0, arr_x1):
														
 
															+            _indices = (c0 <= x_sorted_boxes_chunk[:, 0]) & (
														
 
															+                x_sorted_boxes_chunk[:, 0] < c1
														
 
															+            )
														
 
															+            recursive_xy_cut(
														
 
															+                x_sorted_boxes_chunk[_indices], x_sorted_indices_chunk[_indices], res
														
 
															+            )
														
 
															+
														
 
															+
														
 
															+def points_to_bbox(points):
														
 
															+    assert len(points) == 8
														
 
															+
														
 
															+    # [x1,y1,x2,y2,x3,y3,x4,y4]
														
 
															+    left = min(points[::2])
														
 
															+    right = max(points[::2])
														
 
															+    top = min(points[1::2])
														
 
															+    bottom = max(points[1::2])
														
 
															+
														
 
															+    left = max(left, 0)
														
 
															+    top = max(top, 0)
														
 
															+    right = max(right, 0)
														
 
															+    bottom = max(bottom, 0)
														
 
															+    return [left, top, right, bottom]
														
 
															+
														
 
															+
														
 
															+def bbox2points(bbox):
														
 
															+    left, top, right, bottom = bbox
														
 
															+    return [left, top, right, top, right, bottom, left, bottom]
														
 
															+
														
 
															+
														
 
															+def vis_polygon(img, points, thickness=2, color=None):
														
 
															+    br2bl_color = color
														
 
															+    tl2tr_color = color
														
 
															+    tr2br_color = color
														
 
															+    bl2tl_color = color
														
 
															+    cv2.line(
														
 
															+        img,
														
 
															+        (points[0][0], points[0][1]),
														
 
															+        (points[1][0], points[1][1]),
														
 
															+        color=tl2tr_color,
														
 
															+        thickness=thickness,
														
 
															+    )
														
 
															+
														
 
															+    cv2.line(
														
 
															+        img,
														
 
															+        (points[1][0], points[1][1]),
														
 
															+        (points[2][0], points[2][1]),
														
 
															+        color=tr2br_color,
														
 
															+        thickness=thickness,
														
 
															+    )
														
 
															+
														
 
															+    cv2.line(
														
 
															+        img,
														
 
															+        (points[2][0], points[2][1]),
														
 
															+        (points[3][0], points[3][1]),
														
 
															+        color=br2bl_color,
														
 
															+        thickness=thickness,
														
 
															+    )
														
 
															+
														
 
															+    cv2.line(
														
 
															+        img,
														
 
															+        (points[3][0], points[3][1]),
														
 
															+        (points[0][0], points[0][1]),
														
 
															+        color=bl2tl_color,
														
 
															+        thickness=thickness,
														
 
															+    )
														
 
															+    return img
														
 
															+
														
 
															+
														
 
															+def vis_points(
														
 
															+    img: np.ndarray, points, texts: List[str] = None, color=(0, 200, 0)
														
 
															+) -> np.ndarray:
														
 
															+    """
														
 
															+
														
 
															+    Args:
														
 
															+        img:
														
 
															+        points: [N, 8]  8: x1,y1,x2,y2,x3,y3,x3,y4
														
 
															+        texts:
														
 
															+        color:
														
 
															+
														
 
															+    Returns:
														
 
															+
														
 
															+    """
														
 
															+    points = np.array(points)
														
 
															+    if texts is not None:
														
 
															+        assert len(texts) == points.shape[0]
														
 
															+
														
 
															+    for i, _points in enumerate(points):
														
 
															+        vis_polygon(img, _points.reshape(-1, 2), thickness=2, color=color)
														
 
															+        bbox = points_to_bbox(_points)
														
 
															+        left, top, right, bottom = bbox
														
 
															+        cx = (left + right) // 2
														
 
															+        cy = (top + bottom) // 2
														
 
															+
														
 
															+        txt = texts[i]
														
 
															+        font = cv2.FONT_HERSHEY_SIMPLEX
														
 
															+        cat_size = cv2.getTextSize(txt, font, 0.5, 2)[0]
														
 
															+
														
 
															+        img = cv2.rectangle(
														
 
															+            img,
														
 
															+            (cx - 5 * len(txt), cy - cat_size[1] - 5),
														
 
															+            (cx - 5 * len(txt) + cat_size[0], cy - 5),
														
 
															+            color,
														
 
															+            -1,
														
 
															+        )
														
 
															+
														
 
															+        img = cv2.putText(
														
 
															+            img,
														
 
															+            txt,
														
 
															+            (cx - 5 * len(txt), cy - 5),
														
 
															+            font,
														
 
															+            0.5,
														
 
															+            (255, 255, 255),
														
 
															+            thickness=1,
														
 
															+            lineType=cv2.LINE_AA,
														
 
															+        )
														
 
															+
														
 
															+    return img
														
 
															+
														
 
															+
														
 
															+def vis_polygons_with_index(image, points):
														
 
															+    texts = [str(i) for i in range(len(points))]
														
 
															+    res_img = vis_points(image.copy(), points, texts)
														
 
															+    return res_img
														
--- a/magic_pdf/model/sub_modules/table/__init__.py
+++ b/magic_pdf/model/sub_modules/table/__init__.py
--- a/magic_pdf/model/sub_modules/table/rapidtable/__init__.py
+++ b/magic_pdf/model/sub_modules/table/rapidtable/__init__.py
--- a/magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py
+++ b/magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py
@@ -0,0 +1,14 @@
 
															+import numpy as np
														
 
															+from rapid_table import RapidTable
														
 
															+from rapidocr_paddle import RapidOCR
														
 
															+
														
 
															+
														
 
															+class RapidTableModel(object):
														
 
															+    def __init__(self):
														
 
															+        self.table_model = RapidTable()
														
 
															+        self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
														
 
															+
														
 
															+    def predict(self, image):
														
 
															+        ocr_result, _ = self.ocr_engine(np.asarray(image))
														
 
															+        html_code, table_cell_bboxes, elapse = self.table_model(np.asarray(image), ocr_result)
														
 
															+        return html_code, table_cell_bboxes, elapse
														
--- a/magic_pdf/model/sub_modules/table/structeqtable/__init__.py
+++ b/magic_pdf/model/sub_modules/table/structeqtable/__init__.py
--- a/magic_pdf/model/sub_modules/table/structeqtable/struct_eqtable.py
+++ b/magic_pdf/model/sub_modules/table/structeqtable/struct_eqtable.py
@@ -1,8 +1,8 @@
 
															-import re
														
 
															-
														
 
															 import torch
														
 
															 from struct_eqtable import build_model
														
 
															+from magic_pdf.model.sub_modules.table.table_utils import minify_html
														
 
															+
														
 
															 class StructTableModel:
														
 
															     def __init__(self, model_path, max_new_tokens=1024, max_time=60):
														
@@ -31,15 +31,7 @@ class StructTableModel:
 
															         )
														
 
															         if output_format == "html":
														
 
															-            results = [self.minify_html(html) for html in results]
														
 
															+            results = [minify_html(html) for html in results]
														
 
															         return results
														
 
															-    def minify_html(self, html):
														
 
															-        # 移除多余的空白字符
														
 
															-        html = re.sub(r'\s+', ' ', html)
														
 
															-        # 移除行尾的空白字符
														
 
															-        html = re.sub(r'\s*>\s*', '>', html)
														
 
															-        # 移除标签前的空白字符
														
 
															-        html = re.sub(r'\s*<\s*', '<', html)
														
 
															-        return html.strip()
														
--- a/magic_pdf/model/sub_modules/table/table_utils.py
+++ b/magic_pdf/model/sub_modules/table/table_utils.py
@@ -0,0 +1,11 @@
 
															+import re
														
 
															+
														
 
															+
														
 
															+def minify_html(html):
														
 
															+    # 移除多余的空白字符
														
 
															+    html = re.sub(r'\s+', ' ', html)
														
 
															+    # 移除行尾的空白字符
														
 
															+    html = re.sub(r'\s*>\s*', '>', html)
														
 
															+    # 移除标签前的空白字符
														
 
															+    html = re.sub(r'\s*<\s*', '<', html)
														
 
															+    return html.strip()
														
--- a/magic_pdf/model/sub_modules/table/tablemaster/__init__.py
+++ b/magic_pdf/model/sub_modules/table/tablemaster/__init__.py
--- a/magic_pdf/model/sub_modules/table/tablemaster/tablemaster_paddle.py
+++ b/magic_pdf/model/sub_modules/table/tablemaster/tablemaster_paddle.py
@@ -7,7 +7,7 @@ from PIL import Image
 
															 import numpy as np
														
 
															-class ppTableModel(object):
														
 
															+class TableMasterPaddleModel(object):
														
 
															     """
														
 
															         This class is responsible for converting image of table into HTML format using a pre-trained model.
														
--- a/magic_pdf/para/para_split_v3.py
+++ b/magic_pdf/para/para_split_v3.py
@@ -77,14 +77,12 @@ def __is_list_or_index_block(block):
 
															         # 如果首行左边不顶格而右边顶格,末行左边顶格而右边不顶格 （第一行可能可以右边不顶格）
														
 
															         if (first_line['bbox'][0] - block['bbox_fs'][0] > line_height / 2 and
														
 
															-                # block['bbox_fs'][2] - first_line['bbox'][2] < line_height and
														
 
															                 abs(last_line['bbox'][0] - block['bbox_fs'][0]) < line_height / 2 and
														
 
															                 block['bbox_fs'][2] - last_line['bbox'][2] > line_height
														
 
															         ):
														
 
															             multiple_para_flag = True
														
 
															         for line in block['lines']:
														
 
															-
														
 
															             line_mid_x = (line['bbox'][0] + line['bbox'][2]) / 2
														
 
															             block_mid_x = (block['bbox_fs'][0] + block['bbox_fs'][2]) / 2
														
 
															             if (
														
@@ -102,13 +100,13 @@ def __is_list_or_index_block(block):
 
															                 if span_type == ContentType.Text:
														
 
															                     line_text += span['content'].strip()
														
 
															+            # 添加所有文本，包括空行，保持与block['lines']长度一致
														
 
															             lines_text_list.append(line_text)
														
 
															             # 计算line左侧顶格数量是否大于2，是否顶格用abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height/2 来判断
														
 
															             if abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height / 2:
														
 
															                 left_close_num += 1
														
 
															             elif line['bbox'][0] - block['bbox_fs'][0] > line_height:
														
 
															-                # logger.info(f"{line_text}, {block['bbox_fs']}, {line['bbox']}")
														
 
															                 left_not_close_num += 1
														
 
															             # 计算右侧是否顶格
														
@@ -117,7 +115,6 @@ def __is_list_or_index_block(block):
 
															             else:
														
 
															                 # 右侧不顶格情况下是否有一段距离，拍脑袋用0.3block宽度做阈值
														
 
															                 closed_area = 0.26 * block_weight
														
 
															-                # closed_area = 5 * line_height
														
 
															                 if block['bbox_fs'][2] - line['bbox'][2] > closed_area:
														
 
															                     right_not_close_num += 1
														
@@ -128,6 +125,7 @@ def __is_list_or_index_block(block):
 
															         num_start_count = 0
														
 
															         num_end_count = 0
														
 
															         flag_end_count = 0
														
 
															+
														
 
															         if len(lines_text_list) > 0:
														
 
															             for line_text in lines_text_list:
														
 
															                 if len(line_text) > 0:
														
@@ -138,11 +136,10 @@ def __is_list_or_index_block(block):
 
															                     if line_text[-1].isdigit():
														
 
															                         num_end_count += 1
														
 
															-            if flag_end_count / len(lines_text_list) >= 0.8:
														
 
															-                line_end_flag = True
														
 
															-
														
 
															             if num_start_count / len(lines_text_list) >= 0.8 or num_end_count / len(lines_text_list) >= 0.8:
														
 
															                 line_num_flag = True
														
 
															+            if flag_end_count / len(lines_text_list) >= 0.8:
														
 
															+                line_end_flag = True
														
 
															         # 有的目录右侧不贴边, 目前认为左边或者右边有一边全贴边，且符合数字规则极为index
														
 
															         if ((left_close_num / len(block['lines']) >= 0.8 or right_close_num / len(block['lines']) >= 0.8)
														
@@ -176,7 +173,7 @@ def __is_list_or_index_block(block):
 
															                 # 这种是大部分line item 都有结束标识符的情况，按结束标识符区分不同item
														
 
															                 elif line_end_flag:
														
 
															                     for i, line in enumerate(block['lines']):
														
 
															-                        if lines_text_list[i][-1] in LIST_END_FLAG:
														
 
															+                        if len(lines_text_list[i]) > 0 and lines_text_list[i][-1] in LIST_END_FLAG:
														
 
															                             line[ListLineTag.IS_LIST_END_LINE] = True
														
 
															                             if i + 1 < len(block['lines']):
														
 
															                                 block['lines'][i + 1][ListLineTag.IS_LIST_START_LINE] = True
														
@@ -187,17 +184,18 @@ def __is_list_or_index_block(block):
 
															                         if line_start_flag:
														
 
															                             line[ListLineTag.IS_LIST_START_LINE] = True
														
 
															                             line_start_flag = False
														
 
															-                        # elif abs(block['bbox_fs'][2] - line['bbox'][2]) > line_height:
														
 
															+
														
 
															                         if abs(block['bbox_fs'][2] - line['bbox'][2]) > 0.1 * block_weight:
														
 
															                             line[ListLineTag.IS_LIST_END_LINE] = True
														
 
															                             line_start_flag = True
														
 
															-            # 一种有缩进的特殊有序list,start line 左侧不贴边且以数字开头，end line 以 IS_LIST_END_LINE 结尾且数量和start line 一致
														
 
															-            elif num_start_count >= 2 and num_start_count == flag_end_count:  # 简单一点先不考虑左侧不贴边的情况
														
 
															+            # 一种有缩进的特殊有序list,start line 左侧不贴边且以数字开头，end line 以 IS_LIST_END_FLAG 结尾且数量和start line 一致
														
 
															+            elif num_start_count >= 2 and num_start_count == flag_end_count:
														
 
															                 for i, line in enumerate(block['lines']):
														
 
															-                    if lines_text_list[i][0].isdigit():
														
 
															-                        line[ListLineTag.IS_LIST_START_LINE] = True
														
 
															-                    if lines_text_list[i][-1] in LIST_END_FLAG:
														
 
															-                        line[ListLineTag.IS_LIST_END_LINE] = True
														
 
															+                    if len(lines_text_list[i]) > 0:
														
 
															+                        if lines_text_list[i][0].isdigit():
														
 
															+                            line[ListLineTag.IS_LIST_START_LINE] = True
														
 
															+                        if lines_text_list[i][-1] in LIST_END_FLAG:
														
 
															+                            line[ListLineTag.IS_LIST_END_LINE] = True
														
 
															             else:
														
 
															                 # 正常有缩进的list处理
														
 
															                 for line in block['lines']:
														
--- a/magic_pdf/pdf_parse_union_core_v2.py
+++ b/magic_pdf/pdf_parse_union_core_v2.py
@@ -30,8 +30,8 @@ from magic_pdf.pre_proc.equations_replace import (
 
															 from magic_pdf.pre_proc.ocr_detect_all_bboxes import \
														
 
															     ocr_prepare_bboxes_for_layout_split_v2
														
 
															 from magic_pdf.pre_proc.ocr_dict_merge import (fill_spans_in_blocks,
														
 
															-                                               fix_block_spans,
														
 
															-                                               fix_discarded_block, fix_block_spans_v2)
														
 
															+                                               fix_discarded_block,
														
 
															+                                               fix_block_spans_v2)
														
 
															 from magic_pdf.pre_proc.ocr_span_list_modify import (
														
 
															     get_qa_need_list_v2, remove_overlaps_low_confidence_spans,
														
 
															     remove_overlaps_min_spans)
														
@@ -164,8 +164,8 @@ class ModelSingleton:
 
															 def do_predict(boxes: List[List[int]], model) -> List[int]:
														
 
															-    from magic_pdf.model.v3.helpers import (boxes2inputs, parse_logits,
														
 
															-                                            prepare_inputs)
														
 
															+    from magic_pdf.model.sub_modules.reading_oreder.layoutreader.helpers import (boxes2inputs, parse_logits,
														
 
															+                                                                                 prepare_inputs)
														
 
															     inputs = boxes2inputs(boxes)
														
 
															     inputs = prepare_inputs(inputs, model)
														
@@ -174,23 +174,57 @@ def do_predict(boxes: List[List[int]], model) -> List[int]:
 
															 def cal_block_index(fix_blocks, sorted_bboxes):
														
 
															-    for block in fix_blocks:
														
 
															-        line_index_list = []
														
 
															-        if len(block['lines']) == 0:
														
 
															-            block['index'] = sorted_bboxes.index(block['bbox'])
														
 
															-        else:
														
 
															+    if sorted_bboxes is not None:
														
 
															+        # 使用layoutreader排序
														
 
															+        for block in fix_blocks:
														
 
															+            line_index_list = []
														
 
															+            if len(block['lines']) == 0:
														
 
															+                block['index'] = sorted_bboxes.index(block['bbox'])
														
 
															+            else:
														
 
															+                for line in block['lines']:
														
 
															+                    line['index'] = sorted_bboxes.index(line['bbox'])
														
 
															+                    line_index_list.append(line['index'])
														
 
															+                median_value = statistics.median(line_index_list)
														
 
															+                block['index'] = median_value
														
 
															+
														
 
															+            # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
														
 
															+            if block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
														
 
															+                block['virtual_lines'] = copy.deepcopy(block['lines'])
														
 
															+                block['lines'] = copy.deepcopy(block['real_lines'])
														
 
															+                del block['real_lines']
														
 
															+    else:
														
 
															+        # 使用xycut排序
														
 
															+        block_bboxes = []
														
 
															+        for block in fix_blocks:
														
 
															+            block_bboxes.append(block['bbox'])
														
 
															+
														
 
															+            # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
														
 
															+            if block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
														
 
															+                block['virtual_lines'] = copy.deepcopy(block['lines'])
														
 
															+                block['lines'] = copy.deepcopy(block['real_lines'])
														
 
															+                del block['real_lines']
														
 
															+
														
 
															+        import numpy as np
														
 
															+        from magic_pdf.model.sub_modules.reading_oreder.layoutreader.xycut import recursive_xy_cut
														
 
															+
														
 
															+        random_boxes = np.array(block_bboxes)
														
 
															+        np.random.shuffle(random_boxes)
														
 
															+        res = []
														
 
															+        recursive_xy_cut(np.asarray(random_boxes).astype(int), np.arange(len(block_bboxes)), res)
														
 
															+        assert len(res) == len(block_bboxes)
														
 
															+        sorted_boxes = random_boxes[np.array(res)].tolist()
														
 
															+
														
 
															+        for i, block in enumerate(fix_blocks):
														
 
															+            block['index'] = sorted_boxes.index(block['bbox'])
														
 
															+
														
 
															+        # 生成line index
														
 
															+        sorted_blocks = sorted(fix_blocks, key=lambda b: b['index'])
														
 
															+        line_inedx = 1
														
 
															+        for block in sorted_blocks:
														
 
															             for line in block['lines']:
														
 
															-                line['index'] = sorted_bboxes.index(line['bbox'])
														
 
															-                line_index_list.append(line['index'])
														
 
															-            median_value = statistics.median(line_index_list)
														
 
															-            block['index'] = median_value
														
 
															-
														
 
															-        # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
														
 
															-        if block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
														
 
															-            block['virtual_lines'] = copy.deepcopy(block['lines'])
														
 
															-            block['lines'] = copy.deepcopy(block['real_lines'])
														
 
															-            del block['real_lines']
														
 
															+                line['index'] = line_inedx
														
 
															+                line_inedx += 1
														
 
															     return fix_blocks
														
@@ -264,6 +298,9 @@ def sort_lines_by_model(fix_blocks, page_w, page_h, line_height):
 
															                 block['lines'].append({'bbox': line, 'spans': []})
														
 
															             page_line_list.extend(lines)
														
 
															+    if len(page_line_list) > 200:  # layoutreader最高支持512line
														
 
															+        return None
														
 
															+
														
 
															     # 使用layoutreader排序
														
 
															     x_scale = 1000.0 / page_w
														
 
															     y_scale = 1000.0 / page_h
														
--- a/magic_pdf/resources/model_config/model_configs.yaml
+++ b/magic_pdf/resources/model_config/model_configs.yaml
@@ -4,4 +4,5 @@ weights:
 
															   yolo_v8_mfd: MFD/YOLO/yolo_v8_ft.pt
														
 
															   unimernet_small: MFR/unimernet_small
														
 
															   struct_eqtable: TabRec/StructEqTable
														
 
															-  tablemaster: TabRec/TableMaster
														
 
															+  tablemaster: TabRec/TableMaster
														
 
															+  rapid_table: TabRec/RapidTable
														
--- a/magic_pdf/tools/common.py
+++ b/magic_pdf/tools/common.py
@@ -14,6 +14,9 @@ from magic_pdf.pipe.TXTPipe import TXTPipe
 
															 from magic_pdf.pipe.UNIPipe import UNIPipe
														
 
															 from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
														
 
															 from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
														
 
															+import fitz
														
 
															+# from io import BytesIO
														
 
															+# from pypdf import PdfReader, PdfWriter
														
 
															 def prepare_env(output_dir, pdf_file_name, method):
														
@@ -26,6 +29,42 @@ def prepare_env(output_dir, pdf_file_name, method):
 
															     return local_image_dir, local_md_dir
														
 
															+# def convert_pdf_bytes_to_bytes_by_pypdf(pdf_bytes, start_page_id=0, end_page_id=None):
														
 
															+#     # 将字节数据包装在 BytesIO 对象中
														
 
															+#     pdf_file = BytesIO(pdf_bytes)
														
 
															+#     # 读取 PDF 的字节数据
														
 
															+#     reader = PdfReader(pdf_file)
														
 
															+#     # 创建一个新的 PDF 写入器
														
 
															+#     writer = PdfWriter()
														
 
															+#     # 将所有页面添加到新的 PDF 写入器中
														
 
															+#     end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(reader.pages) - 1
														
 
															+#     if end_page_id > len(reader.pages) - 1:
														
 
															+#         logger.warning("end_page_id is out of range, use pdf_docs length")
														
 
															+#         end_page_id = len(reader.pages) - 1
														
 
															+#     for i, page in enumerate(reader.pages):
														
 
															+#         if start_page_id <= i <= end_page_id:
														
 
															+#             writer.add_page(page)
														
 
															+#     # 创建一个字节缓冲区来存储输出的 PDF 数据
														
 
															+#     output_buffer = BytesIO()
														
 
															+#     # 将 PDF 写入字节缓冲区
														
 
															+#     writer.write(output_buffer)
														
 
															+#     # 获取字节缓冲区的内容
														
 
															+#     converted_pdf_bytes = output_buffer.getvalue()
														
 
															+#     return converted_pdf_bytes
														
 
															+
														
 
															+
														
 
															+def convert_pdf_bytes_to_bytes_by_pymupdf(pdf_bytes, start_page_id=0, end_page_id=None):
														
 
															+    document = fitz.open("pdf", pdf_bytes)
														
 
															+    output_document = fitz.open()
														
 
															+    end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(document) - 1
														
 
															+    if end_page_id > len(document) - 1:
														
 
															+        logger.warning("end_page_id is out of range, use pdf_docs length")
														
 
															+        end_page_id = len(document) - 1
														
 
															+    output_document.insert_pdf(document, from_page=start_page_id, to_page=end_page_id)
														
 
															+    output_bytes = output_document.tobytes()
														
 
															+    return output_bytes
														
 
															+
														
 
															+
														
 
															 def do_parse(
														
 
															     output_dir,
														
 
															     pdf_file_name,
														
@@ -55,6 +94,8 @@ def do_parse(
 
															         f_draw_model_bbox = True
														
 
															         f_draw_line_sort_bbox = True
														
 
															+    pdf_bytes = convert_pdf_bytes_to_bytes_by_pymupdf(pdf_bytes, start_page_id, end_page_id)
														
 
															+
														
 
															     orig_model_list = copy.deepcopy(model_list)
														
 
															     local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name,
														
 
															                                                 parse_method)
														
@@ -66,15 +107,18 @@ def do_parse(
 
															     if parse_method == 'auto':
														
 
															         jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
														
 
															         pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True,
														
 
															-                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang,
														
 
															+                       # start_page_id=start_page_id, end_page_id=end_page_id,
														
 
															+                       lang=lang,
														
 
															                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
														
 
															     elif parse_method == 'txt':
														
 
															         pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True,
														
 
															-                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang,
														
 
															+                       # start_page_id=start_page_id, end_page_id=end_page_id,
														
 
															+                       lang=lang,
														
 
															                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
														
 
															     elif parse_method == 'ocr':
														
 
															         pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True,
														
 
															-                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang,
														
 
															+                       # start_page_id=start_page_id, end_page_id=end_page_id,
														
 
															+                       lang=lang,
														
 
															                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
														
 
															     else:
														
 
															         logger.error('unknown parse method')
														
--- a/next_docs/README.md
+++ b/next_docs/README.md
--- a/next_docs/README_zh-CN.md
+++ b/next_docs/README_zh-CN.md
--- a/next_docs/en/_static/image/ReadTheDocs.svg
+++ b/next_docs/en/_static/image/ReadTheDocs.svg
--- a/next_docs/en/additional_notes/changelog.rst
+++ b/next_docs/en/additional_notes/changelog.rst
@@ -1,26 +0,0 @@
 
															-
														
 
															-
														
 
															-Changelog
														
 
															-=========
														
 
															-
														
 
															--  2024/09/27 Version 0.8.1 released, Fixed some bugs, and providing a
														
 
															-   `localized deployment version <projects/web_demo/README.md>`__ of the
														
 
															-   `online
														
 
															-   demo <https://opendatalab.com/OpenSourceTools/Extractor/PDF/>`__ and
														
 
															-   the `front-end interface <projects/web/README.md>`__.
														
 
															--  2024/09/09: Version 0.8.0 released, supporting fast deployment with
														
 
															-   Dockerfile, and launching demos on Huggingface and Modelscope.
														
 
															--  2024/08/30: Version 0.7.1 released, add paddle tablemaster table
														
 
															-   recognition option
														
 
															--  2024/08/09: Version 0.7.0b1 released, simplified installation
														
 
															-   process, added table recognition functionality
														
 
															--  2024/08/01: Version 0.6.2b1 released, optimized dependency conflict
														
 
															-   issues and installation documentation
														
 
															--  2024/07/05: Initial open-source release
														
 
															-
														
 
															-
														
 
															-.. warning::
														
 
															-
														
 
															-   fix ``localized deployment version`` and ``front-end interface``
														
 
															-
														
 
															-
														
--- a/next_docs/en/additional_notes/faq.rst
+++ b/next_docs/en/additional_notes/faq.rst
@@ -74,3 +74,15 @@ CUDA version used by Paddle needs to be upgraded.
 
															    pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
														
 
															 Reference: https://github.com/opendatalab/MinerU/issues/558
														
 
															+
														
 
															+
														
 
															+7. On some Linux servers, the program immediately reports an error ``Illegal instruction (core dumped)``
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+This might be because the server's CPU does not support the AVX/AVX2
														
 
															+instruction set, or the CPU itself supports it but has been disabled by
														
 
															+the system administrator. You can try contacting the system
														
 
															+administrator to remove the restriction or change to a different server.
														
 
															+
														
 
															+References: https://github.com/opendatalab/MinerU/issues/591 ,
														
 
															+https://github.com/opendatalab/MinerU/issues/736
														
--- a/next_docs/en/additional_notes/known_issues.rst
+++ b/next_docs/en/additional_notes/known_issues.rst
@@ -1,19 +1,20 @@
 
															 Known Issues
														
 
															 ============
														
 
															--  Reading order is based on the model’s sorting of text distribution in
														
 
															-   space, which may become disordered under extremely complex layouts.
														
 
															+-  Reading order is determined by the model based on the spatial
														
 
															+   distribution of readable content, and may be out of order in some
														
 
															+   areas under extremely complex layouts.
														
 
															 -  Vertical text is not supported.
														
 
															--  Tables of contents and lists are recognized through rules; a few
														
 
															-   uncommon list formats may not be identified.
														
 
															--  Only one level of headings is supported; hierarchical heading levels
														
 
															-   are currently not supported.
														
 
															+-  Tables of contents and lists are recognized through rules, and some
														
 
															+   uncommon list formats may not be recognized.
														
 
															+-  Only one level of headings is supported; hierarchical headings are
														
 
															+   not currently supported.
														
 
															 -  Code blocks are not yet supported in the layout model.
														
 
															--  Comic books, art books, elementary school textbooks, and exercise
														
 
															-   books are not well-parsed yet
														
 
															--  Enabling OCR may produce better results in PDFs with a high density
														
 
															-   of formulas
														
 
															--  If you are processing PDFs with a large number of formulas, it is
														
 
															-   strongly recommended to enable the OCR function. When using PyMuPDF
														
 
															-   to extract text, overlapping text lines can occur, leading to
														
 
															-   inaccurate formula insertion positions.
														
 
															+-  Comic books, art albums, primary school textbooks, and exercises
														
 
															+   cannot be parsed well.
														
 
															+-  Table recognition may result in row/column recognition errors in
														
 
															+   complex tables.
														
 
															+-  OCR recognition may produce inaccurate characters in PDFs of
														
 
															+   lesser-known languages (e.g., diacritical marks in Latin script,
														
 
															+   easily confused characters in Arabic script).
														
 
															+-  Some formulas may not render correctly in Markdown.
														
--- a/next_docs/en/api.rst
+++ b/next_docs/en/api.rst
@@ -7,4 +7,3 @@
 
															    api/read_api
														
 
															    api/schemas
														
 
															    api/io
														
 
															-   api/classes
														
--- a/next_docs/en/api/classes.rst
+++ b/next_docs/en/api/classes.rst
@@ -1,14 +0,0 @@
 
															-Class Hierarchy
														
 
															-===============
														
 
															-
														
 
															-.. inheritance-diagram:: magic_pdf.data.io.base magic_pdf.data.io.http magic_pdf.data.io.s3
														
 
															-   :parts: 2
														
 
															-
														
 
															-
														
 
															-.. inheritance-diagram:: magic_pdf.data.dataset
														
 
															-   :parts: 2
														
 
															-
														
 
															-
														
 
															-.. inheritance-diagram:: magic_pdf.data.data_reader_writer.base magic_pdf.data.data_reader_writer.filebase magic_pdf.data.data_reader_writer.multi_bucket_s3
														
 
															-   :parts: 2
														
 
															-
														
--- a/next_docs/en/api/utils.rst
+++ b/next_docs/en/api/utils.rst
@@ -1 +0,0 @@
 
															-
														
--- a/next_docs/en/conf.py
+++ b/next_docs/en/conf.py
@@ -95,7 +95,7 @@ language = 'en'
 
															 html_theme = 'sphinx_book_theme'
														
 
															 html_logo = '_static/image/logo.png'
														
 
															 html_theme_options = {
														
 
															-    'path_to_docs': 'docs/en',
														
 
															+    'path_to_docs': 'next_docs/en',
														
 
															     'repository_url': 'https://github.com/opendatalab/MinerU',
														
 
															     'use_repository_button': True,
														
 
															 }
														
--- a/next_docs/en/index.rst
+++ b/next_docs/en/index.rst
@@ -46,20 +46,29 @@ the relevant PDF**.
 
															 Key Features
														
 
															 ------------
														
 
															--  Removes elements such as headers, footers, footnotes, and page
														
 
															-   numbers while maintaining semantic continuity
														
 
															--  Outputs text in a human-readable order from multi-column documents
														
 
															--  Retains the original structure of the document, including titles,
														
 
															-   paragraphs, and lists
														
 
															--  Extracts images, image captions, tables, and table captions
														
 
															--  Automatically recognizes formulas in the document and converts them
														
 
															-   to LaTeX
														
 
															--  Automatically recognizes tables in the document and converts them to
														
 
															-   LaTeX
														
 
															--  Automatically detects and enables OCR for corrupted PDFs
														
 
															--  Supports both CPU and GPU environments
														
 
															--  Supports Windows, Linux, and Mac platforms
														
 
															-
														
 
															+-  Remove headers, footers, footnotes, page numbers, etc., to ensure
														
 
															+   semantic coherence.
														
 
															+-  Output text in human-readable order, suitable for single-column,
														
 
															+   multi-column, and complex layouts.
														
 
															+-  Preserve the structure of the original document, including headings,
														
 
															+   paragraphs, lists, etc.
														
 
															+-  Extract images, image descriptions, tables, table titles, and
														
 
															+   footnotes.
														
 
															+-  Automatically recognize and convert formulas in the document to LaTeX
														
 
															+   format.
														
 
															+-  Automatically recognize and convert tables in the document to LaTeX
														
 
															+   or HTML format.
														
 
															+-  Automatically detect scanned PDFs and garbled PDFs and enable OCR
														
 
															+   functionality.
														
 
															+-  OCR supports detection and recognition of 84 languages.
														
 
															+-  Supports multiple output formats, such as multimodal and NLP
														
 
															+   Markdown, JSON sorted by reading order, and rich intermediate
														
 
															+   formats.
														
 
															+-  Supports various visualization results, including layout
														
 
															+   visualization and span visualization, for efficient confirmation of
														
 
															+   output quality.
														
 
															+-  Supports both CPU and GPU environments.
														
 
															+-  Compatible with Windows, Linux, and Mac platforms.
														
 
															 User Guide
														
 
															 -------------
														
@@ -91,14 +100,6 @@ Additional Notes
 
															    additional_notes/known_issues
														
 
															    additional_notes/faq
														
 
															-   additional_notes/changelog
														
 
															    additional_notes/glossary
														
 
															-Projects 
														
 
															----------
														
 
															-.. toctree::
														
 
															-   :maxdepth: 1
														
 
															-   :caption: Projects
														
 
															-
														
 
															-   projects
														
--- a/next_docs/en/projects.rst
+++ b/next_docs/en/projects.rst
@@ -1,13 +0,0 @@
 
															-
														
 
															-
														
 
															-
														
 
															-llama_index_rag 
														
 
															-===============
														
 
															-
														
 
															-
														
 
															-gradio_app
														
 
															-============
														
 
															-
														
 
															-
														
 
															-other projects
														
 
															-===============
														
--- a/next_docs/en/user_guide/data/data_reader_writer.rst
+++ b/next_docs/en/user_guide/data/data_reader_writer.rst
@@ -87,6 +87,8 @@ Read Examples
 
															 .. code:: python
														
 
															+    from magic_pdf.data.data_reader_writer import *
														
 
															+
														
 
															     # file based related 
														
 
															     file_based_reader1 = FileBasedDataReader('')
														
@@ -142,6 +144,8 @@ Write Examples
 
															 .. code:: python
														
 
															+    from magic_pdf.data.data_reader_writer import *
														
 
															+
														
 
															     # file based related 
														
 
															     file_based_writer1 = FileBasedDataWriter('')
														
@@ -201,4 +205,4 @@ Write Examples
 
															     s3_writer1.write('s3://test_bucket/efg', '123'.encode())
														
 
															-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/data_reader_writer` for more details
														
 
															+Check :doc:`../../api/data_reader_writer` for more details
														
--- a/next_docs/en/user_guide/data/dataset.rst
+++ b/next_docs/en/user_guide/data/dataset.rst
@@ -36,5 +36,5 @@ Extract chars via third-party library, currently we use ``pymupdf``.
 
															-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/dataset` for more details
														
 
															+Check :doc:`../../api/dataset` for more details
														
--- a/next_docs/en/user_guide/data/io.rst
+++ b/next_docs/en/user_guide/data/io.rst
@@ -21,5 +21,5 @@ if MinerU have not provide the suitable classes. It is easy to implement new cla
 
															         def write(self, path: str, data: bytes) -> None:
														
 
															             pass
														
 
															-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/io` for more details
														
 
															+Check :doc:`../../api/io` for more details
														
--- a/next_docs/en/user_guide/data/read_api.rst
+++ b/next_docs/en/user_guide/data/read_api.rst
@@ -18,6 +18,8 @@ Read the contet from jsonl which may located on local machine or remote s3. if y
 
															 .. code:: python
														
 
															+    from magic_pdf.data.io.read_api import *
														
 
															+
														
 
															     # read jsonl from local machine 
														
 
															     datasets = read_jsonl("tt.jsonl", None)
														
@@ -33,6 +35,8 @@ Read pdf from path or directory.
 
															 .. code:: python
														
 
															+    from magic_pdf.data.io.read_api import *
														
 
															+
														
 
															     # read pdf path
														
 
															     datasets = read_local_pdfs("tt.pdf")
														
@@ -47,10 +51,11 @@ Read images from path or directory
 
															 .. code:: python 
														
 
															+    from magic_pdf.data.io.read_api import *
														
 
															+
														
 
															     # read from image path 
														
 
															     datasets = read_local_images("tt.png")
														
 
															-
														
 
															     # read files from directory that endswith suffix in suffixes array 
														
 
															     datasets = read_local_images("images/", suffixes=["png", "jpg"])
														
--- a/next_docs/en/user_guide/install/boost_with_cuda.rst
+++ b/next_docs/en/user_guide/install/boost_with_cuda.rst
@@ -9,16 +9,18 @@ appropriate guide based on your system:
 
															 -  :ref:`ubuntu_22_04_lts_section`
														
 
															 -  :ref:`windows_10_or_11_section`
														
 
															+-  Quick Deployment with Docker
														
 
															--  Quick Deployment with Docker > Docker requires a GPU with at least
														
 
															-   16GB of VRAM, and all acceleration features are enabled by default.
														
 
															+.. admonition:: Important
														
 
															+   :class: tip
														
 
															-.. note:: 
														
 
															+   Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
														
 
															-   Before running this Docker, you can use the following command to
														
 
															-   check if your device supports CUDA acceleration on Docker. 
														
 
															+   Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker. 
														
 
															-   bash  docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
														
 
															+   .. code-block:: bash
														
 
															+
														
 
															+      bash  docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
														
 
															 .. code:: sh
														
@@ -42,8 +44,9 @@ Ubuntu 22.04 LTS
 
															 If you see information similar to the following, it means that the
														
 
															 NVIDIA drivers are already installed, and you can skip Step 2.
														
 
															-Notice:``CUDA Version`` should be >= 12.1, If the displayed version
														
 
															-number is less than 12.1, please upgrade the driver.
														
 
															+.. note::
														
 
															+
														
 
															+   ``CUDA Version`` should be >= 12.1, If the displayed version number is less than 12.1, please upgrade the driver.
														
 
															 .. code:: text
														
@@ -105,8 +108,10 @@ Specify Python version 3.10.
 
															    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
														
 
															-❗ After installation, make sure to check the version of ``magic-pdf``
														
 
															-using the following command:
														
 
															+.. admonition:: Important
														
 
															+    :class: tip
														
 
															+
														
 
															+    ❗ After installation, make sure to check the version of ``magic-pdf`` using the following command:
														
 
															 .. code:: sh
														
@@ -127,7 +132,10 @@ the script will automatically generate a ``magic-pdf.json`` file in the
 
															 user directory and configure the default model path. You can find the
														
 
															 ``magic-pdf.json`` file in your user directory.
														
 
															-   The user directory for Linux is “/home/username”.
														
 
															+.. admonition:: TIP
														
 
															+    :class: tip
														
 
															+
														
 
															+    The user directory for Linux is “/home/username”.
														
 
															 8. First Run
														
 
															 ~~~~~~~~~~~~
														
@@ -137,7 +145,7 @@ Download a sample file from the repository and test it.
 
															 .. code:: sh
														
 
															    wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf
														
 
															-   magic-pdf -p small_ocr.pdf
														
 
															+   magic-pdf -p small_ocr.pdf -o ./output
														
 
															 9. Test CUDA Acceleration
														
 
															 ~~~~~~~~~~~~~~~~~~~~~~~~~
														
@@ -145,10 +153,6 @@ Download a sample file from the repository and test it.
 
															 If your graphics card has at least **8GB** of VRAM, follow these steps
														
 
															 to test CUDA acceleration:
														
 
															-   ❗ Due to the extremely limited nature of 8GB VRAM for running this
														
 
															-   application, you need to close all other programs using VRAM to
														
 
															-   ensure that 8GB of VRAM is available when running this application.
														
 
															-
														
 
															 1. Modify the value of ``"device-mode"`` in the ``magic-pdf.json``
														
 
															    configuration file located in your home directory.
														
@@ -162,7 +166,7 @@ to test CUDA acceleration:
 
															    .. code:: sh
														
 
															-      magic-pdf -p small_ocr.pdf
														
 
															+      magic-pdf -p small_ocr.pdf -o ./output
														
 
															 10. Enable CUDA Acceleration for OCR
														
 
															 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
@@ -178,7 +182,9 @@ to test CUDA acceleration:
 
															    .. code:: sh
														
 
															-      magic-pdf -p small_ocr.pdf
														
 
															+      magic-pdf -p small_ocr.pdf -o ./output
														
 
															+
														
 
															+
														
 
															 .. _windows_10_or_11_section:
														
@@ -218,16 +224,16 @@ Python version must be 3.10.
 
															    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
														
 
															-..
														
 
															+.. admonition:: Important
														
 
															+    :class: tip
														
 
															-   ❗️After installation, verify the version of ``magic-pdf``:
														
 
															+    ❗️After installation, verify the version of ``magic-pdf``:
														
 
															-   .. code:: bash
														
 
															+    .. code:: bash
														
 
															       magic-pdf --version
														
 
															-   If the version number is less than 0.7.0, please report it in the
														
 
															-   issues section.
														
 
															+    If the version number is less than 0.7.0, please report it in the issues section.
														
 
															 5. Download Models
														
 
															 ~~~~~~~~~~~~~~~~~~
														
@@ -242,7 +248,10 @@ the script will automatically generate a ``magic-pdf.json`` file in the
 
															 user directory and configure the default model path. You can find the
														
 
															 ``magic-pdf.json`` file in your 【user directory】 .
														
 
															-   The user directory for Windows is “C:/Users/username”.
														
 
															+.. admonition:: Tip
														
 
															+    :class: tip
														
 
															+
														
 
															+    The user directory for Windows is “C:/Users/username”.
														
 
															 7. First Run
														
 
															 ~~~~~~~~~~~~
														
@@ -252,7 +261,7 @@ Download a sample file from the repository and test it.
 
															 .. code:: powershell
														
 
															      wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf
														
 
															-     magic-pdf -p small_ocr.pdf
														
 
															+     magic-pdf -p small_ocr.pdf -o ./output
														
 
															 8. Test CUDA Acceleration
														
 
															 ~~~~~~~~~~~~~~~~~~~~~~~~~
														
@@ -260,27 +269,23 @@ Download a sample file from the repository and test it.
 
															 If your graphics card has at least 8GB of VRAM, follow these steps to
														
 
															 test CUDA-accelerated parsing performance.
														
 
															-   ❗ Due to the extremely limited nature of 8GB VRAM for running this
														
 
															-   application, you need to close all other programs using VRAM to
														
 
															-   ensure that 8GB of VRAM is available when running this application.
														
 
															-
														
 
															-1. **Overwrite the installation of torch and torchvision** supporting
														
 
															-   CUDA.
														
 
															+1. **Overwrite the installation of torch and torchvision** supporting CUDA.
														
 
															-   ::
														
 
															+.. code:: sh
														
 
															-      pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
														
 
															+   pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
														
 
															-   ..
														
 
															+.. admonition:: Important
														
 
															+    :class: tip
														
 
															-      ❗️Ensure the following versions are specified in the command:
														
 
															+    ❗️Ensure the following versions are specified in the command:
														
 
															-      ::
														
 
															+ 
														
 
															+    .. code:: sh
														
 
															          torch==2.3.1 torchvision==0.18.1
														
 
															-      These are the highest versions we support. Installing higher
														
 
															-      versions without specifying them will cause the program to fail.
														
 
															+    These are the highest versions we support. Installing higher versions without specifying them will cause the program to fail.
														
 
															 2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json``
														
 
															    configuration file located in your user directory.
														
@@ -295,7 +300,7 @@ test CUDA-accelerated parsing performance.
 
															    ::
														
 
															-      magic-pdf -p small_ocr.pdf
														
 
															+      magic-pdf -p small_ocr.pdf -o ./output
														
 
															 9. Enable CUDA Acceleration for OCR
														
 
															 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
@@ -311,5 +316,4 @@ test CUDA-accelerated parsing performance.
 
															    ::
														
 
															-      magic-pdf -p small_ocr.pdf
														
 
															-
														
 
															+      magic-pdf -p small_ocr.pdf -o ./output
														
--- a/next_docs/en/user_guide/install/install.rst
+++ b/next_docs/en/user_guide/install/install.rst
@@ -1,87 +1,90 @@
 
															 Install 
														
 
															 ===============================================================
														
 
															-If you encounter any installation issues, please first consult the FAQ.
														
 
															-If the parsing results are not as expected, refer to the Known Issues.
														
 
															-There are three different ways to experience MinerU
														
 
															-
														
 
															-Pre-installation Notice—Hardware and Software Environment Support
														
 
															-------------------------------------------------------------------
														
 
															-
														
 
															-To ensure the stability and reliability of the project, we only optimize
														
 
															-and test for specific hardware and software environments during
														
 
															-development. This ensures that users deploying and running the project
														
 
															-on recommended system configurations will get the best performance with
														
 
															-the fewest compatibility issues.
														
 
															-
														
 
															-By focusing resources on the mainline environment, our team can more
														
 
															-efficiently resolve potential bugs and develop new features.
														
 
															-
														
 
															-In non-mainline environments, due to the diversity of hardware and
														
 
															-software configurations, as well as third-party dependency compatibility
														
 
															-issues, we cannot guarantee 100% project availability. Therefore, for
														
 
															-users who wish to use this project in non-recommended environments, we
														
 
															-suggest carefully reading the documentation and FAQ first. Most issues
														
 
															-already have corresponding solutions in the FAQ. We also encourage
														
 
															-community feedback to help us gradually expand support.
														
 
															+If you encounter any installation issues, please first consult the :doc:`../../additional_notes/faq`.
														
 
															+If the parsing results are not as expected, refer to the :doc:`../../additional_notes/known_issues`.
														
 
															+
														
 
															+
														
 
															+.. admonition:: Warning
														
 
															+    :class: tip
														
 
															+
														
 
															+    **Pre-installation Notice—Hardware and Software Environment Support**
														
 
															+
														
 
															+    To ensure the stability and reliability of the project, we only optimize
														
 
															+    and test for specific hardware and software environments during
														
 
															+    development. This ensures that users deploying and running the project
														
 
															+    on recommended system configurations will get the best performance with
														
 
															+    the fewest compatibility issues.
														
 
															+
														
 
															+    By focusing resources on the mainline environment, our team can more
														
 
															+    efficiently resolve potential bugs and develop new features.
														
 
															+
														
 
															+    In non-mainline environments, due to the diversity of hardware and
														
 
															+    software configurations, as well as third-party dependency compatibility
														
 
															+    issues, we cannot guarantee 100% project availability. Therefore, for
														
 
															+    users who wish to use this project in non-recommended environments, we
														
 
															+    suggest carefully reading the documentation and FAQ first. Most issues
														
 
															+    already have corresponding solutions in the FAQ. We also encourage
														
 
															+    community feedback to help us gradually expand support.
														
 
															 .. raw:: html
														
 
															-   <style>
														
 
															-      table, th, td {
														
 
															-      border: 1px solid black;
														
 
															-      border-collapse: collapse;
														
 
															-      }
														
 
															-   </style>
														
 
															-   <table>
														
 
															-    <tr>
														
 
															-        <td colspan="3" rowspan="2">Operating System</td>
														
 
															-    </tr>
														
 
															-    <tr>
														
 
															-        <td>Ubuntu 22.04 LTS</td>
														
 
															-        <td>Windows 10 / 11</td>
														
 
															-        <td>macOS 11+</td>
														
 
															-    </tr>
														
 
															-    <tr>
														
 
															-        <td colspan="3">CPU</td>
														
 
															-        <td>x86_64</td>
														
 
															-        <td>x86_64</td>
														
 
															-        <td>x86_64 / arm64</td>
														
 
															-    </tr>
														
 
															-    <tr>
														
 
															-        <td colspan="3">Memory</td>
														
 
															-        <td colspan="3">16GB or more, recommended 32GB+</td>
														
 
															-    </tr>
														
 
															-    <tr>
														
 
															-        <td colspan="3">Python Version</td>
														
 
															-        <td colspan="3">3.10</td>
														
 
															-    </tr>
														
 
															-    <tr>
														
 
															-        <td colspan="3">Nvidia Driver Version</td>
														
 
															-        <td>latest (Proprietary Driver)</td>
														
 
															-        <td>latest</td>
														
 
															-        <td>None</td>
														
 
															-    </tr>
														
 
															-    <tr>
														
 
															-        <td colspan="3">CUDA Environment</td>
														
 
															-        <td>Automatic installation [12.1 (pytorch) + 11.8 (paddle)]</td>
														
 
															-        <td>11.8 (manual installation) + cuDNN v8.7.0 (manual installation)</td>
														
 
															-        <td>None</td>
														
 
															-    </tr>
														
 
															-    <tr>
														
 
															-        <td rowspan="2">GPU Hardware Support List</td>
														
 
															-        <td colspan="2">Minimum Requirement 8G+ VRAM</td>
														
 
															-        <td colspan="2">3060ti/3070/3080/3080ti/4060/4070/4070ti<br>
														
 
															-        8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
														
 
															-        <td rowspan="2">None</td>
														
 
															-    </tr>
														
 
															-    <tr>
														
 
															-        <td colspan="2">Recommended Configuration 16G+ VRAM</td>
														
 
															-        <td colspan="2">3090/3090ti/4070ti super/4080/4090<br>
														
 
															-        16G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
														
 
															-        </td>
														
 
															-    </tr>
														
 
															-   </table>
														
 
															+    <style>
														
 
															+        table, th, td {
														
 
															+        border: 1px solid black;
														
 
															+        border-collapse: collapse;
														
 
															+        }
														
 
															+    </style>
														
 
															+    <table>
														
 
															+        <tr>
														
 
															+            <td colspan="3" rowspan="2">Operating System</td>
														
 
															+        </tr>
														
 
															+        <tr>
														
 
															+            <td>Ubuntu 22.04 LTS</td>
														
 
															+            <td>Windows 10 / 11</td>
														
 
															+            <td>macOS 11+</td>
														
 
															+        </tr>
														
 
															+        <tr>
														
 
															+            <td colspan="3">CPU</td>
														
 
															+            <td>x86_64(unsupported ARM Linux)</td>
														
 
															+            <td>x86_64(unsupported ARM Windows)</td>
														
 
															+            <td>x86_64 / arm64</td>
														
 
															+        </tr>
														
 
															+        <tr>
														
 
															+            <td colspan="3">Memory</td>
														
 
															+            <td colspan="3">16GB or more, recommended 32GB+</td>
														
 
															+        </tr>
														
 
															+        <tr>
														
 
															+            <td colspan="3">Python Version</td>
														
 
															+            <td colspan="3">3.10(Please make sure to create a Python 3.10 virtual environment using conda)</td>
														
 
															+        </tr>
														
 
															+        <tr>
														
 
															+            <td colspan="3">Nvidia Driver Version</td>
														
 
															+            <td>latest (Proprietary Driver)</td>
														
 
															+            <td>latest</td>
														
 
															+            <td>None</td>
														
 
															+        </tr>
														
 
															+        <tr>
														
 
															+            <td colspan="3">CUDA Environment</td>
														
 
															+            <td>Automatic installation [12.1 (pytorch) + 11.8 (paddle)]</td>
														
 
															+            <td>11.8 (manual installation) + cuDNN v8.7.0 (manual installation)</td>
														
 
															+            <td>None</td>
														
 
															+        </tr>
														
 
															+        <tr>
														
 
															+            <td rowspan="2">GPU Hardware Support List</td>
														
 
															+            <td colspan="2">Minimum Requirement 8G+ VRAM</td>
														
 
															+            <td colspan="2">3060ti/3070/4060<br>
														
 
															+            8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
														
 
															+            <td rowspan="2">None</td>
														
 
															+        </tr>
														
 
															+        <tr>
														
 
															+            <td colspan="2">Recommended Configuration 10G+ VRAM</td>
														
 
															+            <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
														
 
															+            10G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
														
 
															+            </td>
														
 
															+        </tr>
														
 
															+    </table>
														
 
															+
														
 
															 Create an environment
														
--- a/next_docs/en/user_guide/quick_start/command_line.rst
+++ b/next_docs/en/user_guide/quick_start/command_line.rst
@@ -55,5 +55,8 @@ directory. The output file list is as follows:
 
															    ├── some_pdf_spans.pdf                   # smallest granularity bbox position information diagram
														
 
															    └── some_pdf_content_list.json           # Rich text JSON arranged in reading order
														
 
															-For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
														
 
															+.. admonition:: Tip
														
 
															+   :class: tip
														
 
															+
														
 
															+   For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
														
--- a/next_docs/en/user_guide/quick_start/extract_text.rst
+++ b/next_docs/en/user_guide/quick_start/extract_text.rst
@@ -1,10 +0,0 @@
 
															-
														
 
															-
														
 
															-Extract Content from Pdf
														
 
															-========================
														
 
															-
														
 
															-.. code:: python
														
 
															-
														
 
															-    from magic_pdf.data.read_api import read_local_pdfs
														
 
															-    from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
														
 
															-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
														
--- a/next_docs/zh_cn/_static/image/MinerU-logo-hq.png
+++ b/next_docs/zh_cn/_static/image/MinerU-logo-hq.png
--- a/next_docs/zh_cn/_static/image/MinerU-logo.png
+++ b/next_docs/zh_cn/_static/image/MinerU-logo.png
--- a/next_docs/zh_cn/_static/image/ReadTheDocs.svg
+++ b/next_docs/zh_cn/_static/image/ReadTheDocs.svg
--- a/next_docs/zh_cn/_static/image/datalab_logo.png
+++ b/next_docs/zh_cn/_static/image/datalab_logo.png
--- a/next_docs/zh_cn/_static/image/flowchart_en.png
+++ b/next_docs/zh_cn/_static/image/flowchart_en.png
--- a/next_docs/zh_cn/_static/image/flowchart_zh_cn.png
+++ b/next_docs/zh_cn/_static/image/flowchart_zh_cn.png
--- a/next_docs/zh_cn/_static/image/layout_example.png
+++ b/next_docs/zh_cn/_static/image/layout_example.png
--- a/next_docs/zh_cn/_static/image/poly.png
+++ b/next_docs/zh_cn/_static/image/poly.png
--- a/next_docs/zh_cn/_static/image/project_panorama_en.png
+++ b/next_docs/zh_cn/_static/image/project_panorama_en.png
--- a/next_docs/zh_cn/_static/image/project_panorama_zh_cn.png
+++ b/next_docs/zh_cn/_static/image/project_panorama_zh_cn.png
--- a/next_docs/zh_cn/_static/image/spans_example.png
+++ b/next_docs/zh_cn/_static/image/spans_example.png
--- a/next_docs/zh_cn/_static/image/web_demo_1.png
+++ b/next_docs/zh_cn/_static/image/web_demo_1.png
--- a/next_docs/zh_cn/additional_notes/faq.rst
+++ b/next_docs/zh_cn/additional_notes/faq.rst
@@ -0,0 +1,72 @@
 
															+常见问题解答
														
 
															+============
														
 
															+
														
 
															+1.在较新版本的mac上使用命令安装pip install magic-pdf[full] zsh: no matches found: magic-pdf[full]
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+在 macOS 上，默认的 shell 从 Bash 切换到了 Z shell，而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑，这可能导致no matches found错误。 可以通过在命令行禁用globbing特性，再尝试运行安装命令
														
 
															+
														
 
															+.. code:: bash
														
 
															+
														
 
															+   setopt no_nomatch
														
 
															+   pip install magic-pdf[full]
														
 
															+
														
 
															+2.使用过程中遇到_pickle.UnpicklingError: invalid load key, ‘v’.错误
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+可能是由于模型文件未下载完整导致，可尝试重新下载模型文件后再试。参考：https://github.com/opendatalab/MinerU/issues/143
														
 
															+
														
 
															+3.模型文件应该下载到哪里/models-dir的配置应该怎么填
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+模型文件的路径输入是在”magic-pdf.json”中通过
														
 
															+
														
 
															+.. code:: json
														
 
															+
														
 
															+   {
														
 
															+     "models-dir": "/tmp/models"
														
 
															+   }
														
 
															+
														
 
															+进行配置的。这个路径是绝对路径而不是相对路径，绝对路径的获取可在models目录中通过命令 “pwd” 获取。
														
 
															+参考：https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
														
 
															+
														
 
															+4.在WSL2的Ubuntu22.04中遇到报错\ ``ImportError: libGL.so.1: cannot open shared object file: No such file or directory``
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+WSL2的Ubuntu22.04中缺少\ ``libgl``\ 库，可通过以下命令安装\ ``libgl``\ 库解决：
														
 
															+
														
 
															+.. code:: bash
														
 
															+
														
 
															+   sudo apt-get install libgl1-mesa-glx
														
 
															+
														
 
															+参考：https://github.com/opendatalab/MinerU/issues/388
														
 
															+
														
 
															+5.遇到报错 ``ModuleNotFoundError : Nomodulenamed 'fairscale'``
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+需要卸载该模块并重新安装
														
 
															+
														
 
															+.. code:: bash
														
 
															+
														
 
															+   pip uninstall fairscale
														
 
															+   pip install fairscale
														
 
															+
														
 
															+参考：https://github.com/opendatalab/MinerU/issues/411
														
 
															+
														
 
															+6.在部分较新的设备如H100上，使用CUDA加速OCR时解析出的文字乱码。
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+cuda11对新显卡的兼容性不好，需要升级paddle使用的cuda版本
														
 
															+
														
 
															+.. code:: bash
														
 
															+
														
 
															+   pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
														
 
															+
														
 
															+参考：https://github.com/opendatalab/MinerU/issues/558
														
 
															+
														
 
															+7.在部分Linux服务器上，程序一运行就报错 ``非法指令 (核心已转储)`` 或 ``Illegal instruction (core dumped)``
														
 
															+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
														
 
															+
														
 
															+可能是因为服务器CPU不支持AVX/AVX2指令集，或cpu本身支持但被运维禁用了，可以尝试联系运维解除限制或更换服务器。
														
 
															+
														
 
															+参考：https://github.com/opendatalab/MinerU/issues/591 ，https://github.com/opendatalab/MinerU/issues/736
														
--- a/next_docs/zh_cn/additional_notes/glossary.rst
+++ b/next_docs/zh_cn/additional_notes/glossary.rst
@@ -0,0 +1,11 @@
 
															+
														
 
															+
														
 
															+名词解释
														
 
															+===========
														
 
															+
														
 
															+1. jsonl 
														
 
															+    TODO: add description
														
 
															+
														
 
															+2. magic-pdf.json
														
 
															+    TODO: add description
														
 
															+
	@@ -36,5 +36,5 @@ Extract chars via third-party library, currently we use ``pymupdf``.



	-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/dataset` for more details
			+Check :doc:`../../api/dataset` for more details