Browse Source

Merge pull request #969 from opendatalab/release-0.9.3

Release 0.9.3
Xiaomeng Zhao 1 year ago
parent
commit
845a3ff067
100 changed files with 1773 additions and 1018 deletions
  1. 3 0
      .gitignore
  2. 5 2
      README.md
  3. 3 3
      README_ja-JP.md
  4. 6 5
      README_zh-CN.md
  5. 8 3
      demo/magic_pdf_parse_main.py
  6. 1 1
      magic-pdf.template.json
  7. 1 1
      magic_pdf/dict2md/ocr_mkcontent.py
  8. 3 1
      magic_pdf/libs/Constants.py
  9. 1 1
      magic_pdf/libs/config_reader.py
  10. 10 4
      magic_pdf/libs/draw_bbox.py
  11. 42 297
      magic_pdf/model/pdf_extract_kit.py
  12. 0 36
      magic_pdf/model/pek_sub_modules/post_process.py
  13. 0 388
      magic_pdf/model/pek_sub_modules/self_modify.py
  14. 0 0
      magic_pdf/model/sub_modules/__init__.py
  15. 0 0
      magic_pdf/model/sub_modules/layout/__init__.py
  16. 21 0
      magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py
  17. 0 0
      magic_pdf/model/sub_modules/layout/doclayout_yolo/__init__.py
  18. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/__init__.py
  19. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/backbone.py
  20. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/beit.py
  21. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/deit.py
  22. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/__init__.py
  23. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/__init__.py
  24. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/cord.py
  25. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/data_collator.py
  26. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/funsd.py
  27. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/image_utils.py
  28. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/xfund.py
  29. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/__init__.py
  30. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/__init__.py
  31. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/configuration_layoutlmv3.py
  32. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/modeling_layoutlmv3.py
  33. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3.py
  34. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3_fast.py
  35. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/model_init.py
  36. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/rcnn_vl.py
  37. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/visualizer.py
  38. 0 0
      magic_pdf/model/sub_modules/mfd/__init__.py
  39. 12 0
      magic_pdf/model/sub_modules/mfd/yolov8/YOLOv8.py
  40. 0 0
      magic_pdf/model/sub_modules/mfd/yolov8/__init__.py
  41. 0 0
      magic_pdf/model/sub_modules/mfr/__init__.py
  42. 98 0
      magic_pdf/model/sub_modules/mfr/unimernet/Unimernet.py
  43. 0 0
      magic_pdf/model/sub_modules/mfr/unimernet/__init__.py
  44. 144 0
      magic_pdf/model/sub_modules/model_init.py
  45. 51 0
      magic_pdf/model/sub_modules/model_utils.py
  46. 0 0
      magic_pdf/model/sub_modules/ocr/__init__.py
  47. 0 0
      magic_pdf/model/sub_modules/ocr/paddleocr/__init__.py
  48. 259 0
      magic_pdf/model/sub_modules/ocr/paddleocr/ocr_utils.py
  49. 168 0
      magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py
  50. 213 0
      magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_291_mod.py
  51. 0 0
      magic_pdf/model/sub_modules/reading_oreder/__init__.py
  52. 0 0
      magic_pdf/model/sub_modules/reading_oreder/layoutreader/__init__.py
  53. 0 0
      magic_pdf/model/sub_modules/reading_oreder/layoutreader/helpers.py
  54. 242 0
      magic_pdf/model/sub_modules/reading_oreder/layoutreader/xycut.py
  55. 0 0
      magic_pdf/model/sub_modules/table/__init__.py
  56. 0 0
      magic_pdf/model/sub_modules/table/rapidtable/__init__.py
  57. 14 0
      magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py
  58. 0 0
      magic_pdf/model/sub_modules/table/structeqtable/__init__.py
  59. 3 11
      magic_pdf/model/sub_modules/table/structeqtable/struct_eqtable.py
  60. 11 0
      magic_pdf/model/sub_modules/table/table_utils.py
  61. 0 0
      magic_pdf/model/sub_modules/table/tablemaster/__init__.py
  62. 1 1
      magic_pdf/model/sub_modules/table/tablemaster/tablemaster_paddle.py
  63. 13 15
      magic_pdf/para/para_split_v3.py
  64. 56 19
      magic_pdf/pdf_parse_union_core_v2.py
  65. 2 1
      magic_pdf/resources/model_config/model_configs.yaml
  66. 47 3
      magic_pdf/tools/common.py
  67. 16 0
      next_docs/README.md
  68. 16 0
      next_docs/README_zh-CN.md
  69. 13 0
      next_docs/en/_static/image/ReadTheDocs.svg
  70. 0 26
      next_docs/en/additional_notes/changelog.rst
  71. 12 0
      next_docs/en/additional_notes/faq.rst
  72. 15 14
      next_docs/en/additional_notes/known_issues.rst
  73. 0 1
      next_docs/en/api.rst
  74. 0 14
      next_docs/en/api/classes.rst
  75. 0 1
      next_docs/en/api/utils.rst
  76. 1 1
      next_docs/en/conf.py
  77. 23 22
      next_docs/en/index.rst
  78. 0 13
      next_docs/en/projects.rst
  79. 5 1
      next_docs/en/user_guide/data/data_reader_writer.rst
  80. 1 1
      next_docs/en/user_guide/data/dataset.rst
  81. 1 1
      next_docs/en/user_guide/data/io.rst
  82. 6 1
      next_docs/en/user_guide/data/read_api.rst
  83. 45 41
      next_docs/en/user_guide/install/boost_with_cuda.rst
  84. 81 78
      next_docs/en/user_guide/install/install.rst
  85. 4 1
      next_docs/en/user_guide/quick_start/command_line.rst
  86. 0 10
      next_docs/en/user_guide/quick_start/extract_text.rst
  87. BIN
      next_docs/zh_cn/_static/image/MinerU-logo-hq.png
  88. BIN
      next_docs/zh_cn/_static/image/MinerU-logo.png
  89. 13 0
      next_docs/zh_cn/_static/image/ReadTheDocs.svg
  90. BIN
      next_docs/zh_cn/_static/image/datalab_logo.png
  91. BIN
      next_docs/zh_cn/_static/image/flowchart_en.png
  92. BIN
      next_docs/zh_cn/_static/image/flowchart_zh_cn.png
  93. BIN
      next_docs/zh_cn/_static/image/layout_example.png
  94. BIN
      next_docs/zh_cn/_static/image/poly.png
  95. BIN
      next_docs/zh_cn/_static/image/project_panorama_en.png
  96. BIN
      next_docs/zh_cn/_static/image/project_panorama_zh_cn.png
  97. BIN
      next_docs/zh_cn/_static/image/spans_example.png
  98. BIN
      next_docs/zh_cn/_static/image/web_demo_1.png
  99. 72 0
      next_docs/zh_cn/additional_notes/faq.rst
  100. 11 0
      next_docs/zh_cn/additional_notes/glossary.rst

+ 3 - 0
.gitignore

@@ -48,3 +48,6 @@ debug_utils/
 
 # sphinx docs
 _build/
+
+
+output/

+ 5 - 2
README.md

@@ -42,6 +42,7 @@
 </div>
 
 # Changelog
+- 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
 - 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
 - 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
   - Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
@@ -246,7 +247,7 @@ You can modify certain configurations in this file to enable or disable features
         "enable": true  // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
     },
     "table-config": {
-        "model": "tablemaster",  // When using structEqTable, please change to "struct_eqtable".
+        "model": "rapid_table",  // When using structEqTable, please change to "struct_eqtable".
         "enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true".
         "max_time": 400
     }
@@ -261,7 +262,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi
 - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
 - Quick Deployment with Docker
 > [!IMPORTANT]
-> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
+> Docker requires a GPU with at least 8GB of VRAM, and all acceleration features are enabled by default.
 >
 > Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
 > 
@@ -421,7 +422,9 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
 # Acknowledgments
 
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
+- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
+- [RapidTable](https://github.com/RapidAI/RapidTable)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

+ 3 - 3
README_ja-JP.md

@@ -1,3 +1,5 @@
+> [!Warning]
+> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください:[ENGLISH](README.md)。
 <div id="top">
 
 <p align="center">
@@ -18,9 +20,7 @@
 <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
 
 
-<div align="center" style="color: red; background-color: #ffdddd; padding: 10px; border: 1px solid red; border-radius: 5px;">
-  <strong>NOTE:</strong> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください。
-</div>
+
 
 
 [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)

+ 6 - 5
README_zh-CN.md

@@ -42,7 +42,7 @@
 </div>
 
 # 更新记录
-
+- 2024/11/15 0.9.3发布,为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上,准确率更高,显存占用更低
 - 2024/11/06 0.9.2发布,为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
 - 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:
   - 重构排序模块代码,使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序,确保在各种排版下都能实现极高准确率
@@ -188,13 +188,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
         <td rowspan="2">GPU硬件支持列表</td>
         <td colspan="2">最低要求 8G+显存</td>
         <td colspan="2">3060ti/3070/4060<br>
-        8G显存可开启layout、公式识别和ocr加速</td>
+        8G显存可开启全部加速功能(表格仅限rapid_table)</td>
         <td rowspan="2">None</td>
     </tr>
     <tr>
         <td colspan="2">推荐配置 10G+显存</td>
         <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
-        10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速<br>
+        10G显存及以上可开启全部加速功能<br>
         </td>
     </tr>
 </table>
@@ -251,7 +251,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
         "enable": true  // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false"
     },
     "table-config": {
-        "model": "tablemaster",  // 使用structEqTable请修改为"struct_eqtable"
+        "model": "rapid_table",  // 使用structEqTable请修改为"struct_eqtable"
         "enable": false, // 表格识别功能默认是关闭的,如果需要开启请修改此处的值为"true"
         "max_time": 400
     }
@@ -266,7 +266,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
 - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
 - 使用Docker快速部署
 > [!IMPORTANT]
-> Docker 需设备gpu显存大于等于16GB,默认开启所有加速功能
+> Docker 需设备gpu显存大于等于8GB,默认开启所有加速功能
 > 
 > 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
 > 
@@ -431,6 +431,7 @@ TODO
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
 - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
+- [RapidTable](https://github.com/RapidAI/RapidTable)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

+ 8 - 3
demo/magic_pdf_parse_main.py

@@ -19,9 +19,10 @@ def json_md_dump(
         pdf_name,
         content_list,
         md_content,
+        orig_model_list,
 ):
     # 写入模型结果到 model.json
-    orig_model_list = copy.deepcopy(pipe.model_list)
+
     md_writer.write(
         content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
         path=f"{pdf_name}_model.json"
@@ -87,9 +88,12 @@ def pdf_parse_main(
 
         pdf_bytes = open(pdf_path, "rb").read()  # 读取 pdf 文件的二进制数据
 
+        orig_model_list = []
+
         if model_json_path:
             # 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型
             model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
+            orig_model_list = copy.deepcopy(model_json)
         else:
             model_json = []
 
@@ -115,8 +119,9 @@ def pdf_parse_main(
         pipe.pipe_classify()
 
         # 如果没有传入模型数据,则使用内置模型解析
-        if not model_json:
+        if len(model_json) == 0:
             pipe.pipe_analyze()  # 解析
+            orig_model_list = copy.deepcopy(pipe.model_list)
 
         # 执行解析
         pipe.pipe_parse()
@@ -126,7 +131,7 @@ def pdf_parse_main(
         md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
 
         if is_json_md_dump:
-            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)
+            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content, orig_model_list)
 
         if is_draw_visualization_bbox:
             draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name)

+ 1 - 1
magic-pdf.template.json

@@ -15,7 +15,7 @@
         "enable": true
     },
     "table-config": {
-        "model": "tablemaster",
+        "model": "rapid_table",
         "enable": false,
         "max_time": 400
     },

+ 1 - 1
magic_pdf/dict2md/ocr_mkcontent.py

@@ -168,7 +168,7 @@ def merge_para_with_text(para_block):
                         # 如果是前一行带有-连字符,那么末尾不应该加空格
                         if __is_hyphen_at_line_end(content):
                             para_text += content[:-1]
-                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i']:
+                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i'] and not content.isdigit():
                             para_text += content
                         else:  # 西方文本语境下 content间需要空格分隔
                             para_text += f"{content} "

+ 3 - 1
magic_pdf/libs/Constants.py

@@ -50,4 +50,6 @@ class MODEL_NAME:
 
     YOLO_V8_MFD = "yolo_v8_mfd"
 
-    UniMerNet_v2_Small = "unimernet_small"
+    UniMerNet_v2_Small = "unimernet_small"
+
+    RAPID_TABLE = "rapid_table"

+ 1 - 1
magic_pdf/libs/config_reader.py

@@ -92,7 +92,7 @@ def get_table_recog_config():
     table_config = config.get('table-config')
     if table_config is None:
         logger.warning(f"'table-config' not found in {CONFIG_FILE_NAME}, use 'False' as default")
-        return json.loads(f'{{"model": "{MODEL_NAME.TABLE_MASTER}","enable": false, "max_time": 400}}')
+        return json.loads(f'{{"model": "{MODEL_NAME.RAPID_TABLE}","enable": false, "max_time": 400}}')
     else:
         return table_config
 

+ 10 - 4
magic_pdf/libs/draw_bbox.py

@@ -369,10 +369,16 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
             if block['type'] in [BlockType.Image, BlockType.Table]:
                 for sub_block in block['blocks']:
                     if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
-                        for line in sub_block['virtual_lines']:
-                            bbox = line['bbox']
-                            index = line['index']
-                            page_line_list.append({'index': index, 'bbox': bbox})
+                        if len(sub_block['virtual_lines']) > 0 and sub_block['virtual_lines'][0].get('index', None) is not None:
+                            for line in sub_block['virtual_lines']:
+                                bbox = line['bbox']
+                                index = line['index']
+                                page_line_list.append({'index': index, 'bbox': bbox})
+                        else:
+                            for line in sub_block['lines']:
+                                bbox = line['bbox']
+                                index = line['index']
+                                page_line_list.append({'index': index, 'bbox': bbox})
                     elif sub_block['type'] in [BlockType.ImageCaption, BlockType.TableCaption, BlockType.ImageFootnote, BlockType.TableFootnote]:
                         for line in sub_block['lines']:
                             bbox = line['bbox']

+ 42 - 297
magic_pdf/model/pdf_extract_kit.py

@@ -1,195 +1,28 @@
+import numpy as np
+import torch
 from loguru import logger
 import os
 import time
-from pathlib import Path
-import shutil
-from magic_pdf.libs.Constants import *
-from magic_pdf.libs.clean_memory import clean_memory
-from magic_pdf.model.model_list import AtomicModel
+import cv2
+import yaml
+from PIL import Image
 
 os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1'  # 禁止albumentations检查更新
 os.environ['YOLO_VERBOSE'] = 'False'  # disable yolo logger
+
 try:
-    import cv2
-    import yaml
-    import argparse
-    import numpy as np
-    import torch
     import torchtext
 
     if torchtext.__version__ >= "0.18.0":
         torchtext.disable_torchtext_deprecation_warning()
-    from PIL import Image
-    from torchvision import transforms
-    from torch.utils.data import Dataset, DataLoader
-    from ultralytics import YOLO
-    from unimernet.common.config import Config
-    import unimernet.tasks as tasks
-    from unimernet.processors import load_processor
-    from doclayout_yolo import YOLOv10
-
-except ImportError as e:
-    logger.exception(e)
-    logger.error(
-        'Required dependency not installed, please install by \n'
-        '"pip install magic-pdf[full] --extra-index-url https://myhloli.github.io/wheels/"')
-    exit(1)
-
-from magic_pdf.model.pek_sub_modules.layoutlmv3.model_init import Layoutlmv3_Predictor
-from magic_pdf.model.pek_sub_modules.post_process import latex_rm_whitespace
-from magic_pdf.model.pek_sub_modules.self_modify import ModifiedPaddleOCR
-from magic_pdf.model.pek_sub_modules.structeqtable.StructTableModel import StructTableModel
-from magic_pdf.model.ppTableModel import ppTableModel
-
-
-def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
-    if table_model_type == MODEL_NAME.STRUCT_EQTABLE:
-        table_model = StructTableModel(model_path, max_time=max_time)
-    elif table_model_type == MODEL_NAME.TABLE_MASTER:
-        config = {
-            "model_dir": model_path,
-            "device": _device_
-        }
-        table_model = ppTableModel(config)
-    else:
-        logger.error("table model type not allow")
-        exit(1)
-    return table_model
-
-
-def mfd_model_init(weight):
-    mfd_model = YOLO(weight)
-    return mfd_model
-
-
-def mfr_model_init(weight_dir, cfg_path, _device_='cpu'):
-    args = argparse.Namespace(cfg_path=cfg_path, options=None)
-    cfg = Config(args)
-    cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.pth")
-    cfg.config.model.model_config.model_name = weight_dir
-    cfg.config.model.tokenizer_config.path = weight_dir
-    task = tasks.setup_task(cfg)
-    model = task.build_model(cfg)
-    model.to(_device_)
-    model.eval()
-    vis_processor = load_processor('formula_image_eval', cfg.config.datasets.formula_rec_eval.vis_processor.eval)
-    mfr_transform = transforms.Compose([vis_processor, ])
-    return [model, mfr_transform]
-
-
-def layout_model_init(weight, config_file, device):
-    model = Layoutlmv3_Predictor(weight, config_file, device)
-    return model
-
-
-def doclayout_yolo_model_init(weight):
-    model = YOLOv10(weight)
-    return model
-
-
-def ocr_model_init(show_log: bool = False, det_db_box_thresh=0.3, lang=None, use_dilation=True, det_db_unclip_ratio=1.8):
-    if lang is not None:
-        model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh, lang=lang, use_dilation=use_dilation, det_db_unclip_ratio=det_db_unclip_ratio)
-    else:
-        model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh, use_dilation=use_dilation, det_db_unclip_ratio=det_db_unclip_ratio)
-    return model
-
-
-class MathDataset(Dataset):
-    def __init__(self, image_paths, transform=None):
-        self.image_paths = image_paths
-        self.transform = transform
-
-    def __len__(self):
-        return len(self.image_paths)
-
-    def __getitem__(self, idx):
-        # if not pil image, then convert to pil image
-        if isinstance(self.image_paths[idx], str):
-            raw_image = Image.open(self.image_paths[idx])
-        else:
-            raw_image = self.image_paths[idx]
-        if self.transform:
-            image = self.transform(raw_image)
-            return image
-
-
-class AtomModelSingleton:
-    _instance = None
-    _models = {}
-
-    def __new__(cls, *args, **kwargs):
-        if cls._instance is None:
-            cls._instance = super().__new__(cls)
-        return cls._instance
-
-    def get_atom_model(self, atom_model_name: str, **kwargs):
-        lang = kwargs.get("lang", None)
-        layout_model_name = kwargs.get("layout_model_name", None)
-        key = (atom_model_name, layout_model_name, lang)
-        if key not in self._models:
-            self._models[key] = atom_model_init(model_name=atom_model_name, **kwargs)
-        return self._models[key]
-
-
-def atom_model_init(model_name: str, **kwargs):
-
-    if model_name == AtomicModel.Layout:
-        if kwargs.get("layout_model_name") == MODEL_NAME.LAYOUTLMv3:
-            atom_model = layout_model_init(
-                kwargs.get("layout_weights"),
-                kwargs.get("layout_config_file"),
-                kwargs.get("device")
-            )
-        elif kwargs.get("layout_model_name") == MODEL_NAME.DocLayout_YOLO:
-            atom_model = doclayout_yolo_model_init(
-                kwargs.get("doclayout_yolo_weights"),
-            )
-    elif model_name == AtomicModel.MFD:
-        atom_model = mfd_model_init(
-            kwargs.get("mfd_weights")
-        )
-    elif model_name == AtomicModel.MFR:
-        atom_model = mfr_model_init(
-            kwargs.get("mfr_weight_dir"),
-            kwargs.get("mfr_cfg_path"),
-            kwargs.get("device")
-        )
-    elif model_name == AtomicModel.OCR:
-        atom_model = ocr_model_init(
-            kwargs.get("ocr_show_log"),
-            kwargs.get("det_db_box_thresh"),
-            kwargs.get("lang")
-        )
-    elif model_name == AtomicModel.Table:
-        atom_model = table_model_init(
-            kwargs.get("table_model_name"),
-            kwargs.get("table_model_path"),
-            kwargs.get("table_max_time"),
-            kwargs.get("device")
-        )
-    else:
-        logger.error("model name not allow")
-        exit(1)
-
-    return atom_model
-
+except ImportError:
+    pass
 
-#  Unified crop img logic
-def crop_img(input_res, input_pil_img, crop_paste_x=0, crop_paste_y=0):
-    crop_xmin, crop_ymin = int(input_res['poly'][0]), int(input_res['poly'][1])
-    crop_xmax, crop_ymax = int(input_res['poly'][4]), int(input_res['poly'][5])
-    # Create a white background with an additional width and height of 50
-    crop_new_width = crop_xmax - crop_xmin + crop_paste_x * 2
-    crop_new_height = crop_ymax - crop_ymin + crop_paste_y * 2
-    return_image = Image.new('RGB', (crop_new_width, crop_new_height), 'white')
-
-    # Crop image
-    crop_box = (crop_xmin, crop_ymin, crop_xmax, crop_ymax)
-    cropped_img = input_pil_img.crop(crop_box)
-    return_image.paste(cropped_img, (crop_paste_x, crop_paste_y))
-    return_list = [crop_paste_x, crop_paste_y, crop_xmin, crop_ymin, crop_xmax, crop_ymax, crop_new_width, crop_new_height]
-    return return_image, return_list
+from magic_pdf.libs.Constants import *
+from magic_pdf.model.model_list import AtomicModel
+from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
+from magic_pdf.model.sub_modules.model_utils import get_res_list_from_layout_res, crop_img, clean_vram
+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import get_adjusted_mfdetrec_res, get_ocr_result_list
 
 
 class CustomPEKModel:
@@ -226,7 +59,7 @@ class CustomPEKModel:
         self.table_config = kwargs.get("table_config")
         self.apply_table = self.table_config.get("enable", False)
         self.table_max_time = self.table_config.get("max_time", TABLE_MAX_TIME_VALUE)
-        self.table_model_name = self.table_config.get("model", MODEL_NAME.TABLE_MASTER)
+        self.table_model_name = self.table_config.get("model", MODEL_NAME.RAPID_TABLE)
 
         # ocr config
         self.apply_ocr = ocr
@@ -235,7 +68,8 @@ class CustomPEKModel:
         logger.info(
             "DocAnalysis init, this may take some times, layout_model: {}, apply_formula: {}, apply_ocr: {}, "
             "apply_table: {}, table_model: {}, lang: {}".format(
-                self.layout_model_name, self.apply_formula, self.apply_ocr, self.apply_table, self.table_model_name, self.lang
+                self.layout_model_name, self.apply_formula, self.apply_ocr, self.apply_table, self.table_model_name,
+                self.lang
             )
         )
         # 初始化解析方案
@@ -248,17 +82,17 @@ class CustomPEKModel:
 
         # 初始化公式识别
         if self.apply_formula:
-
             # 初始化公式检测模型
             self.mfd_model = atom_model_manager.get_atom_model(
                 atom_model_name=AtomicModel.MFD,
-                mfd_weights=str(os.path.join(models_dir, self.configs["weights"][self.mfd_model_name]))
+                mfd_weights=str(os.path.join(models_dir, self.configs["weights"][self.mfd_model_name])),
+                device=self.device
             )
 
             # 初始化公式解析模型
             mfr_weight_dir = str(os.path.join(models_dir, self.configs["weights"][self.mfr_model_name]))
             mfr_cfg_path = str(os.path.join(model_config_dir, "UniMERNet", "demo.yaml"))
-            self.mfr_model, self.mfr_transform = atom_model_manager.get_atom_model(
+            self.mfr_model = atom_model_manager.get_atom_model(
                 atom_model_name=AtomicModel.MFR,
                 mfr_weight_dir=mfr_weight_dir,
                 mfr_cfg_path=mfr_cfg_path,
@@ -278,7 +112,8 @@ class CustomPEKModel:
             self.layout_model = atom_model_manager.get_atom_model(
                 atom_model_name=AtomicModel.Layout,
                 layout_model_name=MODEL_NAME.DocLayout_YOLO,
-                doclayout_yolo_weights=str(os.path.join(models_dir, self.configs['weights'][self.layout_model_name]))
+                doclayout_yolo_weights=str(os.path.join(models_dir, self.configs['weights'][self.layout_model_name])),
+                device=self.device
             )
         # 初始化ocr
         if self.apply_ocr:
@@ -305,26 +140,15 @@ class CustomPEKModel:
 
         page_start = time.time()
 
-        latex_filling_list = []
-        mf_image_list = []
-
         # layout检测
         layout_start = time.time()
+        layout_res = []
         if self.layout_model_name == MODEL_NAME.LAYOUTLMv3:
             # layoutlmv3
             layout_res = self.layout_model(image, ignore_catids=[])
         elif self.layout_model_name == MODEL_NAME.DocLayout_YOLO:
             # doclayout_yolo
-            layout_res = []
-            doclayout_yolo_res = self.layout_model.predict(image, imgsz=1024, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
-            for xyxy, conf, cla in zip(doclayout_yolo_res.boxes.xyxy.cpu(), doclayout_yolo_res.boxes.conf.cpu(), doclayout_yolo_res.boxes.cls.cpu()):
-                xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
-                new_item = {
-                    'category_id': int(cla.item()),
-                    'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
-                    'score': round(float(conf.item()), 3),
-                }
-                layout_res.append(new_item)
+            layout_res = self.layout_model.predict(image)
         layout_cost = round(time.time() - layout_start, 2)
         logger.info(f"layout detection time: {layout_cost}")
 
@@ -333,59 +157,21 @@ class CustomPEKModel:
         if self.apply_formula:
             # 公式检测
             mfd_start = time.time()
-            mfd_res = self.mfd_model.predict(image, imgsz=1888, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
+            mfd_res = self.mfd_model.predict(image)
             logger.info(f"mfd time: {round(time.time() - mfd_start, 2)}")
-            for xyxy, conf, cla in zip(mfd_res.boxes.xyxy.cpu(), mfd_res.boxes.conf.cpu(), mfd_res.boxes.cls.cpu()):
-                xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
-                new_item = {
-                    'category_id': 13 + int(cla.item()),
-                    'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
-                    'score': round(float(conf.item()), 2),
-                    'latex': '',
-                }
-                layout_res.append(new_item)
-                latex_filling_list.append(new_item)
-                bbox_img = pil_img.crop((xmin, ymin, xmax, ymax))
-                mf_image_list.append(bbox_img)
 
             # 公式识别
             mfr_start = time.time()
-            dataset = MathDataset(mf_image_list, transform=self.mfr_transform)
-            dataloader = DataLoader(dataset, batch_size=64, num_workers=0)
-            mfr_res = []
-            for mf_img in dataloader:
-                mf_img = mf_img.to(self.device)
-                with torch.no_grad():
-                    output = self.mfr_model.generate({'image': mf_img})
-                mfr_res.extend(output['pred_str'])
-            for res, latex in zip(latex_filling_list, mfr_res):
-                res['latex'] = latex_rm_whitespace(latex)
+            formula_list = self.mfr_model.predict(mfd_res, image)
+            layout_res.extend(formula_list)
             mfr_cost = round(time.time() - mfr_start, 2)
-            logger.info(f"formula nums: {len(mf_image_list)}, mfr time: {mfr_cost}")
-
-        # Select regions for OCR / formula regions / table regions
-        ocr_res_list = []
-        table_res_list = []
-        single_page_mfdetrec_res = []
-        for res in layout_res:
-            if int(res['category_id']) in [13, 14]:
-                single_page_mfdetrec_res.append({
-                    "bbox": [int(res['poly'][0]), int(res['poly'][1]),
-                             int(res['poly'][4]), int(res['poly'][5])],
-                })
-            elif int(res['category_id']) in [0, 1, 2, 4, 6, 7]:
-                ocr_res_list.append(res)
-            elif int(res['category_id']) in [5]:
-                table_res_list.append(res)
-
-        if torch.cuda.is_available() and self.device != 'cpu':
-            properties = torch.cuda.get_device_properties(self.device)
-            total_memory = properties.total_memory / (1024 ** 3)  # 将字节转换为 GB
-            if total_memory <= 10:
-                gc_start = time.time()
-                clean_memory()
-                gc_time = round(time.time() - gc_start, 2)
-                logger.info(f"gc time: {gc_time}")
+            logger.info(f"formula nums: {len(formula_list)}, mfr time: {mfr_cost}")
+
+        # 清理显存
+        clean_vram(self.device, vram_threshold=8)
+
+        # 从layout_res中获取ocr区域、表格区域、公式区域
+        ocr_res_list, table_res_list, single_page_mfdetrec_res = get_res_list_from_layout_res(layout_res)
 
         # ocr识别
         if self.apply_ocr:
@@ -393,23 +179,7 @@ class CustomPEKModel:
             # Process each area that requires OCR processing
             for res in ocr_res_list:
                 new_image, useful_list = crop_img(res, pil_img, crop_paste_x=50, crop_paste_y=50)
-                paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
-                # Adjust the coordinates of the formula area
-                adjusted_mfdetrec_res = []
-                for mf_res in single_page_mfdetrec_res:
-                    mf_xmin, mf_ymin, mf_xmax, mf_ymax = mf_res["bbox"]
-                    # Adjust the coordinates of the formula area to the coordinates relative to the cropping area
-                    x0 = mf_xmin - xmin + paste_x
-                    y0 = mf_ymin - ymin + paste_y
-                    x1 = mf_xmax - xmin + paste_x
-                    y1 = mf_ymax - ymin + paste_y
-                    # Filter formula blocks outside the graph
-                    if any([x1 < 0, y1 < 0]) or any([x0 > new_width, y0 > new_height]):
-                        continue
-                    else:
-                        adjusted_mfdetrec_res.append({
-                            "bbox": [x0, y0, x1, y1],
-                        })
+                adjusted_mfdetrec_res = get_adjusted_mfdetrec_res(single_page_mfdetrec_res, useful_list)
 
                 # OCR recognition
                 new_image = cv2.cvtColor(np.asarray(new_image), cv2.COLOR_RGB2BGR)
@@ -417,22 +187,8 @@ class CustomPEKModel:
 
                 # Integration results
                 if ocr_res:
-                    for box_ocr_res in ocr_res:
-                        p1, p2, p3, p4 = box_ocr_res[0]
-                        text, score = box_ocr_res[1]
-
-                        # Convert the coordinates back to the original coordinate system
-                        p1 = [p1[0] - paste_x + xmin, p1[1] - paste_y + ymin]
-                        p2 = [p2[0] - paste_x + xmin, p2[1] - paste_y + ymin]
-                        p3 = [p3[0] - paste_x + xmin, p3[1] - paste_y + ymin]
-                        p4 = [p4[0] - paste_x + xmin, p4[1] - paste_y + ymin]
-
-                        layout_res.append({
-                            'category_id': 15,
-                            'poly': p1 + p2 + p3 + p4,
-                            'score': round(score, 2),
-                            'text': text,
-                        })
+                    ocr_result_list = get_ocr_result_list(ocr_res, useful_list)
+                    layout_res.extend(ocr_result_list)
 
             ocr_cost = round(time.time() - ocr_start, 2)
             logger.info(f"ocr time: {ocr_cost}")
@@ -443,41 +199,30 @@ class CustomPEKModel:
             for res in table_res_list:
                 new_image, _ = crop_img(res, pil_img)
                 single_table_start_time = time.time()
-                # logger.info("------------------table recognition processing begins-----------------")
-                latex_code = None
                 html_code = None
                 if self.table_model_name == MODEL_NAME.STRUCT_EQTABLE:
                     with torch.no_grad():
                         table_result = self.table_model.predict(new_image, "html")
                         if len(table_result) > 0:
                             html_code = table_result[0]
-                else:
+                elif self.table_model_name == MODEL_NAME.TABLE_MASTER:
                     html_code = self.table_model.img2html(new_image)
-
+                elif self.table_model_name == MODEL_NAME.RAPID_TABLE:
+                    html_code, table_cell_bboxes, elapse = self.table_model.predict(new_image)
                 run_time = time.time() - single_table_start_time
-                # logger.info(f"------------table recognition processing ends within {run_time}s-----")
                 if run_time > self.table_max_time:
-                    logger.warning(f"------------table recognition processing exceeds max time {self.table_max_time}s----------")
+                    logger.warning(f"table recognition processing exceeds max time {self.table_max_time}s")
                 # 判断是否返回正常
-
-                if latex_code:
-                    expected_ending = latex_code.strip().endswith('end{tabular}') or latex_code.strip().endswith('end{table}')
-                    if expected_ending:
-                        res["latex"] = latex_code
-                    else:
-                        logger.warning(f"table recognition processing fails, not found expected LaTeX table end")
-                elif html_code:
+                if html_code:
                     expected_ending = html_code.strip().endswith('</html>') or html_code.strip().endswith('</table>')
                     if expected_ending:
                         res["html"] = html_code
                     else:
                         logger.warning(f"table recognition processing fails, not found expected HTML table end")
                 else:
-                    logger.warning(f"table recognition processing fails, not get latex or html return")
+                    logger.warning(f"table recognition processing fails, not get html return")
             logger.info(f"table time: {round(time.time() - table_start, 2)}")
 
         logger.info(f"-----page total time: {round(time.time() - page_start, 2)}-----")
 
         return layout_res
-
-

+ 0 - 36
magic_pdf/model/pek_sub_modules/post_process.py

@@ -1,36 +0,0 @@
-import re
-
-def layout_rm_equation(layout_res):
-    rm_idxs = []
-    for idx, ele in enumerate(layout_res['layout_dets']):
-        if ele['category_id'] == 10:
-            rm_idxs.append(idx)
-    
-    for idx in rm_idxs[::-1]:
-        del layout_res['layout_dets'][idx]
-    return layout_res
-
-
-def get_croped_image(image_pil, bbox):
-    x_min, y_min, x_max, y_max = bbox
-    croped_img = image_pil.crop((x_min, y_min, x_max, y_max))
-    return croped_img
-
-
-def latex_rm_whitespace(s: str):
-    """Remove unnecessary whitespace from LaTeX code.
-    """
-    text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
-    letter = '[a-zA-Z]'
-    noletter = '[\W_^\d]'
-    names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
-    s = re.sub(text_reg, lambda match: str(names.pop(0)), s)
-    news = s
-    while True:
-        s = news
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
-        news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
-        if news == s:
-            break
-    return s

+ 0 - 388
magic_pdf/model/pek_sub_modules/self_modify.py

@@ -1,388 +0,0 @@
-import time
-import copy
-import base64
-import cv2
-import numpy as np
-from io import BytesIO
-from PIL import Image
-
-from paddleocr import PaddleOCR
-from paddleocr.ppocr.utils.logging import get_logger
-from paddleocr.ppocr.utils.utility import check_and_read, alpha_to_color, binarize_img
-from paddleocr.tools.infer.utility import draw_ocr_box_txt, get_rotate_crop_image, get_minarea_rect_crop
-
-from magic_pdf.libs.boxbase import __is_overlaps_y_exceeds_threshold
-from magic_pdf.pre_proc.ocr_dict_merge import merge_spans_to_line
-
-logger = get_logger()
-
-
-def img_decode(content: bytes):
-    np_arr = np.frombuffer(content, dtype=np.uint8)
-    return cv2.imdecode(np_arr, cv2.IMREAD_UNCHANGED)
-
-
-def check_img(img):
-    if isinstance(img, bytes):
-        img = img_decode(img)
-    if isinstance(img, str):
-        image_file = img
-        img, flag_gif, flag_pdf = check_and_read(image_file)
-        if not flag_gif and not flag_pdf:
-            with open(image_file, 'rb') as f:
-                img_str = f.read()
-                img = img_decode(img_str)
-            if img is None:
-                try:
-                    buf = BytesIO()
-                    image = BytesIO(img_str)
-                    im = Image.open(image)
-                    rgb = im.convert('RGB')
-                    rgb.save(buf, 'jpeg')
-                    buf.seek(0)
-                    image_bytes = buf.read()
-                    data_base64 = str(base64.b64encode(image_bytes),
-                                      encoding="utf-8")
-                    image_decode = base64.b64decode(data_base64)
-                    img_array = np.frombuffer(image_decode, np.uint8)
-                    img = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
-                except:
-                    logger.error("error in loading image:{}".format(image_file))
-                    return None
-        if img is None:
-            logger.error("error in loading image:{}".format(image_file))
-            return None
-    if isinstance(img, np.ndarray) and len(img.shape) == 2:
-        img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
-
-    return img
-
-
-def sorted_boxes(dt_boxes):
-    """
-    Sort text boxes in order from top to bottom, left to right
-    args:
-        dt_boxes(array):detected text boxes with shape [4, 2]
-    return:
-        sorted boxes(array) with shape [4, 2]
-    """
-    num_boxes = dt_boxes.shape[0]
-    sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0]))
-    _boxes = list(sorted_boxes)
-
-    for i in range(num_boxes - 1):
-        for j in range(i, -1, -1):
-            if abs(_boxes[j + 1][0][1] - _boxes[j][0][1]) < 10 and \
-                    (_boxes[j + 1][0][0] < _boxes[j][0][0]):
-                tmp = _boxes[j]
-                _boxes[j] = _boxes[j + 1]
-                _boxes[j + 1] = tmp
-            else:
-                break
-    return _boxes
-
-
-def bbox_to_points(bbox):
-    """ 将bbox格式转换为四个顶点的数组 """
-    x0, y0, x1, y1 = bbox
-    return np.array([[x0, y0], [x1, y0], [x1, y1], [x0, y1]]).astype('float32')
-
-
-def points_to_bbox(points):
-    """ 将四个顶点的数组转换为bbox格式 """
-    x0, y0 = points[0]
-    x1, _ = points[1]
-    _, y1 = points[2]
-    return [x0, y0, x1, y1]
-
-
-def merge_intervals(intervals):
-    # Sort the intervals based on the start value
-    intervals.sort(key=lambda x: x[0])
-
-    merged = []
-    for interval in intervals:
-        # If the list of merged intervals is empty or if the current
-        # interval does not overlap with the previous, simply append it.
-        if not merged or merged[-1][1] < interval[0]:
-            merged.append(interval)
-        else:
-            # Otherwise, there is overlap, so we merge the current and previous intervals.
-            merged[-1][1] = max(merged[-1][1], interval[1])
-
-    return merged
-
-
-def remove_intervals(original, masks):
-    # Merge all mask intervals
-    merged_masks = merge_intervals(masks)
-
-    result = []
-    original_start, original_end = original
-
-    for mask in merged_masks:
-        mask_start, mask_end = mask
-
-        # If the mask starts after the original range, ignore it
-        if mask_start > original_end:
-            continue
-
-        # If the mask ends before the original range starts, ignore it
-        if mask_end < original_start:
-            continue
-
-        # Remove the masked part from the original range
-        if original_start < mask_start:
-            result.append([original_start, mask_start - 1])
-
-        original_start = max(mask_end + 1, original_start)
-
-    # Add the remaining part of the original range, if any
-    if original_start <= original_end:
-        result.append([original_start, original_end])
-
-    return result
-
-
-def update_det_boxes(dt_boxes, mfd_res):
-    new_dt_boxes = []
-    for text_box in dt_boxes:
-        text_bbox = points_to_bbox(text_box)
-        masks_list = []
-        for mf_box in mfd_res:
-            mf_bbox = mf_box['bbox']
-            if __is_overlaps_y_exceeds_threshold(text_bbox, mf_bbox):
-                masks_list.append([mf_bbox[0], mf_bbox[2]])
-        text_x_range = [text_bbox[0], text_bbox[2]]
-        text_remove_mask_range = remove_intervals(text_x_range, masks_list)
-        temp_dt_box = []
-        for text_remove_mask in text_remove_mask_range:
-            temp_dt_box.append(bbox_to_points([text_remove_mask[0], text_bbox[1], text_remove_mask[1], text_bbox[3]]))
-        if len(temp_dt_box) > 0:
-            new_dt_boxes.extend(temp_dt_box)
-    return new_dt_boxes
-
-
-def merge_overlapping_spans(spans):
-    """
-    Merges overlapping spans on the same line.
-
-    :param spans: A list of span coordinates [(x1, y1, x2, y2), ...]
-    :return: A list of merged spans
-    """
-    # Return an empty list if the input spans list is empty
-    if not spans:
-        return []
-
-    # Sort spans by their starting x-coordinate
-    spans.sort(key=lambda x: x[0])
-
-    # Initialize the list of merged spans
-    merged = []
-    for span in spans:
-        # Unpack span coordinates
-        x1, y1, x2, y2 = span
-        # If the merged list is empty or there's no horizontal overlap, add the span directly
-        if not merged or merged[-1][2] < x1:
-            merged.append(span)
-        else:
-            # If there is horizontal overlap, merge the current span with the previous one
-            last_span = merged.pop()
-            # Update the merged span's top-left corner to the smaller (x1, y1) and bottom-right to the larger (x2, y2)
-            x1 = min(last_span[0], x1)
-            y1 = min(last_span[1], y1)
-            x2 = max(last_span[2], x2)
-            y2 = max(last_span[3], y2)
-            # Add the merged span back to the list
-            merged.append((x1, y1, x2, y2))
-
-    # Return the list of merged spans
-    return merged
-
-
-def merge_det_boxes(dt_boxes):
-    """
-    Merge detection boxes.
-
-    This function takes a list of detected bounding boxes, each represented by four corner points.
-    The goal is to merge these bounding boxes into larger text regions.
-
-    Parameters:
-    dt_boxes (list): A list containing multiple text detection boxes, where each box is defined by four corner points.
-
-    Returns:
-    list: A list containing the merged text regions, where each region is represented by four corner points.
-    """
-    # Convert the detection boxes into a dictionary format with bounding boxes and type
-    dt_boxes_dict_list = []
-    for text_box in dt_boxes:
-        text_bbox = points_to_bbox(text_box)
-        text_box_dict = {
-            'bbox': text_bbox,
-            'type': 'text',
-        }
-        dt_boxes_dict_list.append(text_box_dict)
-
-    # Merge adjacent text regions into lines
-    lines = merge_spans_to_line(dt_boxes_dict_list)
-
-    # Initialize a new list for storing the merged text regions
-    new_dt_boxes = []
-    for line in lines:
-        line_bbox_list = []
-        for span in line:
-            line_bbox_list.append(span['bbox'])
-
-        # Merge overlapping text regions within the same line
-        merged_spans = merge_overlapping_spans(line_bbox_list)
-
-        # Convert the merged text regions back to point format and add them to the new detection box list
-        for span in merged_spans:
-            new_dt_boxes.append(bbox_to_points(span))
-
-    return new_dt_boxes
-
-
-class ModifiedPaddleOCR(PaddleOCR):
-    def ocr(self, img, det=True, rec=True, cls=True, bin=False, inv=False, mfd_res=None, alpha_color=(255, 255, 255)):
-        """
-        OCR with PaddleOCR
-        args:
-            img: img for OCR, support ndarray, img_path and list or ndarray
-            det: use text detection or not. If False, only rec will be exec. Default is True
-            rec: use text recognition or not. If False, only det will be exec. Default is True
-            cls: use angle classifier or not. Default is True. If True, the text with rotation of 180 degrees can be recognized. If no text is rotated by 180 degrees, use cls=False to get better performance. Text with rotation of 90 or 270 degrees can be recognized even if cls=False.
-            bin: binarize image to black and white. Default is False.
-            inv: invert image colors. Default is False.
-            alpha_color: set RGB color Tuple for transparent parts replacement. Default is pure white.
-        """
-        assert isinstance(img, (np.ndarray, list, str, bytes))
-        if isinstance(img, list) and det == True:
-            logger.error('When input a list of images, det must be false')
-            exit(0)
-        if cls == True and self.use_angle_cls == False:
-            pass
-            # logger.warning(
-            #     'Since the angle classifier is not initialized, it will not be used during the forward process'
-            # )
-
-        img = check_img(img)
-        # for infer pdf file
-        if isinstance(img, list):
-            if self.page_num > len(img) or self.page_num == 0:
-                self.page_num = len(img)
-            imgs = img[:self.page_num]
-        else:
-            imgs = [img]
-
-        def preprocess_image(_image):
-            _image = alpha_to_color(_image, alpha_color)
-            if inv:
-                _image = cv2.bitwise_not(_image)
-            if bin:
-                _image = binarize_img(_image)
-            return _image
-
-        if det and rec:
-            ocr_res = []
-            for idx, img in enumerate(imgs):
-                img = preprocess_image(img)
-                dt_boxes, rec_res, _ = self.__call__(img, cls, mfd_res=mfd_res)
-                if not dt_boxes and not rec_res:
-                    ocr_res.append(None)
-                    continue
-                tmp_res = [[box.tolist(), res]
-                           for box, res in zip(dt_boxes, rec_res)]
-                ocr_res.append(tmp_res)
-            return ocr_res
-        elif det and not rec:
-            ocr_res = []
-            for idx, img in enumerate(imgs):
-                img = preprocess_image(img)
-                dt_boxes, elapse = self.text_detector(img)
-                if not dt_boxes:
-                    ocr_res.append(None)
-                    continue
-                tmp_res = [box.tolist() for box in dt_boxes]
-                ocr_res.append(tmp_res)
-            return ocr_res
-        else:
-            ocr_res = []
-            cls_res = []
-            for idx, img in enumerate(imgs):
-                if not isinstance(img, list):
-                    img = preprocess_image(img)
-                    img = [img]
-                if self.use_angle_cls and cls:
-                    img, cls_res_tmp, elapse = self.text_classifier(img)
-                    if not rec:
-                        cls_res.append(cls_res_tmp)
-                rec_res, elapse = self.text_recognizer(img)
-                ocr_res.append(rec_res)
-            if not rec:
-                return cls_res
-            return ocr_res
-
-    def __call__(self, img, cls=True, mfd_res=None):
-        time_dict = {'det': 0, 'rec': 0, 'cls': 0, 'all': 0}
-
-        if img is None:
-            logger.debug("no valid image provided")
-            return None, None, time_dict
-
-        start = time.time()
-        ori_im = img.copy()
-        dt_boxes, elapse = self.text_detector(img)
-        time_dict['det'] = elapse
-
-        if dt_boxes is None:
-            logger.debug("no dt_boxes found, elapsed : {}".format(elapse))
-            end = time.time()
-            time_dict['all'] = end - start
-            return None, None, time_dict
-        else:
-            logger.debug("dt_boxes num : {}, elapsed : {}".format(
-                len(dt_boxes), elapse))
-        img_crop_list = []
-
-        dt_boxes = sorted_boxes(dt_boxes)
-
-        dt_boxes = merge_det_boxes(dt_boxes)
-
-        if mfd_res:
-            bef = time.time()
-            dt_boxes = update_det_boxes(dt_boxes, mfd_res)
-            aft = time.time()
-            logger.debug("split text box by formula, new dt_boxes num : {}, elapsed : {}".format(
-                len(dt_boxes), aft - bef))
-
-        for bno in range(len(dt_boxes)):
-            tmp_box = copy.deepcopy(dt_boxes[bno])
-            if self.args.det_box_type == "quad":
-                img_crop = get_rotate_crop_image(ori_im, tmp_box)
-            else:
-                img_crop = get_minarea_rect_crop(ori_im, tmp_box)
-            img_crop_list.append(img_crop)
-        if self.use_angle_cls and cls:
-            img_crop_list, angle_list, elapse = self.text_classifier(
-                img_crop_list)
-            time_dict['cls'] = elapse
-            logger.debug("cls num  : {}, elapsed : {}".format(
-                len(img_crop_list), elapse))
-
-        rec_res, elapse = self.text_recognizer(img_crop_list)
-        time_dict['rec'] = elapse
-        logger.debug("rec_res num  : {}, elapsed : {}".format(
-            len(rec_res), elapse))
-        if self.args.save_crop_res:
-            self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list,
-                                   rec_res)
-        filter_boxes, filter_rec_res = [], []
-        for box, rec_result in zip(dt_boxes, rec_res):
-            text, score = rec_result
-            if score >= self.drop_score:
-                filter_boxes.append(box)
-                filter_rec_res.append(rec_result)
-        end = time.time()
-        time_dict['all'] = end - start
-        return filter_boxes, filter_rec_res, time_dict

+ 0 - 0
magic_pdf/model/pek_sub_modules/__init__.py → magic_pdf/model/sub_modules/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/__init__.py → magic_pdf/model/sub_modules/layout/__init__.py


+ 21 - 0
magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py

@@ -0,0 +1,21 @@
+from doclayout_yolo import YOLOv10
+
+
+class DocLayoutYOLOModel(object):
+    def __init__(self, weight, device):
+        self.model = YOLOv10(weight)
+        self.device = device
+
+    def predict(self, image):
+        layout_res = []
+        doclayout_yolo_res = self.model.predict(image, imgsz=1024, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
+        for xyxy, conf, cla in zip(doclayout_yolo_res.boxes.xyxy.cpu(), doclayout_yolo_res.boxes.conf.cpu(),
+                                   doclayout_yolo_res.boxes.cls.cpu()):
+            xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
+            new_item = {
+                'category_id': int(cla.item()),
+                'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
+                'score': round(float(conf.item()), 3),
+            }
+            layout_res.append(new_item)
+        return layout_res

+ 0 - 0
magic_pdf/model/pek_sub_modules/structeqtable/__init__.py → magic_pdf/model/sub_modules/layout/doclayout_yolo/__init__.py


+ 0 - 0
magic_pdf/model/v3/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/backbone.py → magic_pdf/model/sub_modules/layout/layoutlmv3/backbone.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/beit.py → magic_pdf/model/sub_modules/layout/layoutlmv3/beit.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/deit.py → magic_pdf/model/sub_modules/layout/layoutlmv3/deit.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/cord.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/cord.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/data_collator.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/data_collator.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/funsd.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/funsd.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/image_utils.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/image_utils.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/xfund.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/xfund.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/configuration_layoutlmv3.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/configuration_layoutlmv3.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/modeling_layoutlmv3.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/modeling_layoutlmv3.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3_fast.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3_fast.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/model_init.py → magic_pdf/model/sub_modules/layout/layoutlmv3/model_init.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/rcnn_vl.py → magic_pdf/model/sub_modules/layout/layoutlmv3/rcnn_vl.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/visualizer.py → magic_pdf/model/sub_modules/layout/layoutlmv3/visualizer.py


+ 0 - 0
tests/test_data/__init__.py → magic_pdf/model/sub_modules/mfd/__init__.py


+ 12 - 0
magic_pdf/model/sub_modules/mfd/yolov8/YOLOv8.py

@@ -0,0 +1,12 @@
+from ultralytics import YOLO
+
+
+class YOLOv8MFDModel(object):
+    def __init__(self, weight, device='cpu'):
+        self.mfd_model = YOLO(weight)
+        self.device = device
+
+    def predict(self, image):
+        mfd_res = self.mfd_model.predict(image, imgsz=1888, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
+        return mfd_res
+

+ 0 - 0
tests/test_data/data_reader_writer/__init__.py → magic_pdf/model/sub_modules/mfd/yolov8/__init__.py


+ 0 - 0
tests/test_data/io/__init__.py → magic_pdf/model/sub_modules/mfr/__init__.py


+ 98 - 0
magic_pdf/model/sub_modules/mfr/unimernet/Unimernet.py

@@ -0,0 +1,98 @@
+import os
+import argparse
+import re
+
+from PIL import Image
+import torch
+from torch.utils.data import Dataset, DataLoader
+from torchvision import transforms
+from unimernet.common.config import Config
+import unimernet.tasks as tasks
+from unimernet.processors import load_processor
+
+
+class MathDataset(Dataset):
+    def __init__(self, image_paths, transform=None):
+        self.image_paths = image_paths
+        self.transform = transform
+
+    def __len__(self):
+        return len(self.image_paths)
+
+    def __getitem__(self, idx):
+        # if not pil image, then convert to pil image
+        if isinstance(self.image_paths[idx], str):
+            raw_image = Image.open(self.image_paths[idx])
+        else:
+            raw_image = self.image_paths[idx]
+        if self.transform:
+            image = self.transform(raw_image)
+            return image
+
+
+def latex_rm_whitespace(s: str):
+    """Remove unnecessary whitespace from LaTeX code.
+    """
+    text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
+    letter = '[a-zA-Z]'
+    noletter = '[\W_^\d]'
+    names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
+    s = re.sub(text_reg, lambda match: str(names.pop(0)), s)
+    news = s
+    while True:
+        s = news
+        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
+        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
+        news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
+        if news == s:
+            break
+    return s
+
+
+class UnimernetModel(object):
+    def __init__(self, weight_dir, cfg_path, _device_='cpu'):
+
+        args = argparse.Namespace(cfg_path=cfg_path, options=None)
+        cfg = Config(args)
+        cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.pth")
+        cfg.config.model.model_config.model_name = weight_dir
+        cfg.config.model.tokenizer_config.path = weight_dir
+        task = tasks.setup_task(cfg)
+        self.model = task.build_model(cfg)
+        self.device = _device_
+        self.model.to(_device_)
+        self.model.eval()
+        vis_processor = load_processor('formula_image_eval', cfg.config.datasets.formula_rec_eval.vis_processor.eval)
+        self.mfr_transform = transforms.Compose([vis_processor, ])
+
+    def predict(self, mfd_res, image):
+
+        formula_list = []
+        mf_image_list = []
+        for xyxy, conf, cla in zip(mfd_res.boxes.xyxy.cpu(), mfd_res.boxes.conf.cpu(), mfd_res.boxes.cls.cpu()):
+            xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
+            new_item = {
+                'category_id': 13 + int(cla.item()),
+                'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
+                'score': round(float(conf.item()), 2),
+                'latex': '',
+            }
+            formula_list.append(new_item)
+            pil_img = Image.fromarray(image)
+            bbox_img = pil_img.crop((xmin, ymin, xmax, ymax))
+            mf_image_list.append(bbox_img)
+
+        dataset = MathDataset(mf_image_list, transform=self.mfr_transform)
+        dataloader = DataLoader(dataset, batch_size=64, num_workers=0)
+        mfr_res = []
+        for mf_img in dataloader:
+            mf_img = mf_img.to(self.device)
+            with torch.no_grad():
+                output = self.model.generate({'image': mf_img})
+            mfr_res.extend(output['pred_str'])
+        for res, latex in zip(formula_list, mfr_res):
+            res['latex'] = latex_rm_whitespace(latex)
+        return formula_list
+
+
+

+ 0 - 0
tests/test_model/__init__.py → magic_pdf/model/sub_modules/mfr/unimernet/__init__.py


+ 144 - 0
magic_pdf/model/sub_modules/model_init.py

@@ -0,0 +1,144 @@
+from loguru import logger
+
+from magic_pdf.libs.Constants import MODEL_NAME
+from magic_pdf.model.model_list import AtomicModel
+from magic_pdf.model.sub_modules.layout.doclayout_yolo.DocLayoutYOLO import DocLayoutYOLOModel
+from magic_pdf.model.sub_modules.layout.layoutlmv3.model_init import Layoutlmv3_Predictor
+from magic_pdf.model.sub_modules.mfd.yolov8.YOLOv8 import YOLOv8MFDModel
+
+from magic_pdf.model.sub_modules.mfr.unimernet.Unimernet import UnimernetModel
+from magic_pdf.model.sub_modules.ocr.paddleocr.ppocr_273_mod import ModifiedPaddleOCR
+# from magic_pdf.model.sub_modules.ocr.paddleocr.ppocr_291_mod import ModifiedPaddleOCR
+from magic_pdf.model.sub_modules.table.structeqtable.struct_eqtable import StructTableModel
+from magic_pdf.model.sub_modules.table.tablemaster.tablemaster_paddle import TableMasterPaddleModel
+from magic_pdf.model.sub_modules.table.rapidtable.rapid_table import RapidTableModel
+
+
+def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
+    if table_model_type == MODEL_NAME.STRUCT_EQTABLE:
+        table_model = StructTableModel(model_path, max_new_tokens=2048, max_time=max_time)
+    elif table_model_type == MODEL_NAME.TABLE_MASTER:
+        config = {
+            "model_dir": model_path,
+            "device": _device_
+        }
+        table_model = TableMasterPaddleModel(config)
+    elif table_model_type == MODEL_NAME.RAPID_TABLE:
+        table_model = RapidTableModel()
+    else:
+        logger.error("table model type not allow")
+        exit(1)
+
+    return table_model
+
+
+def mfd_model_init(weight, device='cpu'):
+    mfd_model = YOLOv8MFDModel(weight, device)
+    return mfd_model
+
+
+def mfr_model_init(weight_dir, cfg_path, device='cpu'):
+    mfr_model = UnimernetModel(weight_dir, cfg_path, device)
+    return mfr_model
+
+
+def layout_model_init(weight, config_file, device):
+    model = Layoutlmv3_Predictor(weight, config_file, device)
+    return model
+
+
+def doclayout_yolo_model_init(weight, device='cpu'):
+    model = DocLayoutYOLOModel(weight, device)
+    return model
+
+
+def ocr_model_init(show_log: bool = False,
+                   det_db_box_thresh=0.3,
+                   lang=None,
+                   use_dilation=True,
+                   det_db_unclip_ratio=1.8,
+                   ):
+    if lang is not None:
+        model = ModifiedPaddleOCR(
+            show_log=show_log,
+            det_db_box_thresh=det_db_box_thresh,
+            lang=lang,
+            use_dilation=use_dilation,
+            det_db_unclip_ratio=det_db_unclip_ratio,
+        )
+    else:
+        model = ModifiedPaddleOCR(
+            show_log=show_log,
+            det_db_box_thresh=det_db_box_thresh,
+            use_dilation=use_dilation,
+            det_db_unclip_ratio=det_db_unclip_ratio,
+            # use_angle_cls=True,
+        )
+    return model
+
+
+class AtomModelSingleton:
+    _instance = None
+    _models = {}
+
+    def __new__(cls, *args, **kwargs):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
+
+    def get_atom_model(self, atom_model_name: str, **kwargs):
+        lang = kwargs.get("lang", None)
+        layout_model_name = kwargs.get("layout_model_name", None)
+        key = (atom_model_name, layout_model_name, lang)
+        if key not in self._models:
+            self._models[key] = atom_model_init(model_name=atom_model_name, **kwargs)
+        return self._models[key]
+
+
+def atom_model_init(model_name: str, **kwargs):
+    atom_model = None
+    if model_name == AtomicModel.Layout:
+        if kwargs.get("layout_model_name") == MODEL_NAME.LAYOUTLMv3:
+            atom_model = layout_model_init(
+                kwargs.get("layout_weights"),
+                kwargs.get("layout_config_file"),
+                kwargs.get("device")
+            )
+        elif kwargs.get("layout_model_name") == MODEL_NAME.DocLayout_YOLO:
+            atom_model = doclayout_yolo_model_init(
+                kwargs.get("doclayout_yolo_weights"),
+                kwargs.get("device")
+            )
+    elif model_name == AtomicModel.MFD:
+        atom_model = mfd_model_init(
+            kwargs.get("mfd_weights"),
+            kwargs.get("device")
+        )
+    elif model_name == AtomicModel.MFR:
+        atom_model = mfr_model_init(
+            kwargs.get("mfr_weight_dir"),
+            kwargs.get("mfr_cfg_path"),
+            kwargs.get("device")
+        )
+    elif model_name == AtomicModel.OCR:
+        atom_model = ocr_model_init(
+            kwargs.get("ocr_show_log"),
+            kwargs.get("det_db_box_thresh"),
+            kwargs.get("lang")
+        )
+    elif model_name == AtomicModel.Table:
+        atom_model = table_model_init(
+            kwargs.get("table_model_name"),
+            kwargs.get("table_model_path"),
+            kwargs.get("table_max_time"),
+            kwargs.get("device")
+        )
+    else:
+        logger.error("model name not allow")
+        exit(1)
+
+    if atom_model is None:
+        logger.error("model init failed")
+        exit(1)
+    else:
+        return atom_model

+ 51 - 0
magic_pdf/model/sub_modules/model_utils.py

@@ -0,0 +1,51 @@
+import time
+
+import torch
+from PIL import Image
+from loguru import logger
+
+from magic_pdf.libs.clean_memory import clean_memory
+
+
+def crop_img(input_res, input_pil_img, crop_paste_x=0, crop_paste_y=0):
+    crop_xmin, crop_ymin = int(input_res['poly'][0]), int(input_res['poly'][1])
+    crop_xmax, crop_ymax = int(input_res['poly'][4]), int(input_res['poly'][5])
+    # Create a white background with an additional width and height of 50
+    crop_new_width = crop_xmax - crop_xmin + crop_paste_x * 2
+    crop_new_height = crop_ymax - crop_ymin + crop_paste_y * 2
+    return_image = Image.new('RGB', (crop_new_width, crop_new_height), 'white')
+
+    # Crop image
+    crop_box = (crop_xmin, crop_ymin, crop_xmax, crop_ymax)
+    cropped_img = input_pil_img.crop(crop_box)
+    return_image.paste(cropped_img, (crop_paste_x, crop_paste_y))
+    return_list = [crop_paste_x, crop_paste_y, crop_xmin, crop_ymin, crop_xmax, crop_ymax, crop_new_width, crop_new_height]
+    return return_image, return_list
+
+
+# Select regions for OCR / formula regions / table regions
+def get_res_list_from_layout_res(layout_res):
+    ocr_res_list = []
+    table_res_list = []
+    single_page_mfdetrec_res = []
+    for res in layout_res:
+        if int(res['category_id']) in [13, 14]:
+            single_page_mfdetrec_res.append({
+                "bbox": [int(res['poly'][0]), int(res['poly'][1]),
+                         int(res['poly'][4]), int(res['poly'][5])],
+            })
+        elif int(res['category_id']) in [0, 1, 2, 4, 6, 7]:
+            ocr_res_list.append(res)
+        elif int(res['category_id']) in [5]:
+            table_res_list.append(res)
+    return ocr_res_list, table_res_list, single_page_mfdetrec_res
+
+
+def clean_vram(device, vram_threshold=8):
+    if torch.cuda.is_available() and device != 'cpu':
+        total_memory = torch.cuda.get_device_properties(device).total_memory / (1024 ** 3)  # 将字节转换为 GB
+        if total_memory <= vram_threshold:
+            gc_start = time.time()
+            clean_memory()
+            gc_time = round(time.time() - gc_start, 2)
+            logger.info(f"gc time: {gc_time}")

+ 0 - 0
tests/test_tools/__init__.py → magic_pdf/model/sub_modules/ocr/__init__.py


+ 0 - 0
tests/assets/more_para_test_samples/gift_files.txt → magic_pdf/model/sub_modules/ocr/paddleocr/__init__.py


+ 259 - 0
magic_pdf/model/sub_modules/ocr/paddleocr/ocr_utils.py

@@ -0,0 +1,259 @@
+import math
+
+import numpy as np
+from loguru import logger
+
+from magic_pdf.libs.boxbase import __is_overlaps_y_exceeds_threshold
+from magic_pdf.pre_proc.ocr_dict_merge import merge_spans_to_line
+
+
+def bbox_to_points(bbox):
+    """ 将bbox格式转换为四个顶点的数组 """
+    x0, y0, x1, y1 = bbox
+    return np.array([[x0, y0], [x1, y0], [x1, y1], [x0, y1]]).astype('float32')
+
+
+def points_to_bbox(points):
+    """ 将四个顶点的数组转换为bbox格式 """
+    x0, y0 = points[0]
+    x1, _ = points[1]
+    _, y1 = points[2]
+    return [x0, y0, x1, y1]
+
+
+def merge_intervals(intervals):
+    # Sort the intervals based on the start value
+    intervals.sort(key=lambda x: x[0])
+
+    merged = []
+    for interval in intervals:
+        # If the list of merged intervals is empty or if the current
+        # interval does not overlap with the previous, simply append it.
+        if not merged or merged[-1][1] < interval[0]:
+            merged.append(interval)
+        else:
+            # Otherwise, there is overlap, so we merge the current and previous intervals.
+            merged[-1][1] = max(merged[-1][1], interval[1])
+
+    return merged
+
+
+def remove_intervals(original, masks):
+    # Merge all mask intervals
+    merged_masks = merge_intervals(masks)
+
+    result = []
+    original_start, original_end = original
+
+    for mask in merged_masks:
+        mask_start, mask_end = mask
+
+        # If the mask starts after the original range, ignore it
+        if mask_start > original_end:
+            continue
+
+        # If the mask ends before the original range starts, ignore it
+        if mask_end < original_start:
+            continue
+
+        # Remove the masked part from the original range
+        if original_start < mask_start:
+            result.append([original_start, mask_start - 1])
+
+        original_start = max(mask_end + 1, original_start)
+
+    # Add the remaining part of the original range, if any
+    if original_start <= original_end:
+        result.append([original_start, original_end])
+
+    return result
+
+
+def update_det_boxes(dt_boxes, mfd_res):
+    new_dt_boxes = []
+    for text_box in dt_boxes:
+        text_bbox = points_to_bbox(text_box)
+        masks_list = []
+        for mf_box in mfd_res:
+            mf_bbox = mf_box['bbox']
+            if __is_overlaps_y_exceeds_threshold(text_bbox, mf_bbox):
+                masks_list.append([mf_bbox[0], mf_bbox[2]])
+        text_x_range = [text_bbox[0], text_bbox[2]]
+        text_remove_mask_range = remove_intervals(text_x_range, masks_list)
+        temp_dt_box = []
+        for text_remove_mask in text_remove_mask_range:
+            temp_dt_box.append(bbox_to_points([text_remove_mask[0], text_bbox[1], text_remove_mask[1], text_bbox[3]]))
+        if len(temp_dt_box) > 0:
+            new_dt_boxes.extend(temp_dt_box)
+    return new_dt_boxes
+
+
+def merge_overlapping_spans(spans):
+    """
+    Merges overlapping spans on the same line.
+
+    :param spans: A list of span coordinates [(x1, y1, x2, y2), ...]
+    :return: A list of merged spans
+    """
+    # Return an empty list if the input spans list is empty
+    if not spans:
+        return []
+
+    # Sort spans by their starting x-coordinate
+    spans.sort(key=lambda x: x[0])
+
+    # Initialize the list of merged spans
+    merged = []
+    for span in spans:
+        # Unpack span coordinates
+        x1, y1, x2, y2 = span
+        # If the merged list is empty or there's no horizontal overlap, add the span directly
+        if not merged or merged[-1][2] < x1:
+            merged.append(span)
+        else:
+            # If there is horizontal overlap, merge the current span with the previous one
+            last_span = merged.pop()
+            # Update the merged span's top-left corner to the smaller (x1, y1) and bottom-right to the larger (x2, y2)
+            x1 = min(last_span[0], x1)
+            y1 = min(last_span[1], y1)
+            x2 = max(last_span[2], x2)
+            y2 = max(last_span[3], y2)
+            # Add the merged span back to the list
+            merged.append((x1, y1, x2, y2))
+
+    # Return the list of merged spans
+    return merged
+
+
+def merge_det_boxes(dt_boxes):
+    """
+    Merge detection boxes.
+
+    This function takes a list of detected bounding boxes, each represented by four corner points.
+    The goal is to merge these bounding boxes into larger text regions.
+
+    Parameters:
+    dt_boxes (list): A list containing multiple text detection boxes, where each box is defined by four corner points.
+
+    Returns:
+    list: A list containing the merged text regions, where each region is represented by four corner points.
+    """
+    # Convert the detection boxes into a dictionary format with bounding boxes and type
+    dt_boxes_dict_list = []
+    angle_boxes_list = []
+    for text_box in dt_boxes:
+        text_bbox = points_to_bbox(text_box)
+        if text_bbox[2] <= text_bbox[0] or text_bbox[3] <= text_bbox[1]:
+            angle_boxes_list.append(text_box)
+            continue
+        text_box_dict = {
+            'bbox': text_bbox,
+            'type': 'text',
+        }
+        dt_boxes_dict_list.append(text_box_dict)
+
+    # Merge adjacent text regions into lines
+    lines = merge_spans_to_line(dt_boxes_dict_list)
+
+    # Initialize a new list for storing the merged text regions
+    new_dt_boxes = []
+    for line in lines:
+        line_bbox_list = []
+        for span in line:
+            line_bbox_list.append(span['bbox'])
+
+        # Merge overlapping text regions within the same line
+        merged_spans = merge_overlapping_spans(line_bbox_list)
+
+        # Convert the merged text regions back to point format and add them to the new detection box list
+        for span in merged_spans:
+            new_dt_boxes.append(bbox_to_points(span))
+
+    new_dt_boxes.extend(angle_boxes_list)
+
+    return new_dt_boxes
+
+
+def get_adjusted_mfdetrec_res(single_page_mfdetrec_res, useful_list):
+    paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
+    # Adjust the coordinates of the formula area
+    adjusted_mfdetrec_res = []
+    for mf_res in single_page_mfdetrec_res:
+        mf_xmin, mf_ymin, mf_xmax, mf_ymax = mf_res["bbox"]
+        # Adjust the coordinates of the formula area to the coordinates relative to the cropping area
+        x0 = mf_xmin - xmin + paste_x
+        y0 = mf_ymin - ymin + paste_y
+        x1 = mf_xmax - xmin + paste_x
+        y1 = mf_ymax - ymin + paste_y
+        # Filter formula blocks outside the graph
+        if any([x1 < 0, y1 < 0]) or any([x0 > new_width, y0 > new_height]):
+            continue
+        else:
+            adjusted_mfdetrec_res.append({
+                "bbox": [x0, y0, x1, y1],
+            })
+    return adjusted_mfdetrec_res
+
+
+def get_ocr_result_list(ocr_res, useful_list):
+    paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
+    ocr_result_list = []
+    for box_ocr_res in ocr_res:
+
+        p1, p2, p3, p4 = box_ocr_res[0]
+        text, score = box_ocr_res[1]
+        average_angle_degrees = calculate_angle_degrees(box_ocr_res[0])
+        if average_angle_degrees > 0.5:
+            # logger.info(f"average_angle_degrees: {average_angle_degrees}, text: {text}")
+            # 与x轴的夹角超过0.5度,对边界做一下矫正
+            # 计算几何中心
+            x_center = sum(point[0] for point in box_ocr_res[0]) / 4
+            y_center = sum(point[1] for point in box_ocr_res[0]) / 4
+            new_height = ((p4[1] - p1[1]) + (p3[1] - p2[1])) / 2
+            new_width = p3[0] - p1[0]
+            p1 = [x_center - new_width / 2, y_center - new_height / 2]
+            p2 = [x_center + new_width / 2, y_center - new_height / 2]
+            p3 = [x_center + new_width / 2, y_center + new_height / 2]
+            p4 = [x_center - new_width / 2, y_center + new_height / 2]
+
+        # Convert the coordinates back to the original coordinate system
+        p1 = [p1[0] - paste_x + xmin, p1[1] - paste_y + ymin]
+        p2 = [p2[0] - paste_x + xmin, p2[1] - paste_y + ymin]
+        p3 = [p3[0] - paste_x + xmin, p3[1] - paste_y + ymin]
+        p4 = [p4[0] - paste_x + xmin, p4[1] - paste_y + ymin]
+
+        ocr_result_list.append({
+            'category_id': 15,
+            'poly': p1 + p2 + p3 + p4,
+            'score': float(round(score, 2)),
+            'text': text,
+        })
+
+    return ocr_result_list
+
+
+def calculate_angle_degrees(poly):
+    # 定义对角线的顶点
+    diagonal1 = (poly[0], poly[2])
+    diagonal2 = (poly[1], poly[3])
+
+    # 计算对角线的斜率
+    def slope(p1, p2):
+        return (p2[1] - p1[1]) / (p2[0] - p1[0]) if p2[0] != p1[0] else float('inf')
+
+    slope1 = slope(diagonal1[0], diagonal1[1])
+    slope2 = slope(diagonal2[0], diagonal2[1])
+
+    # 计算对角线与x轴的夹角(以弧度为单位)
+    angle1_radians = math.atan(slope1)
+    angle2_radians = math.atan(slope2)
+
+    # 将弧度转换为角度
+    angle1_degrees = math.degrees(angle1_radians)
+    angle2_degrees = math.degrees(angle2_radians)
+
+    # 取两条对角线与x轴夹角的平均值
+    average_angle_degrees = abs((angle1_degrees + angle2_degrees) / 2)
+    # logger.info(f"average_angle_degrees: {average_angle_degrees}")
+    return average_angle_degrees
+

+ 168 - 0
magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py

@@ -0,0 +1,168 @@
+import copy
+import time
+
+import cv2
+import numpy as np
+from paddleocr import PaddleOCR
+from paddleocr.paddleocr import check_img, logger
+from paddleocr.ppocr.utils.utility import alpha_to_color, binarize_img
+from paddleocr.tools.infer.predict_system import sorted_boxes
+from paddleocr.tools.infer.utility import get_rotate_crop_image, get_minarea_rect_crop
+
+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import update_det_boxes, merge_det_boxes
+
+
+class ModifiedPaddleOCR(PaddleOCR):
+    def ocr(self,
+            img,
+            det=True,
+            rec=True,
+            cls=True,
+            bin=False,
+            inv=False,
+            alpha_color=(255, 255, 255),
+            mfd_res=None,
+            ):
+        """
+        OCR with PaddleOCR
+        args:
+            img: img for OCR, support ndarray, img_path and list or ndarray
+            det: use text detection or not. If False, only rec will be exec. Default is True
+            rec: use text recognition or not. If False, only det will be exec. Default is True
+            cls: use angle classifier or not. Default is True. If True, the text with rotation of 180 degrees can be recognized. If no text is rotated by 180 degrees, use cls=False to get better performance. Text with rotation of 90 or 270 degrees can be recognized even if cls=False.
+            bin: binarize image to black and white. Default is False.
+            inv: invert image colors. Default is False.
+            alpha_color: set RGB color Tuple for transparent parts replacement. Default is pure white.
+        """
+        assert isinstance(img, (np.ndarray, list, str, bytes))
+        if isinstance(img, list) and det == True:
+            logger.error('When input a list of images, det must be false')
+            exit(0)
+        if cls == True and self.use_angle_cls == False:
+            pass
+            # logger.warning(
+            #     'Since the angle classifier is not initialized, it will not be used during the forward process'
+            # )
+
+        img = check_img(img)
+        # for infer pdf file
+        if isinstance(img, list):
+            if self.page_num > len(img) or self.page_num == 0:
+                self.page_num = len(img)
+            imgs = img[:self.page_num]
+        else:
+            imgs = [img]
+
+        def preprocess_image(_image):
+            _image = alpha_to_color(_image, alpha_color)
+            if inv:
+                _image = cv2.bitwise_not(_image)
+            if bin:
+                _image = binarize_img(_image)
+            return _image
+
+        if det and rec:
+            ocr_res = []
+            for idx, img in enumerate(imgs):
+                img = preprocess_image(img)
+                dt_boxes, rec_res, _ = self.__call__(img, cls, mfd_res=mfd_res)
+                if not dt_boxes and not rec_res:
+                    ocr_res.append(None)
+                    continue
+                tmp_res = [[box.tolist(), res]
+                           for box, res in zip(dt_boxes, rec_res)]
+                ocr_res.append(tmp_res)
+            return ocr_res
+        elif det and not rec:
+            ocr_res = []
+            for idx, img in enumerate(imgs):
+                img = preprocess_image(img)
+                dt_boxes, elapse = self.text_detector(img)
+                if not dt_boxes:
+                    ocr_res.append(None)
+                    continue
+                tmp_res = [box.tolist() for box in dt_boxes]
+                ocr_res.append(tmp_res)
+            return ocr_res
+        else:
+            ocr_res = []
+            cls_res = []
+            for idx, img in enumerate(imgs):
+                if not isinstance(img, list):
+                    img = preprocess_image(img)
+                    img = [img]
+                if self.use_angle_cls and cls:
+                    img, cls_res_tmp, elapse = self.text_classifier(img)
+                    if not rec:
+                        cls_res.append(cls_res_tmp)
+                rec_res, elapse = self.text_recognizer(img)
+                ocr_res.append(rec_res)
+            if not rec:
+                return cls_res
+            return ocr_res
+
+    def __call__(self, img, cls=True, mfd_res=None):
+        time_dict = {'det': 0, 'rec': 0, 'cls': 0, 'all': 0}
+
+        if img is None:
+            logger.debug("no valid image provided")
+            return None, None, time_dict
+
+        start = time.time()
+        ori_im = img.copy()
+        dt_boxes, elapse = self.text_detector(img)
+        time_dict['det'] = elapse
+
+        if dt_boxes is None:
+            logger.debug("no dt_boxes found, elapsed : {}".format(elapse))
+            end = time.time()
+            time_dict['all'] = end - start
+            return None, None, time_dict
+        else:
+            logger.debug("dt_boxes num : {}, elapsed : {}".format(
+                len(dt_boxes), elapse))
+        img_crop_list = []
+
+        dt_boxes = sorted_boxes(dt_boxes)
+
+        # @todo 目前是在bbox层merge,对倾斜文本行的兼容性不佳,需要修改成支持poly的merge
+        # dt_boxes = merge_det_boxes(dt_boxes)
+
+
+        if mfd_res:
+            bef = time.time()
+            dt_boxes = update_det_boxes(dt_boxes, mfd_res)
+            aft = time.time()
+            logger.debug("split text box by formula, new dt_boxes num : {}, elapsed : {}".format(
+                len(dt_boxes), aft - bef))
+
+        for bno in range(len(dt_boxes)):
+            tmp_box = copy.deepcopy(dt_boxes[bno])
+            if self.args.det_box_type == "quad":
+                img_crop = get_rotate_crop_image(ori_im, tmp_box)
+            else:
+                img_crop = get_minarea_rect_crop(ori_im, tmp_box)
+            img_crop_list.append(img_crop)
+        if self.use_angle_cls and cls:
+            img_crop_list, angle_list, elapse = self.text_classifier(
+                img_crop_list)
+            time_dict['cls'] = elapse
+            logger.debug("cls num  : {}, elapsed : {}".format(
+                len(img_crop_list), elapse))
+
+        rec_res, elapse = self.text_recognizer(img_crop_list)
+        time_dict['rec'] = elapse
+        logger.debug("rec_res num  : {}, elapsed : {}".format(
+            len(rec_res), elapse))
+        if self.args.save_crop_res:
+            self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list,
+                                   rec_res)
+        filter_boxes, filter_rec_res = [], []
+        for box, rec_result in zip(dt_boxes, rec_res):
+            text, score = rec_result
+            if score >= self.drop_score:
+                filter_boxes.append(box)
+                filter_rec_res.append(rec_result)
+        end = time.time()
+        time_dict['all'] = end - start
+        return filter_boxes, filter_rec_res, time_dict

+ 213 - 0
magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_291_mod.py

@@ -0,0 +1,213 @@
+import copy
+import time
+
+
+import cv2
+import numpy as np
+from paddleocr import PaddleOCR
+from paddleocr.paddleocr import check_img, logger
+from paddleocr.ppocr.utils.utility import alpha_to_color, binarize_img
+from paddleocr.tools.infer.predict_system import sorted_boxes
+from paddleocr.tools.infer.utility import slice_generator, merge_fragmented, get_rotate_crop_image, \
+    get_minarea_rect_crop
+
+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import update_det_boxes
+
+
+class ModifiedPaddleOCR(PaddleOCR):
+
+    def ocr(
+        self,
+        img,
+        det=True,
+        rec=True,
+        cls=True,
+        bin=False,
+        inv=False,
+        alpha_color=(255, 255, 255),
+        slice={},
+        mfd_res=None,
+    ):
+        """
+        OCR with PaddleOCR
+
+        Args:
+            img: Image for OCR. It can be an ndarray, img_path, or a list of ndarrays.
+            det: Use text detection or not. If False, only text recognition will be executed. Default is True.
+            rec: Use text recognition or not. If False, only text detection will be executed. Default is True.
+            cls: Use angle classifier or not. Default is True. If True, the text with a rotation of 180 degrees can be recognized. If no text is rotated by 180 degrees, use cls=False to get better performance.
+            bin: Binarize image to black and white. Default is False.
+            inv: Invert image colors. Default is False.
+            alpha_color: Set RGB color Tuple for transparent parts replacement. Default is pure white.
+            slice: Use sliding window inference for large images. Both det and rec must be True. Requires int values for slice["horizontal_stride"], slice["vertical_stride"], slice["merge_x_thres"], slice["merge_y_thres"] (See doc/doc_en/slice_en.md). Default is {}.
+
+        Returns:
+            If both det and rec are True, returns a list of OCR results for each image. Each OCR result is a list of bounding boxes and recognized text for each detected text region.
+            If det is True and rec is False, returns a list of detected bounding boxes for each image.
+            If det is False and rec is True, returns a list of recognized text for each image.
+            If both det and rec are False, returns a list of angle classification results for each image.
+
+        Raises:
+            AssertionError: If the input image is not of type ndarray, list, str, or bytes.
+            SystemExit: If det is True and the input is a list of images.
+
+        Note:
+            - If the angle classifier is not initialized (use_angle_cls=False), it will not be used during the forward process.
+            - For PDF files, if the input is a list of images and the page_num is specified, only the first page_num images will be processed.
+            - The preprocess_image function is used to preprocess the input image by applying alpha color replacement, inversion, and binarization if specified.
+        """
+        assert isinstance(img, (np.ndarray, list, str, bytes))
+        if isinstance(img, list) and det == True:
+            logger.error("When input a list of images, det must be false")
+            exit(0)
+        if cls == True and self.use_angle_cls == False:
+            logger.warning(
+                "Since the angle classifier is not initialized, it will not be used during the forward process"
+            )
+
+        img, flag_gif, flag_pdf = check_img(img, alpha_color)
+        # for infer pdf file
+        if isinstance(img, list) and flag_pdf:
+            if self.page_num > len(img) or self.page_num == 0:
+                imgs = img
+            else:
+                imgs = img[: self.page_num]
+        else:
+            imgs = [img]
+
+        def preprocess_image(_image):
+            _image = alpha_to_color(_image, alpha_color)
+            if inv:
+                _image = cv2.bitwise_not(_image)
+            if bin:
+                _image = binarize_img(_image)
+            return _image
+
+        if det and rec:
+            ocr_res = []
+            for img in imgs:
+                img = preprocess_image(img)
+                dt_boxes, rec_res, _ = self.__call__(img, cls, slice, mfd_res=mfd_res)
+                if not dt_boxes and not rec_res:
+                    ocr_res.append(None)
+                    continue
+                tmp_res = [[box.tolist(), res] for box, res in zip(dt_boxes, rec_res)]
+                ocr_res.append(tmp_res)
+            return ocr_res
+        elif det and not rec:
+            ocr_res = []
+            for img in imgs:
+                img = preprocess_image(img)
+                dt_boxes, elapse = self.text_detector(img)
+                if dt_boxes.size == 0:
+                    ocr_res.append(None)
+                    continue
+                tmp_res = [box.tolist() for box in dt_boxes]
+                ocr_res.append(tmp_res)
+            return ocr_res
+        else:
+            ocr_res = []
+            cls_res = []
+            for img in imgs:
+                if not isinstance(img, list):
+                    img = preprocess_image(img)
+                    img = [img]
+                if self.use_angle_cls and cls:
+                    img, cls_res_tmp, elapse = self.text_classifier(img)
+                    if not rec:
+                        cls_res.append(cls_res_tmp)
+                rec_res, elapse = self.text_recognizer(img)
+                ocr_res.append(rec_res)
+            if not rec:
+                return cls_res
+            return ocr_res
+
+    def __call__(self, img, cls=True, slice={}, mfd_res=None):
+        time_dict = {"det": 0, "rec": 0, "cls": 0, "all": 0}
+
+        if img is None:
+            logger.debug("no valid image provided")
+            return None, None, time_dict
+
+        start = time.time()
+        ori_im = img.copy()
+        if slice:
+            slice_gen = slice_generator(
+                img,
+                horizontal_stride=slice["horizontal_stride"],
+                vertical_stride=slice["vertical_stride"],
+            )
+            elapsed = []
+            dt_slice_boxes = []
+            for slice_crop, v_start, h_start in slice_gen:
+                dt_boxes, elapse = self.text_detector(slice_crop, use_slice=True)
+                if dt_boxes.size:
+                    dt_boxes[:, :, 0] += h_start
+                    dt_boxes[:, :, 1] += v_start
+                    dt_slice_boxes.append(dt_boxes)
+                    elapsed.append(elapse)
+            dt_boxes = np.concatenate(dt_slice_boxes)
+
+            dt_boxes = merge_fragmented(
+                boxes=dt_boxes,
+                x_threshold=slice["merge_x_thres"],
+                y_threshold=slice["merge_y_thres"],
+            )
+            elapse = sum(elapsed)
+        else:
+            dt_boxes, elapse = self.text_detector(img)
+
+        time_dict["det"] = elapse
+
+        if dt_boxes is None:
+            logger.debug("no dt_boxes found, elapsed : {}".format(elapse))
+            end = time.time()
+            time_dict["all"] = end - start
+            return None, None, time_dict
+        else:
+            logger.debug(
+                "dt_boxes num : {}, elapsed : {}".format(len(dt_boxes), elapse)
+            )
+        img_crop_list = []
+
+        dt_boxes = sorted_boxes(dt_boxes)
+
+        if mfd_res:
+            bef = time.time()
+            dt_boxes = update_det_boxes(dt_boxes, mfd_res)
+            aft = time.time()
+            logger.debug("split text box by formula, new dt_boxes num : {}, elapsed : {}".format(
+                len(dt_boxes), aft - bef))
+
+        for bno in range(len(dt_boxes)):
+            tmp_box = copy.deepcopy(dt_boxes[bno])
+            if self.args.det_box_type == "quad":
+                img_crop = get_rotate_crop_image(ori_im, tmp_box)
+            else:
+                img_crop = get_minarea_rect_crop(ori_im, tmp_box)
+            img_crop_list.append(img_crop)
+        if self.use_angle_cls and cls:
+            img_crop_list, angle_list, elapse = self.text_classifier(img_crop_list)
+            time_dict["cls"] = elapse
+            logger.debug(
+                "cls num  : {}, elapsed : {}".format(len(img_crop_list), elapse)
+            )
+        if len(img_crop_list) > 1000:
+            logger.debug(
+                f"rec crops num: {len(img_crop_list)}, time and memory cost may be large."
+            )
+
+        rec_res, elapse = self.text_recognizer(img_crop_list)
+        time_dict["rec"] = elapse
+        logger.debug("rec_res num  : {}, elapsed : {}".format(len(rec_res), elapse))
+        if self.args.save_crop_res:
+            self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list, rec_res)
+        filter_boxes, filter_rec_res = [], []
+        for box, rec_result in zip(dt_boxes, rec_res):
+            text, score = rec_result[0], rec_result[1]
+            if score >= self.drop_score:
+                filter_boxes.append(box)
+                filter_rec_res.append(rec_result)
+        end = time.time()
+        time_dict["all"] = end - start
+        return filter_boxes, filter_rec_res, time_dict

+ 0 - 0
tests/assets/more_para_test_samples/zlib_files.txt → magic_pdf/model/sub_modules/reading_oreder/__init__.py


+ 0 - 0
magic_pdf/model/sub_modules/reading_oreder/layoutreader/__init__.py


+ 0 - 0
magic_pdf/model/v3/helpers.py → magic_pdf/model/sub_modules/reading_oreder/layoutreader/helpers.py


+ 242 - 0
magic_pdf/model/sub_modules/reading_oreder/layoutreader/xycut.py

@@ -0,0 +1,242 @@
+from typing import List
+import cv2
+import numpy as np
+
+
+def projection_by_bboxes(boxes: np.array, axis: int) -> np.ndarray:
+    """
+     通过一组 bbox 获得投影直方图,最后以 per-pixel 形式输出
+
+    Args:
+        boxes: [N, 4]
+        axis: 0-x坐标向水平方向投影, 1-y坐标向垂直方向投影
+
+    Returns:
+        1D 投影直方图,长度为投影方向坐标的最大值(我们不需要图片的实际边长,因为只是要找文本框的间隔)
+
+    """
+    assert axis in [0, 1]
+    length = np.max(boxes[:, axis::2])
+    res = np.zeros(length, dtype=int)
+    # TODO: how to remove for loop?
+    for start, end in boxes[:, axis::2]:
+        res[start:end] += 1
+    return res
+
+
+# from: https://dothinking.github.io/2021-06-19-%E9%80%92%E5%BD%92%E6%8A%95%E5%BD%B1%E5%88%86%E5%89%B2%E7%AE%97%E6%B3%95/#:~:text=%E9%80%92%E5%BD%92%E6%8A%95%E5%BD%B1%E5%88%86%E5%89%B2%EF%BC%88Recursive%20XY,%EF%BC%8C%E5%8F%AF%E4%BB%A5%E5%88%92%E5%88%86%E6%AE%B5%E8%90%BD%E3%80%81%E8%A1%8C%E3%80%82
+def split_projection_profile(arr_values: np.array, min_value: float, min_gap: float):
+    """Split projection profile:
+
+    ```
+                              ┌──┐
+         arr_values           │  │       ┌─┐───
+             ┌──┐             │  │       │ │ |
+             │  │             │  │ ┌───┐ │ │min_value
+             │  │<- min_gap ->│  │ │   │ │ │ |
+         ────┴──┴─────────────┴──┴─┴───┴─┴─┴─┴───
+         0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
+    ```
+
+    Args:
+        arr_values (np.array): 1-d array representing the projection profile.
+        min_value (float): Ignore the profile if `arr_value` is less than `min_value`.
+        min_gap (float): Ignore the gap if less than this value.
+
+    Returns:
+        tuple: Start indexes and end indexes of split groups.
+    """
+    # all indexes with projection height exceeding the threshold
+    arr_index = np.where(arr_values > min_value)[0]
+    if not len(arr_index):
+        return
+
+    # find zero intervals between adjacent projections
+    # |  |                    ||
+    # ||||<- zero-interval -> |||||
+    arr_diff = arr_index[1:] - arr_index[0:-1]
+    arr_diff_index = np.where(arr_diff > min_gap)[0]
+    arr_zero_intvl_start = arr_index[arr_diff_index]
+    arr_zero_intvl_end = arr_index[arr_diff_index + 1]
+
+    # convert to index of projection range:
+    # the start index of zero interval is the end index of projection
+    arr_start = np.insert(arr_zero_intvl_end, 0, arr_index[0])
+    arr_end = np.append(arr_zero_intvl_start, arr_index[-1])
+    arr_end += 1  # end index will be excluded as index slice
+
+    return arr_start, arr_end
+
+
+def recursive_xy_cut(boxes: np.ndarray, indices: List[int], res: List[int]):
+    """
+
+    Args:
+        boxes: (N, 4)
+        indices: 递归过程中始终表示 box 在原始数据中的索引
+        res: 保存输出结果
+
+    """
+    # 向 y 轴投影
+    assert len(boxes) == len(indices)
+
+    _indices = boxes[:, 1].argsort()
+    y_sorted_boxes = boxes[_indices]
+    y_sorted_indices = indices[_indices]
+
+    # debug_vis(y_sorted_boxes, y_sorted_indices)
+
+    y_projection = projection_by_bboxes(boxes=y_sorted_boxes, axis=1)
+    pos_y = split_projection_profile(y_projection, 0, 1)
+    if not pos_y:
+        return
+
+    arr_y0, arr_y1 = pos_y
+    for r0, r1 in zip(arr_y0, arr_y1):
+        # [r0, r1] 表示按照水平切分,有 bbox 的区域,对这些区域会再进行垂直切分
+        _indices = (r0 <= y_sorted_boxes[:, 1]) & (y_sorted_boxes[:, 1] < r1)
+
+        y_sorted_boxes_chunk = y_sorted_boxes[_indices]
+        y_sorted_indices_chunk = y_sorted_indices[_indices]
+
+        _indices = y_sorted_boxes_chunk[:, 0].argsort()
+        x_sorted_boxes_chunk = y_sorted_boxes_chunk[_indices]
+        x_sorted_indices_chunk = y_sorted_indices_chunk[_indices]
+
+        # 往 x 方向投影
+        x_projection = projection_by_bboxes(boxes=x_sorted_boxes_chunk, axis=0)
+        pos_x = split_projection_profile(x_projection, 0, 1)
+        if not pos_x:
+            continue
+
+        arr_x0, arr_x1 = pos_x
+        if len(arr_x0) == 1:
+            # x 方向无法切分
+            res.extend(x_sorted_indices_chunk)
+            continue
+
+        # x 方向上能分开,继续递归调用
+        for c0, c1 in zip(arr_x0, arr_x1):
+            _indices = (c0 <= x_sorted_boxes_chunk[:, 0]) & (
+                x_sorted_boxes_chunk[:, 0] < c1
+            )
+            recursive_xy_cut(
+                x_sorted_boxes_chunk[_indices], x_sorted_indices_chunk[_indices], res
+            )
+
+
+def points_to_bbox(points):
+    assert len(points) == 8
+
+    # [x1,y1,x2,y2,x3,y3,x4,y4]
+    left = min(points[::2])
+    right = max(points[::2])
+    top = min(points[1::2])
+    bottom = max(points[1::2])
+
+    left = max(left, 0)
+    top = max(top, 0)
+    right = max(right, 0)
+    bottom = max(bottom, 0)
+    return [left, top, right, bottom]
+
+
+def bbox2points(bbox):
+    left, top, right, bottom = bbox
+    return [left, top, right, top, right, bottom, left, bottom]
+
+
+def vis_polygon(img, points, thickness=2, color=None):
+    br2bl_color = color
+    tl2tr_color = color
+    tr2br_color = color
+    bl2tl_color = color
+    cv2.line(
+        img,
+        (points[0][0], points[0][1]),
+        (points[1][0], points[1][1]),
+        color=tl2tr_color,
+        thickness=thickness,
+    )
+
+    cv2.line(
+        img,
+        (points[1][0], points[1][1]),
+        (points[2][0], points[2][1]),
+        color=tr2br_color,
+        thickness=thickness,
+    )
+
+    cv2.line(
+        img,
+        (points[2][0], points[2][1]),
+        (points[3][0], points[3][1]),
+        color=br2bl_color,
+        thickness=thickness,
+    )
+
+    cv2.line(
+        img,
+        (points[3][0], points[3][1]),
+        (points[0][0], points[0][1]),
+        color=bl2tl_color,
+        thickness=thickness,
+    )
+    return img
+
+
+def vis_points(
+    img: np.ndarray, points, texts: List[str] = None, color=(0, 200, 0)
+) -> np.ndarray:
+    """
+
+    Args:
+        img:
+        points: [N, 8]  8: x1,y1,x2,y2,x3,y3,x3,y4
+        texts:
+        color:
+
+    Returns:
+
+    """
+    points = np.array(points)
+    if texts is not None:
+        assert len(texts) == points.shape[0]
+
+    for i, _points in enumerate(points):
+        vis_polygon(img, _points.reshape(-1, 2), thickness=2, color=color)
+        bbox = points_to_bbox(_points)
+        left, top, right, bottom = bbox
+        cx = (left + right) // 2
+        cy = (top + bottom) // 2
+
+        txt = texts[i]
+        font = cv2.FONT_HERSHEY_SIMPLEX
+        cat_size = cv2.getTextSize(txt, font, 0.5, 2)[0]
+
+        img = cv2.rectangle(
+            img,
+            (cx - 5 * len(txt), cy - cat_size[1] - 5),
+            (cx - 5 * len(txt) + cat_size[0], cy - 5),
+            color,
+            -1,
+        )
+
+        img = cv2.putText(
+            img,
+            txt,
+            (cx - 5 * len(txt), cy - 5),
+            font,
+            0.5,
+            (255, 255, 255),
+            thickness=1,
+            lineType=cv2.LINE_AA,
+        )
+
+    return img
+
+
+def vis_polygons_with_index(image, points):
+    texts = [str(i) for i in range(len(points))]
+    res_img = vis_points(image.copy(), points, texts)
+    return res_img

+ 0 - 0
magic_pdf/model/sub_modules/table/__init__.py


+ 0 - 0
magic_pdf/model/sub_modules/table/rapidtable/__init__.py


+ 14 - 0
magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py

@@ -0,0 +1,14 @@
+import numpy as np
+from rapid_table import RapidTable
+from rapidocr_paddle import RapidOCR
+
+
+class RapidTableModel(object):
+    def __init__(self):
+        self.table_model = RapidTable()
+        self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
+
+    def predict(self, image):
+        ocr_result, _ = self.ocr_engine(np.asarray(image))
+        html_code, table_cell_bboxes, elapse = self.table_model(np.asarray(image), ocr_result)
+        return html_code, table_cell_bboxes, elapse

+ 0 - 0
magic_pdf/model/sub_modules/table/structeqtable/__init__.py


+ 3 - 11
magic_pdf/model/pek_sub_modules/structeqtable/StructTableModel.py → magic_pdf/model/sub_modules/table/structeqtable/struct_eqtable.py

@@ -1,8 +1,8 @@
-import re
-
 import torch
 from struct_eqtable import build_model
 
+from magic_pdf.model.sub_modules.table.table_utils import minify_html
+
 
 class StructTableModel:
     def __init__(self, model_path, max_new_tokens=1024, max_time=60):
@@ -31,15 +31,7 @@ class StructTableModel:
         )
 
         if output_format == "html":
-            results = [self.minify_html(html) for html in results]
+            results = [minify_html(html) for html in results]
 
         return results
 
-    def minify_html(self, html):
-        # 移除多余的空白字符
-        html = re.sub(r'\s+', ' ', html)
-        # 移除行尾的空白字符
-        html = re.sub(r'\s*>\s*', '>', html)
-        # 移除标签前的空白字符
-        html = re.sub(r'\s*<\s*', '<', html)
-        return html.strip()

+ 11 - 0
magic_pdf/model/sub_modules/table/table_utils.py

@@ -0,0 +1,11 @@
+import re
+
+
+def minify_html(html):
+    # 移除多余的空白字符
+    html = re.sub(r'\s+', ' ', html)
+    # 移除行尾的空白字符
+    html = re.sub(r'\s*>\s*', '>', html)
+    # 移除标签前的空白字符
+    html = re.sub(r'\s*<\s*', '<', html)
+    return html.strip()

+ 0 - 0
magic_pdf/model/sub_modules/table/tablemaster/__init__.py


+ 1 - 1
magic_pdf/model/ppTableModel.py → magic_pdf/model/sub_modules/table/tablemaster/tablemaster_paddle.py

@@ -7,7 +7,7 @@ from PIL import Image
 import numpy as np
 
 
-class ppTableModel(object):
+class TableMasterPaddleModel(object):
     """
         This class is responsible for converting image of table into HTML format using a pre-trained model.
 

+ 13 - 15
magic_pdf/para/para_split_v3.py

@@ -77,14 +77,12 @@ def __is_list_or_index_block(block):
 
         # 如果首行左边不顶格而右边顶格,末行左边顶格而右边不顶格 (第一行可能可以右边不顶格)
         if (first_line['bbox'][0] - block['bbox_fs'][0] > line_height / 2 and
-                # block['bbox_fs'][2] - first_line['bbox'][2] < line_height and
                 abs(last_line['bbox'][0] - block['bbox_fs'][0]) < line_height / 2 and
                 block['bbox_fs'][2] - last_line['bbox'][2] > line_height
         ):
             multiple_para_flag = True
 
         for line in block['lines']:
-
             line_mid_x = (line['bbox'][0] + line['bbox'][2]) / 2
             block_mid_x = (block['bbox_fs'][0] + block['bbox_fs'][2]) / 2
             if (
@@ -102,13 +100,13 @@ def __is_list_or_index_block(block):
                 if span_type == ContentType.Text:
                     line_text += span['content'].strip()
 
+            # 添加所有文本,包括空行,保持与block['lines']长度一致
             lines_text_list.append(line_text)
 
             # 计算line左侧顶格数量是否大于2,是否顶格用abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height/2 来判断
             if abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height / 2:
                 left_close_num += 1
             elif line['bbox'][0] - block['bbox_fs'][0] > line_height:
-                # logger.info(f"{line_text}, {block['bbox_fs']}, {line['bbox']}")
                 left_not_close_num += 1
 
             # 计算右侧是否顶格
@@ -117,7 +115,6 @@ def __is_list_or_index_block(block):
             else:
                 # 右侧不顶格情况下是否有一段距离,拍脑袋用0.3block宽度做阈值
                 closed_area = 0.26 * block_weight
-                # closed_area = 5 * line_height
                 if block['bbox_fs'][2] - line['bbox'][2] > closed_area:
                     right_not_close_num += 1
 
@@ -128,6 +125,7 @@ def __is_list_or_index_block(block):
         num_start_count = 0
         num_end_count = 0
         flag_end_count = 0
+
         if len(lines_text_list) > 0:
             for line_text in lines_text_list:
                 if len(line_text) > 0:
@@ -138,11 +136,10 @@ def __is_list_or_index_block(block):
                     if line_text[-1].isdigit():
                         num_end_count += 1
 
-            if flag_end_count / len(lines_text_list) >= 0.8:
-                line_end_flag = True
-
             if num_start_count / len(lines_text_list) >= 0.8 or num_end_count / len(lines_text_list) >= 0.8:
                 line_num_flag = True
+            if flag_end_count / len(lines_text_list) >= 0.8:
+                line_end_flag = True
 
         # 有的目录右侧不贴边, 目前认为左边或者右边有一边全贴边,且符合数字规则极为index
         if ((left_close_num / len(block['lines']) >= 0.8 or right_close_num / len(block['lines']) >= 0.8)
@@ -176,7 +173,7 @@ def __is_list_or_index_block(block):
                 # 这种是大部分line item 都有结束标识符的情况,按结束标识符区分不同item
                 elif line_end_flag:
                     for i, line in enumerate(block['lines']):
-                        if lines_text_list[i][-1] in LIST_END_FLAG:
+                        if len(lines_text_list[i]) > 0 and lines_text_list[i][-1] in LIST_END_FLAG:
                             line[ListLineTag.IS_LIST_END_LINE] = True
                             if i + 1 < len(block['lines']):
                                 block['lines'][i + 1][ListLineTag.IS_LIST_START_LINE] = True
@@ -187,17 +184,18 @@ def __is_list_or_index_block(block):
                         if line_start_flag:
                             line[ListLineTag.IS_LIST_START_LINE] = True
                             line_start_flag = False
-                        # elif abs(block['bbox_fs'][2] - line['bbox'][2]) > line_height:
+
                         if abs(block['bbox_fs'][2] - line['bbox'][2]) > 0.1 * block_weight:
                             line[ListLineTag.IS_LIST_END_LINE] = True
                             line_start_flag = True
-            # 一种有缩进的特殊有序list,start line 左侧不贴边且以数字开头,end line 以 IS_LIST_END_LINE 结尾且数量和start line 一致
-            elif num_start_count >= 2 and num_start_count == flag_end_count:  # 简单一点先不考虑左侧不贴边的情况
+            # 一种有缩进的特殊有序list,start line 左侧不贴边且以数字开头,end line 以 IS_LIST_END_FLAG 结尾且数量和start line 一致
+            elif num_start_count >= 2 and num_start_count == flag_end_count:
                 for i, line in enumerate(block['lines']):
-                    if lines_text_list[i][0].isdigit():
-                        line[ListLineTag.IS_LIST_START_LINE] = True
-                    if lines_text_list[i][-1] in LIST_END_FLAG:
-                        line[ListLineTag.IS_LIST_END_LINE] = True
+                    if len(lines_text_list[i]) > 0:
+                        if lines_text_list[i][0].isdigit():
+                            line[ListLineTag.IS_LIST_START_LINE] = True
+                        if lines_text_list[i][-1] in LIST_END_FLAG:
+                            line[ListLineTag.IS_LIST_END_LINE] = True
             else:
                 # 正常有缩进的list处理
                 for line in block['lines']:

+ 56 - 19
magic_pdf/pdf_parse_union_core_v2.py

@@ -30,8 +30,8 @@ from magic_pdf.pre_proc.equations_replace import (
 from magic_pdf.pre_proc.ocr_detect_all_bboxes import \
     ocr_prepare_bboxes_for_layout_split_v2
 from magic_pdf.pre_proc.ocr_dict_merge import (fill_spans_in_blocks,
-                                               fix_block_spans,
-                                               fix_discarded_block, fix_block_spans_v2)
+                                               fix_discarded_block,
+                                               fix_block_spans_v2)
 from magic_pdf.pre_proc.ocr_span_list_modify import (
     get_qa_need_list_v2, remove_overlaps_low_confidence_spans,
     remove_overlaps_min_spans)
@@ -164,8 +164,8 @@ class ModelSingleton:
 
 
 def do_predict(boxes: List[List[int]], model) -> List[int]:
-    from magic_pdf.model.v3.helpers import (boxes2inputs, parse_logits,
-                                            prepare_inputs)
+    from magic_pdf.model.sub_modules.reading_oreder.layoutreader.helpers import (boxes2inputs, parse_logits,
+                                                                                 prepare_inputs)
 
     inputs = boxes2inputs(boxes)
     inputs = prepare_inputs(inputs, model)
@@ -174,23 +174,57 @@ def do_predict(boxes: List[List[int]], model) -> List[int]:
 
 
 def cal_block_index(fix_blocks, sorted_bboxes):
-    for block in fix_blocks:
 
-        line_index_list = []
-        if len(block['lines']) == 0:
-            block['index'] = sorted_bboxes.index(block['bbox'])
-        else:
+    if sorted_bboxes is not None:
+        # 使用layoutreader排序
+        for block in fix_blocks:
+            line_index_list = []
+            if len(block['lines']) == 0:
+                block['index'] = sorted_bboxes.index(block['bbox'])
+            else:
+                for line in block['lines']:
+                    line['index'] = sorted_bboxes.index(line['bbox'])
+                    line_index_list.append(line['index'])
+                median_value = statistics.median(line_index_list)
+                block['index'] = median_value
+
+            # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
+            if block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
+                block['virtual_lines'] = copy.deepcopy(block['lines'])
+                block['lines'] = copy.deepcopy(block['real_lines'])
+                del block['real_lines']
+    else:
+        # 使用xycut排序
+        block_bboxes = []
+        for block in fix_blocks:
+            block_bboxes.append(block['bbox'])
+
+            # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
+            if block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
+                block['virtual_lines'] = copy.deepcopy(block['lines'])
+                block['lines'] = copy.deepcopy(block['real_lines'])
+                del block['real_lines']
+
+        import numpy as np
+        from magic_pdf.model.sub_modules.reading_oreder.layoutreader.xycut import recursive_xy_cut
+
+        random_boxes = np.array(block_bboxes)
+        np.random.shuffle(random_boxes)
+        res = []
+        recursive_xy_cut(np.asarray(random_boxes).astype(int), np.arange(len(block_bboxes)), res)
+        assert len(res) == len(block_bboxes)
+        sorted_boxes = random_boxes[np.array(res)].tolist()
+
+        for i, block in enumerate(fix_blocks):
+            block['index'] = sorted_boxes.index(block['bbox'])
+
+        # 生成line index
+        sorted_blocks = sorted(fix_blocks, key=lambda b: b['index'])
+        line_inedx = 1
+        for block in sorted_blocks:
             for line in block['lines']:
-                line['index'] = sorted_bboxes.index(line['bbox'])
-                line_index_list.append(line['index'])
-            median_value = statistics.median(line_index_list)
-            block['index'] = median_value
-
-        # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
-        if block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
-            block['virtual_lines'] = copy.deepcopy(block['lines'])
-            block['lines'] = copy.deepcopy(block['real_lines'])
-            del block['real_lines']
+                line['index'] = line_inedx
+                line_inedx += 1
 
     return fix_blocks
 
@@ -264,6 +298,9 @@ def sort_lines_by_model(fix_blocks, page_w, page_h, line_height):
                 block['lines'].append({'bbox': line, 'spans': []})
             page_line_list.extend(lines)
 
+    if len(page_line_list) > 200:  # layoutreader最高支持512line
+        return None
+
     # 使用layoutreader排序
     x_scale = 1000.0 / page_w
     y_scale = 1000.0 / page_h

+ 2 - 1
magic_pdf/resources/model_config/model_configs.yaml

@@ -4,4 +4,5 @@ weights:
   yolo_v8_mfd: MFD/YOLO/yolo_v8_ft.pt
   unimernet_small: MFR/unimernet_small
   struct_eqtable: TabRec/StructEqTable
-  tablemaster: TabRec/TableMaster
+  tablemaster: TabRec/TableMaster
+  rapid_table: TabRec/RapidTable

+ 47 - 3
magic_pdf/tools/common.py

@@ -14,6 +14,9 @@ from magic_pdf.pipe.TXTPipe import TXTPipe
 from magic_pdf.pipe.UNIPipe import UNIPipe
 from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
 from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
+import fitz
+# from io import BytesIO
+# from pypdf import PdfReader, PdfWriter
 
 
 def prepare_env(output_dir, pdf_file_name, method):
@@ -26,6 +29,42 @@ def prepare_env(output_dir, pdf_file_name, method):
     return local_image_dir, local_md_dir
 
 
+# def convert_pdf_bytes_to_bytes_by_pypdf(pdf_bytes, start_page_id=0, end_page_id=None):
+#     # 将字节数据包装在 BytesIO 对象中
+#     pdf_file = BytesIO(pdf_bytes)
+#     # 读取 PDF 的字节数据
+#     reader = PdfReader(pdf_file)
+#     # 创建一个新的 PDF 写入器
+#     writer = PdfWriter()
+#     # 将所有页面添加到新的 PDF 写入器中
+#     end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(reader.pages) - 1
+#     if end_page_id > len(reader.pages) - 1:
+#         logger.warning("end_page_id is out of range, use pdf_docs length")
+#         end_page_id = len(reader.pages) - 1
+#     for i, page in enumerate(reader.pages):
+#         if start_page_id <= i <= end_page_id:
+#             writer.add_page(page)
+#     # 创建一个字节缓冲区来存储输出的 PDF 数据
+#     output_buffer = BytesIO()
+#     # 将 PDF 写入字节缓冲区
+#     writer.write(output_buffer)
+#     # 获取字节缓冲区的内容
+#     converted_pdf_bytes = output_buffer.getvalue()
+#     return converted_pdf_bytes
+
+
+def convert_pdf_bytes_to_bytes_by_pymupdf(pdf_bytes, start_page_id=0, end_page_id=None):
+    document = fitz.open("pdf", pdf_bytes)
+    output_document = fitz.open()
+    end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(document) - 1
+    if end_page_id > len(document) - 1:
+        logger.warning("end_page_id is out of range, use pdf_docs length")
+        end_page_id = len(document) - 1
+    output_document.insert_pdf(document, from_page=start_page_id, to_page=end_page_id)
+    output_bytes = output_document.tobytes()
+    return output_bytes
+
+
 def do_parse(
     output_dir,
     pdf_file_name,
@@ -55,6 +94,8 @@ def do_parse(
         f_draw_model_bbox = True
         f_draw_line_sort_bbox = True
 
+    pdf_bytes = convert_pdf_bytes_to_bytes_by_pymupdf(pdf_bytes, start_page_id, end_page_id)
+
     orig_model_list = copy.deepcopy(model_list)
     local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name,
                                                 parse_method)
@@ -66,15 +107,18 @@ def do_parse(
     if parse_method == 'auto':
         jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
         pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True,
-                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang,
+                       # start_page_id=start_page_id, end_page_id=end_page_id,
+                       lang=lang,
                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
     elif parse_method == 'txt':
         pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True,
-                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang,
+                       # start_page_id=start_page_id, end_page_id=end_page_id,
+                       lang=lang,
                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
     elif parse_method == 'ocr':
         pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True,
-                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang,
+                       # start_page_id=start_page_id, end_page_id=end_page_id,
+                       lang=lang,
                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
     else:
         logger.error('unknown parse method')

File diff suppressed because it is too large
+ 16 - 0
next_docs/README.md


File diff suppressed because it is too large
+ 16 - 0
next_docs/README_zh-CN.md


File diff suppressed because it is too large
+ 13 - 0
next_docs/en/_static/image/ReadTheDocs.svg


+ 0 - 26
next_docs/en/additional_notes/changelog.rst

@@ -1,26 +0,0 @@
-
-
-Changelog
-=========
-
--  2024/09/27 Version 0.8.1 released, Fixed some bugs, and providing a
-   `localized deployment version <projects/web_demo/README.md>`__ of the
-   `online
-   demo <https://opendatalab.com/OpenSourceTools/Extractor/PDF/>`__ and
-   the `front-end interface <projects/web/README.md>`__.
--  2024/09/09: Version 0.8.0 released, supporting fast deployment with
-   Dockerfile, and launching demos on Huggingface and Modelscope.
--  2024/08/30: Version 0.7.1 released, add paddle tablemaster table
-   recognition option
--  2024/08/09: Version 0.7.0b1 released, simplified installation
-   process, added table recognition functionality
--  2024/08/01: Version 0.6.2b1 released, optimized dependency conflict
-   issues and installation documentation
--  2024/07/05: Initial open-source release
-
-
-.. warning::
-
-   fix ``localized deployment version`` and ``front-end interface``
-
-

+ 12 - 0
next_docs/en/additional_notes/faq.rst

@@ -74,3 +74,15 @@ CUDA version used by Paddle needs to be upgraded.
    pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
 
 Reference: https://github.com/opendatalab/MinerU/issues/558
+
+
+7. On some Linux servers, the program immediately reports an error ``Illegal instruction (core dumped)``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This might be because the server's CPU does not support the AVX/AVX2
+instruction set, or the CPU itself supports it but has been disabled by
+the system administrator. You can try contacting the system
+administrator to remove the restriction or change to a different server.
+
+References: https://github.com/opendatalab/MinerU/issues/591 ,
+https://github.com/opendatalab/MinerU/issues/736

+ 15 - 14
next_docs/en/additional_notes/known_issues.rst

@@ -1,19 +1,20 @@
 Known Issues
 ============
 
--  Reading order is based on the model’s sorting of text distribution in
-   space, which may become disordered under extremely complex layouts.
+-  Reading order is determined by the model based on the spatial
+   distribution of readable content, and may be out of order in some
+   areas under extremely complex layouts.
 -  Vertical text is not supported.
--  Tables of contents and lists are recognized through rules; a few
-   uncommon list formats may not be identified.
--  Only one level of headings is supported; hierarchical heading levels
-   are currently not supported.
+-  Tables of contents and lists are recognized through rules, and some
+   uncommon list formats may not be recognized.
+-  Only one level of headings is supported; hierarchical headings are
+   not currently supported.
 -  Code blocks are not yet supported in the layout model.
--  Comic books, art books, elementary school textbooks, and exercise
-   books are not well-parsed yet
--  Enabling OCR may produce better results in PDFs with a high density
-   of formulas
--  If you are processing PDFs with a large number of formulas, it is
-   strongly recommended to enable the OCR function. When using PyMuPDF
-   to extract text, overlapping text lines can occur, leading to
-   inaccurate formula insertion positions.
+-  Comic books, art albums, primary school textbooks, and exercises
+   cannot be parsed well.
+-  Table recognition may result in row/column recognition errors in
+   complex tables.
+-  OCR recognition may produce inaccurate characters in PDFs of
+   lesser-known languages (e.g., diacritical marks in Latin script,
+   easily confused characters in Arabic script).
+-  Some formulas may not render correctly in Markdown.

+ 0 - 1
next_docs/en/api.rst

@@ -7,4 +7,3 @@
    api/read_api
    api/schemas
    api/io
-   api/classes

+ 0 - 14
next_docs/en/api/classes.rst

@@ -1,14 +0,0 @@
-Class Hierarchy
-===============
-
-.. inheritance-diagram:: magic_pdf.data.io.base magic_pdf.data.io.http magic_pdf.data.io.s3
-   :parts: 2
-
-
-.. inheritance-diagram:: magic_pdf.data.dataset
-   :parts: 2
-
-
-.. inheritance-diagram:: magic_pdf.data.data_reader_writer.base magic_pdf.data.data_reader_writer.filebase magic_pdf.data.data_reader_writer.multi_bucket_s3
-   :parts: 2
-

+ 0 - 1
next_docs/en/api/utils.rst

@@ -1 +0,0 @@
-

+ 1 - 1
next_docs/en/conf.py

@@ -95,7 +95,7 @@ language = 'en'
 html_theme = 'sphinx_book_theme'
 html_logo = '_static/image/logo.png'
 html_theme_options = {
-    'path_to_docs': 'docs/en',
+    'path_to_docs': 'next_docs/en',
     'repository_url': 'https://github.com/opendatalab/MinerU',
     'use_repository_button': True,
 }

+ 23 - 22
next_docs/en/index.rst

@@ -46,20 +46,29 @@ the relevant PDF**.
 Key Features
 ------------
 
--  Removes elements such as headers, footers, footnotes, and page
-   numbers while maintaining semantic continuity
--  Outputs text in a human-readable order from multi-column documents
--  Retains the original structure of the document, including titles,
-   paragraphs, and lists
--  Extracts images, image captions, tables, and table captions
--  Automatically recognizes formulas in the document and converts them
-   to LaTeX
--  Automatically recognizes tables in the document and converts them to
-   LaTeX
--  Automatically detects and enables OCR for corrupted PDFs
--  Supports both CPU and GPU environments
--  Supports Windows, Linux, and Mac platforms
-
+-  Remove headers, footers, footnotes, page numbers, etc., to ensure
+   semantic coherence.
+-  Output text in human-readable order, suitable for single-column,
+   multi-column, and complex layouts.
+-  Preserve the structure of the original document, including headings,
+   paragraphs, lists, etc.
+-  Extract images, image descriptions, tables, table titles, and
+   footnotes.
+-  Automatically recognize and convert formulas in the document to LaTeX
+   format.
+-  Automatically recognize and convert tables in the document to LaTeX
+   or HTML format.
+-  Automatically detect scanned PDFs and garbled PDFs and enable OCR
+   functionality.
+-  OCR supports detection and recognition of 84 languages.
+-  Supports multiple output formats, such as multimodal and NLP
+   Markdown, JSON sorted by reading order, and rich intermediate
+   formats.
+-  Supports various visualization results, including layout
+   visualization and span visualization, for efficient confirmation of
+   output quality.
+-  Supports both CPU and GPU environments.
+-  Compatible with Windows, Linux, and Mac platforms.
 
 User Guide
 -------------
@@ -91,14 +100,6 @@ Additional Notes
 
    additional_notes/known_issues
    additional_notes/faq
-   additional_notes/changelog
    additional_notes/glossary
 
 
-Projects 
----------
-.. toctree::
-   :maxdepth: 1
-   :caption: Projects
-
-   projects

+ 0 - 13
next_docs/en/projects.rst

@@ -1,13 +0,0 @@
-
-
-
-llama_index_rag 
-===============
-
-
-gradio_app
-============
-
-
-other projects
-===============

+ 5 - 1
next_docs/en/user_guide/data/data_reader_writer.rst

@@ -87,6 +87,8 @@ Read Examples
 
 .. code:: python
 
+    from magic_pdf.data.data_reader_writer import *
+
     # file based related 
     file_based_reader1 = FileBasedDataReader('')
 
@@ -142,6 +144,8 @@ Write Examples
 
 .. code:: python
 
+    from magic_pdf.data.data_reader_writer import *
+
     # file based related 
     file_based_writer1 = FileBasedDataWriter('')
 
@@ -201,4 +205,4 @@ Write Examples
     s3_writer1.write('s3://test_bucket/efg', '123'.encode())
 
 
-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/data_reader_writer` for more details
+Check :doc:`../../api/data_reader_writer` for more details

+ 1 - 1
next_docs/en/user_guide/data/dataset.rst

@@ -36,5 +36,5 @@ Extract chars via third-party library, currently we use ``pymupdf``.
 
 
 
-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/dataset` for more details
+Check :doc:`../../api/dataset` for more details
 

+ 1 - 1
next_docs/en/user_guide/data/io.rst

@@ -21,5 +21,5 @@ if MinerU have not provide the suitable classes. It is easy to implement new cla
         def write(self, path: str, data: bytes) -> None:
             pass
 
-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/io` for more details
+Check :doc:`../../api/io` for more details
 

+ 6 - 1
next_docs/en/user_guide/data/read_api.rst

@@ -18,6 +18,8 @@ Read the contet from jsonl which may located on local machine or remote s3. if y
 
 .. code:: python
 
+    from magic_pdf.data.io.read_api import *
+
     # read jsonl from local machine 
     datasets = read_jsonl("tt.jsonl", None)
 
@@ -33,6 +35,8 @@ Read pdf from path or directory.
 
 .. code:: python
 
+    from magic_pdf.data.io.read_api import *
+
     # read pdf path
     datasets = read_local_pdfs("tt.pdf")
 
@@ -47,10 +51,11 @@ Read images from path or directory
 
 .. code:: python 
 
+    from magic_pdf.data.io.read_api import *
+
     # read from image path 
     datasets = read_local_images("tt.png")
 
-
     # read files from directory that endswith suffix in suffixes array 
     datasets = read_local_images("images/", suffixes=["png", "jpg"])
 

+ 45 - 41
next_docs/en/user_guide/install/boost_with_cuda.rst

@@ -9,16 +9,18 @@ appropriate guide based on your system:
 
 -  :ref:`ubuntu_22_04_lts_section`
 -  :ref:`windows_10_or_11_section`
+-  Quick Deployment with Docker
 
--  Quick Deployment with Docker > Docker requires a GPU with at least
-   16GB of VRAM, and all acceleration features are enabled by default.
+.. admonition:: Important
+   :class: tip
 
-.. note:: 
+   Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
 
-   Before running this Docker, you can use the following command to
-   check if your device supports CUDA acceleration on Docker. 
+   Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker. 
 
-   bash  docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
+   .. code-block:: bash
+
+      bash  docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
 
 .. code:: sh
 
@@ -42,8 +44,9 @@ Ubuntu 22.04 LTS
 If you see information similar to the following, it means that the
 NVIDIA drivers are already installed, and you can skip Step 2.
 
-Notice:``CUDA Version`` should be >= 12.1, If the displayed version
-number is less than 12.1, please upgrade the driver.
+.. note::
+
+   ``CUDA Version`` should be >= 12.1, If the displayed version number is less than 12.1, please upgrade the driver.
 
 .. code:: text
 
@@ -105,8 +108,10 @@ Specify Python version 3.10.
 
    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
 
-❗ After installation, make sure to check the version of ``magic-pdf``
-using the following command:
+.. admonition:: Important
+    :class: tip
+
+    ❗ After installation, make sure to check the version of ``magic-pdf`` using the following command:
 
 .. code:: sh
 
@@ -127,7 +132,10 @@ the script will automatically generate a ``magic-pdf.json`` file in the
 user directory and configure the default model path. You can find the
 ``magic-pdf.json`` file in your user directory.
 
-   The user directory for Linux is “/home/username”.
+.. admonition:: TIP
+    :class: tip
+
+    The user directory for Linux is “/home/username”.
 
 8. First Run
 ~~~~~~~~~~~~
@@ -137,7 +145,7 @@ Download a sample file from the repository and test it.
 .. code:: sh
 
    wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf
-   magic-pdf -p small_ocr.pdf
+   magic-pdf -p small_ocr.pdf -o ./output
 
 9. Test CUDA Acceleration
 ~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -145,10 +153,6 @@ Download a sample file from the repository and test it.
 If your graphics card has at least **8GB** of VRAM, follow these steps
 to test CUDA acceleration:
 
-   ❗ Due to the extremely limited nature of 8GB VRAM for running this
-   application, you need to close all other programs using VRAM to
-   ensure that 8GB of VRAM is available when running this application.
-
 1. Modify the value of ``"device-mode"`` in the ``magic-pdf.json``
    configuration file located in your home directory.
 
@@ -162,7 +166,7 @@ to test CUDA acceleration:
 
    .. code:: sh
 
-      magic-pdf -p small_ocr.pdf
+      magic-pdf -p small_ocr.pdf -o ./output
 
 10. Enable CUDA Acceleration for OCR
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -178,7 +182,9 @@ to test CUDA acceleration:
 
    .. code:: sh
 
-      magic-pdf -p small_ocr.pdf
+      magic-pdf -p small_ocr.pdf -o ./output
+
+
 
 .. _windows_10_or_11_section:
 
@@ -218,16 +224,16 @@ Python version must be 3.10.
 
    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
 
-..
+.. admonition:: Important
+    :class: tip
 
-   ❗️After installation, verify the version of ``magic-pdf``:
+    ❗️After installation, verify the version of ``magic-pdf``:
 
-   .. code:: bash
+    .. code:: bash
 
       magic-pdf --version
 
-   If the version number is less than 0.7.0, please report it in the
-   issues section.
+    If the version number is less than 0.7.0, please report it in the issues section.
 
 5. Download Models
 ~~~~~~~~~~~~~~~~~~
@@ -242,7 +248,10 @@ the script will automatically generate a ``magic-pdf.json`` file in the
 user directory and configure the default model path. You can find the
 ``magic-pdf.json`` file in your 【user directory】 .
 
-   The user directory for Windows is “C:/Users/username”.
+.. admonition:: Tip
+    :class: tip
+
+    The user directory for Windows is “C:/Users/username”.
 
 7. First Run
 ~~~~~~~~~~~~
@@ -252,7 +261,7 @@ Download a sample file from the repository and test it.
 .. code:: powershell
 
      wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf
-     magic-pdf -p small_ocr.pdf
+     magic-pdf -p small_ocr.pdf -o ./output
 
 8. Test CUDA Acceleration
 ~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -260,27 +269,23 @@ Download a sample file from the repository and test it.
 If your graphics card has at least 8GB of VRAM, follow these steps to
 test CUDA-accelerated parsing performance.
 
-   ❗ Due to the extremely limited nature of 8GB VRAM for running this
-   application, you need to close all other programs using VRAM to
-   ensure that 8GB of VRAM is available when running this application.
-
-1. **Overwrite the installation of torch and torchvision** supporting
-   CUDA.
+1. **Overwrite the installation of torch and torchvision** supporting CUDA.
 
-   ::
+.. code:: sh
 
-      pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
+   pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
 
-   ..
+.. admonition:: Important
+    :class: tip
 
-      ❗️Ensure the following versions are specified in the command:
+    ❗️Ensure the following versions are specified in the command:
 
-      ::
+ 
+    .. code:: sh
 
          torch==2.3.1 torchvision==0.18.1
 
-      These are the highest versions we support. Installing higher
-      versions without specifying them will cause the program to fail.
+    These are the highest versions we support. Installing higher versions without specifying them will cause the program to fail.
 
 2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json``
    configuration file located in your user directory.
@@ -295,7 +300,7 @@ test CUDA-accelerated parsing performance.
 
    ::
 
-      magic-pdf -p small_ocr.pdf
+      magic-pdf -p small_ocr.pdf -o ./output
 
 9. Enable CUDA Acceleration for OCR
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -311,5 +316,4 @@ test CUDA-accelerated parsing performance.
 
    ::
 
-      magic-pdf -p small_ocr.pdf
-
+      magic-pdf -p small_ocr.pdf -o ./output

+ 81 - 78
next_docs/en/user_guide/install/install.rst

@@ -1,87 +1,90 @@
 
 Install 
 ===============================================================
-If you encounter any installation issues, please first consult the FAQ.
-If the parsing results are not as expected, refer to the Known Issues.
-There are three different ways to experience MinerU
-
-Pre-installation Notice—Hardware and Software Environment Support
-------------------------------------------------------------------
-
-To ensure the stability and reliability of the project, we only optimize
-and test for specific hardware and software environments during
-development. This ensures that users deploying and running the project
-on recommended system configurations will get the best performance with
-the fewest compatibility issues.
-
-By focusing resources on the mainline environment, our team can more
-efficiently resolve potential bugs and develop new features.
-
-In non-mainline environments, due to the diversity of hardware and
-software configurations, as well as third-party dependency compatibility
-issues, we cannot guarantee 100% project availability. Therefore, for
-users who wish to use this project in non-recommended environments, we
-suggest carefully reading the documentation and FAQ first. Most issues
-already have corresponding solutions in the FAQ. We also encourage
-community feedback to help us gradually expand support.
+If you encounter any installation issues, please first consult the :doc:`../../additional_notes/faq`.
+If the parsing results are not as expected, refer to the :doc:`../../additional_notes/known_issues`.
+
+
+.. admonition:: Warning
+    :class: tip
+
+    **Pre-installation Notice—Hardware and Software Environment Support**
+
+    To ensure the stability and reliability of the project, we only optimize
+    and test for specific hardware and software environments during
+    development. This ensures that users deploying and running the project
+    on recommended system configurations will get the best performance with
+    the fewest compatibility issues.
+
+    By focusing resources on the mainline environment, our team can more
+    efficiently resolve potential bugs and develop new features.
+
+    In non-mainline environments, due to the diversity of hardware and
+    software configurations, as well as third-party dependency compatibility
+    issues, we cannot guarantee 100% project availability. Therefore, for
+    users who wish to use this project in non-recommended environments, we
+    suggest carefully reading the documentation and FAQ first. Most issues
+    already have corresponding solutions in the FAQ. We also encourage
+    community feedback to help us gradually expand support.
 
 .. raw:: html
 
-   <style>
-      table, th, td {
-      border: 1px solid black;
-      border-collapse: collapse;
-      }
-   </style>
-   <table>
-    <tr>
-        <td colspan="3" rowspan="2">Operating System</td>
-    </tr>
-    <tr>
-        <td>Ubuntu 22.04 LTS</td>
-        <td>Windows 10 / 11</td>
-        <td>macOS 11+</td>
-    </tr>
-    <tr>
-        <td colspan="3">CPU</td>
-        <td>x86_64</td>
-        <td>x86_64</td>
-        <td>x86_64 / arm64</td>
-    </tr>
-    <tr>
-        <td colspan="3">Memory</td>
-        <td colspan="3">16GB or more, recommended 32GB+</td>
-    </tr>
-    <tr>
-        <td colspan="3">Python Version</td>
-        <td colspan="3">3.10</td>
-    </tr>
-    <tr>
-        <td colspan="3">Nvidia Driver Version</td>
-        <td>latest (Proprietary Driver)</td>
-        <td>latest</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CUDA Environment</td>
-        <td>Automatic installation [12.1 (pytorch) + 11.8 (paddle)]</td>
-        <td>11.8 (manual installation) + cuDNN v8.7.0 (manual installation)</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td rowspan="2">GPU Hardware Support List</td>
-        <td colspan="2">Minimum Requirement 8G+ VRAM</td>
-        <td colspan="2">3060ti/3070/3080/3080ti/4060/4070/4070ti<br>
-        8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
-        <td rowspan="2">None</td>
-    </tr>
-    <tr>
-        <td colspan="2">Recommended Configuration 16G+ VRAM</td>
-        <td colspan="2">3090/3090ti/4070ti super/4080/4090<br>
-        16G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
-        </td>
-    </tr>
-   </table>
+    <style>
+        table, th, td {
+        border: 1px solid black;
+        border-collapse: collapse;
+        }
+    </style>
+    <table>
+        <tr>
+            <td colspan="3" rowspan="2">Operating System</td>
+        </tr>
+        <tr>
+            <td>Ubuntu 22.04 LTS</td>
+            <td>Windows 10 / 11</td>
+            <td>macOS 11+</td>
+        </tr>
+        <tr>
+            <td colspan="3">CPU</td>
+            <td>x86_64(unsupported ARM Linux)</td>
+            <td>x86_64(unsupported ARM Windows)</td>
+            <td>x86_64 / arm64</td>
+        </tr>
+        <tr>
+            <td colspan="3">Memory</td>
+            <td colspan="3">16GB or more, recommended 32GB+</td>
+        </tr>
+        <tr>
+            <td colspan="3">Python Version</td>
+            <td colspan="3">3.10(Please make sure to create a Python 3.10 virtual environment using conda)</td>
+        </tr>
+        <tr>
+            <td colspan="3">Nvidia Driver Version</td>
+            <td>latest (Proprietary Driver)</td>
+            <td>latest</td>
+            <td>None</td>
+        </tr>
+        <tr>
+            <td colspan="3">CUDA Environment</td>
+            <td>Automatic installation [12.1 (pytorch) + 11.8 (paddle)]</td>
+            <td>11.8 (manual installation) + cuDNN v8.7.0 (manual installation)</td>
+            <td>None</td>
+        </tr>
+        <tr>
+            <td rowspan="2">GPU Hardware Support List</td>
+            <td colspan="2">Minimum Requirement 8G+ VRAM</td>
+            <td colspan="2">3060ti/3070/4060<br>
+            8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
+            <td rowspan="2">None</td>
+        </tr>
+        <tr>
+            <td colspan="2">Recommended Configuration 10G+ VRAM</td>
+            <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
+            10G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
+            </td>
+        </tr>
+    </table>
+
 
 
 Create an environment

+ 4 - 1
next_docs/en/user_guide/quick_start/command_line.rst

@@ -55,5 +55,8 @@ directory. The output file list is as follows:
    ├── some_pdf_spans.pdf                   # smallest granularity bbox position information diagram
    └── some_pdf_content_list.json           # Rich text JSON arranged in reading order
 
-For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
+.. admonition:: Tip
+   :class: tip
+
+   For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
 

+ 0 - 10
next_docs/en/user_guide/quick_start/extract_text.rst

@@ -1,10 +0,0 @@
-
-
-Extract Content from Pdf
-========================
-
-.. code:: python
-
-    from magic_pdf.data.read_api import read_local_pdfs
-    from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze

BIN
next_docs/zh_cn/_static/image/MinerU-logo-hq.png


BIN
next_docs/zh_cn/_static/image/MinerU-logo.png


File diff suppressed because it is too large
+ 13 - 0
next_docs/zh_cn/_static/image/ReadTheDocs.svg


BIN
next_docs/zh_cn/_static/image/datalab_logo.png


BIN
next_docs/zh_cn/_static/image/flowchart_en.png


BIN
next_docs/zh_cn/_static/image/flowchart_zh_cn.png


BIN
next_docs/zh_cn/_static/image/layout_example.png


BIN
next_docs/zh_cn/_static/image/poly.png


BIN
next_docs/zh_cn/_static/image/project_panorama_en.png


BIN
next_docs/zh_cn/_static/image/project_panorama_zh_cn.png


BIN
next_docs/zh_cn/_static/image/spans_example.png


BIN
next_docs/zh_cn/_static/image/web_demo_1.png


+ 72 - 0
next_docs/zh_cn/additional_notes/faq.rst

@@ -0,0 +1,72 @@
+常见问题解答
+============
+
+1.在较新版本的mac上使用命令安装pip install magic-pdf[full] zsh: no matches found: magic-pdf[full]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+在 macOS 上,默认的 shell 从 Bash 切换到了 Z shell,而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑,这可能导致no matches found错误。 可以通过在命令行禁用globbing特性,再尝试运行安装命令
+
+.. code:: bash
+
+   setopt no_nomatch
+   pip install magic-pdf[full]
+
+2.使用过程中遇到_pickle.UnpicklingError: invalid load key, ‘v’.错误
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+可能是由于模型文件未下载完整导致,可尝试重新下载模型文件后再试。参考:https://github.com/opendatalab/MinerU/issues/143
+
+3.模型文件应该下载到哪里/models-dir的配置应该怎么填
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+模型文件的路径输入是在”magic-pdf.json”中通过
+
+.. code:: json
+
+   {
+     "models-dir": "/tmp/models"
+   }
+
+进行配置的。这个路径是绝对路径而不是相对路径,绝对路径的获取可在models目录中通过命令 “pwd” 获取。
+参考:https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
+
+4.在WSL2的Ubuntu22.04中遇到报错\ ``ImportError: libGL.so.1: cannot open shared object file: No such file or directory``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+WSL2的Ubuntu22.04中缺少\ ``libgl``\ 库,可通过以下命令安装\ ``libgl``\ 库解决:
+
+.. code:: bash
+
+   sudo apt-get install libgl1-mesa-glx
+
+参考:https://github.com/opendatalab/MinerU/issues/388
+
+5.遇到报错 ``ModuleNotFoundError : Nomodulenamed 'fairscale'``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+需要卸载该模块并重新安装
+
+.. code:: bash
+
+   pip uninstall fairscale
+   pip install fairscale
+
+参考:https://github.com/opendatalab/MinerU/issues/411
+
+6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本
+
+.. code:: bash
+
+   pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
+
+参考:https://github.com/opendatalab/MinerU/issues/558
+
+7.在部分Linux服务器上,程序一运行就报错 ``非法指令 (核心已转储)`` 或 ``Illegal instruction (core dumped)``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+可能是因为服务器CPU不支持AVX/AVX2指令集,或cpu本身支持但被运维禁用了,可以尝试联系运维解除限制或更换服务器。
+
+参考:https://github.com/opendatalab/MinerU/issues/591 ,https://github.com/opendatalab/MinerU/issues/736

+ 11 - 0
next_docs/zh_cn/additional_notes/glossary.rst

@@ -0,0 +1,11 @@
+
+
+名词解释
+===========
+
+1. jsonl 
+    TODO: add description
+
+2. magic-pdf.json
+    TODO: add description
+

Some files were not shown because too many files changed in this diff