Browse Source

Merge pull request #969 from opendatalab/release-0.9.3

Release 0.9.3
Xiaomeng Zhao 1 year ago
parent
commit
845a3ff067
100 changed files with 1773 additions and 1018 deletions
  1. 3 0
      .gitignore
  2. 5 2
      README.md
  3. 3 3
      README_ja-JP.md
  4. 6 5
      README_zh-CN.md
  5. 8 3
      demo/magic_pdf_parse_main.py
  6. 1 1
      magic-pdf.template.json
  7. 1 1
      magic_pdf/dict2md/ocr_mkcontent.py
  8. 3 1
      magic_pdf/libs/Constants.py
  9. 1 1
      magic_pdf/libs/config_reader.py
  10. 10 4
      magic_pdf/libs/draw_bbox.py
  11. 42 297
      magic_pdf/model/pdf_extract_kit.py
  12. 0 36
      magic_pdf/model/pek_sub_modules/post_process.py
  13. 0 388
      magic_pdf/model/pek_sub_modules/self_modify.py
  14. 0 0
      magic_pdf/model/sub_modules/__init__.py
  15. 0 0
      magic_pdf/model/sub_modules/layout/__init__.py
  16. 21 0
      magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py
  17. 0 0
      magic_pdf/model/sub_modules/layout/doclayout_yolo/__init__.py
  18. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/__init__.py
  19. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/backbone.py
  20. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/beit.py
  21. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/deit.py
  22. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/__init__.py
  23. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/__init__.py
  24. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/cord.py
  25. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/data_collator.py
  26. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/funsd.py
  27. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/image_utils.py
  28. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/xfund.py
  29. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/__init__.py
  30. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/__init__.py
  31. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/configuration_layoutlmv3.py
  32. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/modeling_layoutlmv3.py
  33. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3.py
  34. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3_fast.py
  35. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/model_init.py
  36. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/rcnn_vl.py
  37. 0 0
      magic_pdf/model/sub_modules/layout/layoutlmv3/visualizer.py
  38. 0 0
      magic_pdf/model/sub_modules/mfd/__init__.py
  39. 12 0
      magic_pdf/model/sub_modules/mfd/yolov8/YOLOv8.py
  40. 0 0
      magic_pdf/model/sub_modules/mfd/yolov8/__init__.py
  41. 0 0
      magic_pdf/model/sub_modules/mfr/__init__.py
  42. 98 0
      magic_pdf/model/sub_modules/mfr/unimernet/Unimernet.py
  43. 0 0
      magic_pdf/model/sub_modules/mfr/unimernet/__init__.py
  44. 144 0
      magic_pdf/model/sub_modules/model_init.py
  45. 51 0
      magic_pdf/model/sub_modules/model_utils.py
  46. 0 0
      magic_pdf/model/sub_modules/ocr/__init__.py
  47. 0 0
      magic_pdf/model/sub_modules/ocr/paddleocr/__init__.py
  48. 259 0
      magic_pdf/model/sub_modules/ocr/paddleocr/ocr_utils.py
  49. 168 0
      magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py
  50. 213 0
      magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_291_mod.py
  51. 0 0
      magic_pdf/model/sub_modules/reading_oreder/__init__.py
  52. 0 0
      magic_pdf/model/sub_modules/reading_oreder/layoutreader/__init__.py
  53. 0 0
      magic_pdf/model/sub_modules/reading_oreder/layoutreader/helpers.py
  54. 242 0
      magic_pdf/model/sub_modules/reading_oreder/layoutreader/xycut.py
  55. 0 0
      magic_pdf/model/sub_modules/table/__init__.py
  56. 0 0
      magic_pdf/model/sub_modules/table/rapidtable/__init__.py
  57. 14 0
      magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py
  58. 0 0
      magic_pdf/model/sub_modules/table/structeqtable/__init__.py
  59. 3 11
      magic_pdf/model/sub_modules/table/structeqtable/struct_eqtable.py
  60. 11 0
      magic_pdf/model/sub_modules/table/table_utils.py
  61. 0 0
      magic_pdf/model/sub_modules/table/tablemaster/__init__.py
  62. 1 1
      magic_pdf/model/sub_modules/table/tablemaster/tablemaster_paddle.py
  63. 13 15
      magic_pdf/para/para_split_v3.py
  64. 56 19
      magic_pdf/pdf_parse_union_core_v2.py
  65. 2 1
      magic_pdf/resources/model_config/model_configs.yaml
  66. 47 3
      magic_pdf/tools/common.py
  67. 16 0
      next_docs/README.md
  68. 16 0
      next_docs/README_zh-CN.md
  69. 13 0
      next_docs/en/_static/image/ReadTheDocs.svg
  70. 0 26
      next_docs/en/additional_notes/changelog.rst
  71. 12 0
      next_docs/en/additional_notes/faq.rst
  72. 15 14
      next_docs/en/additional_notes/known_issues.rst
  73. 0 1
      next_docs/en/api.rst
  74. 0 14
      next_docs/en/api/classes.rst
  75. 0 1
      next_docs/en/api/utils.rst
  76. 1 1
      next_docs/en/conf.py
  77. 23 22
      next_docs/en/index.rst
  78. 0 13
      next_docs/en/projects.rst
  79. 5 1
      next_docs/en/user_guide/data/data_reader_writer.rst
  80. 1 1
      next_docs/en/user_guide/data/dataset.rst
  81. 1 1
      next_docs/en/user_guide/data/io.rst
  82. 6 1
      next_docs/en/user_guide/data/read_api.rst
  83. 45 41
      next_docs/en/user_guide/install/boost_with_cuda.rst
  84. 81 78
      next_docs/en/user_guide/install/install.rst
  85. 4 1
      next_docs/en/user_guide/quick_start/command_line.rst
  86. 0 10
      next_docs/en/user_guide/quick_start/extract_text.rst
  87. BIN
      next_docs/zh_cn/_static/image/MinerU-logo-hq.png
  88. BIN
      next_docs/zh_cn/_static/image/MinerU-logo.png
  89. 13 0
      next_docs/zh_cn/_static/image/ReadTheDocs.svg
  90. BIN
      next_docs/zh_cn/_static/image/datalab_logo.png
  91. BIN
      next_docs/zh_cn/_static/image/flowchart_en.png
  92. BIN
      next_docs/zh_cn/_static/image/flowchart_zh_cn.png
  93. BIN
      next_docs/zh_cn/_static/image/layout_example.png
  94. BIN
      next_docs/zh_cn/_static/image/poly.png
  95. BIN
      next_docs/zh_cn/_static/image/project_panorama_en.png
  96. BIN
      next_docs/zh_cn/_static/image/project_panorama_zh_cn.png
  97. BIN
      next_docs/zh_cn/_static/image/spans_example.png
  98. BIN
      next_docs/zh_cn/_static/image/web_demo_1.png
  99. 72 0
      next_docs/zh_cn/additional_notes/faq.rst
  100. 11 0
      next_docs/zh_cn/additional_notes/glossary.rst

+ 3 - 0
.gitignore

@@ -48,3 +48,6 @@ debug_utils/
 
 
 # sphinx docs
 # sphinx docs
 _build/
 _build/
+
+
+output/

+ 5 - 2
README.md

@@ -42,6 +42,7 @@
 </div>
 </div>
 
 
 # Changelog
 # Changelog
+- 2024/11/15 0.9.3 released. Integrated [RapidTable](https://github.com/RapidAI/RapidTable) for table recognition, improving single-table parsing speed by more than 10 times, with higher accuracy and lower GPU memory usage.
 - 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
 - 2024/11/06 0.9.2 released. Integrated the [StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B) model for table recognition functionality.
 - 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
 - 2024/10/31 0.9.0 released. This is a major new version with extensive code refactoring, addressing numerous issues, improving performance, reducing hardware requirements, and enhancing usability:
   - Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
   - Refactored the sorting module code to use [layoutreader](https://github.com/ppaanngggg/layoutreader) for reading order sorting, ensuring high accuracy in various layouts.
@@ -246,7 +247,7 @@ You can modify certain configurations in this file to enable or disable features
         "enable": true  // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
         "enable": true  // The formula recognition feature is enabled by default. If you need to disable it, please change the value here to "false".
     },
     },
     "table-config": {
     "table-config": {
-        "model": "tablemaster",  // When using structEqTable, please change to "struct_eqtable".
+        "model": "rapid_table",  // When using structEqTable, please change to "struct_eqtable".
         "enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true".
         "enable": false, // The table recognition feature is disabled by default. If you need to enable it, please change the value here to "true".
         "max_time": 400
         "max_time": 400
     }
     }
@@ -261,7 +262,7 @@ If your device supports CUDA and meets the GPU requirements of the mainline envi
 - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
 - [Windows 10/11 + GPU](docs/README_Windows_CUDA_Acceleration_en_US.md)
 - Quick Deployment with Docker
 - Quick Deployment with Docker
 > [!IMPORTANT]
 > [!IMPORTANT]
-> Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
+> Docker requires a GPU with at least 8GB of VRAM, and all acceleration features are enabled by default.
 >
 >
 > Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
 > Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker.
 > 
 > 
@@ -421,7 +422,9 @@ This project currently uses PyMuPDF to achieve advanced functionality. However,
 # Acknowledgments
 # Acknowledgments
 
 
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
+- [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
+- [RapidTable](https://github.com/RapidAI/RapidTable)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

+ 3 - 3
README_ja-JP.md

@@ -1,3 +1,5 @@
+> [!Warning]
+> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください:[ENGLISH](README.md)。
 <div id="top">
 <div id="top">
 
 
 <p align="center">
 <p align="center">
@@ -18,9 +20,7 @@
 <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
 <a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
 
 
 
 
-<div align="center" style="color: red; background-color: #ffdddd; padding: 10px; border: 1px solid red; border-radius: 5px;">
-  <strong>NOTE:</strong> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください。
-</div>
+
 
 
 
 
 [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
 [English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)

+ 6 - 5
README_zh-CN.md

@@ -42,7 +42,7 @@
 </div>
 </div>
 
 
 # 更新记录
 # 更新记录
-
+- 2024/11/15 0.9.3发布,为表格识别功能接入了[RapidTable](https://github.com/RapidAI/RapidTable),单表解析速度提升10倍以上,准确率更高,显存占用更低
 - 2024/11/06 0.9.2发布,为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
 - 2024/11/06 0.9.2发布,为表格识别功能接入了[StructTable-InternVL2-1B](https://huggingface.co/U4R/StructTable-InternVL2-1B)模型
 - 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:
 - 2024/10/31 0.9.0发布,这是我们进行了大量代码重构的全新版本,解决了众多问题,提升了性能,降低了硬件需求,并提供了更丰富的易用性:
   - 重构排序模块代码,使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序,确保在各种排版下都能实现极高准确率
   - 重构排序模块代码,使用 [layoutreader](https://github.com/ppaanngggg/layoutreader) 进行阅读顺序排序,确保在各种排版下都能实现极高准确率
@@ -188,13 +188,13 @@ https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
         <td rowspan="2">GPU硬件支持列表</td>
         <td rowspan="2">GPU硬件支持列表</td>
         <td colspan="2">最低要求 8G+显存</td>
         <td colspan="2">最低要求 8G+显存</td>
         <td colspan="2">3060ti/3070/4060<br>
         <td colspan="2">3060ti/3070/4060<br>
-        8G显存可开启layout、公式识别和ocr加速</td>
+        8G显存可开启全部加速功能(表格仅限rapid_table)</td>
         <td rowspan="2">None</td>
         <td rowspan="2">None</td>
     </tr>
     </tr>
     <tr>
     <tr>
         <td colspan="2">推荐配置 10G+显存</td>
         <td colspan="2">推荐配置 10G+显存</td>
         <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
         <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
-        10G显存及以上可以同时开启layout、公式识别和ocr加速和表格识别加速<br>
+        10G显存及以上可开启全部加速功能<br>
         </td>
         </td>
     </tr>
     </tr>
 </table>
 </table>
@@ -251,7 +251,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
         "enable": true  // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false"
         "enable": true  // 公式识别功能默认是开启的,如果需要关闭请修改此处的值为"false"
     },
     },
     "table-config": {
     "table-config": {
-        "model": "tablemaster",  // 使用structEqTable请修改为"struct_eqtable"
+        "model": "rapid_table",  // 使用structEqTable请修改为"struct_eqtable"
         "enable": false, // 表格识别功能默认是关闭的,如果需要开启请修改此处的值为"true"
         "enable": false, // 表格识别功能默认是关闭的,如果需要开启请修改此处的值为"true"
         "max_time": 400
         "max_time": 400
     }
     }
@@ -266,7 +266,7 @@ pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com -i h
 - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
 - [Windows10/11 + GPU](docs/README_Windows_CUDA_Acceleration_zh_CN.md)
 - 使用Docker快速部署
 - 使用Docker快速部署
 > [!IMPORTANT]
 > [!IMPORTANT]
-> Docker 需设备gpu显存大于等于16GB,默认开启所有加速功能
+> Docker 需设备gpu显存大于等于8GB,默认开启所有加速功能
 > 
 > 
 > 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
 > 运行本docker前可以通过以下命令检测自己的设备是否支持在docker上使用CUDA加速
 > 
 > 
@@ -431,6 +431,7 @@ TODO
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
 - [PDF-Extract-Kit](https://github.com/opendatalab/PDF-Extract-Kit)
 - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [DocLayout-YOLO](https://github.com/opendatalab/DocLayout-YOLO)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
 - [StructEqTable](https://github.com/UniModal4Reasoning/StructEqTable-Deploy)
+- [RapidTable](https://github.com/RapidAI/RapidTable)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)
 - [layoutreader](https://github.com/ppaanngggg/layoutreader)

+ 8 - 3
demo/magic_pdf_parse_main.py

@@ -19,9 +19,10 @@ def json_md_dump(
         pdf_name,
         pdf_name,
         content_list,
         content_list,
         md_content,
         md_content,
+        orig_model_list,
 ):
 ):
     # 写入模型结果到 model.json
     # 写入模型结果到 model.json
-    orig_model_list = copy.deepcopy(pipe.model_list)
+
     md_writer.write(
     md_writer.write(
         content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
         content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
         path=f"{pdf_name}_model.json"
         path=f"{pdf_name}_model.json"
@@ -87,9 +88,12 @@ def pdf_parse_main(
 
 
         pdf_bytes = open(pdf_path, "rb").read()  # 读取 pdf 文件的二进制数据
         pdf_bytes = open(pdf_path, "rb").read()  # 读取 pdf 文件的二进制数据
 
 
+        orig_model_list = []
+
         if model_json_path:
         if model_json_path:
             # 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型
             # 读取已经被模型解析后的pdf文件的 json 原始数据,list 类型
             model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
             model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
+            orig_model_list = copy.deepcopy(model_json)
         else:
         else:
             model_json = []
             model_json = []
 
 
@@ -115,8 +119,9 @@ def pdf_parse_main(
         pipe.pipe_classify()
         pipe.pipe_classify()
 
 
         # 如果没有传入模型数据,则使用内置模型解析
         # 如果没有传入模型数据,则使用内置模型解析
-        if not model_json:
+        if len(model_json) == 0:
             pipe.pipe_analyze()  # 解析
             pipe.pipe_analyze()  # 解析
+            orig_model_list = copy.deepcopy(pipe.model_list)
 
 
         # 执行解析
         # 执行解析
         pipe.pipe_parse()
         pipe.pipe_parse()
@@ -126,7 +131,7 @@ def pdf_parse_main(
         md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
         md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
 
 
         if is_json_md_dump:
         if is_json_md_dump:
-            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)
+            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content, orig_model_list)
 
 
         if is_draw_visualization_bbox:
         if is_draw_visualization_bbox:
             draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name)
             draw_visualization_bbox(pipe.pdf_mid_data['pdf_info'], pdf_bytes, output_path, pdf_name)

+ 1 - 1
magic-pdf.template.json

@@ -15,7 +15,7 @@
         "enable": true
         "enable": true
     },
     },
     "table-config": {
     "table-config": {
-        "model": "tablemaster",
+        "model": "rapid_table",
         "enable": false,
         "enable": false,
         "max_time": 400
         "max_time": 400
     },
     },

+ 1 - 1
magic_pdf/dict2md/ocr_mkcontent.py

@@ -168,7 +168,7 @@ def merge_para_with_text(para_block):
                         # 如果是前一行带有-连字符,那么末尾不应该加空格
                         # 如果是前一行带有-连字符,那么末尾不应该加空格
                         if __is_hyphen_at_line_end(content):
                         if __is_hyphen_at_line_end(content):
                             para_text += content[:-1]
                             para_text += content[:-1]
-                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i']:
+                        elif len(content) == 1 and content not in ['A', 'I', 'a', 'i'] and not content.isdigit():
                             para_text += content
                             para_text += content
                         else:  # 西方文本语境下 content间需要空格分隔
                         else:  # 西方文本语境下 content间需要空格分隔
                             para_text += f"{content} "
                             para_text += f"{content} "

+ 3 - 1
magic_pdf/libs/Constants.py

@@ -50,4 +50,6 @@ class MODEL_NAME:
 
 
     YOLO_V8_MFD = "yolo_v8_mfd"
     YOLO_V8_MFD = "yolo_v8_mfd"
 
 
-    UniMerNet_v2_Small = "unimernet_small"
+    UniMerNet_v2_Small = "unimernet_small"
+
+    RAPID_TABLE = "rapid_table"

+ 1 - 1
magic_pdf/libs/config_reader.py

@@ -92,7 +92,7 @@ def get_table_recog_config():
     table_config = config.get('table-config')
     table_config = config.get('table-config')
     if table_config is None:
     if table_config is None:
         logger.warning(f"'table-config' not found in {CONFIG_FILE_NAME}, use 'False' as default")
         logger.warning(f"'table-config' not found in {CONFIG_FILE_NAME}, use 'False' as default")
-        return json.loads(f'{{"model": "{MODEL_NAME.TABLE_MASTER}","enable": false, "max_time": 400}}')
+        return json.loads(f'{{"model": "{MODEL_NAME.RAPID_TABLE}","enable": false, "max_time": 400}}')
     else:
     else:
         return table_config
         return table_config
 
 

+ 10 - 4
magic_pdf/libs/draw_bbox.py

@@ -369,10 +369,16 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
             if block['type'] in [BlockType.Image, BlockType.Table]:
             if block['type'] in [BlockType.Image, BlockType.Table]:
                 for sub_block in block['blocks']:
                 for sub_block in block['blocks']:
                     if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
                     if sub_block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
-                        for line in sub_block['virtual_lines']:
-                            bbox = line['bbox']
-                            index = line['index']
-                            page_line_list.append({'index': index, 'bbox': bbox})
+                        if len(sub_block['virtual_lines']) > 0 and sub_block['virtual_lines'][0].get('index', None) is not None:
+                            for line in sub_block['virtual_lines']:
+                                bbox = line['bbox']
+                                index = line['index']
+                                page_line_list.append({'index': index, 'bbox': bbox})
+                        else:
+                            for line in sub_block['lines']:
+                                bbox = line['bbox']
+                                index = line['index']
+                                page_line_list.append({'index': index, 'bbox': bbox})
                     elif sub_block['type'] in [BlockType.ImageCaption, BlockType.TableCaption, BlockType.ImageFootnote, BlockType.TableFootnote]:
                     elif sub_block['type'] in [BlockType.ImageCaption, BlockType.TableCaption, BlockType.ImageFootnote, BlockType.TableFootnote]:
                         for line in sub_block['lines']:
                         for line in sub_block['lines']:
                             bbox = line['bbox']
                             bbox = line['bbox']

+ 42 - 297
magic_pdf/model/pdf_extract_kit.py

@@ -1,195 +1,28 @@
+import numpy as np
+import torch
 from loguru import logger
 from loguru import logger
 import os
 import os
 import time
 import time
-from pathlib import Path
-import shutil
-from magic_pdf.libs.Constants import *
-from magic_pdf.libs.clean_memory import clean_memory
-from magic_pdf.model.model_list import AtomicModel
+import cv2
+import yaml
+from PIL import Image
 
 
 os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1'  # 禁止albumentations检查更新
 os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1'  # 禁止albumentations检查更新
 os.environ['YOLO_VERBOSE'] = 'False'  # disable yolo logger
 os.environ['YOLO_VERBOSE'] = 'False'  # disable yolo logger
+
 try:
 try:
-    import cv2
-    import yaml
-    import argparse
-    import numpy as np
-    import torch
     import torchtext
     import torchtext
 
 
     if torchtext.__version__ >= "0.18.0":
     if torchtext.__version__ >= "0.18.0":
         torchtext.disable_torchtext_deprecation_warning()
         torchtext.disable_torchtext_deprecation_warning()
-    from PIL import Image
-    from torchvision import transforms
-    from torch.utils.data import Dataset, DataLoader
-    from ultralytics import YOLO
-    from unimernet.common.config import Config
-    import unimernet.tasks as tasks
-    from unimernet.processors import load_processor
-    from doclayout_yolo import YOLOv10
-
-except ImportError as e:
-    logger.exception(e)
-    logger.error(
-        'Required dependency not installed, please install by \n'
-        '"pip install magic-pdf[full] --extra-index-url https://myhloli.github.io/wheels/"')
-    exit(1)
-
-from magic_pdf.model.pek_sub_modules.layoutlmv3.model_init import Layoutlmv3_Predictor
-from magic_pdf.model.pek_sub_modules.post_process import latex_rm_whitespace
-from magic_pdf.model.pek_sub_modules.self_modify import ModifiedPaddleOCR
-from magic_pdf.model.pek_sub_modules.structeqtable.StructTableModel import StructTableModel
-from magic_pdf.model.ppTableModel import ppTableModel
-
-
-def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
-    if table_model_type == MODEL_NAME.STRUCT_EQTABLE:
-        table_model = StructTableModel(model_path, max_time=max_time)
-    elif table_model_type == MODEL_NAME.TABLE_MASTER:
-        config = {
-            "model_dir": model_path,
-            "device": _device_
-        }
-        table_model = ppTableModel(config)
-    else:
-        logger.error("table model type not allow")
-        exit(1)
-    return table_model
-
-
-def mfd_model_init(weight):
-    mfd_model = YOLO(weight)
-    return mfd_model
-
-
-def mfr_model_init(weight_dir, cfg_path, _device_='cpu'):
-    args = argparse.Namespace(cfg_path=cfg_path, options=None)
-    cfg = Config(args)
-    cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.pth")
-    cfg.config.model.model_config.model_name = weight_dir
-    cfg.config.model.tokenizer_config.path = weight_dir
-    task = tasks.setup_task(cfg)
-    model = task.build_model(cfg)
-    model.to(_device_)
-    model.eval()
-    vis_processor = load_processor('formula_image_eval', cfg.config.datasets.formula_rec_eval.vis_processor.eval)
-    mfr_transform = transforms.Compose([vis_processor, ])
-    return [model, mfr_transform]
-
-
-def layout_model_init(weight, config_file, device):
-    model = Layoutlmv3_Predictor(weight, config_file, device)
-    return model
-
-
-def doclayout_yolo_model_init(weight):
-    model = YOLOv10(weight)
-    return model
-
-
-def ocr_model_init(show_log: bool = False, det_db_box_thresh=0.3, lang=None, use_dilation=True, det_db_unclip_ratio=1.8):
-    if lang is not None:
-        model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh, lang=lang, use_dilation=use_dilation, det_db_unclip_ratio=det_db_unclip_ratio)
-    else:
-        model = ModifiedPaddleOCR(show_log=show_log, det_db_box_thresh=det_db_box_thresh, use_dilation=use_dilation, det_db_unclip_ratio=det_db_unclip_ratio)
-    return model
-
-
-class MathDataset(Dataset):
-    def __init__(self, image_paths, transform=None):
-        self.image_paths = image_paths
-        self.transform = transform
-
-    def __len__(self):
-        return len(self.image_paths)
-
-    def __getitem__(self, idx):
-        # if not pil image, then convert to pil image
-        if isinstance(self.image_paths[idx], str):
-            raw_image = Image.open(self.image_paths[idx])
-        else:
-            raw_image = self.image_paths[idx]
-        if self.transform:
-            image = self.transform(raw_image)
-            return image
-
-
-class AtomModelSingleton:
-    _instance = None
-    _models = {}
-
-    def __new__(cls, *args, **kwargs):
-        if cls._instance is None:
-            cls._instance = super().__new__(cls)
-        return cls._instance
-
-    def get_atom_model(self, atom_model_name: str, **kwargs):
-        lang = kwargs.get("lang", None)
-        layout_model_name = kwargs.get("layout_model_name", None)
-        key = (atom_model_name, layout_model_name, lang)
-        if key not in self._models:
-            self._models[key] = atom_model_init(model_name=atom_model_name, **kwargs)
-        return self._models[key]
-
-
-def atom_model_init(model_name: str, **kwargs):
-
-    if model_name == AtomicModel.Layout:
-        if kwargs.get("layout_model_name") == MODEL_NAME.LAYOUTLMv3:
-            atom_model = layout_model_init(
-                kwargs.get("layout_weights"),
-                kwargs.get("layout_config_file"),
-                kwargs.get("device")
-            )
-        elif kwargs.get("layout_model_name") == MODEL_NAME.DocLayout_YOLO:
-            atom_model = doclayout_yolo_model_init(
-                kwargs.get("doclayout_yolo_weights"),
-            )
-    elif model_name == AtomicModel.MFD:
-        atom_model = mfd_model_init(
-            kwargs.get("mfd_weights")
-        )
-    elif model_name == AtomicModel.MFR:
-        atom_model = mfr_model_init(
-            kwargs.get("mfr_weight_dir"),
-            kwargs.get("mfr_cfg_path"),
-            kwargs.get("device")
-        )
-    elif model_name == AtomicModel.OCR:
-        atom_model = ocr_model_init(
-            kwargs.get("ocr_show_log"),
-            kwargs.get("det_db_box_thresh"),
-            kwargs.get("lang")
-        )
-    elif model_name == AtomicModel.Table:
-        atom_model = table_model_init(
-            kwargs.get("table_model_name"),
-            kwargs.get("table_model_path"),
-            kwargs.get("table_max_time"),
-            kwargs.get("device")
-        )
-    else:
-        logger.error("model name not allow")
-        exit(1)
-
-    return atom_model
-
+except ImportError:
+    pass
 
 
-#  Unified crop img logic
-def crop_img(input_res, input_pil_img, crop_paste_x=0, crop_paste_y=0):
-    crop_xmin, crop_ymin = int(input_res['poly'][0]), int(input_res['poly'][1])
-    crop_xmax, crop_ymax = int(input_res['poly'][4]), int(input_res['poly'][5])
-    # Create a white background with an additional width and height of 50
-    crop_new_width = crop_xmax - crop_xmin + crop_paste_x * 2
-    crop_new_height = crop_ymax - crop_ymin + crop_paste_y * 2
-    return_image = Image.new('RGB', (crop_new_width, crop_new_height), 'white')
-
-    # Crop image
-    crop_box = (crop_xmin, crop_ymin, crop_xmax, crop_ymax)
-    cropped_img = input_pil_img.crop(crop_box)
-    return_image.paste(cropped_img, (crop_paste_x, crop_paste_y))
-    return_list = [crop_paste_x, crop_paste_y, crop_xmin, crop_ymin, crop_xmax, crop_ymax, crop_new_width, crop_new_height]
-    return return_image, return_list
+from magic_pdf.libs.Constants import *
+from magic_pdf.model.model_list import AtomicModel
+from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
+from magic_pdf.model.sub_modules.model_utils import get_res_list_from_layout_res, crop_img, clean_vram
+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import get_adjusted_mfdetrec_res, get_ocr_result_list
 
 
 
 
 class CustomPEKModel:
 class CustomPEKModel:
@@ -226,7 +59,7 @@ class CustomPEKModel:
         self.table_config = kwargs.get("table_config")
         self.table_config = kwargs.get("table_config")
         self.apply_table = self.table_config.get("enable", False)
         self.apply_table = self.table_config.get("enable", False)
         self.table_max_time = self.table_config.get("max_time", TABLE_MAX_TIME_VALUE)
         self.table_max_time = self.table_config.get("max_time", TABLE_MAX_TIME_VALUE)
-        self.table_model_name = self.table_config.get("model", MODEL_NAME.TABLE_MASTER)
+        self.table_model_name = self.table_config.get("model", MODEL_NAME.RAPID_TABLE)
 
 
         # ocr config
         # ocr config
         self.apply_ocr = ocr
         self.apply_ocr = ocr
@@ -235,7 +68,8 @@ class CustomPEKModel:
         logger.info(
         logger.info(
             "DocAnalysis init, this may take some times, layout_model: {}, apply_formula: {}, apply_ocr: {}, "
             "DocAnalysis init, this may take some times, layout_model: {}, apply_formula: {}, apply_ocr: {}, "
             "apply_table: {}, table_model: {}, lang: {}".format(
             "apply_table: {}, table_model: {}, lang: {}".format(
-                self.layout_model_name, self.apply_formula, self.apply_ocr, self.apply_table, self.table_model_name, self.lang
+                self.layout_model_name, self.apply_formula, self.apply_ocr, self.apply_table, self.table_model_name,
+                self.lang
             )
             )
         )
         )
         # 初始化解析方案
         # 初始化解析方案
@@ -248,17 +82,17 @@ class CustomPEKModel:
 
 
         # 初始化公式识别
         # 初始化公式识别
         if self.apply_formula:
         if self.apply_formula:
-
             # 初始化公式检测模型
             # 初始化公式检测模型
             self.mfd_model = atom_model_manager.get_atom_model(
             self.mfd_model = atom_model_manager.get_atom_model(
                 atom_model_name=AtomicModel.MFD,
                 atom_model_name=AtomicModel.MFD,
-                mfd_weights=str(os.path.join(models_dir, self.configs["weights"][self.mfd_model_name]))
+                mfd_weights=str(os.path.join(models_dir, self.configs["weights"][self.mfd_model_name])),
+                device=self.device
             )
             )
 
 
             # 初始化公式解析模型
             # 初始化公式解析模型
             mfr_weight_dir = str(os.path.join(models_dir, self.configs["weights"][self.mfr_model_name]))
             mfr_weight_dir = str(os.path.join(models_dir, self.configs["weights"][self.mfr_model_name]))
             mfr_cfg_path = str(os.path.join(model_config_dir, "UniMERNet", "demo.yaml"))
             mfr_cfg_path = str(os.path.join(model_config_dir, "UniMERNet", "demo.yaml"))
-            self.mfr_model, self.mfr_transform = atom_model_manager.get_atom_model(
+            self.mfr_model = atom_model_manager.get_atom_model(
                 atom_model_name=AtomicModel.MFR,
                 atom_model_name=AtomicModel.MFR,
                 mfr_weight_dir=mfr_weight_dir,
                 mfr_weight_dir=mfr_weight_dir,
                 mfr_cfg_path=mfr_cfg_path,
                 mfr_cfg_path=mfr_cfg_path,
@@ -278,7 +112,8 @@ class CustomPEKModel:
             self.layout_model = atom_model_manager.get_atom_model(
             self.layout_model = atom_model_manager.get_atom_model(
                 atom_model_name=AtomicModel.Layout,
                 atom_model_name=AtomicModel.Layout,
                 layout_model_name=MODEL_NAME.DocLayout_YOLO,
                 layout_model_name=MODEL_NAME.DocLayout_YOLO,
-                doclayout_yolo_weights=str(os.path.join(models_dir, self.configs['weights'][self.layout_model_name]))
+                doclayout_yolo_weights=str(os.path.join(models_dir, self.configs['weights'][self.layout_model_name])),
+                device=self.device
             )
             )
         # 初始化ocr
         # 初始化ocr
         if self.apply_ocr:
         if self.apply_ocr:
@@ -305,26 +140,15 @@ class CustomPEKModel:
 
 
         page_start = time.time()
         page_start = time.time()
 
 
-        latex_filling_list = []
-        mf_image_list = []
-
         # layout检测
         # layout检测
         layout_start = time.time()
         layout_start = time.time()
+        layout_res = []
         if self.layout_model_name == MODEL_NAME.LAYOUTLMv3:
         if self.layout_model_name == MODEL_NAME.LAYOUTLMv3:
             # layoutlmv3
             # layoutlmv3
             layout_res = self.layout_model(image, ignore_catids=[])
             layout_res = self.layout_model(image, ignore_catids=[])
         elif self.layout_model_name == MODEL_NAME.DocLayout_YOLO:
         elif self.layout_model_name == MODEL_NAME.DocLayout_YOLO:
             # doclayout_yolo
             # doclayout_yolo
-            layout_res = []
-            doclayout_yolo_res = self.layout_model.predict(image, imgsz=1024, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
-            for xyxy, conf, cla in zip(doclayout_yolo_res.boxes.xyxy.cpu(), doclayout_yolo_res.boxes.conf.cpu(), doclayout_yolo_res.boxes.cls.cpu()):
-                xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
-                new_item = {
-                    'category_id': int(cla.item()),
-                    'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
-                    'score': round(float(conf.item()), 3),
-                }
-                layout_res.append(new_item)
+            layout_res = self.layout_model.predict(image)
         layout_cost = round(time.time() - layout_start, 2)
         layout_cost = round(time.time() - layout_start, 2)
         logger.info(f"layout detection time: {layout_cost}")
         logger.info(f"layout detection time: {layout_cost}")
 
 
@@ -333,59 +157,21 @@ class CustomPEKModel:
         if self.apply_formula:
         if self.apply_formula:
             # 公式检测
             # 公式检测
             mfd_start = time.time()
             mfd_start = time.time()
-            mfd_res = self.mfd_model.predict(image, imgsz=1888, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
+            mfd_res = self.mfd_model.predict(image)
             logger.info(f"mfd time: {round(time.time() - mfd_start, 2)}")
             logger.info(f"mfd time: {round(time.time() - mfd_start, 2)}")
-            for xyxy, conf, cla in zip(mfd_res.boxes.xyxy.cpu(), mfd_res.boxes.conf.cpu(), mfd_res.boxes.cls.cpu()):
-                xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
-                new_item = {
-                    'category_id': 13 + int(cla.item()),
-                    'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
-                    'score': round(float(conf.item()), 2),
-                    'latex': '',
-                }
-                layout_res.append(new_item)
-                latex_filling_list.append(new_item)
-                bbox_img = pil_img.crop((xmin, ymin, xmax, ymax))
-                mf_image_list.append(bbox_img)
 
 
             # 公式识别
             # 公式识别
             mfr_start = time.time()
             mfr_start = time.time()
-            dataset = MathDataset(mf_image_list, transform=self.mfr_transform)
-            dataloader = DataLoader(dataset, batch_size=64, num_workers=0)
-            mfr_res = []
-            for mf_img in dataloader:
-                mf_img = mf_img.to(self.device)
-                with torch.no_grad():
-                    output = self.mfr_model.generate({'image': mf_img})
-                mfr_res.extend(output['pred_str'])
-            for res, latex in zip(latex_filling_list, mfr_res):
-                res['latex'] = latex_rm_whitespace(latex)
+            formula_list = self.mfr_model.predict(mfd_res, image)
+            layout_res.extend(formula_list)
             mfr_cost = round(time.time() - mfr_start, 2)
             mfr_cost = round(time.time() - mfr_start, 2)
-            logger.info(f"formula nums: {len(mf_image_list)}, mfr time: {mfr_cost}")
-
-        # Select regions for OCR / formula regions / table regions
-        ocr_res_list = []
-        table_res_list = []
-        single_page_mfdetrec_res = []
-        for res in layout_res:
-            if int(res['category_id']) in [13, 14]:
-                single_page_mfdetrec_res.append({
-                    "bbox": [int(res['poly'][0]), int(res['poly'][1]),
-                             int(res['poly'][4]), int(res['poly'][5])],
-                })
-            elif int(res['category_id']) in [0, 1, 2, 4, 6, 7]:
-                ocr_res_list.append(res)
-            elif int(res['category_id']) in [5]:
-                table_res_list.append(res)
-
-        if torch.cuda.is_available() and self.device != 'cpu':
-            properties = torch.cuda.get_device_properties(self.device)
-            total_memory = properties.total_memory / (1024 ** 3)  # 将字节转换为 GB
-            if total_memory <= 10:
-                gc_start = time.time()
-                clean_memory()
-                gc_time = round(time.time() - gc_start, 2)
-                logger.info(f"gc time: {gc_time}")
+            logger.info(f"formula nums: {len(formula_list)}, mfr time: {mfr_cost}")
+
+        # 清理显存
+        clean_vram(self.device, vram_threshold=8)
+
+        # 从layout_res中获取ocr区域、表格区域、公式区域
+        ocr_res_list, table_res_list, single_page_mfdetrec_res = get_res_list_from_layout_res(layout_res)
 
 
         # ocr识别
         # ocr识别
         if self.apply_ocr:
         if self.apply_ocr:
@@ -393,23 +179,7 @@ class CustomPEKModel:
             # Process each area that requires OCR processing
             # Process each area that requires OCR processing
             for res in ocr_res_list:
             for res in ocr_res_list:
                 new_image, useful_list = crop_img(res, pil_img, crop_paste_x=50, crop_paste_y=50)
                 new_image, useful_list = crop_img(res, pil_img, crop_paste_x=50, crop_paste_y=50)
-                paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
-                # Adjust the coordinates of the formula area
-                adjusted_mfdetrec_res = []
-                for mf_res in single_page_mfdetrec_res:
-                    mf_xmin, mf_ymin, mf_xmax, mf_ymax = mf_res["bbox"]
-                    # Adjust the coordinates of the formula area to the coordinates relative to the cropping area
-                    x0 = mf_xmin - xmin + paste_x
-                    y0 = mf_ymin - ymin + paste_y
-                    x1 = mf_xmax - xmin + paste_x
-                    y1 = mf_ymax - ymin + paste_y
-                    # Filter formula blocks outside the graph
-                    if any([x1 < 0, y1 < 0]) or any([x0 > new_width, y0 > new_height]):
-                        continue
-                    else:
-                        adjusted_mfdetrec_res.append({
-                            "bbox": [x0, y0, x1, y1],
-                        })
+                adjusted_mfdetrec_res = get_adjusted_mfdetrec_res(single_page_mfdetrec_res, useful_list)
 
 
                 # OCR recognition
                 # OCR recognition
                 new_image = cv2.cvtColor(np.asarray(new_image), cv2.COLOR_RGB2BGR)
                 new_image = cv2.cvtColor(np.asarray(new_image), cv2.COLOR_RGB2BGR)
@@ -417,22 +187,8 @@ class CustomPEKModel:
 
 
                 # Integration results
                 # Integration results
                 if ocr_res:
                 if ocr_res:
-                    for box_ocr_res in ocr_res:
-                        p1, p2, p3, p4 = box_ocr_res[0]
-                        text, score = box_ocr_res[1]
-
-                        # Convert the coordinates back to the original coordinate system
-                        p1 = [p1[0] - paste_x + xmin, p1[1] - paste_y + ymin]
-                        p2 = [p2[0] - paste_x + xmin, p2[1] - paste_y + ymin]
-                        p3 = [p3[0] - paste_x + xmin, p3[1] - paste_y + ymin]
-                        p4 = [p4[0] - paste_x + xmin, p4[1] - paste_y + ymin]
-
-                        layout_res.append({
-                            'category_id': 15,
-                            'poly': p1 + p2 + p3 + p4,
-                            'score': round(score, 2),
-                            'text': text,
-                        })
+                    ocr_result_list = get_ocr_result_list(ocr_res, useful_list)
+                    layout_res.extend(ocr_result_list)
 
 
             ocr_cost = round(time.time() - ocr_start, 2)
             ocr_cost = round(time.time() - ocr_start, 2)
             logger.info(f"ocr time: {ocr_cost}")
             logger.info(f"ocr time: {ocr_cost}")
@@ -443,41 +199,30 @@ class CustomPEKModel:
             for res in table_res_list:
             for res in table_res_list:
                 new_image, _ = crop_img(res, pil_img)
                 new_image, _ = crop_img(res, pil_img)
                 single_table_start_time = time.time()
                 single_table_start_time = time.time()
-                # logger.info("------------------table recognition processing begins-----------------")
-                latex_code = None
                 html_code = None
                 html_code = None
                 if self.table_model_name == MODEL_NAME.STRUCT_EQTABLE:
                 if self.table_model_name == MODEL_NAME.STRUCT_EQTABLE:
                     with torch.no_grad():
                     with torch.no_grad():
                         table_result = self.table_model.predict(new_image, "html")
                         table_result = self.table_model.predict(new_image, "html")
                         if len(table_result) > 0:
                         if len(table_result) > 0:
                             html_code = table_result[0]
                             html_code = table_result[0]
-                else:
+                elif self.table_model_name == MODEL_NAME.TABLE_MASTER:
                     html_code = self.table_model.img2html(new_image)
                     html_code = self.table_model.img2html(new_image)
-
+                elif self.table_model_name == MODEL_NAME.RAPID_TABLE:
+                    html_code, table_cell_bboxes, elapse = self.table_model.predict(new_image)
                 run_time = time.time() - single_table_start_time
                 run_time = time.time() - single_table_start_time
-                # logger.info(f"------------table recognition processing ends within {run_time}s-----")
                 if run_time > self.table_max_time:
                 if run_time > self.table_max_time:
-                    logger.warning(f"------------table recognition processing exceeds max time {self.table_max_time}s----------")
+                    logger.warning(f"table recognition processing exceeds max time {self.table_max_time}s")
                 # 判断是否返回正常
                 # 判断是否返回正常
-
-                if latex_code:
-                    expected_ending = latex_code.strip().endswith('end{tabular}') or latex_code.strip().endswith('end{table}')
-                    if expected_ending:
-                        res["latex"] = latex_code
-                    else:
-                        logger.warning(f"table recognition processing fails, not found expected LaTeX table end")
-                elif html_code:
+                if html_code:
                     expected_ending = html_code.strip().endswith('</html>') or html_code.strip().endswith('</table>')
                     expected_ending = html_code.strip().endswith('</html>') or html_code.strip().endswith('</table>')
                     if expected_ending:
                     if expected_ending:
                         res["html"] = html_code
                         res["html"] = html_code
                     else:
                     else:
                         logger.warning(f"table recognition processing fails, not found expected HTML table end")
                         logger.warning(f"table recognition processing fails, not found expected HTML table end")
                 else:
                 else:
-                    logger.warning(f"table recognition processing fails, not get latex or html return")
+                    logger.warning(f"table recognition processing fails, not get html return")
             logger.info(f"table time: {round(time.time() - table_start, 2)}")
             logger.info(f"table time: {round(time.time() - table_start, 2)}")
 
 
         logger.info(f"-----page total time: {round(time.time() - page_start, 2)}-----")
         logger.info(f"-----page total time: {round(time.time() - page_start, 2)}-----")
 
 
         return layout_res
         return layout_res
-
-

+ 0 - 36
magic_pdf/model/pek_sub_modules/post_process.py

@@ -1,36 +0,0 @@
-import re
-
-def layout_rm_equation(layout_res):
-    rm_idxs = []
-    for idx, ele in enumerate(layout_res['layout_dets']):
-        if ele['category_id'] == 10:
-            rm_idxs.append(idx)
-    
-    for idx in rm_idxs[::-1]:
-        del layout_res['layout_dets'][idx]
-    return layout_res
-
-
-def get_croped_image(image_pil, bbox):
-    x_min, y_min, x_max, y_max = bbox
-    croped_img = image_pil.crop((x_min, y_min, x_max, y_max))
-    return croped_img
-
-
-def latex_rm_whitespace(s: str):
-    """Remove unnecessary whitespace from LaTeX code.
-    """
-    text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
-    letter = '[a-zA-Z]'
-    noletter = '[\W_^\d]'
-    names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
-    s = re.sub(text_reg, lambda match: str(names.pop(0)), s)
-    news = s
-    while True:
-        s = news
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
-        news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
-        if news == s:
-            break
-    return s

+ 0 - 388
magic_pdf/model/pek_sub_modules/self_modify.py

@@ -1,388 +0,0 @@
-import time
-import copy
-import base64
-import cv2
-import numpy as np
-from io import BytesIO
-from PIL import Image
-
-from paddleocr import PaddleOCR
-from paddleocr.ppocr.utils.logging import get_logger
-from paddleocr.ppocr.utils.utility import check_and_read, alpha_to_color, binarize_img
-from paddleocr.tools.infer.utility import draw_ocr_box_txt, get_rotate_crop_image, get_minarea_rect_crop
-
-from magic_pdf.libs.boxbase import __is_overlaps_y_exceeds_threshold
-from magic_pdf.pre_proc.ocr_dict_merge import merge_spans_to_line
-
-logger = get_logger()
-
-
-def img_decode(content: bytes):
-    np_arr = np.frombuffer(content, dtype=np.uint8)
-    return cv2.imdecode(np_arr, cv2.IMREAD_UNCHANGED)
-
-
-def check_img(img):
-    if isinstance(img, bytes):
-        img = img_decode(img)
-    if isinstance(img, str):
-        image_file = img
-        img, flag_gif, flag_pdf = check_and_read(image_file)
-        if not flag_gif and not flag_pdf:
-            with open(image_file, 'rb') as f:
-                img_str = f.read()
-                img = img_decode(img_str)
-            if img is None:
-                try:
-                    buf = BytesIO()
-                    image = BytesIO(img_str)
-                    im = Image.open(image)
-                    rgb = im.convert('RGB')
-                    rgb.save(buf, 'jpeg')
-                    buf.seek(0)
-                    image_bytes = buf.read()
-                    data_base64 = str(base64.b64encode(image_bytes),
-                                      encoding="utf-8")
-                    image_decode = base64.b64decode(data_base64)
-                    img_array = np.frombuffer(image_decode, np.uint8)
-                    img = cv2.imdecode(img_array, cv2.IMREAD_COLOR)
-                except:
-                    logger.error("error in loading image:{}".format(image_file))
-                    return None
-        if img is None:
-            logger.error("error in loading image:{}".format(image_file))
-            return None
-    if isinstance(img, np.ndarray) and len(img.shape) == 2:
-        img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
-
-    return img
-
-
-def sorted_boxes(dt_boxes):
-    """
-    Sort text boxes in order from top to bottom, left to right
-    args:
-        dt_boxes(array):detected text boxes with shape [4, 2]
-    return:
-        sorted boxes(array) with shape [4, 2]
-    """
-    num_boxes = dt_boxes.shape[0]
-    sorted_boxes = sorted(dt_boxes, key=lambda x: (x[0][1], x[0][0]))
-    _boxes = list(sorted_boxes)
-
-    for i in range(num_boxes - 1):
-        for j in range(i, -1, -1):
-            if abs(_boxes[j + 1][0][1] - _boxes[j][0][1]) < 10 and \
-                    (_boxes[j + 1][0][0] < _boxes[j][0][0]):
-                tmp = _boxes[j]
-                _boxes[j] = _boxes[j + 1]
-                _boxes[j + 1] = tmp
-            else:
-                break
-    return _boxes
-
-
-def bbox_to_points(bbox):
-    """ 将bbox格式转换为四个顶点的数组 """
-    x0, y0, x1, y1 = bbox
-    return np.array([[x0, y0], [x1, y0], [x1, y1], [x0, y1]]).astype('float32')
-
-
-def points_to_bbox(points):
-    """ 将四个顶点的数组转换为bbox格式 """
-    x0, y0 = points[0]
-    x1, _ = points[1]
-    _, y1 = points[2]
-    return [x0, y0, x1, y1]
-
-
-def merge_intervals(intervals):
-    # Sort the intervals based on the start value
-    intervals.sort(key=lambda x: x[0])
-
-    merged = []
-    for interval in intervals:
-        # If the list of merged intervals is empty or if the current
-        # interval does not overlap with the previous, simply append it.
-        if not merged or merged[-1][1] < interval[0]:
-            merged.append(interval)
-        else:
-            # Otherwise, there is overlap, so we merge the current and previous intervals.
-            merged[-1][1] = max(merged[-1][1], interval[1])
-
-    return merged
-
-
-def remove_intervals(original, masks):
-    # Merge all mask intervals
-    merged_masks = merge_intervals(masks)
-
-    result = []
-    original_start, original_end = original
-
-    for mask in merged_masks:
-        mask_start, mask_end = mask
-
-        # If the mask starts after the original range, ignore it
-        if mask_start > original_end:
-            continue
-
-        # If the mask ends before the original range starts, ignore it
-        if mask_end < original_start:
-            continue
-
-        # Remove the masked part from the original range
-        if original_start < mask_start:
-            result.append([original_start, mask_start - 1])
-
-        original_start = max(mask_end + 1, original_start)
-
-    # Add the remaining part of the original range, if any
-    if original_start <= original_end:
-        result.append([original_start, original_end])
-
-    return result
-
-
-def update_det_boxes(dt_boxes, mfd_res):
-    new_dt_boxes = []
-    for text_box in dt_boxes:
-        text_bbox = points_to_bbox(text_box)
-        masks_list = []
-        for mf_box in mfd_res:
-            mf_bbox = mf_box['bbox']
-            if __is_overlaps_y_exceeds_threshold(text_bbox, mf_bbox):
-                masks_list.append([mf_bbox[0], mf_bbox[2]])
-        text_x_range = [text_bbox[0], text_bbox[2]]
-        text_remove_mask_range = remove_intervals(text_x_range, masks_list)
-        temp_dt_box = []
-        for text_remove_mask in text_remove_mask_range:
-            temp_dt_box.append(bbox_to_points([text_remove_mask[0], text_bbox[1], text_remove_mask[1], text_bbox[3]]))
-        if len(temp_dt_box) > 0:
-            new_dt_boxes.extend(temp_dt_box)
-    return new_dt_boxes
-
-
-def merge_overlapping_spans(spans):
-    """
-    Merges overlapping spans on the same line.
-
-    :param spans: A list of span coordinates [(x1, y1, x2, y2), ...]
-    :return: A list of merged spans
-    """
-    # Return an empty list if the input spans list is empty
-    if not spans:
-        return []
-
-    # Sort spans by their starting x-coordinate
-    spans.sort(key=lambda x: x[0])
-
-    # Initialize the list of merged spans
-    merged = []
-    for span in spans:
-        # Unpack span coordinates
-        x1, y1, x2, y2 = span
-        # If the merged list is empty or there's no horizontal overlap, add the span directly
-        if not merged or merged[-1][2] < x1:
-            merged.append(span)
-        else:
-            # If there is horizontal overlap, merge the current span with the previous one
-            last_span = merged.pop()
-            # Update the merged span's top-left corner to the smaller (x1, y1) and bottom-right to the larger (x2, y2)
-            x1 = min(last_span[0], x1)
-            y1 = min(last_span[1], y1)
-            x2 = max(last_span[2], x2)
-            y2 = max(last_span[3], y2)
-            # Add the merged span back to the list
-            merged.append((x1, y1, x2, y2))
-
-    # Return the list of merged spans
-    return merged
-
-
-def merge_det_boxes(dt_boxes):
-    """
-    Merge detection boxes.
-
-    This function takes a list of detected bounding boxes, each represented by four corner points.
-    The goal is to merge these bounding boxes into larger text regions.
-
-    Parameters:
-    dt_boxes (list): A list containing multiple text detection boxes, where each box is defined by four corner points.
-
-    Returns:
-    list: A list containing the merged text regions, where each region is represented by four corner points.
-    """
-    # Convert the detection boxes into a dictionary format with bounding boxes and type
-    dt_boxes_dict_list = []
-    for text_box in dt_boxes:
-        text_bbox = points_to_bbox(text_box)
-        text_box_dict = {
-            'bbox': text_bbox,
-            'type': 'text',
-        }
-        dt_boxes_dict_list.append(text_box_dict)
-
-    # Merge adjacent text regions into lines
-    lines = merge_spans_to_line(dt_boxes_dict_list)
-
-    # Initialize a new list for storing the merged text regions
-    new_dt_boxes = []
-    for line in lines:
-        line_bbox_list = []
-        for span in line:
-            line_bbox_list.append(span['bbox'])
-
-        # Merge overlapping text regions within the same line
-        merged_spans = merge_overlapping_spans(line_bbox_list)
-
-        # Convert the merged text regions back to point format and add them to the new detection box list
-        for span in merged_spans:
-            new_dt_boxes.append(bbox_to_points(span))
-
-    return new_dt_boxes
-
-
-class ModifiedPaddleOCR(PaddleOCR):
-    def ocr(self, img, det=True, rec=True, cls=True, bin=False, inv=False, mfd_res=None, alpha_color=(255, 255, 255)):
-        """
-        OCR with PaddleOCR
-        args:
-            img: img for OCR, support ndarray, img_path and list or ndarray
-            det: use text detection or not. If False, only rec will be exec. Default is True
-            rec: use text recognition or not. If False, only det will be exec. Default is True
-            cls: use angle classifier or not. Default is True. If True, the text with rotation of 180 degrees can be recognized. If no text is rotated by 180 degrees, use cls=False to get better performance. Text with rotation of 90 or 270 degrees can be recognized even if cls=False.
-            bin: binarize image to black and white. Default is False.
-            inv: invert image colors. Default is False.
-            alpha_color: set RGB color Tuple for transparent parts replacement. Default is pure white.
-        """
-        assert isinstance(img, (np.ndarray, list, str, bytes))
-        if isinstance(img, list) and det == True:
-            logger.error('When input a list of images, det must be false')
-            exit(0)
-        if cls == True and self.use_angle_cls == False:
-            pass
-            # logger.warning(
-            #     'Since the angle classifier is not initialized, it will not be used during the forward process'
-            # )
-
-        img = check_img(img)
-        # for infer pdf file
-        if isinstance(img, list):
-            if self.page_num > len(img) or self.page_num == 0:
-                self.page_num = len(img)
-            imgs = img[:self.page_num]
-        else:
-            imgs = [img]
-
-        def preprocess_image(_image):
-            _image = alpha_to_color(_image, alpha_color)
-            if inv:
-                _image = cv2.bitwise_not(_image)
-            if bin:
-                _image = binarize_img(_image)
-            return _image
-
-        if det and rec:
-            ocr_res = []
-            for idx, img in enumerate(imgs):
-                img = preprocess_image(img)
-                dt_boxes, rec_res, _ = self.__call__(img, cls, mfd_res=mfd_res)
-                if not dt_boxes and not rec_res:
-                    ocr_res.append(None)
-                    continue
-                tmp_res = [[box.tolist(), res]
-                           for box, res in zip(dt_boxes, rec_res)]
-                ocr_res.append(tmp_res)
-            return ocr_res
-        elif det and not rec:
-            ocr_res = []
-            for idx, img in enumerate(imgs):
-                img = preprocess_image(img)
-                dt_boxes, elapse = self.text_detector(img)
-                if not dt_boxes:
-                    ocr_res.append(None)
-                    continue
-                tmp_res = [box.tolist() for box in dt_boxes]
-                ocr_res.append(tmp_res)
-            return ocr_res
-        else:
-            ocr_res = []
-            cls_res = []
-            for idx, img in enumerate(imgs):
-                if not isinstance(img, list):
-                    img = preprocess_image(img)
-                    img = [img]
-                if self.use_angle_cls and cls:
-                    img, cls_res_tmp, elapse = self.text_classifier(img)
-                    if not rec:
-                        cls_res.append(cls_res_tmp)
-                rec_res, elapse = self.text_recognizer(img)
-                ocr_res.append(rec_res)
-            if not rec:
-                return cls_res
-            return ocr_res
-
-    def __call__(self, img, cls=True, mfd_res=None):
-        time_dict = {'det': 0, 'rec': 0, 'cls': 0, 'all': 0}
-
-        if img is None:
-            logger.debug("no valid image provided")
-            return None, None, time_dict
-
-        start = time.time()
-        ori_im = img.copy()
-        dt_boxes, elapse = self.text_detector(img)
-        time_dict['det'] = elapse
-
-        if dt_boxes is None:
-            logger.debug("no dt_boxes found, elapsed : {}".format(elapse))
-            end = time.time()
-            time_dict['all'] = end - start
-            return None, None, time_dict
-        else:
-            logger.debug("dt_boxes num : {}, elapsed : {}".format(
-                len(dt_boxes), elapse))
-        img_crop_list = []
-
-        dt_boxes = sorted_boxes(dt_boxes)
-
-        dt_boxes = merge_det_boxes(dt_boxes)
-
-        if mfd_res:
-            bef = time.time()
-            dt_boxes = update_det_boxes(dt_boxes, mfd_res)
-            aft = time.time()
-            logger.debug("split text box by formula, new dt_boxes num : {}, elapsed : {}".format(
-                len(dt_boxes), aft - bef))
-
-        for bno in range(len(dt_boxes)):
-            tmp_box = copy.deepcopy(dt_boxes[bno])
-            if self.args.det_box_type == "quad":
-                img_crop = get_rotate_crop_image(ori_im, tmp_box)
-            else:
-                img_crop = get_minarea_rect_crop(ori_im, tmp_box)
-            img_crop_list.append(img_crop)
-        if self.use_angle_cls and cls:
-            img_crop_list, angle_list, elapse = self.text_classifier(
-                img_crop_list)
-            time_dict['cls'] = elapse
-            logger.debug("cls num  : {}, elapsed : {}".format(
-                len(img_crop_list), elapse))
-
-        rec_res, elapse = self.text_recognizer(img_crop_list)
-        time_dict['rec'] = elapse
-        logger.debug("rec_res num  : {}, elapsed : {}".format(
-            len(rec_res), elapse))
-        if self.args.save_crop_res:
-            self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list,
-                                   rec_res)
-        filter_boxes, filter_rec_res = [], []
-        for box, rec_result in zip(dt_boxes, rec_res):
-            text, score = rec_result
-            if score >= self.drop_score:
-                filter_boxes.append(box)
-                filter_rec_res.append(rec_result)
-        end = time.time()
-        time_dict['all'] = end - start
-        return filter_boxes, filter_rec_res, time_dict

+ 0 - 0
magic_pdf/model/pek_sub_modules/__init__.py → magic_pdf/model/sub_modules/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/__init__.py → magic_pdf/model/sub_modules/layout/__init__.py


+ 21 - 0
magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py

@@ -0,0 +1,21 @@
+from doclayout_yolo import YOLOv10
+
+
+class DocLayoutYOLOModel(object):
+    def __init__(self, weight, device):
+        self.model = YOLOv10(weight)
+        self.device = device
+
+    def predict(self, image):
+        layout_res = []
+        doclayout_yolo_res = self.model.predict(image, imgsz=1024, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
+        for xyxy, conf, cla in zip(doclayout_yolo_res.boxes.xyxy.cpu(), doclayout_yolo_res.boxes.conf.cpu(),
+                                   doclayout_yolo_res.boxes.cls.cpu()):
+            xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
+            new_item = {
+                'category_id': int(cla.item()),
+                'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
+                'score': round(float(conf.item()), 3),
+            }
+            layout_res.append(new_item)
+        return layout_res

+ 0 - 0
magic_pdf/model/pek_sub_modules/structeqtable/__init__.py → magic_pdf/model/sub_modules/layout/doclayout_yolo/__init__.py


+ 0 - 0
magic_pdf/model/v3/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/backbone.py → magic_pdf/model/sub_modules/layout/layoutlmv3/backbone.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/beit.py → magic_pdf/model/sub_modules/layout/layoutlmv3/beit.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/deit.py → magic_pdf/model/sub_modules/layout/layoutlmv3/deit.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/cord.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/cord.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/data_collator.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/data_collator.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/funsd.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/funsd.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/image_utils.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/image_utils.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/data/xfund.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/data/xfund.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/__init__.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/__init__.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/configuration_layoutlmv3.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/configuration_layoutlmv3.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/modeling_layoutlmv3.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/modeling_layoutlmv3.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3_fast.py → magic_pdf/model/sub_modules/layout/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3_fast.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/model_init.py → magic_pdf/model/sub_modules/layout/layoutlmv3/model_init.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/rcnn_vl.py → magic_pdf/model/sub_modules/layout/layoutlmv3/rcnn_vl.py


+ 0 - 0
magic_pdf/model/pek_sub_modules/layoutlmv3/visualizer.py → magic_pdf/model/sub_modules/layout/layoutlmv3/visualizer.py


+ 0 - 0
tests/test_data/__init__.py → magic_pdf/model/sub_modules/mfd/__init__.py


+ 12 - 0
magic_pdf/model/sub_modules/mfd/yolov8/YOLOv8.py

@@ -0,0 +1,12 @@
+from ultralytics import YOLO
+
+
+class YOLOv8MFDModel(object):
+    def __init__(self, weight, device='cpu'):
+        self.mfd_model = YOLO(weight)
+        self.device = device
+
+    def predict(self, image):
+        mfd_res = self.mfd_model.predict(image, imgsz=1888, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
+        return mfd_res
+

+ 0 - 0
tests/test_data/data_reader_writer/__init__.py → magic_pdf/model/sub_modules/mfd/yolov8/__init__.py


+ 0 - 0
tests/test_data/io/__init__.py → magic_pdf/model/sub_modules/mfr/__init__.py


+ 98 - 0
magic_pdf/model/sub_modules/mfr/unimernet/Unimernet.py

@@ -0,0 +1,98 @@
+import os
+import argparse
+import re
+
+from PIL import Image
+import torch
+from torch.utils.data import Dataset, DataLoader
+from torchvision import transforms
+from unimernet.common.config import Config
+import unimernet.tasks as tasks
+from unimernet.processors import load_processor
+
+
+class MathDataset(Dataset):
+    def __init__(self, image_paths, transform=None):
+        self.image_paths = image_paths
+        self.transform = transform
+
+    def __len__(self):
+        return len(self.image_paths)
+
+    def __getitem__(self, idx):
+        # if not pil image, then convert to pil image
+        if isinstance(self.image_paths[idx], str):
+            raw_image = Image.open(self.image_paths[idx])
+        else:
+            raw_image = self.image_paths[idx]
+        if self.transform:
+            image = self.transform(raw_image)
+            return image
+
+
+def latex_rm_whitespace(s: str):
+    """Remove unnecessary whitespace from LaTeX code.
+    """
+    text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
+    letter = '[a-zA-Z]'
+    noletter = '[\W_^\d]'
+    names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
+    s = re.sub(text_reg, lambda match: str(names.pop(0)), s)
+    news = s
+    while True:
+        s = news
+        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
+        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
+        news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
+        if news == s:
+            break
+    return s
+
+
+class UnimernetModel(object):
+    def __init__(self, weight_dir, cfg_path, _device_='cpu'):
+
+        args = argparse.Namespace(cfg_path=cfg_path, options=None)
+        cfg = Config(args)
+        cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.pth")
+        cfg.config.model.model_config.model_name = weight_dir
+        cfg.config.model.tokenizer_config.path = weight_dir
+        task = tasks.setup_task(cfg)
+        self.model = task.build_model(cfg)
+        self.device = _device_
+        self.model.to(_device_)
+        self.model.eval()
+        vis_processor = load_processor('formula_image_eval', cfg.config.datasets.formula_rec_eval.vis_processor.eval)
+        self.mfr_transform = transforms.Compose([vis_processor, ])
+
+    def predict(self, mfd_res, image):
+
+        formula_list = []
+        mf_image_list = []
+        for xyxy, conf, cla in zip(mfd_res.boxes.xyxy.cpu(), mfd_res.boxes.conf.cpu(), mfd_res.boxes.cls.cpu()):
+            xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
+            new_item = {
+                'category_id': 13 + int(cla.item()),
+                'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
+                'score': round(float(conf.item()), 2),
+                'latex': '',
+            }
+            formula_list.append(new_item)
+            pil_img = Image.fromarray(image)
+            bbox_img = pil_img.crop((xmin, ymin, xmax, ymax))
+            mf_image_list.append(bbox_img)
+
+        dataset = MathDataset(mf_image_list, transform=self.mfr_transform)
+        dataloader = DataLoader(dataset, batch_size=64, num_workers=0)
+        mfr_res = []
+        for mf_img in dataloader:
+            mf_img = mf_img.to(self.device)
+            with torch.no_grad():
+                output = self.model.generate({'image': mf_img})
+            mfr_res.extend(output['pred_str'])
+        for res, latex in zip(formula_list, mfr_res):
+            res['latex'] = latex_rm_whitespace(latex)
+        return formula_list
+
+
+

+ 0 - 0
tests/test_model/__init__.py → magic_pdf/model/sub_modules/mfr/unimernet/__init__.py


+ 144 - 0
magic_pdf/model/sub_modules/model_init.py

@@ -0,0 +1,144 @@
+from loguru import logger
+
+from magic_pdf.libs.Constants import MODEL_NAME
+from magic_pdf.model.model_list import AtomicModel
+from magic_pdf.model.sub_modules.layout.doclayout_yolo.DocLayoutYOLO import DocLayoutYOLOModel
+from magic_pdf.model.sub_modules.layout.layoutlmv3.model_init import Layoutlmv3_Predictor
+from magic_pdf.model.sub_modules.mfd.yolov8.YOLOv8 import YOLOv8MFDModel
+
+from magic_pdf.model.sub_modules.mfr.unimernet.Unimernet import UnimernetModel
+from magic_pdf.model.sub_modules.ocr.paddleocr.ppocr_273_mod import ModifiedPaddleOCR
+# from magic_pdf.model.sub_modules.ocr.paddleocr.ppocr_291_mod import ModifiedPaddleOCR
+from magic_pdf.model.sub_modules.table.structeqtable.struct_eqtable import StructTableModel
+from magic_pdf.model.sub_modules.table.tablemaster.tablemaster_paddle import TableMasterPaddleModel
+from magic_pdf.model.sub_modules.table.rapidtable.rapid_table import RapidTableModel
+
+
+def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
+    if table_model_type == MODEL_NAME.STRUCT_EQTABLE:
+        table_model = StructTableModel(model_path, max_new_tokens=2048, max_time=max_time)
+    elif table_model_type == MODEL_NAME.TABLE_MASTER:
+        config = {
+            "model_dir": model_path,
+            "device": _device_
+        }
+        table_model = TableMasterPaddleModel(config)
+    elif table_model_type == MODEL_NAME.RAPID_TABLE:
+        table_model = RapidTableModel()
+    else:
+        logger.error("table model type not allow")
+        exit(1)
+
+    return table_model
+
+
+def mfd_model_init(weight, device='cpu'):
+    mfd_model = YOLOv8MFDModel(weight, device)
+    return mfd_model
+
+
+def mfr_model_init(weight_dir, cfg_path, device='cpu'):
+    mfr_model = UnimernetModel(weight_dir, cfg_path, device)
+    return mfr_model
+
+
+def layout_model_init(weight, config_file, device):
+    model = Layoutlmv3_Predictor(weight, config_file, device)
+    return model
+
+
+def doclayout_yolo_model_init(weight, device='cpu'):
+    model = DocLayoutYOLOModel(weight, device)
+    return model
+
+
+def ocr_model_init(show_log: bool = False,
+                   det_db_box_thresh=0.3,
+                   lang=None,
+                   use_dilation=True,
+                   det_db_unclip_ratio=1.8,
+                   ):
+    if lang is not None:
+        model = ModifiedPaddleOCR(
+            show_log=show_log,
+            det_db_box_thresh=det_db_box_thresh,
+            lang=lang,
+            use_dilation=use_dilation,
+            det_db_unclip_ratio=det_db_unclip_ratio,
+        )
+    else:
+        model = ModifiedPaddleOCR(
+            show_log=show_log,
+            det_db_box_thresh=det_db_box_thresh,
+            use_dilation=use_dilation,
+            det_db_unclip_ratio=det_db_unclip_ratio,
+            # use_angle_cls=True,
+        )
+    return model
+
+
+class AtomModelSingleton:
+    _instance = None
+    _models = {}
+
+    def __new__(cls, *args, **kwargs):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
+
+    def get_atom_model(self, atom_model_name: str, **kwargs):
+        lang = kwargs.get("lang", None)
+        layout_model_name = kwargs.get("layout_model_name", None)
+        key = (atom_model_name, layout_model_name, lang)
+        if key not in self._models:
+            self._models[key] = atom_model_init(model_name=atom_model_name, **kwargs)
+        return self._models[key]
+
+
+def atom_model_init(model_name: str, **kwargs):
+    atom_model = None
+    if model_name == AtomicModel.Layout:
+        if kwargs.get("layout_model_name") == MODEL_NAME.LAYOUTLMv3:
+            atom_model = layout_model_init(
+                kwargs.get("layout_weights"),
+                kwargs.get("layout_config_file"),
+                kwargs.get("device")
+            )
+        elif kwargs.get("layout_model_name") == MODEL_NAME.DocLayout_YOLO:
+            atom_model = doclayout_yolo_model_init(
+                kwargs.get("doclayout_yolo_weights"),
+                kwargs.get("device")
+            )
+    elif model_name == AtomicModel.MFD:
+        atom_model = mfd_model_init(
+            kwargs.get("mfd_weights"),
+            kwargs.get("device")
+        )
+    elif model_name == AtomicModel.MFR:
+        atom_model = mfr_model_init(
+            kwargs.get("mfr_weight_dir"),
+            kwargs.get("mfr_cfg_path"),
+            kwargs.get("device")
+        )
+    elif model_name == AtomicModel.OCR:
+        atom_model = ocr_model_init(
+            kwargs.get("ocr_show_log"),
+            kwargs.get("det_db_box_thresh"),
+            kwargs.get("lang")
+        )
+    elif model_name == AtomicModel.Table:
+        atom_model = table_model_init(
+            kwargs.get("table_model_name"),
+            kwargs.get("table_model_path"),
+            kwargs.get("table_max_time"),
+            kwargs.get("device")
+        )
+    else:
+        logger.error("model name not allow")
+        exit(1)
+
+    if atom_model is None:
+        logger.error("model init failed")
+        exit(1)
+    else:
+        return atom_model

+ 51 - 0
magic_pdf/model/sub_modules/model_utils.py

@@ -0,0 +1,51 @@
+import time
+
+import torch
+from PIL import Image
+from loguru import logger
+
+from magic_pdf.libs.clean_memory import clean_memory
+
+
+def crop_img(input_res, input_pil_img, crop_paste_x=0, crop_paste_y=0):
+    crop_xmin, crop_ymin = int(input_res['poly'][0]), int(input_res['poly'][1])
+    crop_xmax, crop_ymax = int(input_res['poly'][4]), int(input_res['poly'][5])
+    # Create a white background with an additional width and height of 50
+    crop_new_width = crop_xmax - crop_xmin + crop_paste_x * 2
+    crop_new_height = crop_ymax - crop_ymin + crop_paste_y * 2
+    return_image = Image.new('RGB', (crop_new_width, crop_new_height), 'white')
+
+    # Crop image
+    crop_box = (crop_xmin, crop_ymin, crop_xmax, crop_ymax)
+    cropped_img = input_pil_img.crop(crop_box)
+    return_image.paste(cropped_img, (crop_paste_x, crop_paste_y))
+    return_list = [crop_paste_x, crop_paste_y, crop_xmin, crop_ymin, crop_xmax, crop_ymax, crop_new_width, crop_new_height]
+    return return_image, return_list
+
+
+# Select regions for OCR / formula regions / table regions
+def get_res_list_from_layout_res(layout_res):
+    ocr_res_list = []
+    table_res_list = []
+    single_page_mfdetrec_res = []
+    for res in layout_res:
+        if int(res['category_id']) in [13, 14]:
+            single_page_mfdetrec_res.append({
+                "bbox": [int(res['poly'][0]), int(res['poly'][1]),
+                         int(res['poly'][4]), int(res['poly'][5])],
+            })
+        elif int(res['category_id']) in [0, 1, 2, 4, 6, 7]:
+            ocr_res_list.append(res)
+        elif int(res['category_id']) in [5]:
+            table_res_list.append(res)
+    return ocr_res_list, table_res_list, single_page_mfdetrec_res
+
+
+def clean_vram(device, vram_threshold=8):
+    if torch.cuda.is_available() and device != 'cpu':
+        total_memory = torch.cuda.get_device_properties(device).total_memory / (1024 ** 3)  # 将字节转换为 GB
+        if total_memory <= vram_threshold:
+            gc_start = time.time()
+            clean_memory()
+            gc_time = round(time.time() - gc_start, 2)
+            logger.info(f"gc time: {gc_time}")

+ 0 - 0
tests/test_tools/__init__.py → magic_pdf/model/sub_modules/ocr/__init__.py


+ 0 - 0
tests/assets/more_para_test_samples/gift_files.txt → magic_pdf/model/sub_modules/ocr/paddleocr/__init__.py


+ 259 - 0
magic_pdf/model/sub_modules/ocr/paddleocr/ocr_utils.py

@@ -0,0 +1,259 @@
+import math
+
+import numpy as np
+from loguru import logger
+
+from magic_pdf.libs.boxbase import __is_overlaps_y_exceeds_threshold
+from magic_pdf.pre_proc.ocr_dict_merge import merge_spans_to_line
+
+
+def bbox_to_points(bbox):
+    """ 将bbox格式转换为四个顶点的数组 """
+    x0, y0, x1, y1 = bbox
+    return np.array([[x0, y0], [x1, y0], [x1, y1], [x0, y1]]).astype('float32')
+
+
+def points_to_bbox(points):
+    """ 将四个顶点的数组转换为bbox格式 """
+    x0, y0 = points[0]
+    x1, _ = points[1]
+    _, y1 = points[2]
+    return [x0, y0, x1, y1]
+
+
+def merge_intervals(intervals):
+    # Sort the intervals based on the start value
+    intervals.sort(key=lambda x: x[0])
+
+    merged = []
+    for interval in intervals:
+        # If the list of merged intervals is empty or if the current
+        # interval does not overlap with the previous, simply append it.
+        if not merged or merged[-1][1] < interval[0]:
+            merged.append(interval)
+        else:
+            # Otherwise, there is overlap, so we merge the current and previous intervals.
+            merged[-1][1] = max(merged[-1][1], interval[1])
+
+    return merged
+
+
+def remove_intervals(original, masks):
+    # Merge all mask intervals
+    merged_masks = merge_intervals(masks)
+
+    result = []
+    original_start, original_end = original
+
+    for mask in merged_masks:
+        mask_start, mask_end = mask
+
+        # If the mask starts after the original range, ignore it
+        if mask_start > original_end:
+            continue
+
+        # If the mask ends before the original range starts, ignore it
+        if mask_end < original_start:
+            continue
+
+        # Remove the masked part from the original range
+        if original_start < mask_start:
+            result.append([original_start, mask_start - 1])
+
+        original_start = max(mask_end + 1, original_start)
+
+    # Add the remaining part of the original range, if any
+    if original_start <= original_end:
+        result.append([original_start, original_end])
+
+    return result
+
+
+def update_det_boxes(dt_boxes, mfd_res):
+    new_dt_boxes = []
+    for text_box in dt_boxes:
+        text_bbox = points_to_bbox(text_box)
+        masks_list = []
+        for mf_box in mfd_res:
+            mf_bbox = mf_box['bbox']
+            if __is_overlaps_y_exceeds_threshold(text_bbox, mf_bbox):
+                masks_list.append([mf_bbox[0], mf_bbox[2]])
+        text_x_range = [text_bbox[0], text_bbox[2]]
+        text_remove_mask_range = remove_intervals(text_x_range, masks_list)
+        temp_dt_box = []
+        for text_remove_mask in text_remove_mask_range:
+            temp_dt_box.append(bbox_to_points([text_remove_mask[0], text_bbox[1], text_remove_mask[1], text_bbox[3]]))
+        if len(temp_dt_box) > 0:
+            new_dt_boxes.extend(temp_dt_box)
+    return new_dt_boxes
+
+
+def merge_overlapping_spans(spans):
+    """
+    Merges overlapping spans on the same line.
+
+    :param spans: A list of span coordinates [(x1, y1, x2, y2), ...]
+    :return: A list of merged spans
+    """
+    # Return an empty list if the input spans list is empty
+    if not spans:
+        return []
+
+    # Sort spans by their starting x-coordinate
+    spans.sort(key=lambda x: x[0])
+
+    # Initialize the list of merged spans
+    merged = []
+    for span in spans:
+        # Unpack span coordinates
+        x1, y1, x2, y2 = span
+        # If the merged list is empty or there's no horizontal overlap, add the span directly
+        if not merged or merged[-1][2] < x1:
+            merged.append(span)
+        else:
+            # If there is horizontal overlap, merge the current span with the previous one
+            last_span = merged.pop()
+            # Update the merged span's top-left corner to the smaller (x1, y1) and bottom-right to the larger (x2, y2)
+            x1 = min(last_span[0], x1)
+            y1 = min(last_span[1], y1)
+            x2 = max(last_span[2], x2)
+            y2 = max(last_span[3], y2)
+            # Add the merged span back to the list
+            merged.append((x1, y1, x2, y2))
+
+    # Return the list of merged spans
+    return merged
+
+
+def merge_det_boxes(dt_boxes):
+    """
+    Merge detection boxes.
+
+    This function takes a list of detected bounding boxes, each represented by four corner points.
+    The goal is to merge these bounding boxes into larger text regions.
+
+    Parameters:
+    dt_boxes (list): A list containing multiple text detection boxes, where each box is defined by four corner points.
+
+    Returns:
+    list: A list containing the merged text regions, where each region is represented by four corner points.
+    """
+    # Convert the detection boxes into a dictionary format with bounding boxes and type
+    dt_boxes_dict_list = []
+    angle_boxes_list = []
+    for text_box in dt_boxes:
+        text_bbox = points_to_bbox(text_box)
+        if text_bbox[2] <= text_bbox[0] or text_bbox[3] <= text_bbox[1]:
+            angle_boxes_list.append(text_box)
+            continue
+        text_box_dict = {
+            'bbox': text_bbox,
+            'type': 'text',
+        }
+        dt_boxes_dict_list.append(text_box_dict)
+
+    # Merge adjacent text regions into lines
+    lines = merge_spans_to_line(dt_boxes_dict_list)
+
+    # Initialize a new list for storing the merged text regions
+    new_dt_boxes = []
+    for line in lines:
+        line_bbox_list = []
+        for span in line:
+            line_bbox_list.append(span['bbox'])
+
+        # Merge overlapping text regions within the same line
+        merged_spans = merge_overlapping_spans(line_bbox_list)
+
+        # Convert the merged text regions back to point format and add them to the new detection box list
+        for span in merged_spans:
+            new_dt_boxes.append(bbox_to_points(span))
+
+    new_dt_boxes.extend(angle_boxes_list)
+
+    return new_dt_boxes
+
+
+def get_adjusted_mfdetrec_res(single_page_mfdetrec_res, useful_list):
+    paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
+    # Adjust the coordinates of the formula area
+    adjusted_mfdetrec_res = []
+    for mf_res in single_page_mfdetrec_res:
+        mf_xmin, mf_ymin, mf_xmax, mf_ymax = mf_res["bbox"]
+        # Adjust the coordinates of the formula area to the coordinates relative to the cropping area
+        x0 = mf_xmin - xmin + paste_x
+        y0 = mf_ymin - ymin + paste_y
+        x1 = mf_xmax - xmin + paste_x
+        y1 = mf_ymax - ymin + paste_y
+        # Filter formula blocks outside the graph
+        if any([x1 < 0, y1 < 0]) or any([x0 > new_width, y0 > new_height]):
+            continue
+        else:
+            adjusted_mfdetrec_res.append({
+                "bbox": [x0, y0, x1, y1],
+            })
+    return adjusted_mfdetrec_res
+
+
+def get_ocr_result_list(ocr_res, useful_list):
+    paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
+    ocr_result_list = []
+    for box_ocr_res in ocr_res:
+
+        p1, p2, p3, p4 = box_ocr_res[0]
+        text, score = box_ocr_res[1]
+        average_angle_degrees = calculate_angle_degrees(box_ocr_res[0])
+        if average_angle_degrees > 0.5:
+            # logger.info(f"average_angle_degrees: {average_angle_degrees}, text: {text}")
+            # 与x轴的夹角超过0.5度,对边界做一下矫正
+            # 计算几何中心
+            x_center = sum(point[0] for point in box_ocr_res[0]) / 4
+            y_center = sum(point[1] for point in box_ocr_res[0]) / 4
+            new_height = ((p4[1] - p1[1]) + (p3[1] - p2[1])) / 2
+            new_width = p3[0] - p1[0]
+            p1 = [x_center - new_width / 2, y_center - new_height / 2]
+            p2 = [x_center + new_width / 2, y_center - new_height / 2]
+            p3 = [x_center + new_width / 2, y_center + new_height / 2]
+            p4 = [x_center - new_width / 2, y_center + new_height / 2]
+
+        # Convert the coordinates back to the original coordinate system
+        p1 = [p1[0] - paste_x + xmin, p1[1] - paste_y + ymin]
+        p2 = [p2[0] - paste_x + xmin, p2[1] - paste_y + ymin]
+        p3 = [p3[0] - paste_x + xmin, p3[1] - paste_y + ymin]
+        p4 = [p4[0] - paste_x + xmin, p4[1] - paste_y + ymin]
+
+        ocr_result_list.append({
+            'category_id': 15,
+            'poly': p1 + p2 + p3 + p4,
+            'score': float(round(score, 2)),
+            'text': text,
+        })
+
+    return ocr_result_list
+
+
+def calculate_angle_degrees(poly):
+    # 定义对角线的顶点
+    diagonal1 = (poly[0], poly[2])
+    diagonal2 = (poly[1], poly[3])
+
+    # 计算对角线的斜率
+    def slope(p1, p2):
+        return (p2[1] - p1[1]) / (p2[0] - p1[0]) if p2[0] != p1[0] else float('inf')
+
+    slope1 = slope(diagonal1[0], diagonal1[1])
+    slope2 = slope(diagonal2[0], diagonal2[1])
+
+    # 计算对角线与x轴的夹角(以弧度为单位)
+    angle1_radians = math.atan(slope1)
+    angle2_radians = math.atan(slope2)
+
+    # 将弧度转换为角度
+    angle1_degrees = math.degrees(angle1_radians)
+    angle2_degrees = math.degrees(angle2_radians)
+
+    # 取两条对角线与x轴夹角的平均值
+    average_angle_degrees = abs((angle1_degrees + angle2_degrees) / 2)
+    # logger.info(f"average_angle_degrees: {average_angle_degrees}")
+    return average_angle_degrees
+

+ 168 - 0
magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py

@@ -0,0 +1,168 @@
+import copy
+import time
+
+import cv2
+import numpy as np
+from paddleocr import PaddleOCR
+from paddleocr.paddleocr import check_img, logger
+from paddleocr.ppocr.utils.utility import alpha_to_color, binarize_img
+from paddleocr.tools.infer.predict_system import sorted_boxes
+from paddleocr.tools.infer.utility import get_rotate_crop_image, get_minarea_rect_crop
+
+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import update_det_boxes, merge_det_boxes
+
+
+class ModifiedPaddleOCR(PaddleOCR):
+    def ocr(self,
+            img,
+            det=True,
+            rec=True,
+            cls=True,
+            bin=False,
+            inv=False,
+            alpha_color=(255, 255, 255),
+            mfd_res=None,
+            ):
+        """
+        OCR with PaddleOCR
+        args:
+            img: img for OCR, support ndarray, img_path and list or ndarray
+            det: use text detection or not. If False, only rec will be exec. Default is True
+            rec: use text recognition or not. If False, only det will be exec. Default is True
+            cls: use angle classifier or not. Default is True. If True, the text with rotation of 180 degrees can be recognized. If no text is rotated by 180 degrees, use cls=False to get better performance. Text with rotation of 90 or 270 degrees can be recognized even if cls=False.
+            bin: binarize image to black and white. Default is False.
+            inv: invert image colors. Default is False.
+            alpha_color: set RGB color Tuple for transparent parts replacement. Default is pure white.
+        """
+        assert isinstance(img, (np.ndarray, list, str, bytes))
+        if isinstance(img, list) and det == True:
+            logger.error('When input a list of images, det must be false')
+            exit(0)
+        if cls == True and self.use_angle_cls == False:
+            pass
+            # logger.warning(
+            #     'Since the angle classifier is not initialized, it will not be used during the forward process'
+            # )
+
+        img = check_img(img)
+        # for infer pdf file
+        if isinstance(img, list):
+            if self.page_num > len(img) or self.page_num == 0:
+                self.page_num = len(img)
+            imgs = img[:self.page_num]
+        else:
+            imgs = [img]
+
+        def preprocess_image(_image):
+            _image = alpha_to_color(_image, alpha_color)
+            if inv:
+                _image = cv2.bitwise_not(_image)
+            if bin:
+                _image = binarize_img(_image)
+            return _image
+
+        if det and rec:
+            ocr_res = []
+            for idx, img in enumerate(imgs):
+                img = preprocess_image(img)
+                dt_boxes, rec_res, _ = self.__call__(img, cls, mfd_res=mfd_res)
+                if not dt_boxes and not rec_res:
+                    ocr_res.append(None)
+                    continue
+                tmp_res = [[box.tolist(), res]
+                           for box, res in zip(dt_boxes, rec_res)]
+                ocr_res.append(tmp_res)
+            return ocr_res
+        elif det and not rec:
+            ocr_res = []
+            for idx, img in enumerate(imgs):
+                img = preprocess_image(img)
+                dt_boxes, elapse = self.text_detector(img)
+                if not dt_boxes:
+                    ocr_res.append(None)
+                    continue
+                tmp_res = [box.tolist() for box in dt_boxes]
+                ocr_res.append(tmp_res)
+            return ocr_res
+        else:
+            ocr_res = []
+            cls_res = []
+            for idx, img in enumerate(imgs):
+                if not isinstance(img, list):
+                    img = preprocess_image(img)
+                    img = [img]
+                if self.use_angle_cls and cls:
+                    img, cls_res_tmp, elapse = self.text_classifier(img)
+                    if not rec:
+                        cls_res.append(cls_res_tmp)
+                rec_res, elapse = self.text_recognizer(img)
+                ocr_res.append(rec_res)
+            if not rec:
+                return cls_res
+            return ocr_res
+
+    def __call__(self, img, cls=True, mfd_res=None):
+        time_dict = {'det': 0, 'rec': 0, 'cls': 0, 'all': 0}
+
+        if img is None:
+            logger.debug("no valid image provided")
+            return None, None, time_dict
+
+        start = time.time()
+        ori_im = img.copy()
+        dt_boxes, elapse = self.text_detector(img)
+        time_dict['det'] = elapse
+
+        if dt_boxes is None:
+            logger.debug("no dt_boxes found, elapsed : {}".format(elapse))
+            end = time.time()
+            time_dict['all'] = end - start
+            return None, None, time_dict
+        else:
+            logger.debug("dt_boxes num : {}, elapsed : {}".format(
+                len(dt_boxes), elapse))
+        img_crop_list = []
+
+        dt_boxes = sorted_boxes(dt_boxes)
+
+        # @todo 目前是在bbox层merge,对倾斜文本行的兼容性不佳,需要修改成支持poly的merge
+        # dt_boxes = merge_det_boxes(dt_boxes)
+
+
+        if mfd_res:
+            bef = time.time()
+            dt_boxes = update_det_boxes(dt_boxes, mfd_res)
+            aft = time.time()
+            logger.debug("split text box by formula, new dt_boxes num : {}, elapsed : {}".format(
+                len(dt_boxes), aft - bef))
+
+        for bno in range(len(dt_boxes)):
+            tmp_box = copy.deepcopy(dt_boxes[bno])
+            if self.args.det_box_type == "quad":
+                img_crop = get_rotate_crop_image(ori_im, tmp_box)
+            else:
+                img_crop = get_minarea_rect_crop(ori_im, tmp_box)
+            img_crop_list.append(img_crop)
+        if self.use_angle_cls and cls:
+            img_crop_list, angle_list, elapse = self.text_classifier(
+                img_crop_list)
+            time_dict['cls'] = elapse
+            logger.debug("cls num  : {}, elapsed : {}".format(
+                len(img_crop_list), elapse))
+
+        rec_res, elapse = self.text_recognizer(img_crop_list)
+        time_dict['rec'] = elapse
+        logger.debug("rec_res num  : {}, elapsed : {}".format(
+            len(rec_res), elapse))
+        if self.args.save_crop_res:
+            self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list,
+                                   rec_res)
+        filter_boxes, filter_rec_res = [], []
+        for box, rec_result in zip(dt_boxes, rec_res):
+            text, score = rec_result
+            if score >= self.drop_score:
+                filter_boxes.append(box)
+                filter_rec_res.append(rec_result)
+        end = time.time()
+        time_dict['all'] = end - start
+        return filter_boxes, filter_rec_res, time_dict

+ 213 - 0
magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_291_mod.py

@@ -0,0 +1,213 @@
+import copy
+import time
+
+
+import cv2
+import numpy as np
+from paddleocr import PaddleOCR
+from paddleocr.paddleocr import check_img, logger
+from paddleocr.ppocr.utils.utility import alpha_to_color, binarize_img
+from paddleocr.tools.infer.predict_system import sorted_boxes
+from paddleocr.tools.infer.utility import slice_generator, merge_fragmented, get_rotate_crop_image, \
+    get_minarea_rect_crop
+
+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import update_det_boxes
+
+
+class ModifiedPaddleOCR(PaddleOCR):
+
+    def ocr(
+        self,
+        img,
+        det=True,
+        rec=True,
+        cls=True,
+        bin=False,
+        inv=False,
+        alpha_color=(255, 255, 255),
+        slice={},
+        mfd_res=None,
+    ):
+        """
+        OCR with PaddleOCR
+
+        Args:
+            img: Image for OCR. It can be an ndarray, img_path, or a list of ndarrays.
+            det: Use text detection or not. If False, only text recognition will be executed. Default is True.
+            rec: Use text recognition or not. If False, only text detection will be executed. Default is True.
+            cls: Use angle classifier or not. Default is True. If True, the text with a rotation of 180 degrees can be recognized. If no text is rotated by 180 degrees, use cls=False to get better performance.
+            bin: Binarize image to black and white. Default is False.
+            inv: Invert image colors. Default is False.
+            alpha_color: Set RGB color Tuple for transparent parts replacement. Default is pure white.
+            slice: Use sliding window inference for large images. Both det and rec must be True. Requires int values for slice["horizontal_stride"], slice["vertical_stride"], slice["merge_x_thres"], slice["merge_y_thres"] (See doc/doc_en/slice_en.md). Default is {}.
+
+        Returns:
+            If both det and rec are True, returns a list of OCR results for each image. Each OCR result is a list of bounding boxes and recognized text for each detected text region.
+            If det is True and rec is False, returns a list of detected bounding boxes for each image.
+            If det is False and rec is True, returns a list of recognized text for each image.
+            If both det and rec are False, returns a list of angle classification results for each image.
+
+        Raises:
+            AssertionError: If the input image is not of type ndarray, list, str, or bytes.
+            SystemExit: If det is True and the input is a list of images.
+
+        Note:
+            - If the angle classifier is not initialized (use_angle_cls=False), it will not be used during the forward process.
+            - For PDF files, if the input is a list of images and the page_num is specified, only the first page_num images will be processed.
+            - The preprocess_image function is used to preprocess the input image by applying alpha color replacement, inversion, and binarization if specified.
+        """
+        assert isinstance(img, (np.ndarray, list, str, bytes))
+        if isinstance(img, list) and det == True:
+            logger.error("When input a list of images, det must be false")
+            exit(0)
+        if cls == True and self.use_angle_cls == False:
+            logger.warning(
+                "Since the angle classifier is not initialized, it will not be used during the forward process"
+            )
+
+        img, flag_gif, flag_pdf = check_img(img, alpha_color)
+        # for infer pdf file
+        if isinstance(img, list) and flag_pdf:
+            if self.page_num > len(img) or self.page_num == 0:
+                imgs = img
+            else:
+                imgs = img[: self.page_num]
+        else:
+            imgs = [img]
+
+        def preprocess_image(_image):
+            _image = alpha_to_color(_image, alpha_color)
+            if inv:
+                _image = cv2.bitwise_not(_image)
+            if bin:
+                _image = binarize_img(_image)
+            return _image
+
+        if det and rec:
+            ocr_res = []
+            for img in imgs:
+                img = preprocess_image(img)
+                dt_boxes, rec_res, _ = self.__call__(img, cls, slice, mfd_res=mfd_res)
+                if not dt_boxes and not rec_res:
+                    ocr_res.append(None)
+                    continue
+                tmp_res = [[box.tolist(), res] for box, res in zip(dt_boxes, rec_res)]
+                ocr_res.append(tmp_res)
+            return ocr_res
+        elif det and not rec:
+            ocr_res = []
+            for img in imgs:
+                img = preprocess_image(img)
+                dt_boxes, elapse = self.text_detector(img)
+                if dt_boxes.size == 0:
+                    ocr_res.append(None)
+                    continue
+                tmp_res = [box.tolist() for box in dt_boxes]
+                ocr_res.append(tmp_res)
+            return ocr_res
+        else:
+            ocr_res = []
+            cls_res = []
+            for img in imgs:
+                if not isinstance(img, list):
+                    img = preprocess_image(img)
+                    img = [img]
+                if self.use_angle_cls and cls:
+                    img, cls_res_tmp, elapse = self.text_classifier(img)
+                    if not rec:
+                        cls_res.append(cls_res_tmp)
+                rec_res, elapse = self.text_recognizer(img)
+                ocr_res.append(rec_res)
+            if not rec:
+                return cls_res
+            return ocr_res
+
+    def __call__(self, img, cls=True, slice={}, mfd_res=None):
+        time_dict = {"det": 0, "rec": 0, "cls": 0, "all": 0}
+
+        if img is None:
+            logger.debug("no valid image provided")
+            return None, None, time_dict
+
+        start = time.time()
+        ori_im = img.copy()
+        if slice:
+            slice_gen = slice_generator(
+                img,
+                horizontal_stride=slice["horizontal_stride"],
+                vertical_stride=slice["vertical_stride"],
+            )
+            elapsed = []
+            dt_slice_boxes = []
+            for slice_crop, v_start, h_start in slice_gen:
+                dt_boxes, elapse = self.text_detector(slice_crop, use_slice=True)
+                if dt_boxes.size:
+                    dt_boxes[:, :, 0] += h_start
+                    dt_boxes[:, :, 1] += v_start
+                    dt_slice_boxes.append(dt_boxes)
+                    elapsed.append(elapse)
+            dt_boxes = np.concatenate(dt_slice_boxes)
+
+            dt_boxes = merge_fragmented(
+                boxes=dt_boxes,
+                x_threshold=slice["merge_x_thres"],
+                y_threshold=slice["merge_y_thres"],
+            )
+            elapse = sum(elapsed)
+        else:
+            dt_boxes, elapse = self.text_detector(img)
+
+        time_dict["det"] = elapse
+
+        if dt_boxes is None:
+            logger.debug("no dt_boxes found, elapsed : {}".format(elapse))
+            end = time.time()
+            time_dict["all"] = end - start
+            return None, None, time_dict
+        else:
+            logger.debug(
+                "dt_boxes num : {}, elapsed : {}".format(len(dt_boxes), elapse)
+            )
+        img_crop_list = []
+
+        dt_boxes = sorted_boxes(dt_boxes)
+
+        if mfd_res:
+            bef = time.time()
+            dt_boxes = update_det_boxes(dt_boxes, mfd_res)
+            aft = time.time()
+            logger.debug("split text box by formula, new dt_boxes num : {}, elapsed : {}".format(
+                len(dt_boxes), aft - bef))
+
+        for bno in range(len(dt_boxes)):
+            tmp_box = copy.deepcopy(dt_boxes[bno])
+            if self.args.det_box_type == "quad":
+                img_crop = get_rotate_crop_image(ori_im, tmp_box)
+            else:
+                img_crop = get_minarea_rect_crop(ori_im, tmp_box)
+            img_crop_list.append(img_crop)
+        if self.use_angle_cls and cls:
+            img_crop_list, angle_list, elapse = self.text_classifier(img_crop_list)
+            time_dict["cls"] = elapse
+            logger.debug(
+                "cls num  : {}, elapsed : {}".format(len(img_crop_list), elapse)
+            )
+        if len(img_crop_list) > 1000:
+            logger.debug(
+                f"rec crops num: {len(img_crop_list)}, time and memory cost may be large."
+            )
+
+        rec_res, elapse = self.text_recognizer(img_crop_list)
+        time_dict["rec"] = elapse
+        logger.debug("rec_res num  : {}, elapsed : {}".format(len(rec_res), elapse))
+        if self.args.save_crop_res:
+            self.draw_crop_rec_res(self.args.crop_res_save_dir, img_crop_list, rec_res)
+        filter_boxes, filter_rec_res = [], []
+        for box, rec_result in zip(dt_boxes, rec_res):
+            text, score = rec_result[0], rec_result[1]
+            if score >= self.drop_score:
+                filter_boxes.append(box)
+                filter_rec_res.append(rec_result)
+        end = time.time()
+        time_dict["all"] = end - start
+        return filter_boxes, filter_rec_res, time_dict

+ 0 - 0
tests/assets/more_para_test_samples/zlib_files.txt → magic_pdf/model/sub_modules/reading_oreder/__init__.py


+ 0 - 0
magic_pdf/model/sub_modules/reading_oreder/layoutreader/__init__.py


+ 0 - 0
magic_pdf/model/v3/helpers.py → magic_pdf/model/sub_modules/reading_oreder/layoutreader/helpers.py


+ 242 - 0
magic_pdf/model/sub_modules/reading_oreder/layoutreader/xycut.py

@@ -0,0 +1,242 @@
+from typing import List
+import cv2
+import numpy as np
+
+
+def projection_by_bboxes(boxes: np.array, axis: int) -> np.ndarray:
+    """
+     通过一组 bbox 获得投影直方图,最后以 per-pixel 形式输出
+
+    Args:
+        boxes: [N, 4]
+        axis: 0-x坐标向水平方向投影, 1-y坐标向垂直方向投影
+
+    Returns:
+        1D 投影直方图,长度为投影方向坐标的最大值(我们不需要图片的实际边长,因为只是要找文本框的间隔)
+
+    """
+    assert axis in [0, 1]
+    length = np.max(boxes[:, axis::2])
+    res = np.zeros(length, dtype=int)
+    # TODO: how to remove for loop?
+    for start, end in boxes[:, axis::2]:
+        res[start:end] += 1
+    return res
+
+
+# from: https://dothinking.github.io/2021-06-19-%E9%80%92%E5%BD%92%E6%8A%95%E5%BD%B1%E5%88%86%E5%89%B2%E7%AE%97%E6%B3%95/#:~:text=%E9%80%92%E5%BD%92%E6%8A%95%E5%BD%B1%E5%88%86%E5%89%B2%EF%BC%88Recursive%20XY,%EF%BC%8C%E5%8F%AF%E4%BB%A5%E5%88%92%E5%88%86%E6%AE%B5%E8%90%BD%E3%80%81%E8%A1%8C%E3%80%82
+def split_projection_profile(arr_values: np.array, min_value: float, min_gap: float):
+    """Split projection profile:
+
+    ```
+                              ┌──┐
+         arr_values           │  │       ┌─┐───
+             ┌──┐             │  │       │ │ |
+             │  │             │  │ ┌───┐ │ │min_value
+             │  │<- min_gap ->│  │ │   │ │ │ |
+         ────┴──┴─────────────┴──┴─┴───┴─┴─┴─┴───
+         0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
+    ```
+
+    Args:
+        arr_values (np.array): 1-d array representing the projection profile.
+        min_value (float): Ignore the profile if `arr_value` is less than `min_value`.
+        min_gap (float): Ignore the gap if less than this value.
+
+    Returns:
+        tuple: Start indexes and end indexes of split groups.
+    """
+    # all indexes with projection height exceeding the threshold
+    arr_index = np.where(arr_values > min_value)[0]
+    if not len(arr_index):
+        return
+
+    # find zero intervals between adjacent projections
+    # |  |                    ||
+    # ||||<- zero-interval -> |||||
+    arr_diff = arr_index[1:] - arr_index[0:-1]
+    arr_diff_index = np.where(arr_diff > min_gap)[0]
+    arr_zero_intvl_start = arr_index[arr_diff_index]
+    arr_zero_intvl_end = arr_index[arr_diff_index + 1]
+
+    # convert to index of projection range:
+    # the start index of zero interval is the end index of projection
+    arr_start = np.insert(arr_zero_intvl_end, 0, arr_index[0])
+    arr_end = np.append(arr_zero_intvl_start, arr_index[-1])
+    arr_end += 1  # end index will be excluded as index slice
+
+    return arr_start, arr_end
+
+
+def recursive_xy_cut(boxes: np.ndarray, indices: List[int], res: List[int]):
+    """
+
+    Args:
+        boxes: (N, 4)
+        indices: 递归过程中始终表示 box 在原始数据中的索引
+        res: 保存输出结果
+
+    """
+    # 向 y 轴投影
+    assert len(boxes) == len(indices)
+
+    _indices = boxes[:, 1].argsort()
+    y_sorted_boxes = boxes[_indices]
+    y_sorted_indices = indices[_indices]
+
+    # debug_vis(y_sorted_boxes, y_sorted_indices)
+
+    y_projection = projection_by_bboxes(boxes=y_sorted_boxes, axis=1)
+    pos_y = split_projection_profile(y_projection, 0, 1)
+    if not pos_y:
+        return
+
+    arr_y0, arr_y1 = pos_y
+    for r0, r1 in zip(arr_y0, arr_y1):
+        # [r0, r1] 表示按照水平切分,有 bbox 的区域,对这些区域会再进行垂直切分
+        _indices = (r0 <= y_sorted_boxes[:, 1]) & (y_sorted_boxes[:, 1] < r1)
+
+        y_sorted_boxes_chunk = y_sorted_boxes[_indices]
+        y_sorted_indices_chunk = y_sorted_indices[_indices]
+
+        _indices = y_sorted_boxes_chunk[:, 0].argsort()
+        x_sorted_boxes_chunk = y_sorted_boxes_chunk[_indices]
+        x_sorted_indices_chunk = y_sorted_indices_chunk[_indices]
+
+        # 往 x 方向投影
+        x_projection = projection_by_bboxes(boxes=x_sorted_boxes_chunk, axis=0)
+        pos_x = split_projection_profile(x_projection, 0, 1)
+        if not pos_x:
+            continue
+
+        arr_x0, arr_x1 = pos_x
+        if len(arr_x0) == 1:
+            # x 方向无法切分
+            res.extend(x_sorted_indices_chunk)
+            continue
+
+        # x 方向上能分开,继续递归调用
+        for c0, c1 in zip(arr_x0, arr_x1):
+            _indices = (c0 <= x_sorted_boxes_chunk[:, 0]) & (
+                x_sorted_boxes_chunk[:, 0] < c1
+            )
+            recursive_xy_cut(
+                x_sorted_boxes_chunk[_indices], x_sorted_indices_chunk[_indices], res
+            )
+
+
+def points_to_bbox(points):
+    assert len(points) == 8
+
+    # [x1,y1,x2,y2,x3,y3,x4,y4]
+    left = min(points[::2])
+    right = max(points[::2])
+    top = min(points[1::2])
+    bottom = max(points[1::2])
+
+    left = max(left, 0)
+    top = max(top, 0)
+    right = max(right, 0)
+    bottom = max(bottom, 0)
+    return [left, top, right, bottom]
+
+
+def bbox2points(bbox):
+    left, top, right, bottom = bbox
+    return [left, top, right, top, right, bottom, left, bottom]
+
+
+def vis_polygon(img, points, thickness=2, color=None):
+    br2bl_color = color
+    tl2tr_color = color
+    tr2br_color = color
+    bl2tl_color = color
+    cv2.line(
+        img,
+        (points[0][0], points[0][1]),
+        (points[1][0], points[1][1]),
+        color=tl2tr_color,
+        thickness=thickness,
+    )
+
+    cv2.line(
+        img,
+        (points[1][0], points[1][1]),
+        (points[2][0], points[2][1]),
+        color=tr2br_color,
+        thickness=thickness,
+    )
+
+    cv2.line(
+        img,
+        (points[2][0], points[2][1]),
+        (points[3][0], points[3][1]),
+        color=br2bl_color,
+        thickness=thickness,
+    )
+
+    cv2.line(
+        img,
+        (points[3][0], points[3][1]),
+        (points[0][0], points[0][1]),
+        color=bl2tl_color,
+        thickness=thickness,
+    )
+    return img
+
+
+def vis_points(
+    img: np.ndarray, points, texts: List[str] = None, color=(0, 200, 0)
+) -> np.ndarray:
+    """
+
+    Args:
+        img:
+        points: [N, 8]  8: x1,y1,x2,y2,x3,y3,x3,y4
+        texts:
+        color:
+
+    Returns:
+
+    """
+    points = np.array(points)
+    if texts is not None:
+        assert len(texts) == points.shape[0]
+
+    for i, _points in enumerate(points):
+        vis_polygon(img, _points.reshape(-1, 2), thickness=2, color=color)
+        bbox = points_to_bbox(_points)
+        left, top, right, bottom = bbox
+        cx = (left + right) // 2
+        cy = (top + bottom) // 2
+
+        txt = texts[i]
+        font = cv2.FONT_HERSHEY_SIMPLEX
+        cat_size = cv2.getTextSize(txt, font, 0.5, 2)[0]
+
+        img = cv2.rectangle(
+            img,
+            (cx - 5 * len(txt), cy - cat_size[1] - 5),
+            (cx - 5 * len(txt) + cat_size[0], cy - 5),
+            color,
+            -1,
+        )
+
+        img = cv2.putText(
+            img,
+            txt,
+            (cx - 5 * len(txt), cy - 5),
+            font,
+            0.5,
+            (255, 255, 255),
+            thickness=1,
+            lineType=cv2.LINE_AA,
+        )
+
+    return img
+
+
+def vis_polygons_with_index(image, points):
+    texts = [str(i) for i in range(len(points))]
+    res_img = vis_points(image.copy(), points, texts)
+    return res_img

+ 0 - 0
magic_pdf/model/sub_modules/table/__init__.py


+ 0 - 0
magic_pdf/model/sub_modules/table/rapidtable/__init__.py


+ 14 - 0
magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py

@@ -0,0 +1,14 @@
+import numpy as np
+from rapid_table import RapidTable
+from rapidocr_paddle import RapidOCR
+
+
+class RapidTableModel(object):
+    def __init__(self):
+        self.table_model = RapidTable()
+        self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
+
+    def predict(self, image):
+        ocr_result, _ = self.ocr_engine(np.asarray(image))
+        html_code, table_cell_bboxes, elapse = self.table_model(np.asarray(image), ocr_result)
+        return html_code, table_cell_bboxes, elapse

+ 0 - 0
magic_pdf/model/sub_modules/table/structeqtable/__init__.py


+ 3 - 11
magic_pdf/model/pek_sub_modules/structeqtable/StructTableModel.py → magic_pdf/model/sub_modules/table/structeqtable/struct_eqtable.py

@@ -1,8 +1,8 @@
-import re
-
 import torch
 import torch
 from struct_eqtable import build_model
 from struct_eqtable import build_model
 
 
+from magic_pdf.model.sub_modules.table.table_utils import minify_html
+
 
 
 class StructTableModel:
 class StructTableModel:
     def __init__(self, model_path, max_new_tokens=1024, max_time=60):
     def __init__(self, model_path, max_new_tokens=1024, max_time=60):
@@ -31,15 +31,7 @@ class StructTableModel:
         )
         )
 
 
         if output_format == "html":
         if output_format == "html":
-            results = [self.minify_html(html) for html in results]
+            results = [minify_html(html) for html in results]
 
 
         return results
         return results
 
 
-    def minify_html(self, html):
-        # 移除多余的空白字符
-        html = re.sub(r'\s+', ' ', html)
-        # 移除行尾的空白字符
-        html = re.sub(r'\s*>\s*', '>', html)
-        # 移除标签前的空白字符
-        html = re.sub(r'\s*<\s*', '<', html)
-        return html.strip()

+ 11 - 0
magic_pdf/model/sub_modules/table/table_utils.py

@@ -0,0 +1,11 @@
+import re
+
+
+def minify_html(html):
+    # 移除多余的空白字符
+    html = re.sub(r'\s+', ' ', html)
+    # 移除行尾的空白字符
+    html = re.sub(r'\s*>\s*', '>', html)
+    # 移除标签前的空白字符
+    html = re.sub(r'\s*<\s*', '<', html)
+    return html.strip()

+ 0 - 0
magic_pdf/model/sub_modules/table/tablemaster/__init__.py


+ 1 - 1
magic_pdf/model/ppTableModel.py → magic_pdf/model/sub_modules/table/tablemaster/tablemaster_paddle.py

@@ -7,7 +7,7 @@ from PIL import Image
 import numpy as np
 import numpy as np
 
 
 
 
-class ppTableModel(object):
+class TableMasterPaddleModel(object):
     """
     """
         This class is responsible for converting image of table into HTML format using a pre-trained model.
         This class is responsible for converting image of table into HTML format using a pre-trained model.
 
 

+ 13 - 15
magic_pdf/para/para_split_v3.py

@@ -77,14 +77,12 @@ def __is_list_or_index_block(block):
 
 
         # 如果首行左边不顶格而右边顶格,末行左边顶格而右边不顶格 (第一行可能可以右边不顶格)
         # 如果首行左边不顶格而右边顶格,末行左边顶格而右边不顶格 (第一行可能可以右边不顶格)
         if (first_line['bbox'][0] - block['bbox_fs'][0] > line_height / 2 and
         if (first_line['bbox'][0] - block['bbox_fs'][0] > line_height / 2 and
-                # block['bbox_fs'][2] - first_line['bbox'][2] < line_height and
                 abs(last_line['bbox'][0] - block['bbox_fs'][0]) < line_height / 2 and
                 abs(last_line['bbox'][0] - block['bbox_fs'][0]) < line_height / 2 and
                 block['bbox_fs'][2] - last_line['bbox'][2] > line_height
                 block['bbox_fs'][2] - last_line['bbox'][2] > line_height
         ):
         ):
             multiple_para_flag = True
             multiple_para_flag = True
 
 
         for line in block['lines']:
         for line in block['lines']:
-
             line_mid_x = (line['bbox'][0] + line['bbox'][2]) / 2
             line_mid_x = (line['bbox'][0] + line['bbox'][2]) / 2
             block_mid_x = (block['bbox_fs'][0] + block['bbox_fs'][2]) / 2
             block_mid_x = (block['bbox_fs'][0] + block['bbox_fs'][2]) / 2
             if (
             if (
@@ -102,13 +100,13 @@ def __is_list_or_index_block(block):
                 if span_type == ContentType.Text:
                 if span_type == ContentType.Text:
                     line_text += span['content'].strip()
                     line_text += span['content'].strip()
 
 
+            # 添加所有文本,包括空行,保持与block['lines']长度一致
             lines_text_list.append(line_text)
             lines_text_list.append(line_text)
 
 
             # 计算line左侧顶格数量是否大于2,是否顶格用abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height/2 来判断
             # 计算line左侧顶格数量是否大于2,是否顶格用abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height/2 来判断
             if abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height / 2:
             if abs(block['bbox_fs'][0] - line['bbox'][0]) < line_height / 2:
                 left_close_num += 1
                 left_close_num += 1
             elif line['bbox'][0] - block['bbox_fs'][0] > line_height:
             elif line['bbox'][0] - block['bbox_fs'][0] > line_height:
-                # logger.info(f"{line_text}, {block['bbox_fs']}, {line['bbox']}")
                 left_not_close_num += 1
                 left_not_close_num += 1
 
 
             # 计算右侧是否顶格
             # 计算右侧是否顶格
@@ -117,7 +115,6 @@ def __is_list_or_index_block(block):
             else:
             else:
                 # 右侧不顶格情况下是否有一段距离,拍脑袋用0.3block宽度做阈值
                 # 右侧不顶格情况下是否有一段距离,拍脑袋用0.3block宽度做阈值
                 closed_area = 0.26 * block_weight
                 closed_area = 0.26 * block_weight
-                # closed_area = 5 * line_height
                 if block['bbox_fs'][2] - line['bbox'][2] > closed_area:
                 if block['bbox_fs'][2] - line['bbox'][2] > closed_area:
                     right_not_close_num += 1
                     right_not_close_num += 1
 
 
@@ -128,6 +125,7 @@ def __is_list_or_index_block(block):
         num_start_count = 0
         num_start_count = 0
         num_end_count = 0
         num_end_count = 0
         flag_end_count = 0
         flag_end_count = 0
+
         if len(lines_text_list) > 0:
         if len(lines_text_list) > 0:
             for line_text in lines_text_list:
             for line_text in lines_text_list:
                 if len(line_text) > 0:
                 if len(line_text) > 0:
@@ -138,11 +136,10 @@ def __is_list_or_index_block(block):
                     if line_text[-1].isdigit():
                     if line_text[-1].isdigit():
                         num_end_count += 1
                         num_end_count += 1
 
 
-            if flag_end_count / len(lines_text_list) >= 0.8:
-                line_end_flag = True
-
             if num_start_count / len(lines_text_list) >= 0.8 or num_end_count / len(lines_text_list) >= 0.8:
             if num_start_count / len(lines_text_list) >= 0.8 or num_end_count / len(lines_text_list) >= 0.8:
                 line_num_flag = True
                 line_num_flag = True
+            if flag_end_count / len(lines_text_list) >= 0.8:
+                line_end_flag = True
 
 
         # 有的目录右侧不贴边, 目前认为左边或者右边有一边全贴边,且符合数字规则极为index
         # 有的目录右侧不贴边, 目前认为左边或者右边有一边全贴边,且符合数字规则极为index
         if ((left_close_num / len(block['lines']) >= 0.8 or right_close_num / len(block['lines']) >= 0.8)
         if ((left_close_num / len(block['lines']) >= 0.8 or right_close_num / len(block['lines']) >= 0.8)
@@ -176,7 +173,7 @@ def __is_list_or_index_block(block):
                 # 这种是大部分line item 都有结束标识符的情况,按结束标识符区分不同item
                 # 这种是大部分line item 都有结束标识符的情况,按结束标识符区分不同item
                 elif line_end_flag:
                 elif line_end_flag:
                     for i, line in enumerate(block['lines']):
                     for i, line in enumerate(block['lines']):
-                        if lines_text_list[i][-1] in LIST_END_FLAG:
+                        if len(lines_text_list[i]) > 0 and lines_text_list[i][-1] in LIST_END_FLAG:
                             line[ListLineTag.IS_LIST_END_LINE] = True
                             line[ListLineTag.IS_LIST_END_LINE] = True
                             if i + 1 < len(block['lines']):
                             if i + 1 < len(block['lines']):
                                 block['lines'][i + 1][ListLineTag.IS_LIST_START_LINE] = True
                                 block['lines'][i + 1][ListLineTag.IS_LIST_START_LINE] = True
@@ -187,17 +184,18 @@ def __is_list_or_index_block(block):
                         if line_start_flag:
                         if line_start_flag:
                             line[ListLineTag.IS_LIST_START_LINE] = True
                             line[ListLineTag.IS_LIST_START_LINE] = True
                             line_start_flag = False
                             line_start_flag = False
-                        # elif abs(block['bbox_fs'][2] - line['bbox'][2]) > line_height:
+
                         if abs(block['bbox_fs'][2] - line['bbox'][2]) > 0.1 * block_weight:
                         if abs(block['bbox_fs'][2] - line['bbox'][2]) > 0.1 * block_weight:
                             line[ListLineTag.IS_LIST_END_LINE] = True
                             line[ListLineTag.IS_LIST_END_LINE] = True
                             line_start_flag = True
                             line_start_flag = True
-            # 一种有缩进的特殊有序list,start line 左侧不贴边且以数字开头,end line 以 IS_LIST_END_LINE 结尾且数量和start line 一致
-            elif num_start_count >= 2 and num_start_count == flag_end_count:  # 简单一点先不考虑左侧不贴边的情况
+            # 一种有缩进的特殊有序list,start line 左侧不贴边且以数字开头,end line 以 IS_LIST_END_FLAG 结尾且数量和start line 一致
+            elif num_start_count >= 2 and num_start_count == flag_end_count:
                 for i, line in enumerate(block['lines']):
                 for i, line in enumerate(block['lines']):
-                    if lines_text_list[i][0].isdigit():
-                        line[ListLineTag.IS_LIST_START_LINE] = True
-                    if lines_text_list[i][-1] in LIST_END_FLAG:
-                        line[ListLineTag.IS_LIST_END_LINE] = True
+                    if len(lines_text_list[i]) > 0:
+                        if lines_text_list[i][0].isdigit():
+                            line[ListLineTag.IS_LIST_START_LINE] = True
+                        if lines_text_list[i][-1] in LIST_END_FLAG:
+                            line[ListLineTag.IS_LIST_END_LINE] = True
             else:
             else:
                 # 正常有缩进的list处理
                 # 正常有缩进的list处理
                 for line in block['lines']:
                 for line in block['lines']:

+ 56 - 19
magic_pdf/pdf_parse_union_core_v2.py

@@ -30,8 +30,8 @@ from magic_pdf.pre_proc.equations_replace import (
 from magic_pdf.pre_proc.ocr_detect_all_bboxes import \
 from magic_pdf.pre_proc.ocr_detect_all_bboxes import \
     ocr_prepare_bboxes_for_layout_split_v2
     ocr_prepare_bboxes_for_layout_split_v2
 from magic_pdf.pre_proc.ocr_dict_merge import (fill_spans_in_blocks,
 from magic_pdf.pre_proc.ocr_dict_merge import (fill_spans_in_blocks,
-                                               fix_block_spans,
-                                               fix_discarded_block, fix_block_spans_v2)
+                                               fix_discarded_block,
+                                               fix_block_spans_v2)
 from magic_pdf.pre_proc.ocr_span_list_modify import (
 from magic_pdf.pre_proc.ocr_span_list_modify import (
     get_qa_need_list_v2, remove_overlaps_low_confidence_spans,
     get_qa_need_list_v2, remove_overlaps_low_confidence_spans,
     remove_overlaps_min_spans)
     remove_overlaps_min_spans)
@@ -164,8 +164,8 @@ class ModelSingleton:
 
 
 
 
 def do_predict(boxes: List[List[int]], model) -> List[int]:
 def do_predict(boxes: List[List[int]], model) -> List[int]:
-    from magic_pdf.model.v3.helpers import (boxes2inputs, parse_logits,
-                                            prepare_inputs)
+    from magic_pdf.model.sub_modules.reading_oreder.layoutreader.helpers import (boxes2inputs, parse_logits,
+                                                                                 prepare_inputs)
 
 
     inputs = boxes2inputs(boxes)
     inputs = boxes2inputs(boxes)
     inputs = prepare_inputs(inputs, model)
     inputs = prepare_inputs(inputs, model)
@@ -174,23 +174,57 @@ def do_predict(boxes: List[List[int]], model) -> List[int]:
 
 
 
 
 def cal_block_index(fix_blocks, sorted_bboxes):
 def cal_block_index(fix_blocks, sorted_bboxes):
-    for block in fix_blocks:
 
 
-        line_index_list = []
-        if len(block['lines']) == 0:
-            block['index'] = sorted_bboxes.index(block['bbox'])
-        else:
+    if sorted_bboxes is not None:
+        # 使用layoutreader排序
+        for block in fix_blocks:
+            line_index_list = []
+            if len(block['lines']) == 0:
+                block['index'] = sorted_bboxes.index(block['bbox'])
+            else:
+                for line in block['lines']:
+                    line['index'] = sorted_bboxes.index(line['bbox'])
+                    line_index_list.append(line['index'])
+                median_value = statistics.median(line_index_list)
+                block['index'] = median_value
+
+            # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
+            if block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
+                block['virtual_lines'] = copy.deepcopy(block['lines'])
+                block['lines'] = copy.deepcopy(block['real_lines'])
+                del block['real_lines']
+    else:
+        # 使用xycut排序
+        block_bboxes = []
+        for block in fix_blocks:
+            block_bboxes.append(block['bbox'])
+
+            # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
+            if block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
+                block['virtual_lines'] = copy.deepcopy(block['lines'])
+                block['lines'] = copy.deepcopy(block['real_lines'])
+                del block['real_lines']
+
+        import numpy as np
+        from magic_pdf.model.sub_modules.reading_oreder.layoutreader.xycut import recursive_xy_cut
+
+        random_boxes = np.array(block_bboxes)
+        np.random.shuffle(random_boxes)
+        res = []
+        recursive_xy_cut(np.asarray(random_boxes).astype(int), np.arange(len(block_bboxes)), res)
+        assert len(res) == len(block_bboxes)
+        sorted_boxes = random_boxes[np.array(res)].tolist()
+
+        for i, block in enumerate(fix_blocks):
+            block['index'] = sorted_boxes.index(block['bbox'])
+
+        # 生成line index
+        sorted_blocks = sorted(fix_blocks, key=lambda b: b['index'])
+        line_inedx = 1
+        for block in sorted_blocks:
             for line in block['lines']:
             for line in block['lines']:
-                line['index'] = sorted_bboxes.index(line['bbox'])
-                line_index_list.append(line['index'])
-            median_value = statistics.median(line_index_list)
-            block['index'] = median_value
-
-        # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
-        if block['type'] in [BlockType.ImageBody, BlockType.TableBody]:
-            block['virtual_lines'] = copy.deepcopy(block['lines'])
-            block['lines'] = copy.deepcopy(block['real_lines'])
-            del block['real_lines']
+                line['index'] = line_inedx
+                line_inedx += 1
 
 
     return fix_blocks
     return fix_blocks
 
 
@@ -264,6 +298,9 @@ def sort_lines_by_model(fix_blocks, page_w, page_h, line_height):
                 block['lines'].append({'bbox': line, 'spans': []})
                 block['lines'].append({'bbox': line, 'spans': []})
             page_line_list.extend(lines)
             page_line_list.extend(lines)
 
 
+    if len(page_line_list) > 200:  # layoutreader最高支持512line
+        return None
+
     # 使用layoutreader排序
     # 使用layoutreader排序
     x_scale = 1000.0 / page_w
     x_scale = 1000.0 / page_w
     y_scale = 1000.0 / page_h
     y_scale = 1000.0 / page_h

+ 2 - 1
magic_pdf/resources/model_config/model_configs.yaml

@@ -4,4 +4,5 @@ weights:
   yolo_v8_mfd: MFD/YOLO/yolo_v8_ft.pt
   yolo_v8_mfd: MFD/YOLO/yolo_v8_ft.pt
   unimernet_small: MFR/unimernet_small
   unimernet_small: MFR/unimernet_small
   struct_eqtable: TabRec/StructEqTable
   struct_eqtable: TabRec/StructEqTable
-  tablemaster: TabRec/TableMaster
+  tablemaster: TabRec/TableMaster
+  rapid_table: TabRec/RapidTable

+ 47 - 3
magic_pdf/tools/common.py

@@ -14,6 +14,9 @@ from magic_pdf.pipe.TXTPipe import TXTPipe
 from magic_pdf.pipe.UNIPipe import UNIPipe
 from magic_pdf.pipe.UNIPipe import UNIPipe
 from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
 from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
 from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
 from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
+import fitz
+# from io import BytesIO
+# from pypdf import PdfReader, PdfWriter
 
 
 
 
 def prepare_env(output_dir, pdf_file_name, method):
 def prepare_env(output_dir, pdf_file_name, method):
@@ -26,6 +29,42 @@ def prepare_env(output_dir, pdf_file_name, method):
     return local_image_dir, local_md_dir
     return local_image_dir, local_md_dir
 
 
 
 
+# def convert_pdf_bytes_to_bytes_by_pypdf(pdf_bytes, start_page_id=0, end_page_id=None):
+#     # 将字节数据包装在 BytesIO 对象中
+#     pdf_file = BytesIO(pdf_bytes)
+#     # 读取 PDF 的字节数据
+#     reader = PdfReader(pdf_file)
+#     # 创建一个新的 PDF 写入器
+#     writer = PdfWriter()
+#     # 将所有页面添加到新的 PDF 写入器中
+#     end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(reader.pages) - 1
+#     if end_page_id > len(reader.pages) - 1:
+#         logger.warning("end_page_id is out of range, use pdf_docs length")
+#         end_page_id = len(reader.pages) - 1
+#     for i, page in enumerate(reader.pages):
+#         if start_page_id <= i <= end_page_id:
+#             writer.add_page(page)
+#     # 创建一个字节缓冲区来存储输出的 PDF 数据
+#     output_buffer = BytesIO()
+#     # 将 PDF 写入字节缓冲区
+#     writer.write(output_buffer)
+#     # 获取字节缓冲区的内容
+#     converted_pdf_bytes = output_buffer.getvalue()
+#     return converted_pdf_bytes
+
+
+def convert_pdf_bytes_to_bytes_by_pymupdf(pdf_bytes, start_page_id=0, end_page_id=None):
+    document = fitz.open("pdf", pdf_bytes)
+    output_document = fitz.open()
+    end_page_id = end_page_id if end_page_id is not None and end_page_id >= 0 else len(document) - 1
+    if end_page_id > len(document) - 1:
+        logger.warning("end_page_id is out of range, use pdf_docs length")
+        end_page_id = len(document) - 1
+    output_document.insert_pdf(document, from_page=start_page_id, to_page=end_page_id)
+    output_bytes = output_document.tobytes()
+    return output_bytes
+
+
 def do_parse(
 def do_parse(
     output_dir,
     output_dir,
     pdf_file_name,
     pdf_file_name,
@@ -55,6 +94,8 @@ def do_parse(
         f_draw_model_bbox = True
         f_draw_model_bbox = True
         f_draw_line_sort_bbox = True
         f_draw_line_sort_bbox = True
 
 
+    pdf_bytes = convert_pdf_bytes_to_bytes_by_pymupdf(pdf_bytes, start_page_id, end_page_id)
+
     orig_model_list = copy.deepcopy(model_list)
     orig_model_list = copy.deepcopy(model_list)
     local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name,
     local_image_dir, local_md_dir = prepare_env(output_dir, pdf_file_name,
                                                 parse_method)
                                                 parse_method)
@@ -66,15 +107,18 @@ def do_parse(
     if parse_method == 'auto':
     if parse_method == 'auto':
         jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
         jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
         pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True,
         pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer, is_debug=True,
-                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang,
+                       # start_page_id=start_page_id, end_page_id=end_page_id,
+                       lang=lang,
                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
     elif parse_method == 'txt':
     elif parse_method == 'txt':
         pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True,
         pipe = TXTPipe(pdf_bytes, model_list, image_writer, is_debug=True,
-                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang,
+                       # start_page_id=start_page_id, end_page_id=end_page_id,
+                       lang=lang,
                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
     elif parse_method == 'ocr':
     elif parse_method == 'ocr':
         pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True,
         pipe = OCRPipe(pdf_bytes, model_list, image_writer, is_debug=True,
-                       start_page_id=start_page_id, end_page_id=end_page_id, lang=lang,
+                       # start_page_id=start_page_id, end_page_id=end_page_id,
+                       lang=lang,
                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
                        layout_model=layout_model, formula_enable=formula_enable, table_enable=table_enable)
     else:
     else:
         logger.error('unknown parse method')
         logger.error('unknown parse method')

File diff suppressed because it is too large
+ 16 - 0
next_docs/README.md


File diff suppressed because it is too large
+ 16 - 0
next_docs/README_zh-CN.md


File diff suppressed because it is too large
+ 13 - 0
next_docs/en/_static/image/ReadTheDocs.svg


+ 0 - 26
next_docs/en/additional_notes/changelog.rst

@@ -1,26 +0,0 @@
-
-
-Changelog
-=========
-
--  2024/09/27 Version 0.8.1 released, Fixed some bugs, and providing a
-   `localized deployment version <projects/web_demo/README.md>`__ of the
-   `online
-   demo <https://opendatalab.com/OpenSourceTools/Extractor/PDF/>`__ and
-   the `front-end interface <projects/web/README.md>`__.
--  2024/09/09: Version 0.8.0 released, supporting fast deployment with
-   Dockerfile, and launching demos on Huggingface and Modelscope.
--  2024/08/30: Version 0.7.1 released, add paddle tablemaster table
-   recognition option
--  2024/08/09: Version 0.7.0b1 released, simplified installation
-   process, added table recognition functionality
--  2024/08/01: Version 0.6.2b1 released, optimized dependency conflict
-   issues and installation documentation
--  2024/07/05: Initial open-source release
-
-
-.. warning::
-
-   fix ``localized deployment version`` and ``front-end interface``
-
-

+ 12 - 0
next_docs/en/additional_notes/faq.rst

@@ -74,3 +74,15 @@ CUDA version used by Paddle needs to be upgraded.
    pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
    pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
 
 
 Reference: https://github.com/opendatalab/MinerU/issues/558
 Reference: https://github.com/opendatalab/MinerU/issues/558
+
+
+7. On some Linux servers, the program immediately reports an error ``Illegal instruction (core dumped)``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This might be because the server's CPU does not support the AVX/AVX2
+instruction set, or the CPU itself supports it but has been disabled by
+the system administrator. You can try contacting the system
+administrator to remove the restriction or change to a different server.
+
+References: https://github.com/opendatalab/MinerU/issues/591 ,
+https://github.com/opendatalab/MinerU/issues/736

+ 15 - 14
next_docs/en/additional_notes/known_issues.rst

@@ -1,19 +1,20 @@
 Known Issues
 Known Issues
 ============
 ============
 
 
--  Reading order is based on the model’s sorting of text distribution in
-   space, which may become disordered under extremely complex layouts.
+-  Reading order is determined by the model based on the spatial
+   distribution of readable content, and may be out of order in some
+   areas under extremely complex layouts.
 -  Vertical text is not supported.
 -  Vertical text is not supported.
--  Tables of contents and lists are recognized through rules; a few
-   uncommon list formats may not be identified.
--  Only one level of headings is supported; hierarchical heading levels
-   are currently not supported.
+-  Tables of contents and lists are recognized through rules, and some
+   uncommon list formats may not be recognized.
+-  Only one level of headings is supported; hierarchical headings are
+   not currently supported.
 -  Code blocks are not yet supported in the layout model.
 -  Code blocks are not yet supported in the layout model.
--  Comic books, art books, elementary school textbooks, and exercise
-   books are not well-parsed yet
--  Enabling OCR may produce better results in PDFs with a high density
-   of formulas
--  If you are processing PDFs with a large number of formulas, it is
-   strongly recommended to enable the OCR function. When using PyMuPDF
-   to extract text, overlapping text lines can occur, leading to
-   inaccurate formula insertion positions.
+-  Comic books, art albums, primary school textbooks, and exercises
+   cannot be parsed well.
+-  Table recognition may result in row/column recognition errors in
+   complex tables.
+-  OCR recognition may produce inaccurate characters in PDFs of
+   lesser-known languages (e.g., diacritical marks in Latin script,
+   easily confused characters in Arabic script).
+-  Some formulas may not render correctly in Markdown.

+ 0 - 1
next_docs/en/api.rst

@@ -7,4 +7,3 @@
    api/read_api
    api/read_api
    api/schemas
    api/schemas
    api/io
    api/io
-   api/classes

+ 0 - 14
next_docs/en/api/classes.rst

@@ -1,14 +0,0 @@
-Class Hierarchy
-===============
-
-.. inheritance-diagram:: magic_pdf.data.io.base magic_pdf.data.io.http magic_pdf.data.io.s3
-   :parts: 2
-
-
-.. inheritance-diagram:: magic_pdf.data.dataset
-   :parts: 2
-
-
-.. inheritance-diagram:: magic_pdf.data.data_reader_writer.base magic_pdf.data.data_reader_writer.filebase magic_pdf.data.data_reader_writer.multi_bucket_s3
-   :parts: 2
-

+ 0 - 1
next_docs/en/api/utils.rst

@@ -1 +0,0 @@
-

+ 1 - 1
next_docs/en/conf.py

@@ -95,7 +95,7 @@ language = 'en'
 html_theme = 'sphinx_book_theme'
 html_theme = 'sphinx_book_theme'
 html_logo = '_static/image/logo.png'
 html_logo = '_static/image/logo.png'
 html_theme_options = {
 html_theme_options = {
-    'path_to_docs': 'docs/en',
+    'path_to_docs': 'next_docs/en',
     'repository_url': 'https://github.com/opendatalab/MinerU',
     'repository_url': 'https://github.com/opendatalab/MinerU',
     'use_repository_button': True,
     'use_repository_button': True,
 }
 }

+ 23 - 22
next_docs/en/index.rst

@@ -46,20 +46,29 @@ the relevant PDF**.
 Key Features
 Key Features
 ------------
 ------------
 
 
--  Removes elements such as headers, footers, footnotes, and page
-   numbers while maintaining semantic continuity
--  Outputs text in a human-readable order from multi-column documents
--  Retains the original structure of the document, including titles,
-   paragraphs, and lists
--  Extracts images, image captions, tables, and table captions
--  Automatically recognizes formulas in the document and converts them
-   to LaTeX
--  Automatically recognizes tables in the document and converts them to
-   LaTeX
--  Automatically detects and enables OCR for corrupted PDFs
--  Supports both CPU and GPU environments
--  Supports Windows, Linux, and Mac platforms
-
+-  Remove headers, footers, footnotes, page numbers, etc., to ensure
+   semantic coherence.
+-  Output text in human-readable order, suitable for single-column,
+   multi-column, and complex layouts.
+-  Preserve the structure of the original document, including headings,
+   paragraphs, lists, etc.
+-  Extract images, image descriptions, tables, table titles, and
+   footnotes.
+-  Automatically recognize and convert formulas in the document to LaTeX
+   format.
+-  Automatically recognize and convert tables in the document to LaTeX
+   or HTML format.
+-  Automatically detect scanned PDFs and garbled PDFs and enable OCR
+   functionality.
+-  OCR supports detection and recognition of 84 languages.
+-  Supports multiple output formats, such as multimodal and NLP
+   Markdown, JSON sorted by reading order, and rich intermediate
+   formats.
+-  Supports various visualization results, including layout
+   visualization and span visualization, for efficient confirmation of
+   output quality.
+-  Supports both CPU and GPU environments.
+-  Compatible with Windows, Linux, and Mac platforms.
 
 
 User Guide
 User Guide
 -------------
 -------------
@@ -91,14 +100,6 @@ Additional Notes
 
 
    additional_notes/known_issues
    additional_notes/known_issues
    additional_notes/faq
    additional_notes/faq
-   additional_notes/changelog
    additional_notes/glossary
    additional_notes/glossary
 
 
 
 
-Projects 
----------
-.. toctree::
-   :maxdepth: 1
-   :caption: Projects
-
-   projects

+ 0 - 13
next_docs/en/projects.rst

@@ -1,13 +0,0 @@
-
-
-
-llama_index_rag 
-===============
-
-
-gradio_app
-============
-
-
-other projects
-===============

+ 5 - 1
next_docs/en/user_guide/data/data_reader_writer.rst

@@ -87,6 +87,8 @@ Read Examples
 
 
 .. code:: python
 .. code:: python
 
 
+    from magic_pdf.data.data_reader_writer import *
+
     # file based related 
     # file based related 
     file_based_reader1 = FileBasedDataReader('')
     file_based_reader1 = FileBasedDataReader('')
 
 
@@ -142,6 +144,8 @@ Write Examples
 
 
 .. code:: python
 .. code:: python
 
 
+    from magic_pdf.data.data_reader_writer import *
+
     # file based related 
     # file based related 
     file_based_writer1 = FileBasedDataWriter('')
     file_based_writer1 = FileBasedDataWriter('')
 
 
@@ -201,4 +205,4 @@ Write Examples
     s3_writer1.write('s3://test_bucket/efg', '123'.encode())
     s3_writer1.write('s3://test_bucket/efg', '123'.encode())
 
 
 
 
-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/data_reader_writer` for more details
+Check :doc:`../../api/data_reader_writer` for more details

+ 1 - 1
next_docs/en/user_guide/data/dataset.rst

@@ -36,5 +36,5 @@ Extract chars via third-party library, currently we use ``pymupdf``.
 
 
 
 
 
 
-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/dataset` for more details
+Check :doc:`../../api/dataset` for more details
 
 

+ 1 - 1
next_docs/en/user_guide/data/io.rst

@@ -21,5 +21,5 @@ if MinerU have not provide the suitable classes. It is easy to implement new cla
         def write(self, path: str, data: bytes) -> None:
         def write(self, path: str, data: bytes) -> None:
             pass
             pass
 
 
-Check :doc:`../../api/classes` for more intuitions or check :doc:`../../api/io` for more details
+Check :doc:`../../api/io` for more details
 
 

+ 6 - 1
next_docs/en/user_guide/data/read_api.rst

@@ -18,6 +18,8 @@ Read the contet from jsonl which may located on local machine or remote s3. if y
 
 
 .. code:: python
 .. code:: python
 
 
+    from magic_pdf.data.io.read_api import *
+
     # read jsonl from local machine 
     # read jsonl from local machine 
     datasets = read_jsonl("tt.jsonl", None)
     datasets = read_jsonl("tt.jsonl", None)
 
 
@@ -33,6 +35,8 @@ Read pdf from path or directory.
 
 
 .. code:: python
 .. code:: python
 
 
+    from magic_pdf.data.io.read_api import *
+
     # read pdf path
     # read pdf path
     datasets = read_local_pdfs("tt.pdf")
     datasets = read_local_pdfs("tt.pdf")
 
 
@@ -47,10 +51,11 @@ Read images from path or directory
 
 
 .. code:: python 
 .. code:: python 
 
 
+    from magic_pdf.data.io.read_api import *
+
     # read from image path 
     # read from image path 
     datasets = read_local_images("tt.png")
     datasets = read_local_images("tt.png")
 
 
-
     # read files from directory that endswith suffix in suffixes array 
     # read files from directory that endswith suffix in suffixes array 
     datasets = read_local_images("images/", suffixes=["png", "jpg"])
     datasets = read_local_images("images/", suffixes=["png", "jpg"])
 
 

+ 45 - 41
next_docs/en/user_guide/install/boost_with_cuda.rst

@@ -9,16 +9,18 @@ appropriate guide based on your system:
 
 
 -  :ref:`ubuntu_22_04_lts_section`
 -  :ref:`ubuntu_22_04_lts_section`
 -  :ref:`windows_10_or_11_section`
 -  :ref:`windows_10_or_11_section`
+-  Quick Deployment with Docker
 
 
--  Quick Deployment with Docker > Docker requires a GPU with at least
-   16GB of VRAM, and all acceleration features are enabled by default.
+.. admonition:: Important
+   :class: tip
 
 
-.. note:: 
+   Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
 
 
-   Before running this Docker, you can use the following command to
-   check if your device supports CUDA acceleration on Docker. 
+   Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker. 
 
 
-   bash  docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
+   .. code-block:: bash
+
+      bash  docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
 
 
 .. code:: sh
 .. code:: sh
 
 
@@ -42,8 +44,9 @@ Ubuntu 22.04 LTS
 If you see information similar to the following, it means that the
 If you see information similar to the following, it means that the
 NVIDIA drivers are already installed, and you can skip Step 2.
 NVIDIA drivers are already installed, and you can skip Step 2.
 
 
-Notice:``CUDA Version`` should be >= 12.1, If the displayed version
-number is less than 12.1, please upgrade the driver.
+.. note::
+
+   ``CUDA Version`` should be >= 12.1, If the displayed version number is less than 12.1, please upgrade the driver.
 
 
 .. code:: text
 .. code:: text
 
 
@@ -105,8 +108,10 @@ Specify Python version 3.10.
 
 
    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
 
 
-❗ After installation, make sure to check the version of ``magic-pdf``
-using the following command:
+.. admonition:: Important
+    :class: tip
+
+    ❗ After installation, make sure to check the version of ``magic-pdf`` using the following command:
 
 
 .. code:: sh
 .. code:: sh
 
 
@@ -127,7 +132,10 @@ the script will automatically generate a ``magic-pdf.json`` file in the
 user directory and configure the default model path. You can find the
 user directory and configure the default model path. You can find the
 ``magic-pdf.json`` file in your user directory.
 ``magic-pdf.json`` file in your user directory.
 
 
-   The user directory for Linux is “/home/username”.
+.. admonition:: TIP
+    :class: tip
+
+    The user directory for Linux is “/home/username”.
 
 
 8. First Run
 8. First Run
 ~~~~~~~~~~~~
 ~~~~~~~~~~~~
@@ -137,7 +145,7 @@ Download a sample file from the repository and test it.
 .. code:: sh
 .. code:: sh
 
 
    wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf
    wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf
-   magic-pdf -p small_ocr.pdf
+   magic-pdf -p small_ocr.pdf -o ./output
 
 
 9. Test CUDA Acceleration
 9. Test CUDA Acceleration
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 ~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -145,10 +153,6 @@ Download a sample file from the repository and test it.
 If your graphics card has at least **8GB** of VRAM, follow these steps
 If your graphics card has at least **8GB** of VRAM, follow these steps
 to test CUDA acceleration:
 to test CUDA acceleration:
 
 
-   ❗ Due to the extremely limited nature of 8GB VRAM for running this
-   application, you need to close all other programs using VRAM to
-   ensure that 8GB of VRAM is available when running this application.
-
 1. Modify the value of ``"device-mode"`` in the ``magic-pdf.json``
 1. Modify the value of ``"device-mode"`` in the ``magic-pdf.json``
    configuration file located in your home directory.
    configuration file located in your home directory.
 
 
@@ -162,7 +166,7 @@ to test CUDA acceleration:
 
 
    .. code:: sh
    .. code:: sh
 
 
-      magic-pdf -p small_ocr.pdf
+      magic-pdf -p small_ocr.pdf -o ./output
 
 
 10. Enable CUDA Acceleration for OCR
 10. Enable CUDA Acceleration for OCR
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -178,7 +182,9 @@ to test CUDA acceleration:
 
 
    .. code:: sh
    .. code:: sh
 
 
-      magic-pdf -p small_ocr.pdf
+      magic-pdf -p small_ocr.pdf -o ./output
+
+
 
 
 .. _windows_10_or_11_section:
 .. _windows_10_or_11_section:
 
 
@@ -218,16 +224,16 @@ Python version must be 3.10.
 
 
    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
    pip install -U magic-pdf[full] --extra-index-url https://wheels.myhloli.com
 
 
-..
+.. admonition:: Important
+    :class: tip
 
 
-   ❗️After installation, verify the version of ``magic-pdf``:
+    ❗️After installation, verify the version of ``magic-pdf``:
 
 
-   .. code:: bash
+    .. code:: bash
 
 
       magic-pdf --version
       magic-pdf --version
 
 
-   If the version number is less than 0.7.0, please report it in the
-   issues section.
+    If the version number is less than 0.7.0, please report it in the issues section.
 
 
 5. Download Models
 5. Download Models
 ~~~~~~~~~~~~~~~~~~
 ~~~~~~~~~~~~~~~~~~
@@ -242,7 +248,10 @@ the script will automatically generate a ``magic-pdf.json`` file in the
 user directory and configure the default model path. You can find the
 user directory and configure the default model path. You can find the
 ``magic-pdf.json`` file in your 【user directory】 .
 ``magic-pdf.json`` file in your 【user directory】 .
 
 
-   The user directory for Windows is “C:/Users/username”.
+.. admonition:: Tip
+    :class: tip
+
+    The user directory for Windows is “C:/Users/username”.
 
 
 7. First Run
 7. First Run
 ~~~~~~~~~~~~
 ~~~~~~~~~~~~
@@ -252,7 +261,7 @@ Download a sample file from the repository and test it.
 .. code:: powershell
 .. code:: powershell
 
 
      wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf
      wget https://github.com/opendatalab/MinerU/raw/master/demo/small_ocr.pdf -O small_ocr.pdf
-     magic-pdf -p small_ocr.pdf
+     magic-pdf -p small_ocr.pdf -o ./output
 
 
 8. Test CUDA Acceleration
 8. Test CUDA Acceleration
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 ~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -260,27 +269,23 @@ Download a sample file from the repository and test it.
 If your graphics card has at least 8GB of VRAM, follow these steps to
 If your graphics card has at least 8GB of VRAM, follow these steps to
 test CUDA-accelerated parsing performance.
 test CUDA-accelerated parsing performance.
 
 
-   ❗ Due to the extremely limited nature of 8GB VRAM for running this
-   application, you need to close all other programs using VRAM to
-   ensure that 8GB of VRAM is available when running this application.
-
-1. **Overwrite the installation of torch and torchvision** supporting
-   CUDA.
+1. **Overwrite the installation of torch and torchvision** supporting CUDA.
 
 
-   ::
+.. code:: sh
 
 
-      pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
+   pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
 
 
-   ..
+.. admonition:: Important
+    :class: tip
 
 
-      ❗️Ensure the following versions are specified in the command:
+    ❗️Ensure the following versions are specified in the command:
 
 
-      ::
+ 
+    .. code:: sh
 
 
          torch==2.3.1 torchvision==0.18.1
          torch==2.3.1 torchvision==0.18.1
 
 
-      These are the highest versions we support. Installing higher
-      versions without specifying them will cause the program to fail.
+    These are the highest versions we support. Installing higher versions without specifying them will cause the program to fail.
 
 
 2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json``
 2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json``
    configuration file located in your user directory.
    configuration file located in your user directory.
@@ -295,7 +300,7 @@ test CUDA-accelerated parsing performance.
 
 
    ::
    ::
 
 
-      magic-pdf -p small_ocr.pdf
+      magic-pdf -p small_ocr.pdf -o ./output
 
 
 9. Enable CUDA Acceleration for OCR
 9. Enable CUDA Acceleration for OCR
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
@@ -311,5 +316,4 @@ test CUDA-accelerated parsing performance.
 
 
    ::
    ::
 
 
-      magic-pdf -p small_ocr.pdf
-
+      magic-pdf -p small_ocr.pdf -o ./output

+ 81 - 78
next_docs/en/user_guide/install/install.rst

@@ -1,87 +1,90 @@
 
 
 Install 
 Install 
 ===============================================================
 ===============================================================
-If you encounter any installation issues, please first consult the FAQ.
-If the parsing results are not as expected, refer to the Known Issues.
-There are three different ways to experience MinerU
-
-Pre-installation Notice—Hardware and Software Environment Support
-------------------------------------------------------------------
-
-To ensure the stability and reliability of the project, we only optimize
-and test for specific hardware and software environments during
-development. This ensures that users deploying and running the project
-on recommended system configurations will get the best performance with
-the fewest compatibility issues.
-
-By focusing resources on the mainline environment, our team can more
-efficiently resolve potential bugs and develop new features.
-
-In non-mainline environments, due to the diversity of hardware and
-software configurations, as well as third-party dependency compatibility
-issues, we cannot guarantee 100% project availability. Therefore, for
-users who wish to use this project in non-recommended environments, we
-suggest carefully reading the documentation and FAQ first. Most issues
-already have corresponding solutions in the FAQ. We also encourage
-community feedback to help us gradually expand support.
+If you encounter any installation issues, please first consult the :doc:`../../additional_notes/faq`.
+If the parsing results are not as expected, refer to the :doc:`../../additional_notes/known_issues`.
+
+
+.. admonition:: Warning
+    :class: tip
+
+    **Pre-installation Notice—Hardware and Software Environment Support**
+
+    To ensure the stability and reliability of the project, we only optimize
+    and test for specific hardware and software environments during
+    development. This ensures that users deploying and running the project
+    on recommended system configurations will get the best performance with
+    the fewest compatibility issues.
+
+    By focusing resources on the mainline environment, our team can more
+    efficiently resolve potential bugs and develop new features.
+
+    In non-mainline environments, due to the diversity of hardware and
+    software configurations, as well as third-party dependency compatibility
+    issues, we cannot guarantee 100% project availability. Therefore, for
+    users who wish to use this project in non-recommended environments, we
+    suggest carefully reading the documentation and FAQ first. Most issues
+    already have corresponding solutions in the FAQ. We also encourage
+    community feedback to help us gradually expand support.
 
 
 .. raw:: html
 .. raw:: html
 
 
-   <style>
-      table, th, td {
-      border: 1px solid black;
-      border-collapse: collapse;
-      }
-   </style>
-   <table>
-    <tr>
-        <td colspan="3" rowspan="2">Operating System</td>
-    </tr>
-    <tr>
-        <td>Ubuntu 22.04 LTS</td>
-        <td>Windows 10 / 11</td>
-        <td>macOS 11+</td>
-    </tr>
-    <tr>
-        <td colspan="3">CPU</td>
-        <td>x86_64</td>
-        <td>x86_64</td>
-        <td>x86_64 / arm64</td>
-    </tr>
-    <tr>
-        <td colspan="3">Memory</td>
-        <td colspan="3">16GB or more, recommended 32GB+</td>
-    </tr>
-    <tr>
-        <td colspan="3">Python Version</td>
-        <td colspan="3">3.10</td>
-    </tr>
-    <tr>
-        <td colspan="3">Nvidia Driver Version</td>
-        <td>latest (Proprietary Driver)</td>
-        <td>latest</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CUDA Environment</td>
-        <td>Automatic installation [12.1 (pytorch) + 11.8 (paddle)]</td>
-        <td>11.8 (manual installation) + cuDNN v8.7.0 (manual installation)</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td rowspan="2">GPU Hardware Support List</td>
-        <td colspan="2">Minimum Requirement 8G+ VRAM</td>
-        <td colspan="2">3060ti/3070/3080/3080ti/4060/4070/4070ti<br>
-        8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
-        <td rowspan="2">None</td>
-    </tr>
-    <tr>
-        <td colspan="2">Recommended Configuration 16G+ VRAM</td>
-        <td colspan="2">3090/3090ti/4070ti super/4080/4090<br>
-        16G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
-        </td>
-    </tr>
-   </table>
+    <style>
+        table, th, td {
+        border: 1px solid black;
+        border-collapse: collapse;
+        }
+    </style>
+    <table>
+        <tr>
+            <td colspan="3" rowspan="2">Operating System</td>
+        </tr>
+        <tr>
+            <td>Ubuntu 22.04 LTS</td>
+            <td>Windows 10 / 11</td>
+            <td>macOS 11+</td>
+        </tr>
+        <tr>
+            <td colspan="3">CPU</td>
+            <td>x86_64(unsupported ARM Linux)</td>
+            <td>x86_64(unsupported ARM Windows)</td>
+            <td>x86_64 / arm64</td>
+        </tr>
+        <tr>
+            <td colspan="3">Memory</td>
+            <td colspan="3">16GB or more, recommended 32GB+</td>
+        </tr>
+        <tr>
+            <td colspan="3">Python Version</td>
+            <td colspan="3">3.10(Please make sure to create a Python 3.10 virtual environment using conda)</td>
+        </tr>
+        <tr>
+            <td colspan="3">Nvidia Driver Version</td>
+            <td>latest (Proprietary Driver)</td>
+            <td>latest</td>
+            <td>None</td>
+        </tr>
+        <tr>
+            <td colspan="3">CUDA Environment</td>
+            <td>Automatic installation [12.1 (pytorch) + 11.8 (paddle)]</td>
+            <td>11.8 (manual installation) + cuDNN v8.7.0 (manual installation)</td>
+            <td>None</td>
+        </tr>
+        <tr>
+            <td rowspan="2">GPU Hardware Support List</td>
+            <td colspan="2">Minimum Requirement 8G+ VRAM</td>
+            <td colspan="2">3060ti/3070/4060<br>
+            8G VRAM enables layout, formula recognition acceleration and OCR acceleration</td>
+            <td rowspan="2">None</td>
+        </tr>
+        <tr>
+            <td colspan="2">Recommended Configuration 10G+ VRAM</td>
+            <td colspan="2">3080/3080ti/3090/3090ti/4070/4070ti/4070tisuper/4080/4090<br>
+            10G VRAM or more can enable layout, formula recognition, OCR acceleration and table recognition acceleration simultaneously
+            </td>
+        </tr>
+    </table>
+
 
 
 
 
 Create an environment
 Create an environment

+ 4 - 1
next_docs/en/user_guide/quick_start/command_line.rst

@@ -55,5 +55,8 @@ directory. The output file list is as follows:
    ├── some_pdf_spans.pdf                   # smallest granularity bbox position information diagram
    ├── some_pdf_spans.pdf                   # smallest granularity bbox position information diagram
    └── some_pdf_content_list.json           # Rich text JSON arranged in reading order
    └── some_pdf_content_list.json           # Rich text JSON arranged in reading order
 
 
-For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
+.. admonition:: Tip
+   :class: tip
+
+   For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
 
 

+ 0 - 10
next_docs/en/user_guide/quick_start/extract_text.rst

@@ -1,10 +0,0 @@
-
-
-Extract Content from Pdf
-========================
-
-.. code:: python
-
-    from magic_pdf.data.read_api import read_local_pdfs
-    from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze

BIN
next_docs/zh_cn/_static/image/MinerU-logo-hq.png


BIN
next_docs/zh_cn/_static/image/MinerU-logo.png


File diff suppressed because it is too large
+ 13 - 0
next_docs/zh_cn/_static/image/ReadTheDocs.svg


BIN
next_docs/zh_cn/_static/image/datalab_logo.png


BIN
next_docs/zh_cn/_static/image/flowchart_en.png


BIN
next_docs/zh_cn/_static/image/flowchart_zh_cn.png


BIN
next_docs/zh_cn/_static/image/layout_example.png


BIN
next_docs/zh_cn/_static/image/poly.png


BIN
next_docs/zh_cn/_static/image/project_panorama_en.png


BIN
next_docs/zh_cn/_static/image/project_panorama_zh_cn.png


BIN
next_docs/zh_cn/_static/image/spans_example.png


BIN
next_docs/zh_cn/_static/image/web_demo_1.png


+ 72 - 0
next_docs/zh_cn/additional_notes/faq.rst

@@ -0,0 +1,72 @@
+常见问题解答
+============
+
+1.在较新版本的mac上使用命令安装pip install magic-pdf[full] zsh: no matches found: magic-pdf[full]
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+在 macOS 上,默认的 shell 从 Bash 切换到了 Z shell,而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑,这可能导致no matches found错误。 可以通过在命令行禁用globbing特性,再尝试运行安装命令
+
+.. code:: bash
+
+   setopt no_nomatch
+   pip install magic-pdf[full]
+
+2.使用过程中遇到_pickle.UnpicklingError: invalid load key, ‘v’.错误
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+可能是由于模型文件未下载完整导致,可尝试重新下载模型文件后再试。参考:https://github.com/opendatalab/MinerU/issues/143
+
+3.模型文件应该下载到哪里/models-dir的配置应该怎么填
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+模型文件的路径输入是在”magic-pdf.json”中通过
+
+.. code:: json
+
+   {
+     "models-dir": "/tmp/models"
+   }
+
+进行配置的。这个路径是绝对路径而不是相对路径,绝对路径的获取可在models目录中通过命令 “pwd” 获取。
+参考:https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
+
+4.在WSL2的Ubuntu22.04中遇到报错\ ``ImportError: libGL.so.1: cannot open shared object file: No such file or directory``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+WSL2的Ubuntu22.04中缺少\ ``libgl``\ 库,可通过以下命令安装\ ``libgl``\ 库解决:
+
+.. code:: bash
+
+   sudo apt-get install libgl1-mesa-glx
+
+参考:https://github.com/opendatalab/MinerU/issues/388
+
+5.遇到报错 ``ModuleNotFoundError : Nomodulenamed 'fairscale'``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+需要卸载该模块并重新安装
+
+.. code:: bash
+
+   pip uninstall fairscale
+   pip install fairscale
+
+参考:https://github.com/opendatalab/MinerU/issues/411
+
+6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本
+
+.. code:: bash
+
+   pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
+
+参考:https://github.com/opendatalab/MinerU/issues/558
+
+7.在部分Linux服务器上,程序一运行就报错 ``非法指令 (核心已转储)`` 或 ``Illegal instruction (core dumped)``
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+可能是因为服务器CPU不支持AVX/AVX2指令集,或cpu本身支持但被运维禁用了,可以尝试联系运维解除限制或更换服务器。
+
+参考:https://github.com/opendatalab/MinerU/issues/591 ,https://github.com/opendatalab/MinerU/issues/736

+ 11 - 0
next_docs/zh_cn/additional_notes/glossary.rst

@@ -0,0 +1,11 @@
+
+
+名词解释
+===========
+
+1. jsonl 
+    TODO: add description
+
+2. magic-pdf.json
+    TODO: add description
+

Some files were not shown because too many files changed in this diff