Browse Source

Merge pull request #1427 from opendatalab/release-1.0.0

Release 1.0.0
Xiaomeng Zhao 10 months ago
parent
commit
4bb543939e
100 changed files with 3082 additions and 2018 deletions
  1. 2 1
      .github/ISSUE_TEMPLATE/bug_report.yml
  2. 0 0
      README.md
  3. 0 327
      README_ja-JP.md
  4. 0 0
      README_zh-CN.md
  5. 16 2
      demo/demo.py
  6. 0 0
      demo/demo1.json
  7. 0 0
      demo/demo2.json
  8. 0 0
      demo/small_ocr.json
  9. 55 0
      docker/ascend_npu/Dockerfile
  10. 5 3
      docker/ascend_npu/requirements.txt
  11. 4 4
      docker/china/Dockerfile
  12. 25 0
      docker/china/requirements.txt
  13. 50 0
      docker/global/Dockerfile
  14. 25 0
      docker/global/requirements.txt
  15. 57 0
      docs/README_Ascend_NPU_Acceleration_zh_CN.md
  16. BIN
      docs/images/MinerU-logo-hq.png
  17. BIN
      docs/images/MinerU-logo.png
  18. 23 3
      magic-pdf.template.json
  19. 2 0
      magic_pdf/config/constants.py
  20. 7 0
      magic_pdf/config/exceptions.py
  21. 1 1
      magic_pdf/data/data_reader_writer/filebase.py
  22. 8 6
      magic_pdf/data/data_reader_writer/multi_bucket_s3.py
  23. 13 1
      magic_pdf/data/dataset.py
  24. 59 12
      magic_pdf/data/read_api.py
  25. 35 0
      magic_pdf/data/utils.py
  26. 14 13
      magic_pdf/dict2md/ocr_mkcontent.py
  27. 11 4
      magic_pdf/libs/clean_memory.py
  28. 9 0
      magic_pdf/libs/config_reader.py
  29. 8 12
      magic_pdf/libs/draw_bbox.py
  30. 3 0
      magic_pdf/libs/language.py
  31. 1 125
      magic_pdf/model/__init__.py
  32. 275 0
      magic_pdf/model/batch_analyze.py
  33. 4 51
      magic_pdf/model/doc_analyze_by_custom_model.py
  34. 4 435
      magic_pdf/model/magic_model.py
  35. 1 0
      magic_pdf/model/model_list.py
  36. 33 22
      magic_pdf/model/pdf_extract_kit.py
  37. 1 0
      magic_pdf/model/sub_modules/language_detection/__init__.py
  38. 82 0
      magic_pdf/model/sub_modules/language_detection/utils.py
  39. 139 0
      magic_pdf/model/sub_modules/language_detection/yolov11/YOLOv11.py
  40. 1 0
      magic_pdf/model/sub_modules/language_detection/yolov11/__init__.py
  41. 44 7
      magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py
  42. 21 2
      magic_pdf/model/sub_modules/mfd/yolov8/YOLOv8.py
  43. 70 27
      magic_pdf/model/sub_modules/mfr/unimernet/Unimernet.py
  44. 30 4
      magic_pdf/model/sub_modules/model_init.py
  45. 8 2
      magic_pdf/model/sub_modules/model_utils.py
  46. 51 1
      magic_pdf/model/sub_modules/ocr/paddleocr/ocr_utils.py
  47. 32 6
      magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py
  48. 42 7
      magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py
  49. 94 0
      magic_pdf/operators/__init__.py
  50. 2 38
      magic_pdf/operators/models.py
  51. 70 17
      magic_pdf/operators/pipes.py
  52. 0 0
      magic_pdf/para/__init__.py
  53. 0 22
      magic_pdf/pdf_parse_by_ocr.py
  54. 0 23
      magic_pdf/pdf_parse_by_txt.py
  55. 68 17
      magic_pdf/pdf_parse_union_core_v2.py
  56. 0 80
      magic_pdf/pipe/OCRPipe.py
  57. 0 42
      magic_pdf/pipe/TXTPipe.py
  58. 0 150
      magic_pdf/pipe/UNIPipe.py
  59. 0 0
      magic_pdf/pipe/__init__.py
  60. 1 0
      magic_pdf/post_proc/__init__.py
  61. 133 0
      magic_pdf/post_proc/llm_aided.py
  62. 0 0
      magic_pdf/post_proc/para_split_v3.py
  63. 8 0
      magic_pdf/pre_proc/ocr_span_list_modify.py
  64. 1 1
      magic_pdf/pre_proc/remove_bbox_overlap.py
  65. BIN
      magic_pdf/resources/yolov11-langdetect/yolo_v11_ft.pt
  66. 0 17
      magic_pdf/rw/AbsReaderWriter.py
  67. 0 74
      magic_pdf/rw/DiskReaderWriter.py
  68. 0 142
      magic_pdf/rw/S3ReaderWriter.py
  69. 0 0
      magic_pdf/rw/__init__.py
  70. 36 11
      magic_pdf/tools/cli.py
  71. 28 18
      magic_pdf/tools/common.py
  72. 0 144
      magic_pdf/user_api.py
  73. 29 0
      magic_pdf/utils/office_to_pdf.py
  74. 0 16
      next_docs/README.md
  75. 0 16
      next_docs/README_zh-CN.md
  76. BIN
      next_docs/en/_static/image/inference_result.png
  77. 6 3
      next_docs/en/additional_notes/glossary.rst
  78. 1 1
      next_docs/en/api/model_operators.rst
  79. 2 2
      next_docs/en/api/pipe_operators.rst
  80. 6 0
      next_docs/en/index.rst
  81. 3 1
      next_docs/en/user_guide.rst
  82. 90 62
      next_docs/en/user_guide/data/data_reader_writer.rst
  83. 52 9
      next_docs/en/user_guide/data/read_api.rst
  84. 144 0
      next_docs/en/user_guide/inference_result.rst
  85. 1 1
      next_docs/en/user_guide/install.rst
  86. 0 18
      next_docs/en/user_guide/install/boost_with_cuda.rst
  87. 168 0
      next_docs/en/user_guide/install/config.rst
  88. 32 3
      next_docs/en/user_guide/install/install.rst
  89. 335 0
      next_docs/en/user_guide/pipe_result.rst
  90. 4 5
      next_docs/en/user_guide/quick_start.rst
  91. 47 0
      next_docs/en/user_guide/quick_start/convert_image.rst
  92. 60 0
      next_docs/en/user_guide/quick_start/convert_ms_office.rst
  93. 56 0
      next_docs/en/user_guide/quick_start/convert_pdf.rst
  94. 0 1
      next_docs/en/user_guide/tutorial.rst
  95. 0 3
      next_docs/en/user_guide/tutorial/pipeline.rst
  96. 12 0
      next_docs/en/user_guide/usage.rst
  97. 279 0
      next_docs/en/user_guide/usage/api.rst
  98. 18 3
      next_docs/en/user_guide/usage/command_line.rst
  99. 24 0
      next_docs/en/user_guide/usage/docker.rst
  100. 1 0
      next_docs/requirements.txt

+ 2 - 1
.github/ISSUE_TEMPLATE/bug_report.yml

@@ -78,10 +78,10 @@ body:
       #multiple: false
       options:
         -
-        - "0.7.x"
         - "0.8.x"
         - "0.9.x"
         - "0.10.x"
+        - "1.0.x"
     validations:
       required: true
 
@@ -94,5 +94,6 @@ body:
         -
         - cpu
         - cuda
+        - npu
     validations:
       required: true

File diff suppressed because it is too large
+ 0 - 0
README.md


+ 0 - 327
README_ja-JP.md

@@ -1,327 +0,0 @@
-> [!Warning]
-> このドキュメントはすでに古くなっています。最新版のドキュメントを参照してください:[ENGLISH](README.md)。
-<div id="top">
-
-<p align="center">
-  <img src="docs/images/MinerU-logo.png" width="300px" style="vertical-align:middle;">
-</p>
-
-</div>
-<div align="center">
-
-[![stars](https://img.shields.io/github/stars/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
-[![forks](https://img.shields.io/github/forks/opendatalab/MinerU.svg)](https://github.com/opendatalab/MinerU)
-[![open issues](https://img.shields.io/github/issues-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
-[![issue resolution](https://img.shields.io/github/issues-closed-raw/opendatalab/MinerU)](https://github.com/opendatalab/MinerU/issues)
-[![PyPI version](https://badge.fury.io/py/magic-pdf.svg)](https://badge.fury.io/py/magic-pdf)
-[![Downloads](https://static.pepy.tech/badge/magic-pdf)](https://pepy.tech/project/magic-pdf)
-[![Downloads](https://static.pepy.tech/badge/magic-pdf/month)](https://pepy.tech/project/magic-pdf)
-
-<a href="https://trendshift.io/repositories/11174" target="_blank"><img src="https://trendshift.io/api/badge/repositories/11174" alt="opendatalab%2FMinerU | Trendshift" style="width: 200px; height: 55px;"/></a>
-
-
-
-
-
-[English](README.md) | [简体中文](README_zh-CN.md) | [日本語](README_ja-JP.md)
-
-</div>
-
-<div align="center">
-<p align="center">
-<a href="https://github.com/opendatalab/MinerU">MinerU: PDF-Extract-Kitに基づくエンドツーエンドのPDF解析ツールで、PDFからMarkdownへの変換をサポートします。</a>🚀🚀🚀<br>
-<a href="https://github.com/opendatalab/PDF-Extract-Kit">PDF-Extract-Kit: 高品質なPDFコンテンツ抽出のための包括的なツールキット</a>🔥🔥🔥
-</p>
-
-<p align="center">
-    👋 <a href="https://discord.gg/gPxmVeGC" target="_blank">Discord</a>と<a href="https://cdn.vansin.top/internlm/mineru.jpg" target="_blank">WeChat</a>で参加してください
-</p>
-</div>
-
-# MinerU 
-
-
-## 紹介
-
-MinerUは、ワンストップのオープンソースで高品質なデータ抽出ツールであり、以下の主要な機能を含みます:
-
-- [Magic-PDF](#Magic-PDF)  PDFドキュメント抽出  
-- [Magic-Doc](#Magic-Doc)  ウェブページと電子書籍の抽出
-
-
-# Magic-PDF
-
-
-## 紹介
-
-Magic-PDFは、PDFドキュメントをMarkdown形式に変換するためのツールであり、ローカルに保存されたファイルやS3プロトコルをサポートするオブジェクトストレージ上のファイルを処理することができます。
-
-主な機能は以下の通りです:
-
-- 複数のフロントエンドモデル入力をサポート
-- ヘッダー、フッター、脚注、ページ番号の削除
-- 人間が読みやすいレイアウトフォーマット
-- 見出し、段落、リストなど、元のドキュメントの構造とフォーマットを保持
-- 画像や表を抽出してmarkdown内に表示
-- 数式をLaTeX形式に変換
-- 文字化けしたPDFの自動検出と変換
-- CPUおよびGPU環境に対応
-- Windows、Linux、macOSプラットフォームに対応
-
-
-https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
-
-
-
-## プロジェクト全景
-
-![プロジェクト全景](docs/images/project_panorama_en.png)
-
-
-## フローチャート
-
-![フローチャート](docs/images/flowchart_en.png)
-
-### 依存リポジトリ
-
-- [PDF-Extract-Kit : 高品質なPDFコンテンツ抽出のための包括的なツールキット](https://github.com/opendatalab/PDF-Extract-Kit) 🚀🚀🚀
-
-## 入門ガイド
-
-### 要件
-
-- Python >= 3.9
-
-依存関係の競合を避けるために、仮想環境の使用をお勧めします。venvとcondaの両方が適しています。 
-例:
-```bash
-conda create -n MinerU python=3.10
-conda activate MinerU
-```
-
-### インストールと設定
-
-#### 1. Magic-PDFのインストール
-
-**1.依存パッケージのインストール**
-
-フル機能パッケージはdetectron2に依存しており、コンパイルインストールが必要です。   
-自分でコンパイルする必要がある場合は、https://github.com/facebookresearch/detectron2/issues/5114 を参照してください。  
-または、私たちの事前コンパイルされたwhlパッケージを直接使用できます(Python 3.10に限定):
-
-```bash
-pip install detectron2 --extra-index-url https://wheels.myhloli.com
-```
-
-**2.pipを使用してフル機能パッケージをインストールします**
->注意:pipでインストールされたパッケージはCPUのみをサポートし、クイックテストに最適です。
->
->CUDA/MPSによる加速については、[CUDAまたはMPSによる加速](#4-CUDAまたはMPSによる加速)を参照してください。
-
-```bash
-pip install -U magic-pdf[full]
-```
-
-> ❗️❗️❗️
-> 私たちは0.6.2 ベータ版を事前にリリースし、私たちのログに記載されている多くの問題に対処しました。しかし、このビルドはまだ完全なQAテストを経ておらず、最終的なリリース品質を表していません。問題に遭遇した場合は、問題を通じて速やかに報告するか、0.6.1バージョンに戻ることをお願いします。
-> ```bash
-> pip install -U magic-pdf[full]
-> ```
-
-
-#### 2. モデルの重みファイルのダウンロード
-
-詳細については、[how_to_download_models](docs/how_to_download_models_en.md)を参照してください。
-
-モデルの重みをダウンロードした後、'models'ディレクトリを大きなディスクスペースのあるディレクトリに移動します。できればSSDに移動してください。
-
-
-#### 3. 設定ファイルのコピーと設定
-リポジトリのルートディレクトリに[magic-pdf.template.json](magic-pdf.template.json)ファイルがあります。
-```bash
-cp magic-pdf.template.json ~/magic-pdf.json
-```
-magic-pdf.jsonで、"models-dir"をモデルの重みファイルがあるディレクトリに設定します。
-
-```json
-{
-  "models-dir": "/tmp/models"
-}
-```
-
-
-#### 4. CUDAまたはMPSによる加速
-利用可能なNvidia GPUを持っている場合や、Apple Siliconを搭載したMacを使用している場合は、それぞれCUDAまたはMPSによる加速を利用できます。
-##### CUDA
-
-CUDAバージョンに応じたPyTorchバージョンをインストールする必要があります。  
-この例では、CUDA 11.8バージョンをインストールします。詳細はhttps://pytorch.org/get-started/locally/ を参照してください。  
-```bash
-pip install --force-reinstall torch==2.3.1 torchvision==0.18.1 --index-url https://download.pytorch.org/whl/cu118
-```
-また、設定ファイルmagic-pdf.jsonの"device-mode"の値を変更する必要があります。  
-```json
-{
-  "device-mode":"cuda"
-}
-```
-
-##### MPS
-
-Mシリーズチップデバイスを搭載したmacOSユーザーは、推論加速のためにMPSを使用できます。  
-設定ファイルmagic-pdf.jsonの"device-mode"の値を変更する必要があります。  
-```json
-{
-  "device-mode":"mps"
-}
-```
-
-
-### 使用方法
-
-#### 1. コマンドラインでの使用
-
-###### シンプル
-
-```bash
-magic-pdf pdf-command --pdf "pdf_path" --inside_model true
-```
-プログラムが終了した後、"/tmp/magic-pdf"ディレクトリに生成されたmarkdownファイルが見つかります。  
-markdownディレクトリには対応するxxx_model.jsonファイルがあります。   
-ポストプロセッシングパイプラインの二次開発を行う場合は、次のコマンドを使用できます:  
-```bash
-magic-pdf pdf-command --pdf "pdf_path" --model "model_json_path"
-```
-この方法では、モデルデータを再実行する必要がなくなり、デバッグが便利になります。
-
-
-###### 詳細
-
-```bash
-magic-pdf --help
-```
-
-
-#### 2. APIを使用した利用
-
-###### ローカル
-```python
-image_writer = DiskReaderWriter(local_image_dir)
-image_dir = str(os.path.basename(local_image_dir))
-jso_useful_key = {"_pdf_type": "", "model_list": []}
-pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
-pipe.pipe_classify()
-pipe.pipe_parse()
-md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
-```
-
-###### オブジェクトストレージ
-```python
-s3pdf_cli = S3ReaderWriter(pdf_ak, pdf_sk, pdf_endpoint)
-image_dir = "s3://img_bucket/"
-s3image_cli = S3ReaderWriter(img_ak, img_sk, img_endpoint, parent_path=image_dir)
-pdf_bytes = s3pdf_cli.read(s3_pdf_path, mode=s3pdf_cli.MODE_BIN)
-jso_useful_key = {"_pdf_type": "", "model_list": []}
-pipe = UNIPipe(pdf_bytes, jso_useful_key, s3image_cli)
-pipe.pipe_classify()
-pipe.pipe_parse()
-md_content = pipe.pipe_mk_markdown(image_dir, drop_mode="none")
-```
-
-デモは[demo.py](demo/demo.py)を参照してください
-
-
-# Magic-Doc
-
-
-## 紹介
-
-Magic-Docは、ウェブページや多形式の電子書籍をmarkdown形式に変換するためのツールです。
-
-主な機能は以下の通りです:
-
-- ウェブページ抽出
-  - テキスト、画像、表、数式情報のクロスモーダルな正確な解析。
-
-- 電子書籍ドキュメント抽出
-  - epub、mobiなどのさまざまなドキュメント形式をサポートし、テキストと画像に完全対応。
-
-- 言語タイプの識別
-  - 176の言語を正確に認識。
-
-https://github.com/opendatalab/MinerU/assets/11393164/a5a650e9-f4c0-463e-acc3-960967f1a1ca
-
-
-
-https://github.com/opendatalab/MinerU/assets/11393164/0f4a6fe9-6cca-4113-9fdc-a537749d764d
-
-
-
-https://github.com/opendatalab/MinerU/assets/11393164/20438a02-ce6c-4af8-9dde-d722a4e825b2
-
-
-
-
-## プロジェクトリポジトリ
-
-- [Magic-Doc](https://github.com/InternLM/magic-doc)
-  優れたウェブページと電子書籍の抽出ツール
-
-
-# 貢献者の皆様に感謝
-
-<a href="https://github.com/opendatalab/MinerU/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=opendatalab/MinerU" />
-</a>
-
-
-# ライセンス情報
-
-[LICENSE.md](LICENSE.md)
-
-このプロジェクトは現在、PyMuPDFを利用して高度な機能を提供していますが、AGPLライセンスに準拠しているため、特定の使用ケースに制限を課す可能性があります。今後のバージョンでは、より寛容なライセンスのPDF処理ライブラリへの移行を検討し、ユーザーフレンドリーさと柔軟性を向上させる予定です。
-
-
-# 謝辞
-
-- [PaddleOCR](https://github.com/PaddlePaddle/PaddleOCR)
-- [PyMuPDF](https://github.com/pymupdf/PyMuPDF)
-- [fast-langdetect](https://github.com/LlmKira/fast-langdetect)
-- [pdfminer.six](https://github.com/pdfminer/pdfminer.six)
-
-
-# 引用
-
-```bibtex
-@misc{wang2024mineruopensourcesolutionprecise,
-      title={MinerU: An Open-Source Solution for Precise Document Content Extraction}, 
-      author={Bin Wang and Chao Xu and Xiaomeng Zhao and Linke Ouyang and Fan Wu and Zhiyuan Zhao and Rui Xu and Kaiwen Liu and Yuan Qu and Fukai Shang and Bo Zhang and Liqun Wei and Zhihao Sui and Wei Li and Botian Shi and Yu Qiao and Dahua Lin and Conghui He},
-      year={2024},
-      eprint={2409.18839},
-      archivePrefix={arXiv},
-      primaryClass={cs.CV},
-      url={https://arxiv.org/abs/2409.18839}, 
-}
-
-@article{he2024opendatalab,
-  title={Opendatalab: Empowering general artificial intelligence with open datasets},
-  author={He, Conghui and Li, Wei and Jin, Zhenjiang and Xu, Chao and Wang, Bin and Lin, Dahua},
-  journal={arXiv preprint arXiv:2407.13773},
-  year={2024}
-}
-```
-
-# スター履歴
-
-<a>
- <picture>
-   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date&theme=dark" />
-   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
-   <img alt="Star History Chart" src="https://api.star-history.com/svg?repos=opendatalab/MinerU&type=Date" />
- </picture>
-</a>
-
-# リンク
-- [LabelU (軽量なマルチモーダルデータアノテーションツール)](https://github.com/opendatalab/labelU)
-- [LabelLLM (オープンソースのLLM対話アノテーションプラットフォーム)](https://github.com/opendatalab/LabelLLM)
-- [PDF-Extract-Kit (高品質なPDFコンテンツ抽出のための包括的なツールキット)](https://github.com/opendatalab/PDF-Extract-Kit)

File diff suppressed because it is too large
+ 0 - 0
README_zh-CN.md


+ 16 - 2
demo/demo.py

@@ -19,7 +19,6 @@ os.makedirs(local_image_dir, exist_ok=True)
 image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
     local_md_dir
 )
-image_dir = str(os.path.basename(local_image_dir))
 
 # read bytes
 reader1 = FileBasedDataReader("")
@@ -45,14 +44,29 @@ else:
 ### draw model result on each page
 infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
 
+### get model inference result
+model_inference_result = infer_result.get_infer_res()
+
 ### draw layout result on each page
 pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
 
 ### draw spans result on each page
 pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
 
+### get markdown content
+md_content = pipe_result.get_markdown(image_dir)
+
 ### dump markdown
 pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)
 
+### get content list content
+content_list_content = pipe_result.get_content_list(image_dir)
+
 ### dump content list
-pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
+pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
+
+### get middle json
+middle_json_content = pipe_result.get_middle_json()
+
+### dump middle json
+pipe_result.dump_middle_json(md_writer, f'{name_without_suff}_middle.json')

File diff suppressed because it is too large
+ 0 - 0
demo/demo1.json


File diff suppressed because it is too large
+ 0 - 0
demo/demo2.json


File diff suppressed because it is too large
+ 0 - 0
demo/small_ocr.json


+ 55 - 0
docker/ascend_npu/Dockerfile

@@ -0,0 +1,55 @@
+# Use the official Ubuntu base image
+FROM swr.cn-central-221.ovaijisuan.com/mindformers/mindformers1.2_mindspore2.3:20240722
+
+USER root
+
+# Set environment variables to non-interactive to avoid prompts during installation
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Update the package list and install necessary packages
+RUN apt-get update && \
+    apt-get install -y \
+        software-properties-common && \
+    add-apt-repository ppa:deadsnakes/ppa && \
+    apt-get update && \
+    apt-get install -y \
+        python3.10 \
+        python3.10-venv \
+        python3.10-distutils \
+        python3.10-dev \
+        python3-pip \
+        wget \
+        git \
+        libgl1 \
+        libglib2.0-0 \
+        && rm -rf /var/lib/apt/lists/*
+
+# Set Python 3.10 as the default python3
+RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
+
+# Create a virtual environment for MinerU
+RUN python3 -m venv /opt/mineru_venv
+
+# Activate the virtual environment and install necessary Python packages
+RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
+    pip3 install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple && \
+    wget https://gitee.com/myhloli/MinerU/raw/master/docker/ascend_npu/requirements.txt -O requirements.txt && \
+    pip3 install -r requirements.txt --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple && \
+    wget https://gitee.com/ascend/pytorch/releases/download/v6.0.rc2-pytorch2.3.1/torch_npu-2.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl && \
+    pip install torch_npu-2.3.1-cp310-cp310-manylinux_2_17_aarch64.manylinux2014_aarch64.whl"
+
+# Copy the configuration file template and install magic-pdf latest
+RUN /bin/bash -c "wget https://gitee.com/myhloli/MinerU/raw/master/magic-pdf.template.json && \
+    cp magic-pdf.template.json /root/magic-pdf.json && \
+    source /opt/mineru_venv/bin/activate && \
+    pip3 install -U magic-pdf"
+
+# Download models and update the configuration file
+RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
+    pip3 install modelscope -i https://mirrors.aliyun.com/pypi/simple && \
+    wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py -O download_models.py && \
+    python3 download_models.py && \
+    sed -i 's|cpu|npu|g' /root/magic-pdf.json"
+
+# Set the entry point to activate the virtual environment and run the command line tool
+ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]

+ 5 - 3
requirements-docker.txt → docker/ascend_npu/requirements.txt

@@ -4,10 +4,10 @@ click>=8.1.7
 PyMuPDF>=1.24.9
 loguru>=0.6.0
 numpy>=1.21.6,<2.0.0
-fast-langdetect==0.2.0
+fast-langdetect>=0.2.3,<0.3.0
 scikit-learn>=1.0.2
 pdfminer.six==20231228
-unimernet==0.2.2
+unimernet==0.2.3
 torch>=2.2.2,<=2.3.1
 torchvision>=0.17.2,<=0.18.1
 matplotlib
@@ -19,6 +19,8 @@ einops
 accelerate
 doclayout_yolo==0.0.2
 rapidocr-paddle
-rapid_table
+rapidocr-onnxruntime
+rapid_table==0.3.0
 doclayout-yolo==0.0.2
+openai
 detectron2

+ 4 - 4
Dockerfile → docker/china/Dockerfile

@@ -29,9 +29,9 @@ RUN python3 -m venv /opt/mineru_venv
 
 # Activate the virtual environment and install necessary Python packages
 RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
-    pip3 install --upgrade pip && \
-    wget https://gitee.com/myhloli/MinerU/raw/master/requirements-docker.txt && \
-    pip3 install -r requirements-docker.txt --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple && \
+    pip3 install --upgrade pip -i https://mirrors.aliyun.com/pypi/simple && \
+    wget https://gitee.com/myhloli/MinerU/raw/master/docker/china/requirements.txt -O requirements.txt && \
+    pip3 install -r requirements.txt --extra-index-url https://wheels.myhloli.com -i https://mirrors.aliyun.com/pypi/simple && \
     pip3 install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/"
 
 # Copy the configuration file template and install magic-pdf latest
@@ -42,7 +42,7 @@ RUN /bin/bash -c "wget https://gitee.com/myhloli/MinerU/raw/master/magic-pdf.tem
 
 # Download models and update the configuration file
 RUN /bin/bash -c "pip3 install modelscope && \
-    wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py && \
+    wget https://gitee.com/myhloli/MinerU/raw/master/scripts/download_models.py -O download_models.py && \
     python3 download_models.py && \
     sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
 

+ 25 - 0
docker/china/requirements.txt

@@ -0,0 +1,25 @@
+boto3>=1.28.43
+Brotli>=1.1.0
+click>=8.1.7
+PyMuPDF>=1.24.9
+loguru>=0.6.0
+numpy>=1.21.6,<2.0.0
+fast-langdetect>=0.2.3,<0.3.0
+scikit-learn>=1.0.2
+pdfminer.six==20231228
+unimernet==0.2.3
+torch>=2.2.2,<=2.3.1
+torchvision>=0.17.2,<=0.18.1
+matplotlib
+ultralytics>=8.3.48
+paddleocr==2.7.3
+struct-eqtable==0.3.2
+einops
+accelerate
+doclayout_yolo==0.0.2
+rapidocr-paddle
+rapidocr-onnxruntime
+rapid_table==0.3.0
+doclayout-yolo==0.0.2
+openai
+detectron2

+ 50 - 0
docker/global/Dockerfile

@@ -0,0 +1,50 @@
+# Use the official Ubuntu base image
+FROM ubuntu:22.04
+
+# Set environment variables to non-interactive to avoid prompts during installation
+ENV DEBIAN_FRONTEND=noninteractive
+
+# Update the package list and install necessary packages
+RUN apt-get update && \
+    apt-get install -y \
+        software-properties-common && \
+    add-apt-repository ppa:deadsnakes/ppa && \
+    apt-get update && \
+    apt-get install -y \
+        python3.10 \
+        python3.10-venv \
+        python3.10-distutils \
+        python3-pip \
+        wget \
+        git \
+        libgl1 \
+        libglib2.0-0 \
+        && rm -rf /var/lib/apt/lists/*
+
+# Set Python 3.10 as the default python3
+RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
+
+# Create a virtual environment for MinerU
+RUN python3 -m venv /opt/mineru_venv
+
+# Activate the virtual environment and install necessary Python packages
+RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
+    pip3 install --upgrade pip && \
+    wget https://github.com/opendatalab/MinerU/raw/master/docker/global/requirements.txt -O requirements.txt && \
+    pip3 install -r requirements.txt --extra-index-url https://wheels.myhloli.com && \
+    pip3 install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/"
+
+# Copy the configuration file template and install magic-pdf latest
+RUN /bin/bash -c "wget https://github.com/opendatalab/MinerU/raw/master/magic-pdf.template.json && \
+    cp magic-pdf.template.json /root/magic-pdf.json && \
+    source /opt/mineru_venv/bin/activate && \
+    pip3 install -U magic-pdf"
+
+# Download models and update the configuration file
+RUN /bin/bash -c "pip3 install huggingface_hub && \
+    wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models.py && \
+    python3 download_models.py && \
+    sed -i 's|cpu|cuda|g' /root/magic-pdf.json"
+
+# Set the entry point to activate the virtual environment and run the command line tool
+ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\"", "--"]

+ 25 - 0
docker/global/requirements.txt

@@ -0,0 +1,25 @@
+boto3>=1.28.43
+Brotli>=1.1.0
+click>=8.1.7
+PyMuPDF>=1.24.9
+loguru>=0.6.0
+numpy>=1.21.6,<2.0.0
+fast-langdetect>=0.2.3,<0.3.0
+scikit-learn>=1.0.2
+pdfminer.six==20231228
+unimernet==0.2.3
+torch>=2.2.2,<=2.3.1
+torchvision>=0.17.2,<=0.18.1
+matplotlib
+ultralytics>=8.3.48
+paddleocr==2.7.3
+struct-eqtable==0.3.2
+einops
+accelerate
+doclayout_yolo==0.0.2
+rapidocr-paddle
+rapidocr-onnxruntime
+rapid_table==0.3.0
+doclayout-yolo==0.0.2
+openai
+detectron2

+ 57 - 0
docs/README_Ascend_NPU_Acceleration_zh_CN.md

@@ -0,0 +1,57 @@
+# Ascend NPU 加速
+
+## 简介
+
+本文档介绍如何在 Ascend NPU 上使用 MinerU。本文档内容已在`华为Atlas 800T A2`服务器上测试通过。
+```
+CPU:鲲鹏 920 aarch64 2.6GHz
+NPU:Ascend 910B 64GB
+OS:openEuler 22.03 (LTS-SP3)
+```
+由于适配 Ascend NPU 的环境较为复杂,建议使用 Docker 容器运行 MinerU。
+
+通过docker运行MinerU前需确保物理机已安装支持CANN 8.0.RC2的驱动和固件。
+
+
+## 构建镜像
+请保持网络状况良好,并执行以下代码构建镜像。    
+```bash
+wget https://gitee.com/myhloli/MinerU/raw/master/docker/ascend_npu/Dockerfile -O Dockerfile
+docker build -t mineru_npu:latest .
+```
+如果构建过程中未发生报错则说明镜像构建成功。
+
+
+## 运行容器
+
+```bash
+docker run --rm -it -u root --privileged=true \
+    --ipc=host \
+    --network=host \
+    --device=/dev/davinci0 \
+    --device=/dev/davinci1 \
+    --device=/dev/davinci2 \
+    --device=/dev/davinci3 \
+    --device=/dev/davinci4 \
+    --device=/dev/davinci5 \
+    --device=/dev/davinci6 \
+    --device=/dev/davinci7 \
+    --device=/dev/davinci_manager \
+    --device=/dev/devmm_svm \
+    --device=/dev/hisi_hdc \
+    -v /var/log/npu/:/usr/slog \
+    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
+    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
+    mineru_npu:latest \
+    /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
+
+magic-pdf --help
+```
+
+
+## 已知问题
+
+- paddleocr使用内嵌onnx模型,仅在默认语言配置下能以较快速度对中英文进行识别
+- 自定义lang参数时,paddleocr速度会存在明显下降情况
+- layout模型使用layoutlmv3时会发生间歇性崩溃,建议使用默认配置的doclayout_yolo模型
+- 表格解析仅适配了rapid_table模型,其他模型可能会无法使用

BIN
docs/images/MinerU-logo-hq.png


BIN
docs/images/MinerU-logo.png


+ 23 - 3
magic-pdf.template.json

@@ -7,7 +7,7 @@
     "layoutreader-model-dir":"/tmp/layoutreader",
     "device-mode":"cpu",
     "layout-config": {
-        "model": "layoutlmv3"
+        "model": "doclayout_yolo"
     },
     "formula-config": {
         "mfd_model": "yolo_v8_mfd",
@@ -16,8 +16,28 @@
     },
     "table-config": {
         "model": "rapid_table",
-        "enable": false,
+        "enable": true,
         "max_time": 400
     },
-    "config_version": "1.0.0"
+    "llm-aided-config": {
+        "formula_aided": {
+            "api_key": "your_api_key",
+            "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
+            "model": "qwen2.5-7b-instruct",
+            "enable": false
+        },
+        "text_aided": {
+            "api_key": "your_api_key",
+            "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
+            "model": "qwen2.5-7b-instruct",
+            "enable": false
+        },
+        "title_aided": {
+            "api_key": "your_api_key",
+            "base_url": "https://dashscope.aliyuncs.com/compatible-mode/v1",
+            "model": "qwen2.5-32b-instruct",
+            "enable": false
+        }
+    },
+    "config_version": "1.1.0"
 }

+ 2 - 0
magic_pdf/config/constants.py

@@ -52,6 +52,8 @@ class MODEL_NAME:
 
     RAPID_TABLE = 'rapid_table'
 
+    YOLO_V11_LangDetect = 'yolo_v11n_langdetect'
+
 
 PARSE_TYPE_TXT = 'txt'
 PARSE_TYPE_OCR = 'ocr'

+ 7 - 0
magic_pdf/config/exceptions.py

@@ -30,3 +30,10 @@ class EmptyData(Exception):
 
     def __str__(self):
         return f'Empty data: {self.msg}'
+
+class CUDA_NOT_AVAILABLE(Exception):
+    def __init__(self, msg):
+        self.msg = msg
+
+    def __str__(self):
+        return f'CUDA not available: {self.msg}'

+ 1 - 1
magic_pdf/data/data_reader_writer/filebase.py

@@ -55,7 +55,7 @@ class FileBasedDataWriter(DataWriter):
         if not os.path.isabs(fn_path) and len(self._parent_dir) > 0:
             fn_path = os.path.join(self._parent_dir, path)
 
-        if not os.path.exists(os.path.dirname(fn_path)):
+        if not os.path.exists(os.path.dirname(fn_path)) and os.path.dirname(fn_path) != "":
             os.makedirs(os.path.dirname(fn_path), exist_ok=True)
 
         with open(fn_path, 'wb') as f:

+ 8 - 6
magic_pdf/data/data_reader_writer/multi_bucket_s3.py

@@ -1,4 +1,4 @@
-import os
+
 from magic_pdf.config.exceptions import InvalidConfig, InvalidParams
 from magic_pdf.data.data_reader_writer.base import DataReader, DataWriter
 from magic_pdf.data.io.s3 import S3Reader, S3Writer
@@ -22,10 +22,10 @@ class MultiS3Mixin:
         """
         if len(default_prefix) == 0:
             raise InvalidConfig('default_prefix must be provided')
-    
-        arr = default_prefix.strip("/").split("/")
+
+        arr = default_prefix.strip('/').split('/')
         self.default_bucket = arr[0]
-        self.default_prefix = "/".join(arr[1:])
+        self.default_prefix = '/'.join(arr[1:])
 
         found_default_bucket_config = False
         for conf in s3_configs:
@@ -103,7 +103,8 @@ class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
             s3_reader = self.__get_s3_client(bucket_name)
         else:
             s3_reader = self.__get_s3_client(self.default_bucket)
-            path = os.path.join(self.default_prefix, path)
+            if self.default_prefix:
+                path = self.default_prefix + '/' + path
         return s3_reader.read_at(path, offset, limit)
 
 
@@ -139,5 +140,6 @@ class MultiBucketS3DataWriter(DataWriter, MultiS3Mixin):
             s3_writer = self.__get_s3_client(bucket_name)
         else:
             s3_writer = self.__get_s3_client(self.default_bucket)
-            path = os.path.join(self.default_prefix, path)
+            if self.default_prefix:
+                path = self.default_prefix + '/' + path
         return s3_writer.write(path, data)

+ 13 - 1
magic_pdf/data/dataset.py

@@ -3,6 +3,7 @@ from abc import ABC, abstractmethod
 from typing import Callable, Iterator
 
 import fitz
+from loguru import logger
 
 from magic_pdf.config.enums import SupportedPdfParseMethod
 from magic_pdf.data.schemas import PageInfo
@@ -133,7 +134,7 @@ class Dataset(ABC):
 
 
 class PymuDocDataset(Dataset):
-    def __init__(self, bits: bytes):
+    def __init__(self, bits: bytes, lang=None):
         """Initialize the dataset, which wraps the pymudoc documents.
 
         Args:
@@ -144,6 +145,15 @@ class PymuDocDataset(Dataset):
         self._data_bits = bits
         self._raw_data = bits
 
+        if lang == '':
+            self._lang = None
+        elif lang == 'auto':
+            from magic_pdf.model.sub_modules.language_detection.utils import auto_detect_lang
+            self._lang = auto_detect_lang(bits)
+            logger.info(f"lang: {lang}, detect_lang: {self._lang}")
+        else:
+            self._lang = lang
+            logger.info(f"lang: {lang}")
     def __len__(self) -> int:
         """The page number of the pdf."""
         return len(self._records)
@@ -197,6 +207,8 @@ class PymuDocDataset(Dataset):
         Returns:
             Any: return the result generated by proc
         """
+        if 'lang' in kwargs and self._lang is not None:
+            kwargs['lang'] = self._lang
         return proc(self, *args, **kwargs)
 
     def classify(self) -> SupportedPdfParseMethod:

+ 59 - 12
magic_pdf/data/read_api.py

@@ -1,12 +1,14 @@
 import json
 import os
+import tempfile
+import shutil
 from pathlib import Path
 
 from magic_pdf.config.exceptions import EmptyData, InvalidParams
 from magic_pdf.data.data_reader_writer import (FileBasedDataReader,
                                                MultiBucketS3DataReader)
 from magic_pdf.data.dataset import ImageDataset, PymuDocDataset
-
+from magic_pdf.utils.office_to_pdf import convert_file_to_pdf, ConvertToPdfError
 
 def read_jsonl(
     s3_path_or_local: str, s3_client: MultiBucketS3DataReader | None = None
@@ -58,23 +60,68 @@ def read_local_pdfs(path: str) -> list[PymuDocDataset]:
         list[PymuDocDataset]: each pdf file will converted to a PymuDocDataset
     """
     if os.path.isdir(path):
-        reader = FileBasedDataReader(path)
-        return [
-            PymuDocDataset(reader.read(doc_path.name))
-            for doc_path in Path(path).glob('*.pdf')
-        ]
+        reader = FileBasedDataReader()
+        ret = []
+        for root, _, files in os.walk(path):
+            for file in files:
+                suffix = file.split('.')
+                if suffix[-1] == 'pdf':
+                    ret.append( PymuDocDataset(reader.read(os.path.join(root, file))))
+        return ret
     else:
         reader = FileBasedDataReader()
         bits = reader.read(path)
         return [PymuDocDataset(bits)]
 
+def read_local_office(path: str) -> list[PymuDocDataset]:
+    """Read ms-office file (ppt, pptx, doc, docx) from path or directory.
 
-def read_local_images(path: str, suffixes: list[str]) -> list[ImageDataset]:
+    Args:
+        path (str): ms-office file or directory that contains ms-office files
+
+    Returns:
+        list[PymuDocDataset]: each ms-office file will converted to a PymuDocDataset
+        
+    Raises:
+        ConvertToPdfError: Failed to convert ms-office file to pdf via libreoffice
+        FileNotFoundError: File not Found
+        Exception: Unknown Exception raised
+    """
+    suffixes = ['.ppt', '.pptx', '.doc', '.docx']
+    fns = []
+    ret = []
+    if os.path.isdir(path):
+        for root, _, files in os.walk(path):
+            for file in files:
+                suffix = Path(file).suffix
+                if suffix in suffixes:
+                    fns.append((os.path.join(root, file)))
+    else:
+        fns.append(path)
+        
+    reader = FileBasedDataReader()
+    temp_dir = tempfile.mkdtemp()
+    for fn in fns:
+        try:
+            convert_file_to_pdf(fn, temp_dir)
+        except ConvertToPdfError as e:
+            raise e
+        except FileNotFoundError as e:
+            raise e
+        except Exception as e:
+            raise e
+        fn_path = Path(fn)
+        pdf_fn = f"{temp_dir}/{fn_path.stem}.pdf"
+        ret.append(PymuDocDataset(reader.read(pdf_fn)))
+    shutil.rmtree(temp_dir)
+    return ret
+
+def read_local_images(path: str, suffixes: list[str]=['.png', '.jpg']) -> list[ImageDataset]:
     """Read images from path or directory.
 
     Args:
         path (str): image file path or directory that contains image files
-        suffixes (list[str]): the suffixes of the image files used to filter the files. Example: ['jpg', 'png']
+        suffixes (list[str]): the suffixes of the image files used to filter the files. Example: ['.jpg', '.png']
 
     Returns:
         list[ImageDataset]: each image file will converted to a ImageDataset
@@ -82,12 +129,12 @@ def read_local_images(path: str, suffixes: list[str]) -> list[ImageDataset]:
     if os.path.isdir(path):
         imgs_bits = []
         s_suffixes = set(suffixes)
-        reader = FileBasedDataReader(path)
+        reader = FileBasedDataReader()
         for root, _, files in os.walk(path):
             for file in files:
-                suffix = file.split('.')
-                if suffix[-1] in s_suffixes:
-                    imgs_bits.append(reader.read(file))
+                suffix = Path(file).suffix
+                if suffix in s_suffixes:
+                    imgs_bits.append(reader.read(os.path.join(root, file)))
         return [ImageDataset(bits) for bits in imgs_bits]
     else:
         reader = FileBasedDataReader()

+ 35 - 0
magic_pdf/data/utils.py

@@ -1,6 +1,7 @@
 
 import fitz
 import numpy as np
+from loguru import logger
 
 from magic_pdf.utils.annotations import ImportPIL
 
@@ -30,3 +31,37 @@ def fitz_doc_to_image(doc, dpi=200) -> dict:
     img_dict = {'img': img, 'width': pm.width, 'height': pm.height}
 
     return img_dict
+
+@ImportPIL
+def load_images_from_pdf(pdf_bytes: bytes, dpi=200, start_page_id=0, end_page_id=None) -> list:
+    from PIL import Image
+    images = []
+    with fitz.open('pdf', pdf_bytes) as doc:
+        pdf_page_num = doc.page_count
+        end_page_id = (
+            end_page_id
+            if end_page_id is not None and end_page_id >= 0
+            else pdf_page_num - 1
+        )
+        if end_page_id > pdf_page_num - 1:
+            logger.warning('end_page_id is out of range, use images length')
+            end_page_id = pdf_page_num - 1
+
+        for index in range(0, doc.page_count):
+            if start_page_id <= index <= end_page_id:
+                page = doc[index]
+                mat = fitz.Matrix(dpi / 72, dpi / 72)
+                pm = page.get_pixmap(matrix=mat, alpha=False)
+
+                # If the width or height exceeds 4500 after scaling, do not scale further.
+                if pm.width > 4500 or pm.height > 4500:
+                    pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
+
+                img = Image.frombytes('RGB', (pm.width, pm.height), pm.samples)
+                img = np.array(img)
+                img_dict = {'img': img, 'width': pm.width, 'height': pm.height}
+            else:
+                img_dict = {'img': [], 'width': 0, 'height': 0}
+
+            images.append(img_dict)
+    return images

+ 14 - 13
magic_pdf/dict2md/ocr_mkcontent.py

@@ -7,7 +7,7 @@ from magic_pdf.config.ocr_content_type import BlockType, ContentType
 from magic_pdf.libs.commons import join_path
 from magic_pdf.libs.language import detect_lang
 from magic_pdf.libs.markdown_utils import ocr_escape_special_markdown_char
-from magic_pdf.para.para_split_v3 import ListLineTag
+from magic_pdf.post_proc.para_split_v3 import ListLineTag
 
 
 def __is_hyphen_at_line_end(line):
@@ -61,7 +61,8 @@ def ocr_mk_markdown_with_para_core_v2(paras_of_layout,
         if para_type in [BlockType.Text, BlockType.List, BlockType.Index]:
             para_text = merge_para_with_text(para_block)
         elif para_type == BlockType.Title:
-            para_text = f'# {merge_para_with_text(para_block)}'
+            title_level = get_title_level(para_block)
+            para_text = f'{"#" * title_level} {merge_para_with_text(para_block)}'
         elif para_type == BlockType.InterlineEquation:
             para_text = merge_para_with_text(para_block)
         elif para_type == BlockType.Image:
@@ -125,16 +126,6 @@ def detect_language(text):
         return 'empty'
 
 
-# 连写字符拆分
-def __replace_ligatures(text: str):
-    text = re.sub(r'fi', 'fi', text)  # 替换 fi 连写符
-    text = re.sub(r'fl', 'fl', text)  # 替换 fl 连写符
-    text = re.sub(r'ff', 'ff', text)  # 替换 ff 连写符
-    text = re.sub(r'ffi', 'ffi', text)  # 替换 ffi 连写符
-    text = re.sub(r'ffl', 'ffl', text)  # 替换 ffl 连写符
-    return text
-
-
 def merge_para_with_text(para_block):
     block_text = ''
     for line in para_block['lines']:
@@ -196,10 +187,11 @@ def para_to_standard_format_v2(para_block, img_buket_path, page_idx, drop_reason
             'text': merge_para_with_text(para_block),
         }
     elif para_type == BlockType.Title:
+        title_level = get_title_level(para_block)
         para_content = {
             'type': 'text',
             'text': merge_para_with_text(para_block),
-            'text_level': 1,
+            'text_level': title_level,
         }
     elif para_type == BlockType.InterlineEquation:
         para_content = {
@@ -299,3 +291,12 @@ def union_make(pdf_info_dict: list,
         return '\n\n'.join(output_content)
     elif make_mode == MakeMode.STANDARD_FORMAT:
         return output_content
+
+
+def get_title_level(block):
+    title_level = block.get('level', 1)
+    if title_level > 4:
+        title_level = 4
+    elif title_level < 1:
+        title_level = 1
+    return title_level

+ 11 - 4
magic_pdf/libs/clean_memory.py

@@ -3,8 +3,15 @@ import torch
 import gc
 
 
-def clean_memory():
-    if torch.cuda.is_available():
-        torch.cuda.empty_cache()
-        torch.cuda.ipc_collect()
+def clean_memory(device='cuda'):
+    if device == 'cuda':
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+            torch.cuda.ipc_collect()
+    elif str(device).startswith("npu"):
+        import torch_npu
+        if torch_npu.npu.is_available():
+            torch_npu.npu.empty_cache()
+    elif str(device).startswith("mps"):
+        torch.mps.empty_cache()
     gc.collect()

+ 9 - 0
magic_pdf/libs/config_reader.py

@@ -116,6 +116,15 @@ def get_formula_config():
     else:
         return formula_config
 
+def get_llm_aided_config():
+    config = read_config()
+    llm_aided_config = config.get('llm-aided-config')
+    if llm_aided_config is None:
+        logger.warning(f"'llm-aided-config' not found in {CONFIG_FILE_NAME}, use 'None' as default")
+        return None
+    else:
+        return llm_aided_config
+
 
 if __name__ == '__main__':
     ak, sk, endpoint = get_s3_config('llm-raw')

+ 8 - 12
magic_pdf/libs/draw_bbox.py

@@ -394,17 +394,13 @@ def draw_line_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
     pdf_docs.save(f'{out_path}/{filename}')
 
 
-def draw_layout_sort_bbox(pdf_info, pdf_bytes, out_path, filename):
-    layout_bbox_list = []
-
-    for page in pdf_info:
-        page_block_list = []
-        for block in page['para_blocks']:
-            bbox = block['bbox']
-            page_block_list.append(bbox)
-        layout_bbox_list.append(page_block_list)
+def draw_char_bbox(pdf_bytes, out_path, filename):
     pdf_docs = fitz.open('pdf', pdf_bytes)
     for i, page in enumerate(pdf_docs):
-        draw_bbox_with_number(i, layout_bbox_list, page, [255, 0, 0], False)
-
-    pdf_docs.save(f'{out_path}/{filename}_layout_sort.pdf')
+        for block in page.get_text('rawdict', flags=fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_MEDIABOX_CLIP)['blocks']:
+            for line in block['lines']:
+                for span in line['spans']:
+                    for char in span['chars']:
+                        char_bbox = char['bbox']
+                        page.draw_rect(char_bbox, color=[1, 0, 0], fill=None, fill_opacity=1, width=0.3, overlay=True,)
+    pdf_docs.save(f'{out_path}/{filename}')

+ 3 - 0
magic_pdf/libs/language.py

@@ -16,11 +16,14 @@ def detect_lang(text: str) -> str:
 
     if len(text) == 0:
         return ""
+
+    text = text.replace("\n", "")
     try:
         lang_upper = detect_language(text)
     except:
         html_no_ctrl_chars = ''.join([l for l in text if unicodedata.category(l)[0] not in ['C', ]])
         lang_upper = detect_language(html_no_ctrl_chars)
+
     try:
         lang = lang_upper.lower()
     except:

+ 1 - 125
magic_pdf/model/__init__.py

@@ -1,126 +1,2 @@
-from typing import Callable
-
-from abc import ABC, abstractmethod
-
-from magic_pdf.data.data_reader_writer import DataWriter
-from magic_pdf.data.dataset import Dataset
-from magic_pdf.pipe.operators import PipeResult
-
-
 __use_inside_model__ = True
-__model_mode__ = "full"
-
-
-class InferenceResultBase(ABC):
-
-    @abstractmethod
-    def __init__(self, inference_results: list, dataset: Dataset):
-        """Initialized method.
-
-        Args:
-            inference_results (list): the inference result generated by model
-            dataset (Dataset): the dataset related with model inference result
-        """
-        self._infer_res = inference_results
-        self._dataset = dataset
-
-    @abstractmethod
-    def draw_model(self, file_path: str) -> None:
-        """Draw model inference result.
-
-        Args:
-            file_path (str): the output file path
-        """
-        pass
-
-    @abstractmethod
-    def dump_model(self, writer: DataWriter, file_path: str):
-        """Dump model inference result to file.
-
-        Args:
-            writer (DataWriter): writer handle
-            file_path (str): the location of target file
-        """
-        pass
-
-    @abstractmethod
-    def get_infer_res(self):
-        """Get the inference result.
-
-        Returns:
-            list: the inference result generated by model
-        """
-        pass
-
-    @abstractmethod
-    def apply(self, proc: Callable, *args, **kwargs):
-        """Apply callable method which.
-
-        Args:
-            proc (Callable): invoke proc as follows:
-                proc(inference_result, *args, **kwargs)
-
-        Returns:
-            Any: return the result generated by proc
-        """
-        pass
-
-    @abstractmethod
-    def pipe_auto_mode(
-        self,
-        imageWriter: DataWriter,
-        start_page_id=0,
-        end_page_id=None,
-        debug_mode=False,
-        lang=None,
-    ) -> PipeResult:
-        """Post-proc the model inference result.
-            step1: classify the dataset type
-            step2: based the result of step1, using `pipe_txt_mode` or `pipe_ocr_mode`
-
-        Args:
-            imageWriter (DataWriter): the image writer handle
-            start_page_id (int, optional): Defaults to 0. Let user select some pages He/She want to process
-            end_page_id (int, optional):  Defaults to the last page index of dataset. Let user select some pages He/She want to process
-            debug_mode (bool, optional): Defaults to False. will dump more log if enabled
-            lang (str, optional): Defaults to None.
-
-        Returns:
-            PipeResult: the result
-        """
-        pass
-
-    @abstractmethod
-    def pipe_txt_mode(
-        self,
-        imageWriter: DataWriter,
-        start_page_id=0,
-        end_page_id=None,
-        debug_mode=False,
-        lang=None,
-    ) -> PipeResult:
-        """Post-proc the model inference result, Extract the text using the
-        third library, such as `pymupdf`
-
-        Args:
-            imageWriter (DataWriter): the image writer handle
-            start_page_id (int, optional): Defaults to 0. Let user select some pages He/She want to process
-            end_page_id (int, optional):  Defaults to the last page index of dataset. Let user select some pages He/She want to process
-            debug_mode (bool, optional): Defaults to False. will dump more log if enabled
-            lang (str, optional): Defaults to None.
-
-        Returns:
-            PipeResult: the result
-        """
-        pass
-
-    @abstractmethod
-    def pipe_ocr_mode(
-        self,
-        imageWriter: DataWriter,
-        start_page_id=0,
-        end_page_id=None,
-        debug_mode=False,
-        lang=None,
-    ) -> PipeResult:
-        pass
+__model_mode__ = 'full'

+ 275 - 0
magic_pdf/model/batch_analyze.py

@@ -0,0 +1,275 @@
+import time
+
+import cv2
+import numpy as np
+import torch
+from loguru import logger
+from PIL import Image
+
+from magic_pdf.config.constants import MODEL_NAME
+from magic_pdf.config.exceptions import CUDA_NOT_AVAILABLE
+from magic_pdf.data.dataset import Dataset
+from magic_pdf.libs.clean_memory import clean_memory
+from magic_pdf.libs.config_reader import get_device
+from magic_pdf.model.doc_analyze_by_custom_model import ModelSingleton
+from magic_pdf.model.pdf_extract_kit import CustomPEKModel
+from magic_pdf.model.sub_modules.model_utils import (
+    clean_vram, crop_img, get_res_list_from_layout_res)
+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import (
+    get_adjusted_mfdetrec_res, get_ocr_result_list)
+from magic_pdf.operators.models import InferenceResult
+
+YOLO_LAYOUT_BASE_BATCH_SIZE = 4
+MFD_BASE_BATCH_SIZE = 1
+MFR_BASE_BATCH_SIZE = 16
+
+
+class BatchAnalyze:
+    def __init__(self, model: CustomPEKModel, batch_ratio: int):
+        self.model = model
+        self.batch_ratio = batch_ratio
+
+    def __call__(self, images: list) -> list:
+        images_layout_res = []
+
+        layout_start_time = time.time()
+        if self.model.layout_model_name == MODEL_NAME.LAYOUTLMv3:
+            # layoutlmv3
+            for image in images:
+                layout_res = self.model.layout_model(image, ignore_catids=[])
+                images_layout_res.append(layout_res)
+        elif self.model.layout_model_name == MODEL_NAME.DocLayout_YOLO:
+            # doclayout_yolo
+            layout_images = []
+            modified_images = []
+            for image_index, image in enumerate(images):
+                pil_img = Image.fromarray(image)
+                width, height = pil_img.size
+                if height > width:
+                    input_res = {'poly': [0, 0, width, 0, width, height, 0, height]}
+                    new_image, useful_list = crop_img(
+                        input_res, pil_img, crop_paste_x=width // 2, crop_paste_y=0
+                    )
+                    layout_images.append(new_image)
+                    modified_images.append([image_index, useful_list])
+                else:
+                    layout_images.append(pil_img)
+
+            images_layout_res += self.model.layout_model.batch_predict(
+                layout_images, self.batch_ratio * YOLO_LAYOUT_BASE_BATCH_SIZE
+            )
+
+            for image_index, useful_list in modified_images:
+                for res in images_layout_res[image_index]:
+                    for i in range(len(res['poly'])):
+                        if i % 2 == 0:
+                            res['poly'][i] = (
+                                res['poly'][i] - useful_list[0] + useful_list[2]
+                            )
+                        else:
+                            res['poly'][i] = (
+                                res['poly'][i] - useful_list[1] + useful_list[3]
+                            )
+        logger.info(
+            f'layout time: {round(time.time() - layout_start_time, 2)}, image num: {len(images)}'
+        )
+
+        if self.model.apply_formula:
+            # 公式检测
+            mfd_start_time = time.time()
+            images_mfd_res = self.model.mfd_model.batch_predict(
+                images, self.batch_ratio * MFD_BASE_BATCH_SIZE
+            )
+            logger.info(
+                f'mfd time: {round(time.time() - mfd_start_time, 2)}, image num: {len(images)}'
+            )
+
+            # 公式识别
+            mfr_start_time = time.time()
+            images_formula_list = self.model.mfr_model.batch_predict(
+                images_mfd_res,
+                images,
+                batch_size=self.batch_ratio * MFR_BASE_BATCH_SIZE,
+            )
+            for image_index in range(len(images)):
+                images_layout_res[image_index] += images_formula_list[image_index]
+            logger.info(
+                f'mfr time: {round(time.time() - mfr_start_time, 2)}, image num: {len(images)}'
+            )
+
+        # 清理显存
+        clean_vram(self.model.device, vram_threshold=8)
+
+        ocr_time = 0
+        ocr_count = 0
+        table_time = 0
+        table_count = 0
+        # reference: magic_pdf/model/doc_analyze_by_custom_model.py:doc_analyze
+        for index in range(len(images)):
+            layout_res = images_layout_res[index]
+            pil_img = Image.fromarray(images[index])
+
+            ocr_res_list, table_res_list, single_page_mfdetrec_res = (
+                get_res_list_from_layout_res(layout_res)
+            )
+            # ocr识别
+            ocr_start = time.time()
+            # Process each area that requires OCR processing
+            for res in ocr_res_list:
+                new_image, useful_list = crop_img(
+                    res, pil_img, crop_paste_x=50, crop_paste_y=50
+                )
+                adjusted_mfdetrec_res = get_adjusted_mfdetrec_res(
+                    single_page_mfdetrec_res, useful_list
+                )
+
+                # OCR recognition
+                new_image = cv2.cvtColor(np.asarray(new_image), cv2.COLOR_RGB2BGR)
+
+                if self.model.apply_ocr:
+                    ocr_res = self.model.ocr_model.ocr(
+                        new_image, mfd_res=adjusted_mfdetrec_res
+                    )[0]
+                else:
+                    ocr_res = self.model.ocr_model.ocr(
+                        new_image, mfd_res=adjusted_mfdetrec_res, rec=False
+                    )[0]
+
+                # Integration results
+                if ocr_res:
+                    ocr_result_list = get_ocr_result_list(ocr_res, useful_list)
+                    layout_res.extend(ocr_result_list)
+            ocr_time += time.time() - ocr_start
+            ocr_count += len(ocr_res_list)
+
+            # 表格识别 table recognition
+            if self.model.apply_table:
+                table_start = time.time()
+                for res in table_res_list:
+                    new_image, _ = crop_img(res, pil_img)
+                    single_table_start_time = time.time()
+                    html_code = None
+                    if self.model.table_model_name == MODEL_NAME.STRUCT_EQTABLE:
+                        with torch.no_grad():
+                            table_result = self.model.table_model.predict(
+                                new_image, 'html'
+                            )
+                            if len(table_result) > 0:
+                                html_code = table_result[0]
+                    elif self.model.table_model_name == MODEL_NAME.TABLE_MASTER:
+                        html_code = self.model.table_model.img2html(new_image)
+                    elif self.model.table_model_name == MODEL_NAME.RAPID_TABLE:
+                        html_code, table_cell_bboxes, elapse = (
+                            self.model.table_model.predict(new_image)
+                        )
+                    run_time = time.time() - single_table_start_time
+                    if run_time > self.model.table_max_time:
+                        logger.warning(
+                            f'table recognition processing exceeds max time {self.model.table_max_time}s'
+                        )
+                    # 判断是否返回正常
+                    if html_code:
+                        expected_ending = html_code.strip().endswith(
+                            '</html>'
+                        ) or html_code.strip().endswith('</table>')
+                        if expected_ending:
+                            res['html'] = html_code
+                        else:
+                            logger.warning(
+                                'table recognition processing fails, not found expected HTML table end'
+                            )
+                    else:
+                        logger.warning(
+                            'table recognition processing fails, not get html return'
+                        )
+                table_time += time.time() - table_start
+                table_count += len(table_res_list)
+
+        if self.model.apply_ocr:
+            logger.info(f'ocr time: {round(ocr_time, 2)}, image num: {ocr_count}')
+        else:
+            logger.info(f'det time: {round(ocr_time, 2)}, image num: {ocr_count}')
+        if self.model.apply_table:
+            logger.info(f'table time: {round(table_time, 2)}, image num: {table_count}')
+
+        return images_layout_res
+
+
+def doc_batch_analyze(
+    dataset: Dataset,
+    ocr: bool = False,
+    show_log: bool = False,
+    start_page_id=0,
+    end_page_id=None,
+    lang=None,
+    layout_model=None,
+    formula_enable=None,
+    table_enable=None,
+    batch_ratio: int | None = None,
+) -> InferenceResult:
+    """Perform batch analysis on a document dataset.
+
+    Args:
+        dataset (Dataset): The dataset containing document pages to be analyzed.
+        ocr (bool, optional): Flag to enable OCR (Optical Character Recognition). Defaults to False.
+        show_log (bool, optional): Flag to enable logging. Defaults to False.
+        start_page_id (int, optional): The starting page ID for analysis. Defaults to 0.
+        end_page_id (int, optional): The ending page ID for analysis. Defaults to None, which means analyze till the last page.
+        lang (str, optional): Language for OCR. Defaults to None.
+        layout_model (optional): Layout model to be used for analysis. Defaults to None.
+        formula_enable (optional): Flag to enable formula detection. Defaults to None.
+        table_enable (optional): Flag to enable table detection. Defaults to None.
+        batch_ratio (int | None, optional): Ratio for batch processing. Defaults to None, which sets it to 1.
+
+    Raises:
+        CUDA_NOT_AVAILABLE: If CUDA is not available, raises an exception as batch analysis is not supported in CPU mode.
+
+    Returns:
+        InferenceResult: The result of the batch analysis containing the analyzed data and the dataset.
+    """
+
+    if not torch.cuda.is_available():
+        raise CUDA_NOT_AVAILABLE('batch analyze not support in CPU mode')
+
+    lang = None if lang == '' else lang
+    # TODO: auto detect batch size
+    batch_ratio = 1 if batch_ratio is None else batch_ratio
+    end_page_id = end_page_id if end_page_id else len(dataset)
+
+    model_manager = ModelSingleton()
+    custom_model: CustomPEKModel = model_manager.get_model(
+        ocr, show_log, lang, layout_model, formula_enable, table_enable
+    )
+    batch_model = BatchAnalyze(model=custom_model, batch_ratio=batch_ratio)
+
+    model_json = []
+
+    # batch analyze
+    images = []
+    for index in range(len(dataset)):
+        if start_page_id <= index <= end_page_id:
+            page_data = dataset.get_page(index)
+            img_dict = page_data.get_image()
+            images.append(img_dict['img'])
+    analyze_result = batch_model(images)
+
+    for index in range(len(dataset)):
+        page_data = dataset.get_page(index)
+        img_dict = page_data.get_image()
+        page_width = img_dict['width']
+        page_height = img_dict['height']
+        if start_page_id <= index <= end_page_id:
+            result = analyze_result.pop(0)
+        else:
+            result = []
+
+        page_info = {'page_no': index, 'height': page_height, 'width': page_width}
+        page_dict = {'layout_dets': result, 'page_info': page_info}
+        model_json.append(page_dict)
+
+    # TODO: clean memory when gpu memory is not enough
+    clean_memory_start_time = time.time()
+    clean_memory(get_device())
+    logger.info(f'clean memory time: {round(time.time() - clean_memory_start_time, 2)}')
+
+    return InferenceResult(model_json, dataset)

+ 4 - 51
magic_pdf/model/doc_analyze_by_custom_model.py

@@ -1,16 +1,13 @@
 import os
 import time
 
-import fitz
-import numpy as np
-from loguru import logger
-
 # 关闭paddle的信号处理
 import paddle
+from loguru import logger
+
 paddle.disable_signal_handler()
 
 os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1'  # 禁止albumentations检查更新
-os.environ['YOLO_VERBOSE'] = 'False'  # disable yolo logger
 
 try:
     import torchtext
@@ -28,7 +25,7 @@ from magic_pdf.libs.config_reader import (get_device, get_formula_config,
                                           get_local_models_dir,
                                           get_table_recog_config)
 from magic_pdf.model.model_list import MODEL
-from magic_pdf.model.operators import InferenceResult
+from magic_pdf.operators.models import InferenceResult
 
 
 def dict_compare(d1, d2):
@@ -45,47 +42,6 @@ def remove_duplicates_dicts(lst):
     return unique_dicts
 
 
-def load_images_from_pdf(
-    pdf_bytes: bytes, dpi=200, start_page_id=0, end_page_id=None
-) -> list:
-    try:
-        from PIL import Image
-    except ImportError:
-        logger.error('Pillow not installed, please install by pip.')
-        exit(1)
-
-    images = []
-    with fitz.open('pdf', pdf_bytes) as doc:
-        pdf_page_num = doc.page_count
-        end_page_id = (
-            end_page_id
-            if end_page_id is not None and end_page_id >= 0
-            else pdf_page_num - 1
-        )
-        if end_page_id > pdf_page_num - 1:
-            logger.warning('end_page_id is out of range, use images length')
-            end_page_id = pdf_page_num - 1
-
-        for index in range(0, doc.page_count):
-            if start_page_id <= index <= end_page_id:
-                page = doc[index]
-                mat = fitz.Matrix(dpi / 72, dpi / 72)
-                pm = page.get_pixmap(matrix=mat, alpha=False)
-
-                # If the width or height exceeds 4500 after scaling, do not scale further.
-                if pm.width > 4500 or pm.height > 4500:
-                    pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
-
-                img = Image.frombytes('RGB', (pm.width, pm.height), pm.samples)
-                img = np.array(img)
-                img_dict = {'img': img, 'width': pm.width, 'height': pm.height}
-            else:
-                img_dict = {'img': [], 'width': 0, 'height': 0}
-
-            images.append(img_dict)
-    return images
-
-
 class ModelSingleton:
     _instance = None
     _models = {}
@@ -198,9 +154,6 @@ def doc_analyze(
     table_enable=None,
 ) -> InferenceResult:
 
-    if lang == '':
-        lang = None
-
     model_manager = ModelSingleton()
     custom_model = model_manager.get_model(
         ocr, show_log, lang, layout_model, formula_enable, table_enable
@@ -230,7 +183,7 @@ def doc_analyze(
         model_json.append(page_dict)
 
     gc_start = time.time()
-    clean_memory()
+    clean_memory(get_device())
     gc_time = round(time.time() - gc_start, 2)
     logger.info(f'gc time: {gc_time}')
 

+ 4 - 435
magic_pdf/model/magic_model.py

@@ -3,12 +3,9 @@ import enum
 from magic_pdf.config.model_block_type import ModelBlockTypeEnum
 from magic_pdf.config.ocr_content_type import CategoryId, ContentType
 from magic_pdf.data.dataset import Dataset
-from magic_pdf.libs.boxbase import (_is_in, _is_part_overlap, bbox_distance,
-                                    bbox_relative_pos, box_area, calculate_iou,
-                                    calculate_overlap_area_in_bbox1_area_ratio,
-                                    get_overlap_area)
+from magic_pdf.libs.boxbase import (_is_in, bbox_distance, bbox_relative_pos,
+                                    calculate_iou)
 from magic_pdf.libs.coordinate_transform import get_scale_ratio
-from magic_pdf.libs.local_math import float_gt
 from magic_pdf.pre_proc.remove_bbox_overlap import _remove_overlap_between_bbox
 
 CAPATION_OVERLAP_AREA_RATIO = 0.6
@@ -208,393 +205,6 @@ class MagicModel:
                     keep[i] = False
         return [bboxes[i] for i in range(N) if keep[i]]
 
-    def __tie_up_category_by_distance(
-        self, page_no, subject_category_id, object_category_id
-    ):
-        """假定每个 subject 最多有一个 object (可以有多个相邻的 object 合并为单个 object),每个 object
-        只能属于一个 subject."""
-        ret = []
-        MAX_DIS_OF_POINT = 10**9 + 7
-        """
-        subject 和 object 的 bbox 会合并成一个大的 bbox (named: merged bbox)。
-        筛选出所有和 merged bbox 有 overlap 且 overlap 面积大于 object 的面积的 subjects。
-        再求出筛选出的 subjects 和 object 的最短距离
-        """
-
-        def search_overlap_between_boxes(subject_idx, object_idx):
-            idxes = [subject_idx, object_idx]
-            x0s = [all_bboxes[idx]['bbox'][0] for idx in idxes]
-            y0s = [all_bboxes[idx]['bbox'][1] for idx in idxes]
-            x1s = [all_bboxes[idx]['bbox'][2] for idx in idxes]
-            y1s = [all_bboxes[idx]['bbox'][3] for idx in idxes]
-
-            merged_bbox = [
-                min(x0s),
-                min(y0s),
-                max(x1s),
-                max(y1s),
-            ]
-            ratio = 0
-
-            other_objects = list(
-                map(
-                    lambda x: {'bbox': x['bbox'], 'score': x['score']},
-                    filter(
-                        lambda x: x['category_id']
-                        not in (object_category_id, subject_category_id),
-                        self.__model_list[page_no]['layout_dets'],
-                    ),
-                )
-            )
-            for other_object in other_objects:
-                ratio = max(
-                    ratio,
-                    get_overlap_area(merged_bbox, other_object['bbox'])
-                    * 1.0
-                    / box_area(all_bboxes[object_idx]['bbox']),
-                )
-                if ratio >= MERGE_BOX_OVERLAP_AREA_RATIO:
-                    break
-
-            return ratio
-
-        def may_find_other_nearest_bbox(subject_idx, object_idx):
-            ret = float('inf')
-
-            x0 = min(
-                all_bboxes[subject_idx]['bbox'][0], all_bboxes[object_idx]['bbox'][0]
-            )
-            y0 = min(
-                all_bboxes[subject_idx]['bbox'][1], all_bboxes[object_idx]['bbox'][1]
-            )
-            x1 = max(
-                all_bboxes[subject_idx]['bbox'][2], all_bboxes[object_idx]['bbox'][2]
-            )
-            y1 = max(
-                all_bboxes[subject_idx]['bbox'][3], all_bboxes[object_idx]['bbox'][3]
-            )
-
-            object_area = abs(
-                all_bboxes[object_idx]['bbox'][2] - all_bboxes[object_idx]['bbox'][0]
-            ) * abs(
-                all_bboxes[object_idx]['bbox'][3] - all_bboxes[object_idx]['bbox'][1]
-            )
-
-            for i in range(len(all_bboxes)):
-                if (
-                    i == subject_idx
-                    or all_bboxes[i]['category_id'] != subject_category_id
-                ):
-                    continue
-                if _is_part_overlap([x0, y0, x1, y1], all_bboxes[i]['bbox']) or _is_in(
-                    all_bboxes[i]['bbox'], [x0, y0, x1, y1]
-                ):
-
-                    i_area = abs(
-                        all_bboxes[i]['bbox'][2] - all_bboxes[i]['bbox'][0]
-                    ) * abs(all_bboxes[i]['bbox'][3] - all_bboxes[i]['bbox'][1])
-                    if i_area >= object_area:
-                        ret = min(float('inf'), dis[i][object_idx])
-
-            return ret
-
-        def expand_bbbox(idxes):
-            x0s = [all_bboxes[idx]['bbox'][0] for idx in idxes]
-            y0s = [all_bboxes[idx]['bbox'][1] for idx in idxes]
-            x1s = [all_bboxes[idx]['bbox'][2] for idx in idxes]
-            y1s = [all_bboxes[idx]['bbox'][3] for idx in idxes]
-            return min(x0s), min(y0s), max(x1s), max(y1s)
-
-        subjects = self.__reduct_overlap(
-            list(
-                map(
-                    lambda x: {'bbox': x['bbox'], 'score': x['score']},
-                    filter(
-                        lambda x: x['category_id'] == subject_category_id,
-                        self.__model_list[page_no]['layout_dets'],
-                    ),
-                )
-            )
-        )
-
-        objects = self.__reduct_overlap(
-            list(
-                map(
-                    lambda x: {'bbox': x['bbox'], 'score': x['score']},
-                    filter(
-                        lambda x: x['category_id'] == object_category_id,
-                        self.__model_list[page_no]['layout_dets'],
-                    ),
-                )
-            )
-        )
-        subject_object_relation_map = {}
-
-        subjects.sort(
-            key=lambda x: x['bbox'][0] ** 2 + x['bbox'][1] ** 2
-        )  # get the distance !
-
-        all_bboxes = []
-
-        for v in subjects:
-            all_bboxes.append(
-                {
-                    'category_id': subject_category_id,
-                    'bbox': v['bbox'],
-                    'score': v['score'],
-                }
-            )
-
-        for v in objects:
-            all_bboxes.append(
-                {
-                    'category_id': object_category_id,
-                    'bbox': v['bbox'],
-                    'score': v['score'],
-                }
-            )
-
-        N = len(all_bboxes)
-        dis = [[MAX_DIS_OF_POINT] * N for _ in range(N)]
-
-        for i in range(N):
-            for j in range(i):
-                if (
-                    all_bboxes[i]['category_id'] == subject_category_id
-                    and all_bboxes[j]['category_id'] == subject_category_id
-                ):
-                    continue
-
-                subject_idx, object_idx = i, j
-                if all_bboxes[j]['category_id'] == subject_category_id:
-                    subject_idx, object_idx = j, i
-
-                if (
-                    search_overlap_between_boxes(subject_idx, object_idx)
-                    >= MERGE_BOX_OVERLAP_AREA_RATIO
-                ):
-                    dis[i][j] = float('inf')
-                    dis[j][i] = dis[i][j]
-                    continue
-
-                dis[i][j] = self._bbox_distance(
-                    all_bboxes[subject_idx]['bbox'], all_bboxes[object_idx]['bbox']
-                )
-                dis[j][i] = dis[i][j]
-
-        used = set()
-        for i in range(N):
-            # 求第 i 个 subject 所关联的 object
-            if all_bboxes[i]['category_id'] != subject_category_id:
-                continue
-            seen = set()
-            candidates = []
-            arr = []
-            for j in range(N):
-
-                pos_flag_count = sum(
-                    list(
-                        map(
-                            lambda x: 1 if x else 0,
-                            bbox_relative_pos(
-                                all_bboxes[i]['bbox'], all_bboxes[j]['bbox']
-                            ),
-                        )
-                    )
-                )
-                if pos_flag_count > 1:
-                    continue
-                if (
-                    all_bboxes[j]['category_id'] != object_category_id
-                    or j in used
-                    or dis[i][j] == MAX_DIS_OF_POINT
-                ):
-                    continue
-                left, right, _, _ = bbox_relative_pos(
-                    all_bboxes[i]['bbox'], all_bboxes[j]['bbox']
-                )  # 由  pos_flag_count 相关逻辑保证本段逻辑准确性
-                if left or right:
-                    one_way_dis = all_bboxes[i]['bbox'][2] - all_bboxes[i]['bbox'][0]
-                else:
-                    one_way_dis = all_bboxes[i]['bbox'][3] - all_bboxes[i]['bbox'][1]
-                if dis[i][j] > one_way_dis:
-                    continue
-                arr.append((dis[i][j], j))
-
-            arr.sort(key=lambda x: x[0])
-            if len(arr) > 0:
-                """
-                bug: 离该subject 最近的 object 可能跨越了其它的 subject。
-                比如 [this subect] [some sbuject] [the nearest object of subject]
-                """
-                if may_find_other_nearest_bbox(i, arr[0][1]) >= arr[0][0]:
-
-                    candidates.append(arr[0][1])
-                    seen.add(arr[0][1])
-
-            # 已经获取初始种子
-            for j in set(candidates):
-                tmp = []
-                for k in range(i + 1, N):
-                    pos_flag_count = sum(
-                        list(
-                            map(
-                                lambda x: 1 if x else 0,
-                                bbox_relative_pos(
-                                    all_bboxes[j]['bbox'], all_bboxes[k]['bbox']
-                                ),
-                            )
-                        )
-                    )
-
-                    if pos_flag_count > 1:
-                        continue
-
-                    if (
-                        all_bboxes[k]['category_id'] != object_category_id
-                        or k in used
-                        or k in seen
-                        or dis[j][k] == MAX_DIS_OF_POINT
-                        or dis[j][k] > dis[i][j]
-                    ):
-                        continue
-
-                    is_nearest = True
-                    for ni in range(i + 1, N):
-                        if ni in (j, k) or ni in used or ni in seen:
-                            continue
-
-                        if not float_gt(dis[ni][k], dis[j][k]):
-                            is_nearest = False
-                            break
-
-                    if is_nearest:
-                        nx0, ny0, nx1, ny1 = expand_bbbox(list(seen) + [k])
-                        n_dis = bbox_distance(
-                            all_bboxes[i]['bbox'], [nx0, ny0, nx1, ny1]
-                        )
-                        if float_gt(dis[i][j], n_dis):
-                            continue
-                        tmp.append(k)
-                        seen.add(k)
-
-                candidates = tmp
-                if len(candidates) == 0:
-                    break
-
-            # 已经获取到某个 figure 下所有的最靠近的 captions,以及最靠近这些 captions 的 captions 。
-            # 先扩一下 bbox,
-            ox0, oy0, ox1, oy1 = expand_bbbox(list(seen) + [i])
-            ix0, iy0, ix1, iy1 = all_bboxes[i]['bbox']
-
-            # 分成了 4 个截取空间,需要计算落在每个截取空间下 objects 合并后占据的矩形面积
-            caption_poses = [
-                [ox0, oy0, ix0, oy1],
-                [ox0, oy0, ox1, iy0],
-                [ox0, iy1, ox1, oy1],
-                [ix1, oy0, ox1, oy1],
-            ]
-
-            caption_areas = []
-            for bbox in caption_poses:
-                embed_arr = []
-                for idx in seen:
-                    if (
-                        calculate_overlap_area_in_bbox1_area_ratio(
-                            all_bboxes[idx]['bbox'], bbox
-                        )
-                        > CAPATION_OVERLAP_AREA_RATIO
-                    ):
-                        embed_arr.append(idx)
-
-                if len(embed_arr) > 0:
-                    embed_x0 = min([all_bboxes[idx]['bbox'][0] for idx in embed_arr])
-                    embed_y0 = min([all_bboxes[idx]['bbox'][1] for idx in embed_arr])
-                    embed_x1 = max([all_bboxes[idx]['bbox'][2] for idx in embed_arr])
-                    embed_y1 = max([all_bboxes[idx]['bbox'][3] for idx in embed_arr])
-                    caption_areas.append(
-                        int(abs(embed_x1 - embed_x0) * abs(embed_y1 - embed_y0))
-                    )
-                else:
-                    caption_areas.append(0)
-
-            subject_object_relation_map[i] = []
-            if max(caption_areas) > 0:
-                max_area_idx = caption_areas.index(max(caption_areas))
-                caption_bbox = caption_poses[max_area_idx]
-
-                for j in seen:
-                    if (
-                        calculate_overlap_area_in_bbox1_area_ratio(
-                            all_bboxes[j]['bbox'], caption_bbox
-                        )
-                        > CAPATION_OVERLAP_AREA_RATIO
-                    ):
-                        used.add(j)
-                        subject_object_relation_map[i].append(j)
-
-        for i in sorted(subject_object_relation_map.keys()):
-            result = {
-                'subject_body': all_bboxes[i]['bbox'],
-                'all': all_bboxes[i]['bbox'],
-                'score': all_bboxes[i]['score'],
-            }
-
-            if len(subject_object_relation_map[i]) > 0:
-                x0 = min(
-                    [all_bboxes[j]['bbox'][0] for j in subject_object_relation_map[i]]
-                )
-                y0 = min(
-                    [all_bboxes[j]['bbox'][1] for j in subject_object_relation_map[i]]
-                )
-                x1 = max(
-                    [all_bboxes[j]['bbox'][2] for j in subject_object_relation_map[i]]
-                )
-                y1 = max(
-                    [all_bboxes[j]['bbox'][3] for j in subject_object_relation_map[i]]
-                )
-                result['object_body'] = [x0, y0, x1, y1]
-                result['all'] = [
-                    min(x0, all_bboxes[i]['bbox'][0]),
-                    min(y0, all_bboxes[i]['bbox'][1]),
-                    max(x1, all_bboxes[i]['bbox'][2]),
-                    max(y1, all_bboxes[i]['bbox'][3]),
-                ]
-            ret.append(result)
-
-        total_subject_object_dis = 0
-        # 计算已经配对的 distance 距离
-        for i in subject_object_relation_map.keys():
-            for j in subject_object_relation_map[i]:
-                total_subject_object_dis += bbox_distance(
-                    all_bboxes[i]['bbox'], all_bboxes[j]['bbox']
-                )
-
-        # 计算未匹配的 subject 和 object 的距离(非精确版)
-        with_caption_subject = set(
-            [
-                key
-                for key in subject_object_relation_map.keys()
-                if len(subject_object_relation_map[i]) > 0
-            ]
-        )
-        for i in range(N):
-            if all_bboxes[i]['category_id'] != object_category_id or i in used:
-                continue
-            candidates = []
-            for j in range(N):
-                if (
-                    all_bboxes[j]['category_id'] != subject_category_id
-                    or j in with_caption_subject
-                ):
-                    continue
-                candidates.append((dis[i][j], j))
-            if len(candidates) > 0:
-                candidates.sort(key=lambda x: x[0])
-                total_subject_object_dis += candidates[0][1]
-                with_caption_subject.add(j)
-        return ret, total_subject_object_dis
-
     def __tie_up_category_by_distance_v2(
         self,
         page_no: int,
@@ -879,52 +489,12 @@ class MagicModel:
         return ret
 
     def get_imgs(self, page_no: int):
-        with_captions, _ = self.__tie_up_category_by_distance(page_no, 3, 4)
-        with_footnotes, _ = self.__tie_up_category_by_distance(
-            page_no, 3, CategoryId.ImageFootnote
-        )
-        ret = []
-        N, M = len(with_captions), len(with_footnotes)
-        assert N == M
-        for i in range(N):
-            record = {
-                'score': with_captions[i]['score'],
-                'img_caption_bbox': with_captions[i].get('object_body', None),
-                'img_body_bbox': with_captions[i]['subject_body'],
-                'img_footnote_bbox': with_footnotes[i].get('object_body', None),
-            }
-
-            x0 = min(with_captions[i]['all'][0], with_footnotes[i]['all'][0])
-            y0 = min(with_captions[i]['all'][1], with_footnotes[i]['all'][1])
-            x1 = max(with_captions[i]['all'][2], with_footnotes[i]['all'][2])
-            y1 = max(with_captions[i]['all'][3], with_footnotes[i]['all'][3])
-            record['bbox'] = [x0, y0, x1, y1]
-            ret.append(record)
-        return ret
+        return self.get_imgs_v2(page_no)
 
     def get_tables(
         self, page_no: int
     ) -> list:  # 3个坐标, caption, table主体,table-note
-        with_captions, _ = self.__tie_up_category_by_distance(page_no, 5, 6)
-        with_footnotes, _ = self.__tie_up_category_by_distance(page_no, 5, 7)
-        ret = []
-        N, M = len(with_captions), len(with_footnotes)
-        assert N == M
-        for i in range(N):
-            record = {
-                'score': with_captions[i]['score'],
-                'table_caption_bbox': with_captions[i].get('object_body', None),
-                'table_body_bbox': with_captions[i]['subject_body'],
-                'table_footnote_bbox': with_footnotes[i].get('object_body', None),
-            }
-
-            x0 = min(with_captions[i]['all'][0], with_footnotes[i]['all'][0])
-            y0 = min(with_captions[i]['all'][1], with_footnotes[i]['all'][1])
-            x1 = max(with_captions[i]['all'][2], with_footnotes[i]['all'][2])
-            y1 = max(with_captions[i]['all'][3], with_footnotes[i]['all'][3])
-            record['bbox'] = [x0, y0, x1, y1]
-            ret.append(record)
-        return ret
+        return self.get_tables_v2(page_no)
 
     def get_equations(self, page_no: int) -> list:  # 有坐标,也有字
         inline_equations = self.__get_blocks_by_type(
@@ -1043,4 +613,3 @@ class MagicModel:
 
     def get_model_list(self, page_no):
         return self.__model_list[page_no]
-

+ 1 - 0
magic_pdf/model/model_list.py

@@ -9,3 +9,4 @@ class AtomicModel:
     MFR = "mfr"
     OCR = "ocr"
     Table = "table"
+    LangDetect = "langdetect"

+ 33 - 22
magic_pdf/model/pdf_extract_kit.py

@@ -10,7 +10,6 @@ from loguru import logger
 from PIL import Image
 
 os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1'  # 禁止albumentations检查更新
-os.environ['YOLO_VERBOSE'] = 'False'  # disable yolo logger
 
 try:
     import torchtext
@@ -88,6 +87,14 @@ class CustomPEKModel:
         )
         # 初始化解析方案
         self.device = kwargs.get('device', 'cpu')
+
+        if str(self.device).startswith("npu"):
+            import torch_npu
+            os.environ['FLAGS_npu_jit_compile'] = '0'
+            os.environ['FLAGS_use_stride_kernel'] = '0'
+        elif str(self.device).startswith("mps"):
+            os.environ['PYTORCH_ENABLE_MPS_FALLBACK'] = '1'
+
         logger.info('using device: {}'.format(self.device))
         models_dir = kwargs.get(
             'models_dir', os.path.join(root_dir, 'resources', 'models')
@@ -114,11 +121,12 @@ class CustomPEKModel:
                 os.path.join(models_dir, self.configs['weights'][self.mfr_model_name])
             )
             mfr_cfg_path = str(os.path.join(model_config_dir, 'UniMERNet', 'demo.yaml'))
+
             self.mfr_model = atom_model_manager.get_atom_model(
                 atom_model_name=AtomicModel.MFR,
                 mfr_weight_dir=mfr_weight_dir,
                 mfr_cfg_path=mfr_cfg_path,
-                device=self.device,
+                device='cpu' if str(self.device).startswith("mps") else self.device,
             )
 
         # 初始化layout模型
@@ -165,12 +173,17 @@ class CustomPEKModel:
                 table_model_path=str(os.path.join(models_dir, table_model_dir)),
                 table_max_time=self.table_max_time,
                 device=self.device,
+                ocr_engine=self.ocr_model,
             )
 
         logger.info('DocAnalysis init done!')
 
     def __call__(self, image):
 
+        pil_img = Image.fromarray(image)
+        width, height = pil_img.size
+        # logger.info(f'width: {width}, height: {height}')
+
         # layout检测
         layout_start = time.time()
         layout_res = []
@@ -179,30 +192,28 @@ class CustomPEKModel:
             layout_res = self.layout_model(image, ignore_catids=[])
         elif self.layout_model_name == MODEL_NAME.DocLayout_YOLO:
             # doclayout_yolo
-            img_pil = Image.fromarray(image)
-            width, height = img_pil.size
-            # logger.info(f'width: {width}, height: {height}')
-            input_res = {"poly":[0,0,width,0,width,height,0,height]}
-            new_image, useful_list = crop_img(input_res, img_pil, crop_paste_x=width//2, crop_paste_y=0)
-            paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
-            layout_res = self.layout_model.predict(new_image)
-            for res in layout_res:
-                p1, p2, p3, p4, p5, p6, p7, p8 = res['poly']
-                p1 = p1 - paste_x + xmin
-                p2 = p2 - paste_y + ymin
-                p3 = p3 - paste_x + xmin
-                p4 = p4 - paste_y + ymin
-                p5 = p5 - paste_x + xmin
-                p6 = p6 - paste_y + ymin
-                p7 = p7 - paste_x + xmin
-                p8 = p8 - paste_y + ymin
-                res['poly'] = [p1, p2, p3, p4, p5, p6, p7, p8]
+            if height > width:
+                input_res = {"poly":[0,0,width,0,width,height,0,height]}
+                new_image, useful_list = crop_img(input_res, pil_img, crop_paste_x=width//2, crop_paste_y=0)
+                paste_x, paste_y, xmin, ymin, xmax, ymax, new_width, new_height = useful_list
+                layout_res = self.layout_model.predict(new_image)
+                for res in layout_res:
+                    p1, p2, p3, p4, p5, p6, p7, p8 = res['poly']
+                    p1 = p1 - paste_x + xmin
+                    p2 = p2 - paste_y + ymin
+                    p3 = p3 - paste_x + xmin
+                    p4 = p4 - paste_y + ymin
+                    p5 = p5 - paste_x + xmin
+                    p6 = p6 - paste_y + ymin
+                    p7 = p7 - paste_x + xmin
+                    p8 = p8 - paste_y + ymin
+                    res['poly'] = [p1, p2, p3, p4, p5, p6, p7, p8]
+            else:
+                layout_res = self.layout_model.predict(image)
 
         layout_cost = round(time.time() - layout_start, 2)
         logger.info(f'layout detection time: {layout_cost}')
 
-        pil_img = Image.fromarray(image)
-
         if self.apply_formula:
             # 公式检测
             mfd_start = time.time()

+ 1 - 0
magic_pdf/model/sub_modules/language_detection/__init__.py

@@ -0,0 +1 @@
+# Copyright (c) Opendatalab. All rights reserved.

+ 82 - 0
magic_pdf/model/sub_modules/language_detection/utils.py

@@ -0,0 +1,82 @@
+# Copyright (c) Opendatalab. All rights reserved.
+import os
+from pathlib import Path
+
+import yaml
+from PIL import Image
+
+os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1'  # 禁止albumentations检查更新
+
+from magic_pdf.config.constants import MODEL_NAME
+from magic_pdf.data.utils import load_images_from_pdf
+from magic_pdf.libs.config_reader import get_local_models_dir, get_device
+from magic_pdf.libs.pdf_check import extract_pages
+from magic_pdf.model.model_list import AtomicModel
+from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
+
+
+def get_model_config():
+    local_models_dir = get_local_models_dir()
+    device = get_device()
+    current_file_path = os.path.abspath(__file__)
+    root_dir = Path(current_file_path).parents[3]
+    model_config_dir = os.path.join(root_dir, 'resources', 'model_config')
+    config_path = os.path.join(model_config_dir, 'model_configs.yaml')
+    with open(config_path, 'r', encoding='utf-8') as f:
+        configs = yaml.load(f, Loader=yaml.FullLoader)
+    return root_dir, local_models_dir, device, configs
+
+
+def get_text_images(simple_images):
+    _, local_models_dir, device, configs = get_model_config()
+    atom_model_manager = AtomModelSingleton()
+    temp_layout_model = atom_model_manager.get_atom_model(
+        atom_model_name=AtomicModel.Layout,
+        layout_model_name=MODEL_NAME.DocLayout_YOLO,
+        doclayout_yolo_weights=str(
+            os.path.join(
+                local_models_dir, configs['weights'][MODEL_NAME.DocLayout_YOLO]
+            )
+        ),
+        device=device,
+    )
+    text_images = []
+    for simple_image in simple_images:
+        image = Image.fromarray(simple_image['img'])
+        layout_res = temp_layout_model.predict(image)
+        # 给textblock截图
+        for res in layout_res:
+            if res['category_id'] in [1]:
+                x1, y1, _, _, x2, y2, _, _ = res['poly']
+                # 初步清洗(宽和高都小于100)
+                if x2 - x1 < 100 and y2 - y1 < 100:
+                    continue
+                text_images.append(image.crop((x1, y1, x2, y2)))
+    return text_images
+
+
+def auto_detect_lang(pdf_bytes: bytes):
+    sample_docs = extract_pages(pdf_bytes)
+    sample_pdf_bytes = sample_docs.tobytes()
+    simple_images = load_images_from_pdf(sample_pdf_bytes, dpi=200)
+    text_images = get_text_images(simple_images)
+    langdetect_model = model_init(MODEL_NAME.YOLO_V11_LangDetect)
+    lang = langdetect_model.do_detect(text_images)
+    return lang
+
+
+def model_init(model_name: str):
+    atom_model_manager = AtomModelSingleton()
+
+    if model_name == MODEL_NAME.YOLO_V11_LangDetect:
+        root_dir, _, device, _ = get_model_config()
+        model = atom_model_manager.get_atom_model(
+            atom_model_name=AtomicModel.LangDetect,
+            langdetect_model_name=MODEL_NAME.YOLO_V11_LangDetect,
+            langdetect_model_weight=str(os.path.join(root_dir, 'resources', 'yolov11-langdetect', 'yolo_v11_ft.pt')),
+            device=device,
+        )
+    else:
+        raise ValueError(f"model_name {model_name} not found")
+    return model
+

+ 139 - 0
magic_pdf/model/sub_modules/language_detection/yolov11/YOLOv11.py

@@ -0,0 +1,139 @@
+# Copyright (c) Opendatalab. All rights reserved.
+from collections import Counter
+from uuid import uuid4
+
+import torch
+from PIL import Image
+from loguru import logger
+from ultralytics import YOLO
+
+language_dict = {
+    "ch": "中文简体",
+    "en": "英语",
+    "japan": "日语",
+    "korean": "韩语",
+    "fr": "法语",
+    "german": "德语",
+    "ar": "阿拉伯语",
+    "ru": "俄语"
+}
+
+
+def split_images(image, result_images=None):
+    """
+    对输入文件夹内的图片进行处理,若图片竖向(y方向)分辨率超过400,则进行拆分,
+    每次平分图片,直至拆分出的图片竖向分辨率都满足400以下,将处理后的图片(拆分后的子图片)保存到输出文件夹。
+    避免保存因裁剪区域超出图片范围导致出现的无效黑色图片部分。
+    """
+    if result_images is None:
+        result_images = []
+
+    width, height = image.size
+    long_side = max(width, height)  # 获取较长边长度
+
+    if long_side <= 400:
+        result_images.append(image)
+        return result_images
+
+    new_long_side = long_side // 2
+    sub_images = []
+
+    if width >= height:  # 如果宽度是较长边
+        for x in range(0, width, new_long_side):
+            # 判断裁剪区域是否超出图片范围,如果超出则不进行裁剪保存操作
+            if x + new_long_side > width:
+                continue
+            box = (x, 0, x + new_long_side, height)
+            sub_image = image.crop(box)
+            sub_images.append(sub_image)
+    else:  # 如果高度是较长边
+        for y in range(0, height, new_long_side):
+            # 判断裁剪区域是否超出图片范围,如果超出则不进行裁剪保存操作
+            if y + new_long_side > height:
+                continue
+            box = (0, y, width, y + new_long_side)
+            sub_image = image.crop(box)
+            sub_images.append(sub_image)
+
+    for sub_image in sub_images:
+        split_images(sub_image, result_images)
+
+    return result_images
+
+
+def resize_images_to_224(image):
+    """
+    若分辨率小于224则用黑色背景补齐到224*224大小,若大于等于224则调整为224*224大小,并保存到输出文件夹中。
+    """
+    try:
+        width, height = image.size
+        if width < 224 or height < 224:
+            new_image = Image.new('RGB', (224, 224), (0, 0, 0))
+            paste_x = (224 - width) // 2
+            paste_y = (224 - height) // 2
+            new_image.paste(image, (paste_x, paste_y))
+            image = new_image
+        else:
+            image = image.resize((224, 224), Image.Resampling.LANCZOS)
+
+        # uuid = str(uuid4())
+        # image.save(f"/tmp/{uuid}.jpg")
+        return image
+    except Exception as e:
+        logger.exception(e)
+
+
+class YOLOv11LangDetModel(object):
+    def __init__(self, langdetect_model_weight, device):
+
+        self.model = YOLO(langdetect_model_weight)
+
+        if str(device).startswith("npu"):
+            self.device = torch.device(device)
+        else:
+            self.device = device
+    def do_detect(self, images: list):
+        all_images = []
+        for image in images:
+            width, height = image.size
+            # logger.info(f"image size: {width} x {height}")
+            if width < 100 and height < 100:
+                continue
+            temp_images = split_images(image)
+            for temp_image in temp_images:
+                all_images.append(resize_images_to_224(temp_image))
+
+        images_lang_res = self.batch_predict(all_images, batch_size=8)
+        # logger.info(f"images_lang_res: {images_lang_res}")
+        if len(images_lang_res) > 0:
+            count_dict = Counter(images_lang_res)
+            language = max(count_dict, key=count_dict.get)
+        else:
+            language = None
+        return language
+
+    def predict(self, image):
+        results = self.model.predict(image, verbose=False, device=self.device)
+        predicted_class_id = int(results[0].probs.top1)
+        predicted_class_name = self.model.names[predicted_class_id]
+        return predicted_class_name
+
+
+    def batch_predict(self, images: list, batch_size: int) -> list:
+        images_lang_res = []
+
+        for index in range(0, len(images), batch_size):
+            lang_res = [
+                image_res.cpu()
+                for image_res in self.model.predict(
+                    images[index: index + batch_size],
+                    verbose = False,
+                    device=self.device,
+                )
+            ]
+            for res in lang_res:
+                predicted_class_id = int(res.probs.top1)
+                predicted_class_name = self.model.names[predicted_class_id]
+                images_lang_res.append(predicted_class_name)
+
+        return images_lang_res

+ 1 - 0
magic_pdf/model/sub_modules/language_detection/yolov11/__init__.py

@@ -0,0 +1 @@
+# Copyright (c) Opendatalab. All rights reserved.

+ 44 - 7
magic_pdf/model/sub_modules/layout/doclayout_yolo/DocLayoutYOLO.py

@@ -8,14 +8,51 @@ class DocLayoutYOLOModel(object):
 
     def predict(self, image):
         layout_res = []
-        doclayout_yolo_res = self.model.predict(image, imgsz=1024, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
-        for xyxy, conf, cla in zip(doclayout_yolo_res.boxes.xyxy.cpu(), doclayout_yolo_res.boxes.conf.cpu(),
-                                   doclayout_yolo_res.boxes.cls.cpu()):
+        doclayout_yolo_res = self.model.predict(
+            image, imgsz=1024, conf=0.25, iou=0.45, verbose=False, device=self.device
+        )[0]
+        for xyxy, conf, cla in zip(
+            doclayout_yolo_res.boxes.xyxy.cpu(),
+            doclayout_yolo_res.boxes.conf.cpu(),
+            doclayout_yolo_res.boxes.cls.cpu(),
+        ):
             xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
             new_item = {
-                'category_id': int(cla.item()),
-                'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
-                'score': round(float(conf.item()), 3),
+                "category_id": int(cla.item()),
+                "poly": [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
+                "score": round(float(conf.item()), 3),
             }
             layout_res.append(new_item)
-        return layout_res
+        return layout_res
+
+    def batch_predict(self, images: list, batch_size: int) -> list:
+        images_layout_res = []
+        for index in range(0, len(images), batch_size):
+            doclayout_yolo_res = [
+                image_res.cpu()
+                for image_res in self.model.predict(
+                    images[index : index + batch_size],
+                    imgsz=1024,
+                    conf=0.25,
+                    iou=0.45,
+                    verbose=False,
+                    device=self.device,
+                )
+            ]
+            for image_res in doclayout_yolo_res:
+                layout_res = []
+                for xyxy, conf, cla in zip(
+                    image_res.boxes.xyxy,
+                    image_res.boxes.conf,
+                    image_res.boxes.cls,
+                ):
+                    xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
+                    new_item = {
+                        "category_id": int(cla.item()),
+                        "poly": [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
+                        "score": round(float(conf.item()), 3),
+                    }
+                    layout_res.append(new_item)
+                images_layout_res.append(layout_res)
+
+        return images_layout_res

+ 21 - 2
magic_pdf/model/sub_modules/mfd/yolov8/YOLOv8.py

@@ -2,11 +2,30 @@ from ultralytics import YOLO
 
 
 class YOLOv8MFDModel(object):
-    def __init__(self, weight, device='cpu'):
+    def __init__(self, weight, device="cpu"):
         self.mfd_model = YOLO(weight)
         self.device = device
 
     def predict(self, image):
-        mfd_res = self.mfd_model.predict(image, imgsz=1888, conf=0.25, iou=0.45, verbose=True, device=self.device)[0]
+        mfd_res = self.mfd_model.predict(
+            image, imgsz=1888, conf=0.25, iou=0.45, verbose=False, device=self.device
+        )[0]
         return mfd_res
 
+    def batch_predict(self, images: list, batch_size: int) -> list:
+        images_mfd_res = []
+        for index in range(0, len(images), batch_size):
+            mfd_res = [
+                image_res.cpu()
+                for image_res in self.mfd_model.predict(
+                    images[index : index + batch_size],
+                    imgsz=1888,
+                    conf=0.25,
+                    iou=0.45,
+                    verbose=False,
+                    device=self.device,
+                )
+            ]
+            for image_res in mfd_res:
+                images_mfd_res.append(image_res)
+        return images_mfd_res

+ 70 - 27
magic_pdf/model/sub_modules/mfr/unimernet/Unimernet.py

@@ -1,13 +1,13 @@
-import os
 import argparse
+import os
 import re
 
-from PIL import Image
 import torch
-from torch.utils.data import Dataset, DataLoader
+import unimernet.tasks as tasks
+from PIL import Image
+from torch.utils.data import DataLoader, Dataset
 from torchvision import transforms
 from unimernet.common.config import Config
-import unimernet.tasks as tasks
 from unimernet.processors import load_processor
 
 
@@ -31,27 +31,25 @@ class MathDataset(Dataset):
 
 
 def latex_rm_whitespace(s: str):
-    """Remove unnecessary whitespace from LaTeX code.
-    """
-    text_reg = r'(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})'
-    letter = '[a-zA-Z]'
-    noletter = '[\W_^\d]'
-    names = [x[0].replace(' ', '') for x in re.findall(text_reg, s)]
+    """Remove unnecessary whitespace from LaTeX code."""
+    text_reg = r"(\\(operatorname|mathrm|text|mathbf)\s?\*? {.*?})"
+    letter = "[a-zA-Z]"
+    noletter = "[\W_^\d]"
+    names = [x[0].replace(" ", "") for x in re.findall(text_reg, s)]
     s = re.sub(text_reg, lambda match: str(names.pop(0)), s)
     news = s
     while True:
         s = news
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, noletter), r'\1\2', s)
-        news = re.sub(r'(?!\\ )(%s)\s+?(%s)' % (noletter, letter), r'\1\2', news)
-        news = re.sub(r'(%s)\s+?(%s)' % (letter, noletter), r'\1\2', news)
+        news = re.sub(r"(?!\\ )(%s)\s+?(%s)" % (noletter, noletter), r"\1\2", s)
+        news = re.sub(r"(?!\\ )(%s)\s+?(%s)" % (noletter, letter), r"\1\2", news)
+        news = re.sub(r"(%s)\s+?(%s)" % (letter, noletter), r"\1\2", news)
         if news == s:
             break
     return s
 
 
 class UnimernetModel(object):
-    def __init__(self, weight_dir, cfg_path, _device_='cpu'):
-
+    def __init__(self, weight_dir, cfg_path, _device_="cpu"):
         args = argparse.Namespace(cfg_path=cfg_path, options=None)
         cfg = Config(args)
         cfg.config.model.pretrained = os.path.join(weight_dir, "pytorch_model.pth")
@@ -62,20 +60,28 @@ class UnimernetModel(object):
         self.device = _device_
         self.model.to(_device_)
         self.model.eval()
-        vis_processor = load_processor('formula_image_eval', cfg.config.datasets.formula_rec_eval.vis_processor.eval)
-        self.mfr_transform = transforms.Compose([vis_processor, ])
+        vis_processor = load_processor(
+            "formula_image_eval",
+            cfg.config.datasets.formula_rec_eval.vis_processor.eval,
+        )
+        self.mfr_transform = transforms.Compose(
+            [
+                vis_processor,
+            ]
+        )
 
     def predict(self, mfd_res, image):
-
         formula_list = []
         mf_image_list = []
-        for xyxy, conf, cla in zip(mfd_res.boxes.xyxy.cpu(), mfd_res.boxes.conf.cpu(), mfd_res.boxes.cls.cpu()):
+        for xyxy, conf, cla in zip(
+            mfd_res.boxes.xyxy.cpu(), mfd_res.boxes.conf.cpu(), mfd_res.boxes.cls.cpu()
+        ):
             xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
             new_item = {
-                'category_id': 13 + int(cla.item()),
-                'poly': [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
-                'score': round(float(conf.item()), 2),
-                'latex': '',
+                "category_id": 13 + int(cla.item()),
+                "poly": [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
+                "score": round(float(conf.item()), 2),
+                "latex": "",
             }
             formula_list.append(new_item)
             pil_img = Image.fromarray(image)
@@ -88,11 +94,48 @@ class UnimernetModel(object):
         for mf_img in dataloader:
             mf_img = mf_img.to(self.device)
             with torch.no_grad():
-                output = self.model.generate({'image': mf_img})
-            mfr_res.extend(output['pred_str'])
+                output = self.model.generate({"image": mf_img})
+            mfr_res.extend(output["pred_str"])
         for res, latex in zip(formula_list, mfr_res):
-            res['latex'] = latex_rm_whitespace(latex)
+            res["latex"] = latex_rm_whitespace(latex)
         return formula_list
 
+    def batch_predict(
+        self, images_mfd_res: list, images: list, batch_size: int = 64
+    ) -> list:
+        images_formula_list = []
+        mf_image_list = []
+        backfill_list = []
+        for image_index in range(len(images_mfd_res)):
+            mfd_res = images_mfd_res[image_index]
+            pil_img = Image.fromarray(images[image_index])
+            formula_list = []
+
+            for xyxy, conf, cla in zip(
+                mfd_res.boxes.xyxy, mfd_res.boxes.conf, mfd_res.boxes.cls
+            ):
+                xmin, ymin, xmax, ymax = [int(p.item()) for p in xyxy]
+                new_item = {
+                    "category_id": 13 + int(cla.item()),
+                    "poly": [xmin, ymin, xmax, ymin, xmax, ymax, xmin, ymax],
+                    "score": round(float(conf.item()), 2),
+                    "latex": "",
+                }
+                formula_list.append(new_item)
+                bbox_img = pil_img.crop((xmin, ymin, xmax, ymax))
+                mf_image_list.append(bbox_img)
+
+            images_formula_list.append(formula_list)
+            backfill_list += formula_list
 
-
+        dataset = MathDataset(mf_image_list, transform=self.mfr_transform)
+        dataloader = DataLoader(dataset, batch_size=batch_size, num_workers=0)
+        mfr_res = []
+        for mf_img in dataloader:
+            mf_img = mf_img.to(self.device)
+            with torch.no_grad():
+                output = self.model.generate({"image": mf_img})
+            mfr_res.extend(output["pred_str"])
+        for res, latex in zip(backfill_list, mfr_res):
+            res["latex"] = latex_rm_whitespace(latex)
+        return images_formula_list

+ 30 - 4
magic_pdf/model/sub_modules/model_init.py

@@ -1,7 +1,9 @@
+import torch
 from loguru import logger
 
 from magic_pdf.config.constants import MODEL_NAME
 from magic_pdf.model.model_list import AtomicModel
+from magic_pdf.model.sub_modules.language_detection.yolov11.YOLOv11 import YOLOv11LangDetModel
 from magic_pdf.model.sub_modules.layout.doclayout_yolo.DocLayoutYOLO import \
     DocLayoutYOLOModel
 from magic_pdf.model.sub_modules.layout.layoutlmv3.model_init import \
@@ -19,7 +21,7 @@ from magic_pdf.model.sub_modules.table.tablemaster.tablemaster_paddle import \
     TableMasterPaddleModel
 
 
-def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
+def table_model_init(table_model_type, model_path, max_time, _device_='cpu', ocr_engine=None):
     if table_model_type == MODEL_NAME.STRUCT_EQTABLE:
         table_model = StructTableModel(model_path, max_new_tokens=2048, max_time=max_time)
     elif table_model_type == MODEL_NAME.TABLE_MASTER:
@@ -29,7 +31,7 @@ def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
         }
         table_model = TableMasterPaddleModel(config)
     elif table_model_type == MODEL_NAME.RAPID_TABLE:
-        table_model = RapidTableModel()
+        table_model = RapidTableModel(ocr_engine)
     else:
         logger.error('table model type not allow')
         exit(1)
@@ -38,6 +40,8 @@ def table_model_init(table_model_type, model_path, max_time, _device_='cpu'):
 
 
 def mfd_model_init(weight, device='cpu'):
+    if str(device).startswith("npu"):
+        device = torch.device(device)
     mfd_model = YOLOv8MFDModel(weight, device)
     return mfd_model
 
@@ -53,16 +57,26 @@ def layout_model_init(weight, config_file, device):
 
 
 def doclayout_yolo_model_init(weight, device='cpu'):
+    if str(device).startswith("npu"):
+        device = torch.device(device)
     model = DocLayoutYOLOModel(weight, device)
     return model
 
 
+def langdetect_model_init(langdetect_model_weight, device='cpu'):
+    if str(device).startswith("npu"):
+        device = torch.device(device)
+    model = YOLOv11LangDetModel(langdetect_model_weight, device)
+    return model
+
+
 def ocr_model_init(show_log: bool = False,
                    det_db_box_thresh=0.3,
                    lang=None,
                    use_dilation=True,
                    det_db_unclip_ratio=1.8,
                    ):
+
     if lang is not None and lang != '':
         model = ModifiedPaddleOCR(
             show_log=show_log,
@@ -77,7 +91,6 @@ def ocr_model_init(show_log: bool = False,
             det_db_box_thresh=det_db_box_thresh,
             use_dilation=use_dilation,
             det_db_unclip_ratio=det_db_unclip_ratio,
-            # use_angle_cls=True,
         )
     return model
 
@@ -124,6 +137,9 @@ def atom_model_init(model_name: str, **kwargs):
                 kwargs.get('doclayout_yolo_weights'),
                 kwargs.get('device')
             )
+        else:
+            logger.error('layout model name not allow')
+            exit(1)
     elif model_name == AtomicModel.MFD:
         atom_model = mfd_model_init(
             kwargs.get('mfd_weights'),
@@ -146,8 +162,18 @@ def atom_model_init(model_name: str, **kwargs):
             kwargs.get('table_model_name'),
             kwargs.get('table_model_path'),
             kwargs.get('table_max_time'),
-            kwargs.get('device')
+            kwargs.get('device'),
+            kwargs.get('ocr_engine')
         )
+    elif model_name == AtomicModel.LangDetect:
+        if kwargs.get('langdetect_model_name') == MODEL_NAME.YOLO_V11_LangDetect:
+            atom_model = langdetect_model_init(
+                kwargs.get('langdetect_model_weight'),
+                kwargs.get('device')
+            )
+        else:
+            logger.error('langdetect model name not allow')
+            exit(1)
     else:
         logger.error('model name not allow')
         exit(1)

+ 8 - 2
magic_pdf/model/sub_modules/model_utils.py

@@ -45,7 +45,7 @@ def clean_vram(device, vram_threshold=8):
     total_memory = get_vram(device)
     if total_memory and total_memory <= vram_threshold:
         gc_start = time.time()
-        clean_memory()
+        clean_memory(device)
         gc_time = round(time.time() - gc_start, 2)
         logger.info(f"gc time: {gc_time}")
 
@@ -54,4 +54,10 @@ def get_vram(device):
     if torch.cuda.is_available() and device != 'cpu':
         total_memory = torch.cuda.get_device_properties(device).total_memory / (1024 ** 3)  # 将字节转换为 GB
         return total_memory
-    return None
+    elif str(device).startswith("npu"):
+        import torch_npu
+        if torch_npu.npu.is_available():
+            total_memory = torch_npu.npu.get_device_properties(device).total_memory / (1024 ** 3)  # 转为 GB
+            return total_memory
+    else:
+        return None

+ 51 - 1
magic_pdf/model/sub_modules/ocr/paddleocr/ocr_utils.py

@@ -303,4 +303,54 @@ def calculate_is_angle(poly):
         return False
     else:
         # logger.info((p3[1] - p1[1])/height)
-        return True
+        return True
+
+
+class ONNXModelSingleton:
+    _instance = None
+    _models = {}
+
+    def __new__(cls, *args, **kwargs):
+        if cls._instance is None:
+            cls._instance = super().__new__(cls)
+        return cls._instance
+
+    def get_onnx_model(self, **kwargs):
+
+        lang = kwargs.get('lang', None)
+        det_db_box_thresh = kwargs.get('det_db_box_thresh', 0.3)
+        use_dilation = kwargs.get('use_dilation', True)
+        det_db_unclip_ratio = kwargs.get('det_db_unclip_ratio', 1.8)
+        key = (lang, det_db_box_thresh, use_dilation, det_db_unclip_ratio)
+        if key not in self._models:
+            self._models[key] = onnx_model_init(key)
+        return self._models[key]
+
+def onnx_model_init(key):
+
+    import importlib.resources
+
+    resource_path = importlib.resources.path('rapidocr_onnxruntime.models','')
+
+    onnx_model = None
+    additional_ocr_params = {
+        "use_onnx": True,
+        "det_model_dir": f'{resource_path}/ch_PP-OCRv4_det_infer.onnx',
+        "rec_model_dir": f'{resource_path}/ch_PP-OCRv4_rec_infer.onnx',
+        "cls_model_dir": f'{resource_path}/ch_ppocr_mobile_v2.0_cls_infer.onnx',
+        "det_db_box_thresh": key[1],
+        "use_dilation": key[2],
+        "det_db_unclip_ratio": key[3],
+    }
+    # logger.info(f"additional_ocr_params: {additional_ocr_params}")
+    if key[0] is not None:
+        additional_ocr_params["lang"] = key[0]
+
+    from paddleocr import PaddleOCR
+    onnx_model = PaddleOCR(**additional_ocr_params)
+
+    if onnx_model is None:
+        logger.error('model init failed')
+        exit(1)
+    else:
+        return onnx_model

+ 32 - 6
magic_pdf/model/sub_modules/ocr/paddleocr/ppocr_273_mod.py

@@ -1,7 +1,9 @@
 import copy
+import platform
 import time
 import cv2
 import numpy as np
+import torch
 
 from paddleocr import PaddleOCR
 from ppocr.utils.logging import get_logger
@@ -9,12 +11,25 @@ from ppocr.utils.utility import alpha_to_color, binarize_img
 from tools.infer.predict_system import sorted_boxes
 from tools.infer.utility import get_rotate_crop_image, get_minarea_rect_crop
 
-from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import update_det_boxes, merge_det_boxes, check_img
+from magic_pdf.model.sub_modules.ocr.paddleocr.ocr_utils import update_det_boxes, merge_det_boxes, check_img, \
+    ONNXModelSingleton
 
 logger = get_logger()
 
 
 class ModifiedPaddleOCR(PaddleOCR):
+    def __init__(self, *args, **kwargs):
+
+        super().__init__(*args, **kwargs)
+        self.lang = kwargs.get('lang', 'ch')
+        # 在cpu架构为arm且不支持cuda时调用onnx、
+        if not torch.cuda.is_available() and platform.machine() in ['arm64', 'aarch64']:
+            self.use_onnx = True
+            onnx_model_manager = ONNXModelSingleton()
+            self.additional_ocr = onnx_model_manager.get_onnx_model(**kwargs)
+        else:
+            self.use_onnx = False
+
     def ocr(self,
             img,
             det=True,
@@ -79,7 +94,10 @@ class ModifiedPaddleOCR(PaddleOCR):
             ocr_res = []
             for img in imgs:
                 img = preprocess_image(img)
-                dt_boxes, elapse = self.text_detector(img)
+                if self.lang in ['ch'] and self.use_onnx:
+                    dt_boxes, elapse = self.additional_ocr.text_detector(img)
+                else:
+                    dt_boxes, elapse = self.text_detector(img)
                 if dt_boxes is None:
                     ocr_res.append(None)
                     continue
@@ -106,7 +124,10 @@ class ModifiedPaddleOCR(PaddleOCR):
                     img, cls_res_tmp, elapse = self.text_classifier(img)
                     if not rec:
                         cls_res.append(cls_res_tmp)
-                rec_res, elapse = self.text_recognizer(img)
+                if self.lang in ['ch'] and self.use_onnx:
+                    rec_res, elapse = self.additional_ocr.text_recognizer(img)
+                else:
+                    rec_res, elapse = self.text_recognizer(img)
                 ocr_res.append(rec_res)
             if not rec:
                 return cls_res
@@ -121,7 +142,10 @@ class ModifiedPaddleOCR(PaddleOCR):
 
         start = time.time()
         ori_im = img.copy()
-        dt_boxes, elapse = self.text_detector(img)
+        if self.lang in ['ch'] and self.use_onnx:
+            dt_boxes, elapse = self.additional_ocr.text_detector(img)
+        else:
+            dt_boxes, elapse = self.text_detector(img)
         time_dict['det'] = elapse
 
         if dt_boxes is None:
@@ -159,8 +183,10 @@ class ModifiedPaddleOCR(PaddleOCR):
             time_dict['cls'] = elapse
             logger.debug("cls num  : {}, elapsed : {}".format(
                 len(img_crop_list), elapse))
-
-        rec_res, elapse = self.text_recognizer(img_crop_list)
+        if self.lang in ['ch'] and self.use_onnx:
+            rec_res, elapse = self.additional_ocr.text_recognizer(img_crop_list)
+        else:
+            rec_res, elapse = self.text_recognizer(img_crop_list)
         time_dict['rec'] = elapse
         logger.debug("rec_res num  : {}, elapsed : {}".format(
             len(rec_res), elapse))

+ 42 - 7
magic_pdf/model/sub_modules/table/rapidtable/rapid_table.py

@@ -1,16 +1,51 @@
+import cv2
 import numpy as np
+import torch
+from loguru import logger
 from rapid_table import RapidTable
-from rapidocr_paddle import RapidOCR
 
 
 class RapidTableModel(object):
-    def __init__(self):
+    def __init__(self, ocr_engine):
         self.table_model = RapidTable()
-        self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
+        # if ocr_engine is None:
+        #     self.ocr_model_name = "RapidOCR"
+        #     if torch.cuda.is_available():
+        #         from rapidocr_paddle import RapidOCR
+        #         self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
+        #     else:
+        #         from rapidocr_onnxruntime import RapidOCR
+        #         self.ocr_engine = RapidOCR()
+        # else:
+        #     self.ocr_model_name = "PaddleOCR"
+        #     self.ocr_engine = ocr_engine
+
+        self.ocr_model_name = "RapidOCR"
+        if torch.cuda.is_available():
+            from rapidocr_paddle import RapidOCR
+            self.ocr_engine = RapidOCR(det_use_cuda=True, cls_use_cuda=True, rec_use_cuda=True)
+        else:
+            from rapidocr_onnxruntime import RapidOCR
+            self.ocr_engine = RapidOCR()
 
     def predict(self, image):
-        ocr_result, _ = self.ocr_engine(np.asarray(image))
-        if ocr_result is None:
+
+        if self.ocr_model_name == "RapidOCR":
+            ocr_result, _ = self.ocr_engine(np.asarray(image))
+        elif self.ocr_model_name == "PaddleOCR":
+            bgr_image = cv2.cvtColor(np.asarray(image), cv2.COLOR_RGB2BGR)
+            ocr_result = self.ocr_engine.ocr(bgr_image)[0]
+            if ocr_result:
+                ocr_result = [[item[0], item[1][0], item[1][1]] for item in ocr_result if
+                          len(item) == 2 and isinstance(item[1], tuple)]
+            else:
+                ocr_result = None
+        else:
+            logger.error("OCR model not supported")
+            ocr_result = None
+
+        if ocr_result:
+            html_code, table_cell_bboxes, elapse = self.table_model(np.asarray(image), ocr_result)
+            return html_code, table_cell_bboxes, elapse
+        else:
             return None, None, None
-        html_code, table_cell_bboxes, elapse = self.table_model(np.asarray(image), ocr_result)
-        return html_code, table_cell_bboxes, elapse

+ 94 - 0
magic_pdf/operators/__init__.py

@@ -0,0 +1,94 @@
+from abc import ABC, abstractmethod
+from typing import Callable
+
+from magic_pdf.data.data_reader_writer import DataWriter
+from magic_pdf.data.dataset import Dataset
+from magic_pdf.operators.pipes import PipeResult
+
+
+class InferenceResultBase(ABC):
+
+    @abstractmethod
+    def __init__(self, inference_results: list, dataset: Dataset):
+        """Initialized method.
+
+        Args:
+            inference_results (list): the inference result generated by model
+            dataset (Dataset): the dataset related with model inference result
+        """
+        pass
+
+    @abstractmethod
+    def draw_model(self, file_path: str) -> None:
+        """Draw model inference result.
+
+        Args:
+            file_path (str): the output file path
+        """
+        pass
+
+    @abstractmethod
+    def dump_model(self, writer: DataWriter, file_path: str):
+        """Dump model inference result to file.
+
+        Args:
+            writer (DataWriter): writer handle
+            file_path (str): the location of target file
+        """
+        pass
+
+    @abstractmethod
+    def get_infer_res(self):
+        """Get the inference result.
+
+        Returns:
+            list: the inference result generated by model
+        """
+        pass
+
+    @abstractmethod
+    def apply(self, proc: Callable, *args, **kwargs):
+        """Apply callable method which.
+
+        Args:
+            proc (Callable): invoke proc as follows:
+                proc(inference_result, *args, **kwargs)
+
+        Returns:
+            Any: return the result generated by proc
+        """
+        pass
+
+    def pipe_txt_mode(
+        self,
+        imageWriter: DataWriter,
+        start_page_id=0,
+        end_page_id=None,
+        debug_mode=False,
+        lang=None,
+    ) -> PipeResult:
+        """Post-proc the model inference result, Extract the text using the
+        third library, such as `pymupdf`
+
+        Args:
+            imageWriter (DataWriter): the image writer handle
+            start_page_id (int, optional): Defaults to 0. Let user select some pages He/She want to process
+            end_page_id (int, optional):  Defaults to the last page index of dataset. Let user select some pages He/She want to process
+            debug_mode (bool, optional): Defaults to False. will dump more log if enabled
+            lang (str, optional): Defaults to None.
+
+        Returns:
+            PipeResult: the result
+        """
+        pass
+
+    @abstractmethod
+    def pipe_ocr_mode(
+        self,
+        imageWriter: DataWriter,
+        start_page_id=0,
+        end_page_id=None,
+        debug_mode=False,
+        lang=None,
+    ) -> PipeResult:
+        pass

+ 2 - 38
magic_pdf/model/operators.py → magic_pdf/operators/models.py

@@ -7,13 +7,11 @@ from magic_pdf.config.constants import PARSE_TYPE_OCR, PARSE_TYPE_TXT
 from magic_pdf.config.enums import SupportedPdfParseMethod
 from magic_pdf.data.data_reader_writer import DataWriter
 from magic_pdf.data.dataset import Dataset
-from magic_pdf.filter import classify
 from magic_pdf.libs.draw_bbox import draw_model_bbox
 from magic_pdf.libs.version import __version__
-from magic_pdf.model import InferenceResultBase
+from magic_pdf.operators.pipes import PipeResult
 from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
-from magic_pdf.pipe.operators import PipeResult
-
+from magic_pdf.operators import InferenceResultBase
 
 class InferenceResult(InferenceResultBase):
     def __init__(self, inference_results: list, dataset: Dataset):
@@ -71,40 +69,6 @@ class InferenceResult(InferenceResultBase):
         """
         return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
 
-    def pipe_auto_mode(
-        self,
-        imageWriter: DataWriter,
-        start_page_id=0,
-        end_page_id=None,
-        debug_mode=False,
-        lang=None,
-    ) -> PipeResult:
-        """Post-proc the model inference result.
-            step1: classify the dataset type
-            step2: based the result of step1, using `pipe_txt_mode` or `pipe_ocr_mode`
-
-        Args:
-            imageWriter (DataWriter): the image writer handle
-            start_page_id (int, optional): Defaults to 0. Let user select some pages He/She want to process
-            end_page_id (int, optional):  Defaults to the last page index of dataset. Let user select some pages He/She want to process
-            debug_mode (bool, optional): Defaults to False. will dump more log if enabled
-            lang (str, optional): Defaults to None.
-
-        Returns:
-            PipeResult: the result
-        """
-
-        pdf_proc_method = classify(self._dataset.data_bits())
-
-        if pdf_proc_method == SupportedPdfParseMethod.TXT:
-            return self.pipe_txt_mode(
-                imageWriter, start_page_id, end_page_id, debug_mode, lang
-            )
-        else:
-            return self.pipe_ocr_mode(
-                imageWriter, start_page_id, end_page_id, debug_mode, lang
-            )
-
     def pipe_txt_mode(
         self,
         imageWriter: DataWriter,

+ 70 - 17
magic_pdf/pipe/operators.py → magic_pdf/operators/pipes.py

@@ -1,7 +1,7 @@
+import copy
 import json
 import os
 from typing import Callable
-import copy
 
 from magic_pdf.config.make_content_config import DropMode, MakeMode
 from magic_pdf.data.data_reader_writer import DataWriter
@@ -23,12 +23,34 @@ class PipeResult:
         self._pipe_res = pipe_res
         self._dataset = dataset
 
+    def get_markdown(
+        self,
+        img_dir_or_bucket_prefix: str,
+        drop_mode=DropMode.NONE,
+        md_make_mode=MakeMode.MM_MD,
+    ) -> str:
+        """Get markdown content.
+
+        Args:
+            img_dir_or_bucket_prefix (str): The s3 bucket prefix or local file directory which used to store the figure
+            drop_mode (str, optional): Drop strategy when some page which is corrupted or inappropriate. Defaults to DropMode.NONE.
+            md_make_mode (str, optional): The content Type of Markdown be made. Defaults to MakeMode.MM_MD.
+
+        Returns:
+            str: return markdown content
+        """
+        pdf_info_list = self._pipe_res['pdf_info']
+        md_content = union_make(
+            pdf_info_list, md_make_mode, drop_mode, img_dir_or_bucket_prefix
+        )
+        return md_content
+
     def dump_md(
         self,
         writer: DataWriter,
         file_path: str,
         img_dir_or_bucket_prefix: str,
-        drop_mode=DropMode.WHOLE_PDF,
+        drop_mode=DropMode.NONE,
         md_make_mode=MakeMode.MM_MD,
     ):
         """Dump The Markdown.
@@ -37,36 +59,68 @@ class PipeResult:
             writer (DataWriter): File writer handle
             file_path (str): The file location of markdown
             img_dir_or_bucket_prefix (str): The s3 bucket prefix or local file directory which used to store the figure
-            drop_mode (str, optional): Drop strategy when some page which is corrupted or inappropriate. Defaults to DropMode.WHOLE_PDF.
+            drop_mode (str, optional): Drop strategy when some page which is corrupted or inappropriate. Defaults to DropMode.NONE.
             md_make_mode (str, optional): The content Type of Markdown be made. Defaults to MakeMode.MM_MD.
         """
-        pdf_info_list = self._pipe_res['pdf_info']
-        md_content = union_make(
-            pdf_info_list, md_make_mode, drop_mode, img_dir_or_bucket_prefix
+
+        md_content = self.get_markdown(
+            img_dir_or_bucket_prefix, drop_mode=drop_mode, md_make_mode=md_make_mode
         )
         writer.write_string(file_path, md_content)
 
-    def dump_content_list(
-        self, writer: DataWriter, file_path: str, image_dir_or_bucket_prefix: str
-    ):
-        """Dump Content List.
+    def get_content_list(
+        self,
+        image_dir_or_bucket_prefix: str,
+        drop_mode=DropMode.NONE,
+    ) -> str:
+        """Get Content List.
 
         Args:
-            writer (DataWriter): File writer handle
-            file_path (str): The file location of content list
             image_dir_or_bucket_prefix (str): The s3 bucket prefix or local file directory which used to store the figure
+            drop_mode (str, optional): Drop strategy when some page which is corrupted or inappropriate. Defaults to DropMode.NONE.
+
+        Returns:
+            str: content list content
         """
         pdf_info_list = self._pipe_res['pdf_info']
         content_list = union_make(
             pdf_info_list,
             MakeMode.STANDARD_FORMAT,
-            DropMode.NONE,
+            drop_mode,
             image_dir_or_bucket_prefix,
         )
+        return content_list
+
+    def dump_content_list(
+        self,
+        writer: DataWriter,
+        file_path: str,
+        image_dir_or_bucket_prefix: str,
+        drop_mode=DropMode.NONE,
+    ):
+        """Dump Content List.
+
+        Args:
+            writer (DataWriter): File writer handle
+            file_path (str): The file location of content list
+            image_dir_or_bucket_prefix (str): The s3 bucket prefix or local file directory which used to store the figure
+            drop_mode (str, optional): Drop strategy when some page which is corrupted or inappropriate. Defaults to DropMode.NONE.
+        """
+        content_list = self.get_content_list(
+            image_dir_or_bucket_prefix, drop_mode=drop_mode,
+        )
         writer.write_string(
             file_path, json.dumps(content_list, ensure_ascii=False, indent=4)
         )
 
+    def get_middle_json(self) -> str:
+        """Get middle json.
+
+        Returns:
+            str: The content of middle json
+        """
+        return json.dumps(self._pipe_res, ensure_ascii=False, indent=4)
+
     def dump_middle_json(self, writer: DataWriter, file_path: str):
         """Dump the result of pipeline.
 
@@ -74,9 +128,8 @@ class PipeResult:
             writer (DataWriter): File writer handler
             file_path (str): The file location of middle json
         """
-        writer.write_string(
-            file_path, json.dumps(self._pipe_res, ensure_ascii=False, indent=4)
-        )
+        middle_json = self.get_middle_json()
+        writer.write_string(file_path, middle_json)
 
     def draw_layout(self, file_path: str) -> None:
         """Draw the layout.
@@ -123,7 +176,7 @@ class PipeResult:
         Returns:
             str: compress the pipeline result and return
         """
-        return JsonCompressor.compress_json(self.pdf_mid_data)
+        return JsonCompressor.compress_json(self._pipe_res)
 
     def apply(self, proc: Callable, *args, **kwargs):
         """Apply callable method which.

+ 0 - 0
magic_pdf/para/__init__.py


+ 0 - 22
magic_pdf/pdf_parse_by_ocr.py

@@ -1,22 +0,0 @@
-from magic_pdf.config.enums import SupportedPdfParseMethod
-from magic_pdf.data.dataset import Dataset
-from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
-
-
-def parse_pdf_by_ocr(dataset: Dataset,
-                     model_list,
-                     imageWriter,
-                     start_page_id=0,
-                     end_page_id=None,
-                     debug_mode=False,
-                     lang=None,
-                     ):
-    return pdf_parse_union(model_list,
-                           dataset,
-                           imageWriter,
-                           SupportedPdfParseMethod.OCR,
-                           start_page_id=start_page_id,
-                           end_page_id=end_page_id,
-                           debug_mode=debug_mode,
-                           lang=lang,
-                           )

+ 0 - 23
magic_pdf/pdf_parse_by_txt.py

@@ -1,23 +0,0 @@
-from magic_pdf.config.enums import SupportedPdfParseMethod
-from magic_pdf.data.dataset import Dataset
-from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
-
-
-def parse_pdf_by_txt(
-    dataset: Dataset,
-    model_list,
-    imageWriter,
-    start_page_id=0,
-    end_page_id=None,
-    debug_mode=False,
-    lang=None,
-):
-    return pdf_parse_union(model_list,
-                           dataset,
-                           imageWriter,
-                           SupportedPdfParseMethod.TXT,
-                           start_page_id=start_page_id,
-                           end_page_id=end_page_id,
-                           debug_mode=debug_mode,
-                           lang=lang,
-                           )

+ 68 - 17
magic_pdf/pdf_parse_union_core_v2.py

@@ -1,5 +1,6 @@
 import copy
 import os
+import re
 import statistics
 import time
 from typing import List
@@ -13,11 +14,12 @@ from magic_pdf.config.ocr_content_type import BlockType, ContentType
 from magic_pdf.data.dataset import Dataset, PageableData
 from magic_pdf.libs.boxbase import calculate_overlap_area_in_bbox1_area_ratio
 from magic_pdf.libs.clean_memory import clean_memory
-from magic_pdf.libs.config_reader import get_local_layoutreader_model_dir
+from magic_pdf.libs.config_reader import get_local_layoutreader_model_dir, get_llm_aided_config, get_device
 from magic_pdf.libs.convert_utils import dict_to_list
 from magic_pdf.libs.hash_utils import compute_md5
 from magic_pdf.libs.pdf_image_tools import cut_image_to_pil_image
 from magic_pdf.model.magic_model import MagicModel
+from magic_pdf.post_proc.llm_aided import llm_aided_formula, llm_aided_text, llm_aided_title
 
 try:
     import torchtext
@@ -28,15 +30,15 @@ except ImportError:
     pass
 
 from magic_pdf.model.sub_modules.model_init import AtomModelSingleton
-from magic_pdf.para.para_split_v3 import para_split
+from magic_pdf.post_proc.para_split_v3 import para_split
 from magic_pdf.pre_proc.construct_page_dict import ocr_construct_page_component_v2
 from magic_pdf.pre_proc.cut_image import ocr_cut_image_and_table
 from magic_pdf.pre_proc.ocr_detect_all_bboxes import ocr_prepare_bboxes_for_layout_split_v2
 from magic_pdf.pre_proc.ocr_dict_merge import fill_spans_in_blocks, fix_block_spans_v2, fix_discarded_block
-from magic_pdf.pre_proc.ocr_span_list_modify import get_qa_need_list_v2, remove_overlaps_low_confidence_spans, remove_overlaps_min_spans
+from magic_pdf.pre_proc.ocr_span_list_modify import get_qa_need_list_v2, remove_overlaps_low_confidence_spans, \
+    remove_overlaps_min_spans, check_chars_is_overlap_in_span
 
 os.environ['NO_ALBUMENTATIONS_UPDATE'] = '1'  # 禁止albumentations检查更新
-os.environ['YOLO_VERBOSE'] = 'False'  # disable yolo logger
 
 
 def __replace_STX_ETX(text_str: str):
@@ -63,11 +65,22 @@ def __replace_0xfffd(text_str: str):
         return s
     return text_str
 
+
+# 连写字符拆分
+def __replace_ligatures(text: str):
+    ligatures = {
+        'fi': 'fi', 'fl': 'fl', 'ff': 'ff', 'ffi': 'ffi', 'ffl': 'ffl', 'ſt': 'ft', 'st': 'st'
+    }
+    return re.sub('|'.join(map(re.escape, ligatures.keys())), lambda m: ligatures[m.group()], text)
+
+
 def chars_to_content(span):
     # 检查span中的char是否为空
     if len(span['chars']) == 0:
         pass
         # span['content'] = ''
+    elif check_chars_is_overlap_in_span(span['chars']):
+        pass
     else:
         # 先给chars按char['bbox']的中心点的x坐标排序
         span['chars'] = sorted(span['chars'], key=lambda x: (x['bbox'][0] + x['bbox'][2]) / 2)
@@ -78,11 +91,16 @@ def chars_to_content(span):
 
         content = ''
         for char in span['chars']:
-            # 如果下一个char的x0和上一个char的x1距离超过一个字符宽度,则需要在中间插入一个空格
-            if char['bbox'][0] - span['chars'][span['chars'].index(char) - 1]['bbox'][2] > char_avg_width:
-                content += ' '
-            content += char['c']
 
+            # 如果下一个char的x0和上一个char的x1距离超过0.25个字符宽度,则需要在中间插入一个空格
+            char1 = char
+            char2 = span['chars'][span['chars'].index(char) + 1] if span['chars'].index(char) + 1 < len(span['chars']) else None
+            if char2 and char2['bbox'][0] - char1['bbox'][2] > char_avg_width * 0.25 and char['c'] != ' ' and char2['c'] != ' ':
+                content += f"{char['c']} "
+            else:
+                content += char['c']
+
+        content = __replace_ligatures(content)
         span['content'] = __replace_0xfffd(content)
 
     del span['chars']
@@ -98,6 +116,10 @@ def fill_char_in_spans(spans, all_chars):
     spans = sorted(spans, key=lambda x: x['bbox'][1])
 
     for char in all_chars:
+        # 跳过非法bbox的char
+        x1, y1, x2, y2 = char['bbox']
+        if abs(x1 - x2) <= 0.01 or abs(y1 - y2) <= 0.01:
+            continue
         for span in spans:
             if calculate_char_in_span(char['bbox'], span['bbox'], char['c']):
                 span['chars'].append(char)
@@ -152,14 +174,16 @@ def calculate_char_in_span(char_bbox, span_bbox, char, span_height_radio=0.33):
 
 
 def txt_spans_extract_v2(pdf_page, spans, all_bboxes, all_discarded_blocks, lang):
+    # cid用0xfffd表示,连字符拆开
+    # text_blocks_raw = pdf_page.get_text('rawdict', flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_MEDIABOX_CLIP)['blocks']
 
-    text_blocks_raw = pdf_page.get_text('rawdict', flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_MEDIABOX_CLIP)['blocks']
-
+    # cid用0xfffd表示,连字符不拆开
+    text_blocks_raw = pdf_page.get_text('rawdict', flags=fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_MEDIABOX_CLIP)['blocks']
     all_pymu_chars = []
     for block in text_blocks_raw:
         for line in block['lines']:
             cosine, sine = line['dir']
-            if abs (cosine) < 0.9 or abs(sine) > 0.1:
+            if abs(cosine) < 0.9 or abs(sine) > 0.1:
                 continue
             for span in line['spans']:
                 all_pymu_chars.extend(span['chars'])
@@ -255,19 +279,23 @@ def txt_spans_extract_v2(pdf_page, spans, all_bboxes, all_discarded_blocks, lang
     return spans
 
 
-def replace_text_span(pymu_spans, ocr_spans):
-    return list(filter(lambda x: x['type'] != ContentType.Text, ocr_spans)) + pymu_spans
-
-
 def model_init(model_name: str):
     from transformers import LayoutLMv3ForTokenClassification
-
+    device = get_device()
     if torch.cuda.is_available():
         device = torch.device('cuda')
         if torch.cuda.is_bf16_supported():
             supports_bfloat16 = True
         else:
             supports_bfloat16 = False
+    elif str(device).startswith("npu"):
+        import torch_npu
+        if torch_npu.npu.is_available():
+            device = torch.device('npu')
+            supports_bfloat16 = False
+        else:
+            device = torch.device('cpu')
+            supports_bfloat16 = False
     else:
         device = torch.device('cpu')
         supports_bfloat16 = False
@@ -345,6 +373,8 @@ def cal_block_index(fix_blocks, sorted_bboxes):
         # 使用xycut排序
         block_bboxes = []
         for block in fix_blocks:
+            # 如果block['bbox']任意值小于0,将其置为0
+            block['bbox'] = [max(0, x) for x in block['bbox']]
             block_bboxes.append(block['bbox'])
 
             # 删除图表body block中的虚拟line信息, 并用real_lines信息回填
@@ -738,6 +768,11 @@ def parse_page_core(
     """重排block"""
     sorted_blocks = sorted(fix_blocks, key=lambda b: b['index'])
 
+    """block内重排(img和table的block内多个caption或footnote的排序)"""
+    for block in sorted_blocks:
+        if block['type'] in [BlockType.Image, BlockType.Table]:
+            block['blocks'] = sorted(block['blocks'], key=lambda b: b['index'])
+
     """获取QA需要外置的list"""
     images, tables, interline_equations = get_qa_need_list_v2(sorted_blocks)
 
@@ -819,13 +854,29 @@ def pdf_parse_union(
     """分段"""
     para_split(pdf_info_dict)
 
+    """llm优化"""
+    llm_aided_config = get_llm_aided_config()
+    if llm_aided_config is not None:
+        """公式优化"""
+        formula_aided_config = llm_aided_config.get('formula_aided', None)
+        if formula_aided_config is not None:
+            llm_aided_formula(pdf_info_dict, formula_aided_config)
+        """文本优化"""
+        text_aided_config = llm_aided_config.get('text_aided', None)
+        if text_aided_config is not None:
+            llm_aided_text(pdf_info_dict, text_aided_config)
+        """标题优化"""
+        title_aided_config = llm_aided_config.get('title_aided', None)
+        if title_aided_config is not None:
+            llm_aided_title(pdf_info_dict, title_aided_config)
+
     """dict转list"""
     pdf_info_list = dict_to_list(pdf_info_dict)
     new_pdf_info_dict = {
         'pdf_info': pdf_info_list,
     }
 
-    clean_memory()
+    clean_memory(get_device())
 
     return new_pdf_info_dict
 

+ 0 - 80
magic_pdf/pipe/OCRPipe.py

@@ -1,80 +0,0 @@
-from loguru import logger
-
-from magic_pdf.config.make_content_config import DropMode, MakeMode
-from magic_pdf.data.data_reader_writer import DataWriter
-from magic_pdf.data.dataset import Dataset
-from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-from magic_pdf.pipe.AbsPipe import AbsPipe
-from magic_pdf.user_api import parse_ocr_pdf
-
-
-class OCRPipe(AbsPipe):
-    def __init__(
-        self,
-        dataset: Dataset,
-        model_list: list,
-        image_writer: DataWriter,
-        is_debug: bool = False,
-        start_page_id=0,
-        end_page_id=None,
-        lang=None,
-        layout_model=None,
-        formula_enable=None,
-        table_enable=None,
-    ):
-        super().__init__(
-            dataset,
-            model_list,
-            image_writer,
-            is_debug,
-            start_page_id,
-            end_page_id,
-            lang,
-            layout_model,
-            formula_enable,
-            table_enable,
-        )
-
-    def pipe_classify(self):
-        pass
-
-    def pipe_analyze(self):
-        self.infer_res = doc_analyze(
-            self.dataset,
-            ocr=True,
-            start_page_id=self.start_page_id,
-            end_page_id=self.end_page_id,
-            lang=self.lang,
-            layout_model=self.layout_model,
-            formula_enable=self.formula_enable,
-            table_enable=self.table_enable,
-        )
-
-    def pipe_parse(self):
-        self.pdf_mid_data = parse_ocr_pdf(
-            self.dataset,
-            self.infer_res,
-            self.image_writer,
-            is_debug=self.is_debug,
-            start_page_id=self.start_page_id,
-            end_page_id=self.end_page_id,
-            lang=self.lang,
-            layout_model=self.layout_model,
-            formula_enable=self.formula_enable,
-            table_enable=self.table_enable,
-        )
-
-    def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF):
-        result = super().pipe_mk_uni_format(img_parent_path, drop_mode)
-        logger.info('ocr_pipe mk content list finished')
-        return result
-
-    def pipe_mk_markdown(
-        self,
-        img_parent_path: str,
-        drop_mode=DropMode.WHOLE_PDF,
-        md_make_mode=MakeMode.MM_MD,
-    ):
-        result = super().pipe_mk_markdown(img_parent_path, drop_mode, md_make_mode)
-        logger.info(f'ocr_pipe mk {md_make_mode} finished')
-        return result

+ 0 - 42
magic_pdf/pipe/TXTPipe.py

@@ -1,42 +0,0 @@
-from loguru import logger
-
-from magic_pdf.config.make_content_config import DropMode, MakeMode
-from magic_pdf.data.data_reader_writer import DataWriter
-from magic_pdf.data.dataset import Dataset
-from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-from magic_pdf.pipe.AbsPipe import AbsPipe
-from magic_pdf.user_api import parse_txt_pdf
-
-
-class TXTPipe(AbsPipe):
-
-    def __init__(self, dataset: Dataset, model_list: list, image_writer: DataWriter, is_debug: bool = False,
-                 start_page_id=0, end_page_id=None, lang=None,
-                 layout_model=None, formula_enable=None, table_enable=None):
-        super().__init__(dataset, model_list, image_writer, is_debug, start_page_id, end_page_id, lang,
-                         layout_model, formula_enable, table_enable)
-
-    def pipe_classify(self):
-        pass
-
-    def pipe_analyze(self):
-        self.model_list = doc_analyze(self.dataset, ocr=False,
-                                      start_page_id=self.start_page_id, end_page_id=self.end_page_id,
-                                      lang=self.lang, layout_model=self.layout_model,
-                                      formula_enable=self.formula_enable, table_enable=self.table_enable)
-
-    def pipe_parse(self):
-        self.pdf_mid_data = parse_txt_pdf(self.dataset, self.model_list, self.image_writer, is_debug=self.is_debug,
-                                          start_page_id=self.start_page_id, end_page_id=self.end_page_id,
-                                          lang=self.lang, layout_model=self.layout_model,
-                                          formula_enable=self.formula_enable, table_enable=self.table_enable)
-
-    def pipe_mk_uni_format(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF):
-        result = super().pipe_mk_uni_format(img_parent_path, drop_mode)
-        logger.info('txt_pipe mk content list finished')
-        return result
-
-    def pipe_mk_markdown(self, img_parent_path: str, drop_mode=DropMode.WHOLE_PDF, md_make_mode=MakeMode.MM_MD):
-        result = super().pipe_mk_markdown(img_parent_path, drop_mode, md_make_mode)
-        logger.info(f'txt_pipe mk {md_make_mode} finished')
-        return result

+ 0 - 150
magic_pdf/pipe/UNIPipe.py

@@ -1,150 +0,0 @@
-import json
-
-from loguru import logger
-
-from magic_pdf.config.make_content_config import DropMode, MakeMode
-from magic_pdf.data.data_reader_writer import DataWriter
-from magic_pdf.data.dataset import Dataset
-from magic_pdf.libs.commons import join_path
-from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-from magic_pdf.pipe.AbsPipe import AbsPipe
-from magic_pdf.user_api import parse_ocr_pdf, parse_union_pdf
-
-
-class UNIPipe(AbsPipe):
-
-    def __init__(
-        self,
-        dataset: Dataset,
-        jso_useful_key: dict,
-        image_writer: DataWriter,
-        is_debug: bool = False,
-        start_page_id=0,
-        end_page_id=None,
-        lang=None,
-        layout_model=None,
-        formula_enable=None,
-        table_enable=None,
-    ):
-        self.pdf_type = jso_useful_key['_pdf_type']
-        super().__init__(
-            dataset,
-            jso_useful_key['model_list'],
-            image_writer,
-            is_debug,
-            start_page_id,
-            end_page_id,
-            lang,
-            layout_model,
-            formula_enable,
-            table_enable,
-        )
-        if len(self.model_list) == 0:
-            self.input_model_is_empty = True
-        else:
-            self.input_model_is_empty = False
-
-    def pipe_classify(self):
-        self.pdf_type = AbsPipe.classify(self.pdf_bytes)
-
-    def pipe_analyze(self):
-        if self.pdf_type == self.PIP_TXT:
-            self.model_list = doc_analyze(
-                self.dataset,
-                ocr=False,
-                start_page_id=self.start_page_id,
-                end_page_id=self.end_page_id,
-                lang=self.lang,
-                layout_model=self.layout_model,
-                formula_enable=self.formula_enable,
-                table_enable=self.table_enable,
-            )
-        elif self.pdf_type == self.PIP_OCR:
-            self.model_list = doc_analyze(
-                self.dataset,
-                ocr=True,
-                start_page_id=self.start_page_id,
-                end_page_id=self.end_page_id,
-                lang=self.lang,
-                layout_model=self.layout_model,
-                formula_enable=self.formula_enable,
-                table_enable=self.table_enable,
-            )
-
-    def pipe_parse(self):
-        if self.pdf_type == self.PIP_TXT:
-            self.pdf_mid_data = parse_union_pdf(
-                self.dataset,
-                self.model_list,
-                self.image_writer,
-                is_debug=self.is_debug,
-                start_page_id=self.start_page_id,
-                end_page_id=self.end_page_id,
-                lang=self.lang,
-                layout_model=self.layout_model,
-                formula_enable=self.formula_enable,
-                table_enable=self.table_enable,
-            )
-        elif self.pdf_type == self.PIP_OCR:
-            self.pdf_mid_data = parse_ocr_pdf(
-                self.dataset,
-                self.model_list,
-                self.image_writer,
-                is_debug=self.is_debug,
-                start_page_id=self.start_page_id,
-                end_page_id=self.end_page_id,
-                lang=self.lang,
-            )
-
-    def pipe_mk_uni_format(
-        self, img_parent_path: str, drop_mode=DropMode.NONE_WITH_REASON
-    ):
-        result = super().pipe_mk_uni_format(img_parent_path, drop_mode)
-        logger.info('uni_pipe mk content list finished')
-        return result
-
-    def pipe_mk_markdown(
-        self,
-        img_parent_path: str,
-        drop_mode=DropMode.WHOLE_PDF,
-        md_make_mode=MakeMode.MM_MD,
-    ):
-        result = super().pipe_mk_markdown(img_parent_path, drop_mode, md_make_mode)
-        logger.info(f'uni_pipe mk {md_make_mode} finished')
-        return result
-
-
-if __name__ == '__main__':
-    # 测试
-    from magic_pdf.data.data_reader_writer import DataReader
-
-    drw = DataReader(r'D:/project/20231108code-clean')
-
-    pdf_file_path = r'linshixuqiu\19983-00.pdf'
-    model_file_path = r'linshixuqiu\19983-00.json'
-    pdf_bytes = drw.read(pdf_file_path)
-    model_json_txt = drw.read(model_file_path).decode()
-    model_list = json.loads(model_json_txt)
-    write_path = r'D:\project\20231108code-clean\linshixuqiu\19983-00'
-    img_bucket_path = 'imgs'
-    img_writer = DataWriter(join_path(write_path, img_bucket_path))
-
-    # pdf_type = UNIPipe.classify(pdf_bytes)
-    # jso_useful_key = {
-    #     "_pdf_type": pdf_type,
-    #     "model_list": model_list
-    # }
-
-    jso_useful_key = {'_pdf_type': '', 'model_list': model_list}
-    pipe = UNIPipe(pdf_bytes, jso_useful_key, img_writer)
-    pipe.pipe_classify()
-    pipe.pipe_parse()
-    md_content = pipe.pipe_mk_markdown(img_bucket_path)
-    content_list = pipe.pipe_mk_uni_format(img_bucket_path)
-
-    md_writer = DataWriter(write_path)
-    md_writer.write_string('19983-00.md', md_content)
-    md_writer.write_string(
-        '19983-00.json', json.dumps(pipe.pdf_mid_data, ensure_ascii=False, indent=4)
-    )
-    md_writer.write_string('19983-00.txt', str(content_list))

+ 0 - 0
magic_pdf/pipe/__init__.py


+ 1 - 0
magic_pdf/post_proc/__init__.py

@@ -0,0 +1 @@
+# Copyright (c) Opendatalab. All rights reserved.

+ 133 - 0
magic_pdf/post_proc/llm_aided.py

@@ -0,0 +1,133 @@
+# Copyright (c) Opendatalab. All rights reserved.
+import json
+from loguru import logger
+from magic_pdf.dict2md.ocr_mkcontent import merge_para_with_text
+from openai import OpenAI
+
+
+#@todo: 有的公式以"\"结尾,这样会导致尾部拼接的"$"被转义,也需要修复
+formula_optimize_prompt = """请根据以下指南修正LaTeX公式的错误,确保公式能够渲染且符合原始内容:
+
+1. 修正渲染或编译错误:
+    - Some syntax errors such as mismatched/missing/extra tokens. Your task is to fix these syntax errors and make sure corrected results conform to latex math syntax principles.
+    - 包含KaTeX不支持的关键词等原因导致的无法编译或渲染的错误
+
+2. 保留原始信息:
+   - 保留原始公式中的所有重要信息
+   - 不要添加任何原始公式中没有的新信息
+
+IMPORTANT:请仅返回修正后的公式,不要包含任何介绍、解释或元数据。
+
+LaTeX recognition result:
+$FORMULA
+
+Your corrected result:
+"""
+
+text_optimize_prompt = f"""请根据以下指南修正OCR引起的错误,确保文本连贯并符合原始内容:
+
+1. 修正OCR引起的拼写错误和错误:
+   - 修正常见的OCR错误(例如,'rn' 被误读为 'm')
+   - 使用上下文和常识进行修正
+   - 只修正明显的错误,不要不必要的修改内容
+   - 不要添加额外的句号或其他不必要的标点符号
+
+2. 保持原始结构:
+   - 保留所有标题和子标题
+
+3. 保留原始内容:
+   - 保留原始文本中的所有重要信息
+   - 不要添加任何原始文本中没有的新信息
+   - 保留段落之间的换行符
+
+4. 保持连贯性:
+   - 确保内容与前文顺畅连接
+   - 适当处理在句子中间开始或结束的文本
+   
+5. 修正行内公式:
+   - 去除行内公式前后多余的空格
+   - 修正公式中的OCR错误
+   - 确保公式能够通过KaTeX渲染
+   
+6. 修正全角字符
+    - 修正全角标点符号为半角标点符号
+    - 修正全角字母为半角字母
+    - 修正全角数字为半角数字
+
+IMPORTANT:请仅返回修正后的文本,保留所有原始格式,包括换行符。不要包含任何介绍、解释或元数据。
+
+Previous context:
+
+Current chunk to process:
+
+Corrected text:
+"""
+
+def llm_aided_formula(pdf_info_dict, formula_aided_config):
+    pass
+
+def llm_aided_text(pdf_info_dict, text_aided_config):
+    pass
+
+def llm_aided_title(pdf_info_dict, title_aided_config):
+    client = OpenAI(
+        api_key=title_aided_config["api_key"],
+        base_url=title_aided_config["base_url"],
+    )
+    title_dict = {}
+    origin_title_list = []
+    i = 0
+    for page_num, page in pdf_info_dict.items():
+        blocks = page["para_blocks"]
+        for block in blocks:
+            if block["type"] == "title":
+                origin_title_list.append(block)
+                title_text = merge_para_with_text(block)
+                title_dict[f"{i}"] = title_text
+                i += 1
+    # logger.info(f"Title list: {title_dict}")
+
+    title_optimize_prompt = f"""输入的内容是一篇文档中所有标题组成的字典,请根据以下指南优化标题的结果,使结果符合正常文档的层次结构:
+
+1. 保留原始内容:
+    - 输入的字典中所有元素都是有效的,不能删除字典中的任何元素
+    - 请务必保证输出的字典中元素的数量和输入的数量一致
+
+2. 保持字典内key-value的对应关系不变
+
+3. 优化层次结构:
+    - 为每个标题元素添加适当的层次结构
+    - 标题层级应具有连续性,不能跳过某一层级
+    - 标题层级最多为4级,不要添加过多的层级
+    - 优化后的标题为一个整数,代表该标题的层级
+
+IMPORTANT: 
+请直接返回优化过的由标题层级组成的json,返回的json不需要格式化。
+
+Input title list:
+{title_dict}
+
+Corrected title list:
+"""
+
+    completion = client.chat.completions.create(
+        model=title_aided_config["model"],
+        messages=[
+            {'role': 'user', 'content': title_optimize_prompt}],
+        temperature=0.7,
+    )
+
+    json_completion = json.loads(completion.choices[0].message.content)
+
+    # logger.info(f"Title completion: {json_completion}")
+
+    # logger.info(f"len(json_completion): {len(json_completion)}, len(title_dict): {len(title_dict)}")
+    if len(json_completion) == len(title_dict):
+        try:
+            for i, origin_title_block in enumerate(origin_title_list):
+               origin_title_block["level"] = int(json_completion[str(i)])
+        except Exception as e:
+            logger.exception(e)
+    else:
+        logger.error("The number of titles in the optimized result is not equal to the number of titles in the input.")
+

+ 0 - 0
magic_pdf/para/para_split_v3.py → magic_pdf/post_proc/para_split_v3.py


+ 8 - 0
magic_pdf/pre_proc/ocr_span_list_modify.py

@@ -33,6 +33,14 @@ def remove_overlaps_low_confidence_spans(spans):
     return spans, dropped_spans
 
 
+def check_chars_is_overlap_in_span(chars):
+    for i in range(len(chars)):
+        for j in range(i + 1, len(chars)):
+            if calculate_iou(chars[i]['bbox'], chars[j]['bbox']) > 0.9:
+                return True
+    return False
+
+
 def remove_overlaps_min_spans(spans):
     dropped_spans = []
     #  删除重叠spans中较小的那些

+ 1 - 1
magic_pdf/pre_proc/remove_bbox_overlap.py

@@ -70,7 +70,7 @@ def _remove_overlap_between_bboxes(arr):
                     res[i] = None
                 else:
                     keeps[idx] = False
-                drop_reasons.append(drop_reasons)
+                drop_reasons.append(drop_reason)
         if keeps[idx]:
             res[idx] = v
     return res, drop_reasons

BIN
magic_pdf/resources/yolov11-langdetect/yolo_v11_ft.pt


+ 0 - 17
magic_pdf/rw/AbsReaderWriter.py

@@ -1,17 +0,0 @@
-from abc import ABC, abstractmethod
-
-
-class AbsReaderWriter(ABC):
-    MODE_TXT = "text"
-    MODE_BIN = "binary"
-    @abstractmethod
-    def read(self, path: str, mode=MODE_TXT):
-        raise NotImplementedError
-
-    @abstractmethod
-    def write(self, content: str, path: str, mode=MODE_TXT):
-        raise NotImplementedError
-
-    @abstractmethod
-    def read_offset(self, path: str, offset=0, limit=None) -> bytes:
-        raise NotImplementedError

+ 0 - 74
magic_pdf/rw/DiskReaderWriter.py

@@ -1,74 +0,0 @@
-import os
-from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
-from loguru import logger
-
-
-class DiskReaderWriter(AbsReaderWriter):
-    def __init__(self, parent_path, encoding="utf-8"):
-        self.path = parent_path
-        self.encoding = encoding
-
-    def read(self, path, mode=AbsReaderWriter.MODE_TXT):
-        if os.path.isabs(path):
-            abspath = path
-        else:
-            abspath = os.path.join(self.path, path)
-        if not os.path.exists(abspath):
-            logger.error(f"file {abspath} not exists")
-            raise Exception(f"file {abspath} no exists")
-        if mode == AbsReaderWriter.MODE_TXT:
-            with open(abspath, "r", encoding=self.encoding) as f:
-                return f.read()
-        elif mode == AbsReaderWriter.MODE_BIN:
-            with open(abspath, "rb") as f:
-                return f.read()
-        else:
-            raise ValueError("Invalid mode. Use 'text' or 'binary'.")
-
-    def write(self, content, path, mode=AbsReaderWriter.MODE_TXT):
-        if os.path.isabs(path):
-            abspath = path
-        else:
-            abspath = os.path.join(self.path, path)
-        directory_path = os.path.dirname(abspath)
-        if not os.path.exists(directory_path):
-            os.makedirs(directory_path)
-        if mode == AbsReaderWriter.MODE_TXT:
-            with open(abspath, "w", encoding=self.encoding, errors="replace") as f:
-                f.write(content)
-
-        elif mode == AbsReaderWriter.MODE_BIN:
-            with open(abspath, "wb") as f:
-                f.write(content)
-        else:
-            raise ValueError("Invalid mode. Use 'text' or 'binary'.")
-
-    def read_offset(self, path: str, offset=0, limit=None):
-        abspath = path
-        if not os.path.isabs(path):
-            abspath = os.path.join(self.path, path)
-        with open(abspath, "rb") as f:
-            f.seek(offset)
-            return f.read(limit)
-
-
-if __name__ == "__main__":
-    if 0:
-        file_path = "io/test/example.txt"
-        drw = DiskReaderWriter("D:\projects\papayfork\Magic-PDF\magic_pdf")
-
-        # 写入内容到文件
-        drw.write(b"Hello, World!", path="io/test/example.txt", mode="binary")
-
-        # 从文件读取内容
-        content = drw.read(path=file_path)
-        if content:
-            logger.info(f"从 {file_path} 读取的内容: {content}")
-    if 1:
-        drw = DiskReaderWriter("/opt/data/pdf/resources/test/io/")
-        content_bin = drw.read_offset("1.txt")
-        assert content_bin == b"ABCD!"
-
-        content_bin = drw.read_offset("1.txt", offset=1, limit=2)
-        assert content_bin == b"BC"
-

+ 0 - 142
magic_pdf/rw/S3ReaderWriter.py

@@ -1,142 +0,0 @@
-from magic_pdf.rw.AbsReaderWriter import AbsReaderWriter
-from magic_pdf.libs.commons import parse_bucket_key, join_path
-import boto3
-from loguru import logger
-from botocore.config import Config
-
-
-class S3ReaderWriter(AbsReaderWriter):
-    def __init__(
-        self,
-        ak: str,
-        sk: str,
-        endpoint_url: str,
-        addressing_style: str = "auto",
-        parent_path: str = "",
-    ):
-        self.client = self._get_client(ak, sk, endpoint_url, addressing_style)
-        self.path = parent_path
-
-    def _get_client(self, ak: str, sk: str, endpoint_url: str, addressing_style: str):
-        s3_client = boto3.client(
-            service_name="s3",
-            aws_access_key_id=ak,
-            aws_secret_access_key=sk,
-            endpoint_url=endpoint_url,
-            config=Config(
-                s3={"addressing_style": addressing_style},
-                retries={"max_attempts": 5, "mode": "standard"},
-            ),
-        )
-        return s3_client
-
-    def read(self, s3_relative_path, mode=AbsReaderWriter.MODE_TXT, encoding="utf-8"):
-        if s3_relative_path.startswith("s3://"):
-            s3_path = s3_relative_path
-        else:
-            s3_path = join_path(self.path, s3_relative_path)
-        bucket_name, key = parse_bucket_key(s3_path)
-        res = self.client.get_object(Bucket=bucket_name, Key=key)
-        body = res["Body"].read()
-        if mode == AbsReaderWriter.MODE_TXT:
-            data = body.decode(encoding)  # Decode bytes to text
-        elif mode == AbsReaderWriter.MODE_BIN:
-            data = body
-        else:
-            raise ValueError("Invalid mode. Use 'text' or 'binary'.")
-        return data
-
-    def write(self, content, s3_relative_path, mode=AbsReaderWriter.MODE_TXT, encoding="utf-8"):
-        if s3_relative_path.startswith("s3://"):
-            s3_path = s3_relative_path
-        else:
-            s3_path = join_path(self.path, s3_relative_path)
-        if mode == AbsReaderWriter.MODE_TXT:
-            body = content.encode(encoding)  # Encode text data as bytes
-        elif mode == AbsReaderWriter.MODE_BIN:
-            body = content
-        else:
-            raise ValueError("Invalid mode. Use 'text' or 'binary'.")
-        bucket_name, key = parse_bucket_key(s3_path)
-        self.client.put_object(Body=body, Bucket=bucket_name, Key=key)
-        logger.info(f"内容已写入 {s3_path} ")
-
-    def read_offset(self, path: str, offset=0, limit=None) -> bytes:
-        if path.startswith("s3://"):
-            s3_path = path
-        else:
-            s3_path = join_path(self.path, path)
-        bucket_name, key = parse_bucket_key(s3_path)
-
-        range_header = (
-            f"bytes={offset}-{offset+limit-1}" if limit else f"bytes={offset}-"
-        )
-        res = self.client.get_object(Bucket=bucket_name, Key=key, Range=range_header)
-        return res["Body"].read()
-
-
-if __name__ == "__main__":
-    if 0:
-        # Config the connection info
-        ak = ""
-        sk = ""
-        endpoint_url = ""
-        addressing_style = "auto"
-        bucket_name = ""
-        # Create an S3ReaderWriter object
-        s3_reader_writer = S3ReaderWriter(
-            ak, sk, endpoint_url, addressing_style, "s3://bucket_name/"
-        )
-
-        # Write text data to S3
-        text_data = "This is some text data"
-        s3_reader_writer.write(
-            text_data,
-            s3_relative_path=f"s3://{bucket_name}/ebook/test/test.json",
-            mode=AbsReaderWriter.MODE_TXT,
-        )
-
-        # Read text data from S3
-        text_data_read = s3_reader_writer.read(
-            s3_relative_path=f"s3://{bucket_name}/ebook/test/test.json", mode=AbsReaderWriter.MODE_TXT
-        )
-        logger.info(f"Read text data from S3: {text_data_read}")
-        # Write binary data to S3
-        binary_data = b"This is some binary data"
-        s3_reader_writer.write(
-            text_data,
-            s3_relative_path=f"s3://{bucket_name}/ebook/test/test.json",
-            mode=AbsReaderWriter.MODE_BIN,
-        )
-
-        # Read binary data from S3
-        binary_data_read = s3_reader_writer.read(
-            s3_relative_path=f"s3://{bucket_name}/ebook/test/test.json", mode=AbsReaderWriter.MODE_BIN
-        )
-        logger.info(f"Read binary data from S3: {binary_data_read}")
-
-        # Range Read text data from S3
-        binary_data_read = s3_reader_writer.read_offset(
-            path=f"s3://{bucket_name}/ebook/test/test.json", offset=0, limit=10
-        )
-        logger.info(f"Read binary data from S3: {binary_data_read}")
-    if 1:
-        import os
-        import json
-
-        ak = os.getenv("AK", "")
-        sk = os.getenv("SK", "")
-        endpoint_url = os.getenv("ENDPOINT", "")
-        bucket = os.getenv("S3_BUCKET", "")
-        prefix = os.getenv("S3_PREFIX", "")
-        key_basename = os.getenv("S3_KEY_BASENAME", "")
-        s3_reader_writer = S3ReaderWriter(
-            ak, sk, endpoint_url, "auto", f"s3://{bucket}/{prefix}"
-        )
-        content_bin = s3_reader_writer.read_offset(key_basename)
-        assert content_bin[:10] == b'{"track_id'
-        assert content_bin[-10:] == b'r":null}}\n'
-
-        content_bin = s3_reader_writer.read_offset(key_basename, offset=424, limit=426)
-        jso = json.dumps(content_bin.decode("utf-8"))
-        print(jso)

+ 0 - 0
magic_pdf/rw/__init__.py


+ 36 - 11
magic_pdf/tools/cli.py

@@ -1,13 +1,20 @@
 import os
-from pathlib import Path
-
+import shutil
+import tempfile
 import click
+import fitz
 from loguru import logger
+from pathlib import Path
 
 import magic_pdf.model as model_config
 from magic_pdf.data.data_reader_writer import FileBasedDataReader
 from magic_pdf.libs.version import __version__
 from magic_pdf.tools.common import do_parse, parse_pdf_methods
+from magic_pdf.utils.office_to_pdf import convert_file_to_pdf
+
+pdf_suffixes = ['.pdf']
+ms_office_suffixes = ['.ppt', '.pptx', '.doc', '.docx']
+image_suffixes = ['.png', '.jpeg', '.jpg']
 
 
 @click.command()
@@ -21,7 +28,7 @@ from magic_pdf.tools.common import do_parse, parse_pdf_methods
     'path',
     type=click.Path(exists=True),
     required=True,
-    help='local pdf filepath or directory',
+    help='local filepath or directory. support PDF, PPT, PPTX, DOC, DOCX, PNG, JPG files',
 )
 @click.option(
     '-o',
@@ -83,12 +90,27 @@ def cli(path, output_dir, method, lang, debug_able, start_page_id, end_page_id):
     model_config.__use_inside_model__ = True
     model_config.__model_mode__ = 'full'
     os.makedirs(output_dir, exist_ok=True)
+    temp_dir = tempfile.mkdtemp()
+    def read_fn(path: Path):
+        if path.suffix in ms_office_suffixes:
+            convert_file_to_pdf(str(path), temp_dir)
+            fn = os.path.join(temp_dir, f"{path.stem}.pdf")
+        elif path.suffix in image_suffixes:
+            with open(str(path), 'rb') as f:
+                bits = f.read()
+            pdf_bytes = fitz.open(stream=bits).convert_to_pdf()
+            fn = os.path.join(temp_dir, f"{path.stem}.pdf")
+            with open(fn, 'wb') as f:
+                f.write(pdf_bytes)
+        elif path.suffix in pdf_suffixes:
+            fn = str(path)
+        else:
+            raise Exception(f"Unknown file suffix: {path.suffix}")
+        
+        disk_rw = FileBasedDataReader(os.path.dirname(fn))
+        return disk_rw.read(os.path.basename(fn))
 
-    def read_fn(path):
-        disk_rw = FileBasedDataReader(os.path.dirname(path))
-        return disk_rw.read(os.path.basename(path))
-
-    def parse_doc(doc_path: str):
+    def parse_doc(doc_path: Path):
         try:
             file_name = str(Path(doc_path).stem)
             pdf_data = read_fn(doc_path)
@@ -108,10 +130,13 @@ def cli(path, output_dir, method, lang, debug_able, start_page_id, end_page_id):
             logger.exception(e)
 
     if os.path.isdir(path):
-        for doc_path in Path(path).glob('*.pdf'):
-            parse_doc(doc_path)
+        for doc_path in Path(path).glob('*'):
+            if doc_path.suffix in pdf_suffixes + image_suffixes + ms_office_suffixes:
+                parse_doc(doc_path)
     else:
-        parse_doc(path)
+        parse_doc(Path(path))
+
+    shutil.rmtree(temp_dir)
 
 
 if __name__ == '__main__':

+ 28 - 18
magic_pdf/tools/common.py

@@ -9,8 +9,9 @@ from magic_pdf.config.enums import SupportedPdfParseMethod
 from magic_pdf.config.make_content_config import DropMode, MakeMode
 from magic_pdf.data.data_reader_writer import FileBasedDataWriter
 from magic_pdf.data.dataset import PymuDocDataset
+from magic_pdf.libs.draw_bbox import draw_char_bbox
 from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-from magic_pdf.model.operators import InferenceResult
+from magic_pdf.operators.models import InferenceResult
 
 # from io import BytesIO
 # from pypdf import PdfReader, PdfWriter
@@ -83,6 +84,7 @@ def do_parse(
     f_make_md_mode=MakeMode.MM_MD,
     f_draw_model_bbox=False,
     f_draw_line_sort_bbox=False,
+    f_draw_char_bbox=False,
     start_page_id=0,
     end_page_id=None,
     lang=None,
@@ -94,9 +96,7 @@ def do_parse(
         logger.warning('debug mode is on')
         f_draw_model_bbox = True
         f_draw_line_sort_bbox = True
-
-    if lang == '':
-        lang = None
+        # f_draw_char_bbox = True
 
     pdf_bytes = convert_pdf_bytes_to_bytes_by_pymupdf(
         pdf_bytes, start_page_id, end_page_id
@@ -109,7 +109,7 @@ def do_parse(
     )
     image_dir = str(os.path.basename(local_image_dir))
 
-    ds = PymuDocDataset(pdf_bytes)
+    ds = PymuDocDataset(pdf_bytes, lang=lang)
 
     if len(model_list) == 0:
         if model_config.__use_inside_model__:
@@ -118,50 +118,50 @@ def do_parse(
                     infer_result = ds.apply(
                         doc_analyze,
                         ocr=False,
-                        lang=lang,
+                        lang=ds._lang,
                         layout_model=layout_model,
                         formula_enable=formula_enable,
                         table_enable=table_enable,
                     )
                     pipe_result = infer_result.pipe_txt_mode(
-                        image_writer, debug_mode=True, lang=lang
+                        image_writer, debug_mode=True, lang=ds._lang
                     )
                 else:
                     infer_result = ds.apply(
                         doc_analyze,
                         ocr=True,
-                        lang=lang,
+                        lang=ds._lang,
                         layout_model=layout_model,
                         formula_enable=formula_enable,
                         table_enable=table_enable,
                     )
                     pipe_result = infer_result.pipe_ocr_mode(
-                        image_writer, debug_mode=True, lang=lang
+                        image_writer, debug_mode=True, lang=ds._lang
                     )
 
             elif parse_method == 'txt':
                 infer_result = ds.apply(
                     doc_analyze,
                     ocr=False,
-                    lang=lang,
+                    lang=ds._lang,
                     layout_model=layout_model,
                     formula_enable=formula_enable,
                     table_enable=table_enable,
                 )
                 pipe_result = infer_result.pipe_txt_mode(
-                    image_writer, debug_mode=True, lang=lang
+                    image_writer, debug_mode=True, lang=ds._lang
                 )
             elif parse_method == 'ocr':
                 infer_result = ds.apply(
                     doc_analyze,
                     ocr=True,
-                    lang=lang,
+                    lang=ds._lang,
                     layout_model=layout_model,
                     formula_enable=formula_enable,
                     table_enable=table_enable,
                 )
                 pipe_result = infer_result.pipe_ocr_mode(
-                    image_writer, debug_mode=True, lang=lang
+                    image_writer, debug_mode=True, lang=ds._lang
                 )
             else:
                 logger.error('unknown parse method')
@@ -170,19 +170,26 @@ def do_parse(
             logger.error('need model list input')
             exit(2)
     else:
+
         infer_result = InferenceResult(model_list, ds)
         if parse_method == 'ocr':
             pipe_result = infer_result.pipe_ocr_mode(
-                image_writer, debug_mode=True, lang=lang
+                image_writer, debug_mode=True, lang=ds._lang
             )
         elif parse_method == 'txt':
             pipe_result = infer_result.pipe_txt_mode(
-                image_writer, debug_mode=True, lang=lang
+                image_writer, debug_mode=True, lang=ds._lang
             )
         else:
-            pipe_result = infer_result.pipe_auto_mode(
-                image_writer, debug_mode=True, lang=lang
-            )
+            if ds.classify() == SupportedPdfParseMethod.TXT:
+                pipe_result = infer_result.pipe_txt_mode(
+                        image_writer, debug_mode=True, lang=ds._lang
+                    )
+            else:
+                pipe_result = infer_result.pipe_ocr_mode(
+                        image_writer, debug_mode=True, lang=ds._lang
+                    )
+
 
     if f_draw_model_bbox:
         infer_result.draw_model(
@@ -201,6 +208,9 @@ def do_parse(
             os.path.join(local_md_dir, f'{pdf_file_name}_line_sort.pdf')
         )
 
+    if f_draw_char_bbox:
+        draw_char_bbox(pdf_bytes, local_md_dir, f'{pdf_file_name}_char_bbox.pdf')
+
     if f_dump_md:
         pipe_result.dump_md(
             md_writer,

+ 0 - 144
magic_pdf/user_api.py

@@ -1,144 +0,0 @@
-"""用户输入: model数组,每个元素代表一个页面 pdf在s3的路径 截图保存的s3位置.
-
-然后:
-    1)根据s3路径,调用spark集群的api,拿到ak,sk,endpoint,构造出s3PDFReader
-    2)根据用户输入的s3地址,调用spark集群的api,拿到ak,sk,endpoint,构造出s3ImageWriter
-
-其余部分至于构造s3cli, 获取ak,sk都在code-clean里写代码完成。不要反向依赖!!!
-"""
-
-from loguru import logger
-
-from magic_pdf.data.data_reader_writer import DataWriter
-from magic_pdf.data.dataset import Dataset
-from magic_pdf.libs.version import __version__
-from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-from magic_pdf.pdf_parse_by_ocr import parse_pdf_by_ocr
-from magic_pdf.pdf_parse_by_txt import parse_pdf_by_txt
-from magic_pdf.config.constants import PARSE_TYPE_TXT, PARSE_TYPE_OCR
-
-
-def parse_txt_pdf(
-    dataset: Dataset,
-    model_list: list,
-    imageWriter: DataWriter,
-    is_debug=False,
-    start_page_id=0,
-    end_page_id=None,
-    lang=None,
-    *args,
-    **kwargs
-):
-    """解析文本类pdf."""
-    pdf_info_dict = parse_pdf_by_txt(
-        dataset,
-        model_list,
-        imageWriter,
-        start_page_id=start_page_id,
-        end_page_id=end_page_id,
-        debug_mode=is_debug,
-        lang=lang,
-    )
-
-    pdf_info_dict['_parse_type'] = PARSE_TYPE_TXT
-
-    pdf_info_dict['_version_name'] = __version__
-
-    if lang is not None:
-        pdf_info_dict['_lang'] = lang
-
-    return pdf_info_dict
-
-
-def parse_ocr_pdf(
-    dataset: Dataset,
-    model_list: list,
-    imageWriter: DataWriter,
-    is_debug=False,
-    start_page_id=0,
-    end_page_id=None,
-    lang=None,
-    *args,
-    **kwargs
-):
-    """解析ocr类pdf."""
-    pdf_info_dict = parse_pdf_by_ocr(
-        dataset,
-        model_list,
-        imageWriter,
-        start_page_id=start_page_id,
-        end_page_id=end_page_id,
-        debug_mode=is_debug,
-        lang=lang,
-    )
-
-    pdf_info_dict['_parse_type'] = PARSE_TYPE_OCR
-
-    pdf_info_dict['_version_name'] = __version__
-
-    if lang is not None:
-        pdf_info_dict['_lang'] = lang
-
-    return pdf_info_dict
-
-
-def parse_union_pdf(
-    dataset: Dataset,
-    model_list: list,
-    imageWriter: DataWriter,
-    is_debug=False,
-    start_page_id=0,
-    end_page_id=None,
-    lang=None,
-    *args,
-    **kwargs
-):
-    """ocr和文本混合的pdf,全部解析出来."""
-
-    def parse_pdf(method):
-        try:
-            return method(
-                dataset,
-                model_list,
-                imageWriter,
-                start_page_id=start_page_id,
-                end_page_id=end_page_id,
-                debug_mode=is_debug,
-                lang=lang,
-            )
-        except Exception as e:
-            logger.exception(e)
-            return None
-
-    pdf_info_dict = parse_pdf(parse_pdf_by_txt)
-    if pdf_info_dict is None or pdf_info_dict.get('_need_drop', False):
-        logger.warning('parse_pdf_by_txt drop or error, switch to parse_pdf_by_ocr')
-        if len(model_list) == 0:
-            layout_model = kwargs.get('layout_model', None)
-            formula_enable = kwargs.get('formula_enable', None)
-            table_enable = kwargs.get('table_enable', None)
-            infer_res = doc_analyze(
-                dataset,
-                ocr=True,
-                start_page_id=start_page_id,
-                end_page_id=end_page_id,
-                lang=lang,
-                layout_model=layout_model,
-                formula_enable=formula_enable,
-                table_enable=table_enable,
-            )
-            model_list = infer_res.get_infer_res()
-        pdf_info_dict = parse_pdf(parse_pdf_by_ocr)
-        if pdf_info_dict is None:
-            raise Exception('Both parse_pdf_by_txt and parse_pdf_by_ocr failed.')
-        else:
-            pdf_info_dict['_parse_type'] = PARSE_TYPE_OCR
-    else:
-        pdf_info_dict['_parse_type'] = PARSE_TYPE_TXT
-
-    pdf_info_dict['_version_name'] = __version__
-
-    if lang is not None:
-        pdf_info_dict['_lang'] = lang
-
-    return pdf_info_dict

+ 29 - 0
magic_pdf/utils/office_to_pdf.py

@@ -0,0 +1,29 @@
+import os
+import subprocess
+from pathlib import Path
+
+
+class ConvertToPdfError(Exception):
+    def __init__(self, msg):
+        self.msg = msg
+        super().__init__(self.msg)
+
+
+def convert_file_to_pdf(input_path, output_dir):
+    if not os.path.isfile(input_path):
+        raise FileNotFoundError(f"The input file {input_path} does not exist.")
+
+    os.makedirs(output_dir, exist_ok=True)
+    
+    cmd = [
+        'soffice',
+        '--headless',
+        '--convert-to', 'pdf',
+        '--outdir', str(output_dir),
+        str(input_path)
+    ]
+    
+    process = subprocess.run(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
+    
+    if process.returncode != 0:
+        raise ConvertToPdfError(process.stderr.decode())

File diff suppressed because it is too large
+ 0 - 16
next_docs/README.md


File diff suppressed because it is too large
+ 0 - 16
next_docs/README_zh-CN.md


BIN
next_docs/en/_static/image/inference_result.png


+ 6 - 3
next_docs/en/additional_notes/glossary.rst

@@ -4,8 +4,11 @@ Glossary
 ===========
 
 1. jsonl 
-    TODO: add description
+    Newline-delimited (\n), and each line must be a valid, independent JSON object. 
+    Currently, All the function shipped with **MinerU** assume that json object must contain one field named with either **path** or **file_location**
+
+
+2. magic-pdf.json 
+    TODO
 
-2. magic-pdf.json
-    TODO: add description
 

+ 1 - 1
next_docs/en/api/model_operators.rst

@@ -2,7 +2,7 @@
 Model Api
 ==========
 
-.. autoclass:: magic_pdf.model.InferenceResultBase
+.. autoclass:: magic_pdf.operators.InferenceResultBase
    :members:
    :inherited-members:
    :show-inheritance:

+ 2 - 2
next_docs/en/api/pipe_operators.rst

@@ -3,7 +3,7 @@
 Pipeline Api
 =============
 
-.. autoclass:: magic_pdf.pipe.operators.PipeResult
+.. autoclass:: magic_pdf.operators.pipes.PipeResult
    :members:
    :inherited-members:
-   :show-inheritance:
+   :show-inheritance:

+ 6 - 0
next_docs/en/index.rst

@@ -70,6 +70,12 @@ Key Features
 -  Supports both CPU and GPU environments.
 -  Compatible with Windows, Linux, and Mac platforms.
 
+
+.. tip::
+
+   Get started with MinerU by trying the `online demo <https://www.modelscope.cn/studios/OpenDataLab/MinerU>`_ or :doc:`installing it locally <user_guide/install/install>`.
+
+
 User Guide
 -------------
 .. toctree::

+ 3 - 1
next_docs/en/user_guide.rst

@@ -4,7 +4,9 @@
     :maxdepth: 2
 
     user_guide/install
+    user_guide/usage
     user_guide/quick_start
     user_guide/tutorial
     user_guide/data
-    
+    user_guide/inference_result
+    user_guide/pipe_result

+ 90 - 62
next_docs/en/user_guide/data/data_reader_writer.rst

@@ -87,56 +87,70 @@ Read Examples
 
 .. code:: python
 
+    import os 
     from magic_pdf.data.data_reader_writer import *
+    from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
+    from magic_pdf.data.schemas import S3Config
 
-    # file based related 
+    # file based related
     file_based_reader1 = FileBasedDataReader('')
 
-    ## will read file abc 
-    file_based_reader1.read('abc') 
+    ## will read file abc
+    file_based_reader1.read('abc')
 
     file_based_reader2 = FileBasedDataReader('/tmp')
 
     ## will read /tmp/abc
     file_based_reader2.read('abc')
 
-    ## will read /var/logs/message.txt
-    file_based_reader2.read('/var/logs/message.txt')
+    ## will read /tmp/logs/message.txt
+    file_based_reader2.read('/tmp/logs/message.txt')
 
     # multi bucket s3 releated
-    multi_bucket_s3_reader1 = MultiBucketS3DataReader("test_bucket1/test_prefix", list[S3Config(
-            bucket_name=test_bucket1, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
+    bucket = "bucket"               # replace with real bucket
+    ak = "ak"                       # replace with real access key
+    sk = "sk"                       # replace with real secret key
+    endpoint_url = "endpoint_url"   # replace with real endpoint_url
+
+    bucket_2 = "bucket_2"               # replace with real bucket
+    ak_2 = "ak_2"                       # replace with real access key
+    sk_2 = "sk_2"                       # replace with real secret key 
+    endpoint_url_2 = "endpoint_url_2"   # replace with real endpoint_url
+
+    test_prefix = 'test/unittest'
+    multi_bucket_s3_reader1 = MultiBucketS3DataReader(f"{bucket}/{test_prefix}", [S3Config(
+            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
         ),
         S3Config(
-            bucket_name=test_bucket_2,
+            bucket_name=bucket_2,
             access_key=ak_2,
             secret_key=sk_2,
             endpoint_url=endpoint_url_2,
         )])
-    
-    ## will read s3://test_bucket1/test_prefix/abc
+
+    ## will read s3://{bucket}/{test_prefix}/abc
     multi_bucket_s3_reader1.read('abc')
 
-    ## will read s3://test_bucket1/efg
-    multi_bucket_s3_reader1.read('s3://test_bucket1/efg')
+    ## will read s3://{bucket}/{test_prefix}/efg
+    multi_bucket_s3_reader1.read(f's3://{bucket}/{test_prefix}/efg')
 
-    ## will read s3://test_bucket2/abc
-    multi_bucket_s3_reader1.read('s3://test_bucket2/abc')
+    ## will read s3://{bucket2}/{test_prefix}/abc
+    multi_bucket_s3_reader1.read(f's3://{bucket_2}/{test_prefix}/abc')
 
     # s3 related
     s3_reader1 = S3DataReader(
-        default_prefix_without_bucket = "test_prefix"
-        bucket: "test_bucket",
-        ak: "ak",
-        sk: "sk",
-        endpoint_url: "localhost"
+        test_prefix,
+        bucket,
+        ak,
+        sk,
+        endpoint_url
     )
 
-    ## will read s3://test_bucket/test_prefix/abc 
+    ## will read s3://{bucket}/{test_prefix}/abc
     s3_reader1.read('abc')
-   
-    ## will read s3://test_bucket/efg
-    s3_reader1.read('s3://test_bucket/efg')
+
+    ## will read s3://{bucket}/efg
+    s3_reader1.read(f's3://{bucket}/efg')
 
 
 Write Examples
@@ -144,65 +158,79 @@ Write Examples
 
 .. code:: python
 
+    import os
     from magic_pdf.data.data_reader_writer import *
+    from magic_pdf.data.data_reader_writer import MultiBucketS3DataWriter
+    from magic_pdf.data.schemas import S3Config
 
-    # file based related 
-    file_based_writer1 = FileBasedDataWriter('')
+    # file based related
+    file_based_writer1 = FileBasedDataWriter("")
 
     ## will write 123 to abc
-    file_based_writer1.write('abc', '123'.encode()) 
+    file_based_writer1.write("abc", "123".encode())
 
     ## will write 123 to abc
-    file_based_writer1.write_string('abc', '123') 
+    file_based_writer1.write_string("abc", "123")
 
-    file_based_writer2 = FileBasedDataWriter('/tmp')
+    file_based_writer2 = FileBasedDataWriter("/tmp")
 
     ## will write 123 to /tmp/abc
-    file_based_writer2.write_string('abc', '123')
+    file_based_writer2.write_string("abc", "123")
 
-    ## will write 123 to /var/logs/message.txt
-    file_based_writer2.write_string('/var/logs/message.txt', '123')
+    ## will write 123 to /tmp/logs/message.txt
+    file_based_writer2.write_string("/tmp/logs/message.txt", "123")
 
     # multi bucket s3 releated
-    multi_bucket_s3_writer1 = MultiBucketS3DataWriter("test_bucket1/test_prefix", list[S3Config(
-            bucket_name=test_bucket1, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-        ),
-        S3Config(
-            bucket_name=test_bucket_2,
-            access_key=ak_2,
-            secret_key=sk_2,
-            endpoint_url=endpoint_url_2,
-        )])
-    
-    ## will write 123 to s3://test_bucket1/test_prefix/abc
-    multi_bucket_s3_writer1.write_string('abc', '123')
+    bucket = "bucket"               # replace with real bucket
+    ak = "ak"                       # replace with real access key
+    sk = "sk"                       # replace with real secret key
+    endpoint_url = "endpoint_url"   # replace with real endpoint_url
+
+    bucket_2 = "bucket_2"               # replace with real bucket
+    ak_2 = "ak_2"                       # replace with real access key
+    sk_2 = "sk_2"                       # replace with real secret key 
+    endpoint_url_2 = "endpoint_url_2"   # replace with real endpoint_url
+
+    test_prefix = "test/unittest"
+    multi_bucket_s3_writer1 = MultiBucketS3DataWriter(
+        f"{bucket}/{test_prefix}",
+        [
+            S3Config(
+                bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
+            ),
+            S3Config(
+                bucket_name=bucket_2,
+                access_key=ak_2,
+                secret_key=sk_2,
+                endpoint_url=endpoint_url_2,
+            ),
+        ],
+    )
 
-    ## will write 123 to s3://test_bucket1/test_prefix/abc
-    multi_bucket_s3_writer1.write('abc', '123'.encode())
+    ## will write 123 to s3://{bucket}/{test_prefix}/abc
+    multi_bucket_s3_writer1.write_string("abc", "123")
 
-    ## will write 123 to s3://test_bucket1/efg
-    multi_bucket_s3_writer1.write('s3://test_bucket1/efg', '123'.encode())
+    ## will write 123 to s3://{bucket}/{test_prefix}/abc
+    multi_bucket_s3_writer1.write("abc", "123".encode())
 
-    ## will write 123 to s3://test_bucket2/abc
-    multi_bucket_s3_writer1.write('s3://test_bucket2/abc', '123'.encode())
+    ## will write 123 to s3://{bucket}/{test_prefix}/efg
+    multi_bucket_s3_writer1.write(f"s3://{bucket}/{test_prefix}/efg", "123".encode())
+
+    ## will write 123 to s3://{bucket_2}/{test_prefix}/abc
+    multi_bucket_s3_writer1.write(f's3://{bucket_2}/{test_prefix}/abc', '123'.encode())
 
     # s3 related
-    s3_writer1 = S3DataWriter(
-        default_prefix_without_bucket = "test_prefix"
-        bucket: "test_bucket",
-        ak: "ak",
-        sk: "sk",
-        endpoint_url: "localhost"
-    )
+    s3_writer1 = S3DataWriter(test_prefix, bucket, ak, sk, endpoint_url)
+
+    ## will write 123 to s3://{bucket}/{test_prefix}/abc
+    s3_writer1.write("abc", "123".encode())
 
-    ## will write 123 to s3://test_bucket/test_prefix/abc 
-    s3_writer1.write('abc', '123'.encode())
+    ## will write 123 to s3://{bucket}/{test_prefix}/abc
+    s3_writer1.write_string("abc", "123")
 
-    ## will write 123 to s3://test_bucket/test_prefix/abc 
-    s3_writer1.write_string('abc', '123')
+    ## will write 123 to s3://{bucket}/efg
+    s3_writer1.write(f"s3://{bucket}/efg", "123".encode())
 
-    ## will write 123 to s3://test_bucket/efg
-    s3_writer1.write('s3://test_bucket/efg', '123'.encode())
 
 
 Check :doc:`../../api/data_reader_writer` for more details

+ 52 - 9
next_docs/en/user_guide/data/read_api.rst

@@ -18,24 +18,50 @@ Read the contet from jsonl which may located on local machine or remote s3. if y
 
 .. code:: python
 
-    from magic_pdf.data.io.read_api import *
+    from magic_pdf.data.read_api import *
+    from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
+    from magic_pdf.data.schemas import S3Config
 
-    # read jsonl from local machine 
-    datasets = read_jsonl("tt.jsonl", None)
+    # read jsonl from local machine
+    datasets = read_jsonl("tt.jsonl", None)   # replace with real jsonl file
 
     # read jsonl from remote s3
-    datasets = read_jsonl("s3://bucket_1/tt.jsonl", s3_reader)
 
+    bucket = "bucket_1"                     # replace with real s3 bucket
+    ak = "access_key_1"                     # replace with real s3 access key
+    sk = "secret_key_1"                     # replace with real s3 secret key
+    endpoint_url = "endpoint_url_1"         # replace with real s3 endpoint url
+
+    bucket_2 = "bucket_2"                   # replace with real s3 bucket
+    ak_2 = "access_key_2"                   # replace with real s3 access key
+    sk_2 = "secret_key_2"                   # replace with real s3 secret key
+    endpoint_url_2 = "endpoint_url_2"       # replace with real s3 endpoint url
+
+    s3configs = [
+        S3Config(
+            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
+        ),
+        S3Config(
+            bucket_name=bucket_2,
+            access_key=ak_2,
+            secret_key=sk_2,
+            endpoint_url=endpoint_url_2,
+        ),
+    ]
+
+    s3_reader = MultiBucketS3DataReader(bucket, s3configs)
+
+    datasets = read_jsonl(f"s3://bucket_1/tt.jsonl", s3_reader)  # replace with real s3 jsonl file
 
 read_local_pdfs
-^^^^^^^^^^^^^^^^
+^^^^^^^^^^^^^^^^^
 
 Read pdf from path or directory.
 
 
 .. code:: python
 
-    from magic_pdf.data.io.read_api import *
+    from magic_pdf.data.read_api import *
 
     # read pdf path
     datasets = read_local_pdfs("tt.pdf")
@@ -51,13 +77,30 @@ Read images from path or directory
 
 .. code:: python 
 
-    from magic_pdf.data.io.read_api import *
+    from magic_pdf.data.read_api import *
+
+    # read from image path 
+    datasets = read_local_images("tt.png")  # replace with real file path
+
+    # read files from directory that endswith suffix in suffixes array 
+    datasets = read_local_images("images/", suffixes=[".png", ".jpg"])  # replace with real directory 
+
+
+read_local_office
+^^^^^^^^^^^^^^^^^^^^
+Read MS-Office files from path or directory
+
+.. code:: python 
+
+    from magic_pdf.data.read_api import *
 
     # read from image path 
-    datasets = read_local_images("tt.png")
+    datasets = read_local_office("tt.doc")  # replace with real file path
 
     # read files from directory that endswith suffix in suffixes array 
-    datasets = read_local_images("images/", suffixes=["png", "jpg"])
+    datasets = read_local_office("docs/")  # replace with real directory 
+
+
 
 
 Check :doc:`../../api/read_api` for more details

+ 144 - 0
next_docs/en/user_guide/inference_result.rst

@@ -0,0 +1,144 @@
+
+Inference Result
+==================
+
+.. admonition:: Tip
+    :class: tip
+
+    Please first navigate to :doc:`tutorial/pipeline` to get an initial understanding of how the pipeline works; this will help in understanding the content of this section.
+
+The **InferenceResult** class is a container for storing model inference results and implements a series of methods related to these results, such as draw_model, dump_model.
+Checkout :doc:`../api/model_operators` for more details about **InferenceResult**
+
+
+Model Inference Result
+-----------------------
+
+Structure Definition
+^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: python
+
+    from pydantic import BaseModel, Field
+    from enum import IntEnum
+
+    class CategoryType(IntEnum):
+            title = 0               # Title
+            plain_text = 1          # Text
+            abandon = 2             # Includes headers, footers, page numbers, and page annotations
+            figure = 3              # Image
+            figure_caption = 4      # Image description
+            table = 5               # Table
+            table_caption = 6       # Table description
+            table_footnote = 7      # Table footnote
+            isolate_formula = 8     # Block formula
+            formula_caption = 9     # Formula label
+
+            embedding = 13          # Inline formula
+            isolated = 14           # Block formula
+            text = 15               # OCR recognition result
+
+
+    class PageInfo(BaseModel):
+        page_no: int = Field(description="Page number, the first page is 0", ge=0)
+        height: int = Field(description="Page height", gt=0)
+        width: int = Field(description="Page width", ge=0)
+
+    class ObjectInferenceResult(BaseModel):
+        category_id: CategoryType = Field(description="Category", ge=0)
+        poly: list[float] = Field(description="Quadrilateral coordinates, representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively")
+        score: float = Field(description="Confidence of the inference result")
+        latex: str | None = Field(description="LaTeX parsing result", default=None)
+        html: str | None = Field(description="HTML parsing result", default=None)
+
+    class PageInferenceResults(BaseModel):
+            layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
+            page_info: PageInfo = Field(description="Page metadata")
+
+
+Example
+^^^^^^^^^^^
+
+.. code:: json
+
+    [
+        {
+            "layout_dets": [
+                {
+                    "category_id": 2,
+                    "poly": [
+                        99.1906967163086,
+                        100.3119125366211,
+                        730.3707885742188,
+                        100.3119125366211,
+                        730.3707885742188,
+                        245.81326293945312,
+                        99.1906967163086,
+                        245.81326293945312
+                    ],
+                    "score": 0.9999997615814209
+                }
+            ],
+            "page_info": {
+                "page_no": 0,
+                "height": 2339,
+                "width": 1654
+            }
+        },
+        {
+            "layout_dets": [
+                {
+                    "category_id": 5,
+                    "poly": [
+                        99.13092803955078,
+                        2210.680419921875,
+                        497.3183898925781,
+                        2210.680419921875,
+                        497.3183898925781,
+                        2264.78076171875,
+                        99.13092803955078,
+                        2264.78076171875
+                    ],
+                    "score": 0.9999997019767761
+                }
+            ],
+            "page_info": {
+                "page_no": 1,
+                "height": 2339,
+                "width": 1654
+            }
+        }
+    ]
+
+The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3],
+representing the coordinates of the top-left, top-right, bottom-right,
+and bottom-left points respectively. |Poly Coordinate Diagram|
+
+
+
+Inference Result
+-------------------------
+
+
+.. code:: python
+
+    from magic_pdf.operators.models import InferenceResult
+    from magic_pdf.data.dataset import Dataset
+
+    dataset : Dataset = some_data_set    # not real dataset
+
+    # The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
+    model_inference_result: list[PageInferenceResults] = []
+
+    Inference_result = InferenceResult(model_inference_result, dataset)
+
+
+
+some_model.pdf
+^^^^^^^^^^^^^^^^^^^^
+
+.. figure:: ../_static/image/inference_result.png
+
+
+
+.. |Poly Coordinate Diagram| image:: ../_static/image/poly.png

+ 1 - 1
next_docs/en/user_guide/install.rst

@@ -8,5 +8,5 @@ Installation
    install/install
    install//boost_with_cuda
    install/download_model_weight_files
-
+   install/config
 

+ 0 - 18
next_docs/en/user_guide/install/boost_with_cuda.rst

@@ -9,25 +9,7 @@ appropriate guide based on your system:
 
 -  :ref:`ubuntu_22_04_lts_section`
 -  :ref:`windows_10_or_11_section`
--  Quick Deployment with Docker
 
-.. admonition:: Important
-   :class: tip
-
-   Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
-
-   Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker. 
-
-   .. code-block:: bash
-
-      bash  docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
-
-.. code:: sh
-
-   wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
-   docker build -t mineru:latest .
-   docker run --rm -it --gpus=all mineru:latest /bin/bash
-   magic-pdf --help
 
 .. _ubuntu_22_04_lts_section:
 

+ 168 - 0
next_docs/en/user_guide/install/config.rst

@@ -0,0 +1,168 @@
+
+
+Config
+=========
+
+File **magic-pdf.json** is typically located in the **${HOME}** directory under a Linux system or in the **C:\Users\{username}** directory under a Windows system.
+
+.. admonition:: Tip 
+    :class: tip
+
+    You can override the default location of config file via the following command:
+    
+    export MINERU_TOOLS_CONFIG_JSON=new_magic_pdf.json
+
+
+
+magic-pdf.json
+----------------
+
+.. code:: json 
+
+    {
+        "bucket_info":{
+            "bucket-name-1":["ak", "sk", "endpoint"],
+            "bucket-name-2":["ak", "sk", "endpoint"]
+        },
+        "models-dir":"/tmp/models",
+        "layoutreader-model-dir":"/tmp/layoutreader",
+        "device-mode":"cpu",
+        "layout-config": {
+            "model": "layoutlmv3"
+        },
+        "formula-config": {
+            "mfd_model": "yolo_v8_mfd",
+            "mfr_model": "unimernet_small",
+            "enable": true
+        },
+        "table-config": {
+            "model": "rapid_table",
+            "enable": false,
+            "max_time": 400    
+        },
+        "config_version": "1.0.0"
+    }
+
+
+
+
+bucket_info
+^^^^^^^^^^^^^^
+Store the access_key, secret_key and endpoint of AWS S3 Compatible storage config
+
+Example: 
+
+.. code:: text
+
+        {
+            "image_bucket":[{access_key}, {secret_key}, {endpoint}],
+            "video_bucket":[{access_key}, {secret_key}, {endpoint}]
+        }
+
+
+models-dir
+^^^^^^^^^^^^
+
+Store the models download from **huggingface** or **modelshop**. You do not need to modify this field if you download the model using the scripts shipped with **MinerU**
+
+
+layoutreader-model-dir
+^^^^^^^^^^^^^^^^^^^^^^^
+
+Store the models download from **huggingface** or **modelshop**. You do not need to modify this field if you download the model using the scripts shipped with **MinerU**
+
+
+devide-mode
+^^^^^^^^^^^^^^
+
+This field have two options, **cpu** or **cuda**.
+
+**cpu**: inference via cpu
+
+**cuda**: using cuda to accelerate inference
+
+
+layout-config 
+^^^^^^^^^^^^^^^
+
+.. code:: json
+
+    {
+        "model": "layoutlmv3"  
+    }
+
+layout model can not be disabled now, And we have only kind of layout model currently.
+
+
+formula-config
+^^^^^^^^^^^^^^^^
+
+.. code:: json
+
+    {
+        "mfd_model": "yolo_v8_mfd",   
+        "mfr_model": "unimernet_small",
+        "enable": true 
+    }
+
+
+mfd_model
+""""""""""
+
+Specify the formula detection model, options are ['yolo_v8_mfd']
+
+
+mfr_model
+""""""""""
+Specify the formula recognition model, options are ['unimernet_small']
+
+Check `UniMERNet <https://github.com/opendatalab/UniMERNet>`_ for more details
+
+
+enable
+""""""""
+
+on-off flag, options are [true, false]. **true** means enable formula inference, **false** means disable formula inference
+
+
+table-config
+^^^^^^^^^^^^^^^^
+
+.. code:: json
+
+   {
+        "model": "rapid_table",
+        "enable": false,
+        "max_time": 400    
+    }
+
+model
+""""""""
+
+Specify the table inference model, options are ['rapid_table', 'tablemaster', 'struct_eqtable']
+
+
+max_time
+"""""""""
+
+Since table recognition is a time-consuming process, we set a timeout period. If the process exceeds this time, the table recognition will be terminated.
+
+
+
+enable
+"""""""
+
+on-off flag, options are [true, false]. **true** means enable table inference, **false** means disable table inference
+
+
+config_version
+^^^^^^^^^^^^^^^^
+
+The version of config schema.
+
+
+.. admonition:: Tip
+    :class: tip
+    
+    Check `Config Schema <https://github.com/opendatalab/MinerU/blob/master/magic-pdf.template.json>`_ for the latest details
+

+ 32 - 3
next_docs/en/user_guide/install/install.rst

@@ -4,6 +4,7 @@ Install
 If you encounter any installation issues, please first consult the :doc:`../../additional_notes/faq`.
 If the parsing results are not as expected, refer to the :doc:`../../additional_notes/known_issues`.
 
+Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/MinerU>`_ without installation.
 
 .. admonition:: Warning
     :class: tip
@@ -88,7 +89,7 @@ If the parsing results are not as expected, refer to the :doc:`../../additional_
 
 
 Create an environment
-~~~~~~~~~~~~~~~~~~~~~
+---------------------------
 
 .. code-block:: shell
 
@@ -98,7 +99,7 @@ Create an environment
 
 
 Download model weight files
-~~~~~~~~~~~~~~~~~~~~~~~~~~
+------------------------------
 
 .. code-block:: shell
 
@@ -107,4 +108,32 @@ Download model weight files
     python download_models_hf.py    
 
 
-The MinerU is installed, Check out :doc:`../quick_start` or reading :doc:`boost_with_cuda` for accelerate inference
+
+Install LibreOffice[Optional]
+----------------------------------
+
+This section is required for handle **doc**, **docx**, **ppt**, **pptx** filetype, You can **skip** this section if no need for those filetype processing.
+
+
+Linux/Macos Platform
+""""""""""""""""""""""
+
+.. code::
+
+    apt-get/yum/brew install libreoffice
+
+
+Windows Platform 
+""""""""""""""""""""
+
+.. code::
+
+    install libreoffice 
+    append "install_dir\LibreOffice\program" to ENVIRONMENT PATH
+
+
+.. tip::
+
+    The MinerU is installed, Check out :doc:`../usage/command_line` to convert your first pdf **or** reading the following sections for more details about install
+
+

+ 335 - 0
next_docs/en/user_guide/pipe_result.rst

@@ -0,0 +1,335 @@
+
+
+Pipe Result
+==============
+
+.. admonition:: Tip
+    :class: tip
+
+    Please first navigate to :doc:`tutorial/pipeline` to get an initial understanding of how the pipeline works; this will help in understanding the content of this section.
+
+
+The **PipeResult** class is a container for storing pipeline processing results and implements a series of methods related to these results, such as draw_layout, draw_span.
+Checkout :doc:`../api/pipe_operators` for more details about **PipeResult**
+
+
+
+Structure Definitions
+-------------------------------
+
+**some_pdf_middle.json**
+
++----------------+--------------------------------------------------------------+
+| Field Name     | Description                                                  |
+|                |                                                              |
++================+==============================================================+
+| pdf_info       | list, each element is a dict representing the parsing result |
+|                | of each PDF page, see the table below for details            |
++----------------+--------------------------------------------------------------+
+| \_             | ocr \| txt, used to indicate the mode used in this           |
+| parse_type     | intermediate parsing state                                   |
+|                |                                                              |
++----------------+--------------------------------------------------------------+
+| \_version_name | string, indicates the version of magic-pdf used in this      |
+|                | parsing                                                      |
+|                |                                                              |
++----------------+--------------------------------------------------------------+
+
+**pdf_info**
+
+Field structure description
+
++-------------------------+------------------------------------------------------------+
+| Field                   | Description                                                |
+| Name                    |                                                            |
++=========================+============================================================+
+| preproc_blocks          | Intermediate result after PDF preprocessing, not yet       |
+|                         | segmented                                                  |
++-------------------------+------------------------------------------------------------+
+| layout_bboxes           | Layout segmentation results, containing layout direction   |
+|                         | (vertical, horizontal), and bbox, sorted by reading order  |
++-------------------------+------------------------------------------------------------+
+| page_idx                | Page number, starting from 0                               |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| page_size               | Page width and height                                      |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| \_layout_tree           | Layout tree structure                                      |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| images                  | list, each element is a dict representing an img_block     |
++-------------------------+------------------------------------------------------------+
+| tables                  | list, each element is a dict representing a table_block    |
++-------------------------+------------------------------------------------------------+
+| interline_equation      | list, each element is a dict representing an               |
+|                         | interline_equation_block                                   |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| discarded_blocks        | List, block information returned by the model that needs   |
+|                         | to be dropped                                              |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| para_blocks             | Result after segmenting preproc_blocks                     |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+
+In the above table, ``para_blocks`` is an array of dicts, each dict
+representing a block structure. A block can support up to one level of
+nesting.
+
+**block**
+
+The outer block is referred to as a first-level block, and the fields in
+the first-level block include:
+
++------------------------+-------------------------------------------------------------+
+| Field                  | Description                                                 |
+| Name                   |                                                             |
++========================+=============================================================+
+| type                   | Block type (table|image)                                    |
++------------------------+-------------------------------------------------------------+
+| bbox                   | Block bounding box coordinates                              |
++------------------------+-------------------------------------------------------------+
+| blocks                 | list, each element is a dict representing a second-level    |
+|                        | block                                                       |
++------------------------+-------------------------------------------------------------+
+
+There are only two types of first-level blocks: “table” and “image”. All
+other blocks are second-level blocks.
+
+The fields in a second-level block include:
+
++----------------------+----------------------------------------------------------------+
+| Field                | Description                                                    |
+| Name                 |                                                                |
++======================+================================================================+
+|                      | Block type                                                     |
+| type                 |                                                                |
++----------------------+----------------------------------------------------------------+
+|                      | Block bounding box coordinates                                 |
+| bbox                 |                                                                |
++----------------------+----------------------------------------------------------------+
+|                      | list, each element is a dict representing a line, used to      |
+| lines                | describe the composition of a line of information              |
++----------------------+----------------------------------------------------------------+
+
+Detailed explanation of second-level block types
+
+================== ======================
+type               Description
+================== ======================
+image_body         Main body of the image
+image_caption      Image description text
+table_body         Main body of the table
+table_caption      Table description text
+table_footnote     Table footnote
+text               Text block
+title              Title block
+interline_equation Block formula
+================== ======================
+
+**line**
+
+The field format of a line is as follows:
+
++---------------------+----------------------------------------------------------------+
+| Field               | Description                                                    |
+| Name                |                                                                |
++=====================+================================================================+
+|                     | Bounding box coordinates of the line                           |
+| bbox                |                                                                |
++---------------------+----------------------------------------------------------------+
+| spans               | list, each element is a dict representing a span, used to      |
+|                     | describe the composition of the smallest unit                  |
++---------------------+----------------------------------------------------------------+
+
+**span**
+
++---------------------+-----------------------------------------------------------+
+| Field               | Description                                               |
+| Name                |                                                           |
++=====================+===========================================================+
+| bbox                | Bounding box coordinates of the span                      |
++---------------------+-----------------------------------------------------------+
+| type                | Type of the span                                          |
++---------------------+-----------------------------------------------------------+
+| content             | Text spans use content, chart spans use img_path to store |
+| \|                  | the actual text or screenshot path information            |
+| img_path            |                                                           |
++---------------------+-----------------------------------------------------------+
+
+The types of spans are as follows:
+
+================== ==============
+type               Description
+================== ==============
+image              Image
+table              Table
+text               Text
+inline_equation    Inline formula
+interline_equation Block formula
+================== ==============
+
+**Summary**
+
+A span is the smallest storage unit for all elements.
+
+The elements stored within para_blocks are block information.
+
+The block structure is as follows:
+
+First-level block (if any) -> Second-level block -> Line -> Span
+
+.. _example-1:
+
+example
+^^^^^^^
+
+.. code:: json
+
+   {
+       "pdf_info": [
+           {
+               "preproc_blocks": [
+                   {
+                       "type": "text",
+                       "bbox": [
+                           52,
+                           61.956024169921875,
+                           294,
+                           82.99800872802734
+                       ],
+                       "lines": [
+                           {
+                               "bbox": [
+                                   52,
+                                   61.956024169921875,
+                                   294,
+                                   72.0000228881836
+                               ],
+                               "spans": [
+                                   {
+                                       "bbox": [
+                                           54.0,
+                                           61.956024169921875,
+                                           296.2261657714844,
+                                           72.0000228881836
+                                       ],
+                                       "content": "dependent on the service headway and the reliability of the departure ",
+                                       "type": "text",
+                                       "score": 1.0
+                                   }
+                               ]
+                           }
+                       ]
+                   }
+               ],
+               "layout_bboxes": [
+                   {
+                       "layout_bbox": [
+                           52,
+                           61,
+                           294,
+                           731
+                       ],
+                       "layout_label": "V",
+                       "sub_layout": []
+                   }
+               ],
+               "page_idx": 0,
+               "page_size": [
+                   612.0,
+                   792.0
+               ],
+               "_layout_tree": [],
+               "images": [],
+               "tables": [],
+               "interline_equations": [],
+               "discarded_blocks": [],
+               "para_blocks": [
+                   {
+                       "type": "text",
+                       "bbox": [
+                           52,
+                           61.956024169921875,
+                           294,
+                           82.99800872802734
+                       ],
+                       "lines": [
+                           {
+                               "bbox": [
+                                   52,
+                                   61.956024169921875,
+                                   294,
+                                   72.0000228881836
+                               ],
+                               "spans": [
+                                   {
+                                       "bbox": [
+                                           54.0,
+                                           61.956024169921875,
+                                           296.2261657714844,
+                                           72.0000228881836
+                                       ],
+                                       "content": "dependent on the service headway and the reliability of the departure ",
+                                       "type": "text",
+                                       "score": 1.0
+                                   }
+                               ]
+                           }
+                       ]
+                   }
+               ]
+           }
+       ],
+       "_parse_type": "txt",
+       "_version_name": "0.6.1"
+   }
+
+
+Pipeline Result
+------------------
+
+.. code:: python
+
+    from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
+    from magic_pdf.operators.pipes import PipeResult
+    from magic_pdf.data.dataset import Dataset
+
+    res = pdf_parse_union(*args, **kwargs)
+    res['_parse_type'] = PARSE_TYPE_OCR
+    res['_version_name'] = __version__
+    if 'lang' in kwargs and kwargs['lang'] is not None:
+        res['lang'] = kwargs['lang']
+
+    dataset : Dataset = some_dataset   # not real dataset
+    pipeResult = PipeResult(res, dataset)
+
+
+
+some_pdf_layout.pdf
+~~~~~~~~~~~~~~~~~~~
+
+Each page layout consists of one or more boxes. The number at the top
+left of each box indicates its sequence number. Additionally, in
+``layout.pdf``, different content blocks are highlighted with different
+background colors.
+
+.. figure:: ../_static/image/layout_example.png
+   :alt: layout example
+
+   layout example
+
+some_pdf_spans.pdf
+~~~~~~~~~~~~~~~~~~
+
+All spans on the page are drawn with different colored line frames
+according to the span type. This file can be used for quality control,
+allowing for quick identification of issues such as missing text or
+unrecognized inline formulas.
+
+.. figure:: ../_static/image/spans_example.png
+   :alt: spans example
+
+   spans example

+ 4 - 5
next_docs/en/user_guide/quick_start.rst

@@ -2,12 +2,11 @@
 Quick Start 
 ==============
 
-Eager to get started? This page gives a good introduction to MinerU. Follow Installation to set up a project and install MinerU first.
-
+Want to learn about the usage methods under different scenarios ? This page gives good examples about multiple usage cases match your needs.
 
 .. toctree::
     :maxdepth: 1
 
-    quick_start/command_line
-    quick_start/to_markdown
-
+    quick_start/convert_pdf 
+    quick_start/convert_image
+    quick_start/convert_ms_office

+ 47 - 0
next_docs/en/user_guide/quick_start/convert_image.rst

@@ -0,0 +1,47 @@
+
+
+Convert Image
+===============
+
+
+Command Line
+^^^^^^^^^^^^^
+
+.. code:: python
+
+    # make sure the file have correct suffix
+    magic-pdf -p a.png -o output -m auto
+
+
+API
+^^^^^^
+
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import FileBasedDataWriter
+    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+    from magic_pdf.data.read_api import read_local_images
+
+    # prepare env
+    local_image_dir, local_md_dir = "output/images", "output"
+    image_dir = str(os.path.basename(local_image_dir))
+
+    os.makedirs(local_image_dir, exist_ok=True)
+
+    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
+        local_md_dir
+    )
+
+    # proc
+    ## Create Dataset Instance
+    input_file = "some_image.jpg"       # replace with real image file
+
+    input_file_name = input_file.split(".")[0]
+    ds = read_local_images(input_file)[0]
+
+    # ocr mode
+    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+        md_writer, f"{input_file_name}.md", image_dir
+    )

+ 60 - 0
next_docs/en/user_guide/quick_start/convert_ms_office.rst

@@ -0,0 +1,60 @@
+
+
+Convert Doc
+=============
+
+.. admonition:: Warning
+    :class: tip
+
+    When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF.
+
+    For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output.
+
+
+
+Command Line
+^^^^^^^^^^^^^
+
+.. code:: python
+
+    # replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now
+    magic-pdf -p a.doc -o output -m auto
+
+
+API
+^^^^^^^^
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
+    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+    from magic_pdf.data.read_api import read_local_office
+    from magic_pdf.config.enums import SupportedPdfParseMethod
+
+
+    # prepare env
+    local_image_dir, local_md_dir = "output/images", "output"
+    image_dir = str(os.path.basename(local_image_dir))
+
+    os.makedirs(local_image_dir, exist_ok=True)
+
+    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
+        local_md_dir
+    )
+
+    # proc
+    ## Create Dataset Instance
+    input_file = "some_doc.doc"     # replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now
+
+    input_file_name = input_file.split(".")[0]
+    ds = read_local_office(input_file)[0]
+
+
+    ## inference
+    if ds.classify() == SupportedPdfParseMethod.OCR:
+        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+        md_writer, f"{input_file_name}.md", image_dir)
+    else:
+        ds.apply(doc_analyze, ocr=False).pipe_txt_mode(image_writer).dump_md(
+        md_writer, f"{input_file_name}.md", image_dir)

+ 56 - 0
next_docs/en/user_guide/quick_start/convert_pdf.rst

@@ -0,0 +1,56 @@
+
+
+Convert PDF
+============
+
+Command Line
+^^^^^^^^^^^^^
+
+.. code:: python
+
+    # make sure the file have correct suffix
+    magic-pdf -p a.pdf -o output -m auto
+
+
+API
+^^^^^^
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
+    from magic_pdf.data.dataset import PymuDocDataset
+    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+
+    # args
+    pdf_file_name = "abc.pdf"  # replace with the real pdf path
+    name_without_suff = pdf_file_name.split(".")[0]
+
+    # prepare env
+    local_image_dir, local_md_dir = "output/images", "output"
+    image_dir = str(os.path.basename(local_image_dir))
+
+    os.makedirs(local_image_dir, exist_ok=True)
+
+    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
+        local_md_dir
+    )
+
+    # read bytes
+    reader1 = FileBasedDataReader("")
+    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
+
+    # proc
+    ## Create Dataset Instance
+    ds = PymuDocDataset(pdf_bytes)
+
+    ## inference
+    if ds.classify() == SupportedPdfParseMethod.OCR:
+        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+        md_writer, f"{name_without_suff}.md", image_dir
+    )
+
+    else:
+        ds.apply(doc_analyze, ocr=False).pipe_txt_mode(image_writer).dump_md(
+        md_writer, f"{name_without_suff}.md", image_dir
+    )

+ 0 - 1
next_docs/en/user_guide/tutorial.rst

@@ -7,6 +7,5 @@ From the beginning to the end, Show how to using mineru via a minimal project
 .. toctree::
     :maxdepth: 1
 
-    tutorial/output_file_description
     tutorial/pipeline
 

+ 0 - 3
next_docs/en/user_guide/tutorial/pipeline.rst

@@ -28,7 +28,6 @@ Minimal Example
     image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
         local_md_dir
     )
-    image_dir = str(os.path.basename(local_image_dir))
 
     # read bytes
     reader1 = FileBasedDataReader("")
@@ -85,8 +84,6 @@ These stages are linked together through methods like ``apply``, ``doc_analyze``
 .. admonition:: Tip
     :class: tip
 
-    For more examples on how to use ``Dataset``, ``InferenceResult``, and ``PipeResult``, please refer to :doc:`../quick_start/to_markdown`
-
     For more detailed information about ``Dataset``, ``InferenceResult``, and ``PipeResult``, please refer to :doc:`../../api/dataset`, :doc:`../../api/model_operators`, :doc:`../../api/pipe_operators`
 
 

+ 12 - 0
next_docs/en/user_guide/usage.rst

@@ -0,0 +1,12 @@
+
+
+Usage
+========
+
+.. toctree::
+   :maxdepth: 1
+
+   usage/command_line
+   usage/api
+   usage/docker
+

+ 279 - 0
next_docs/en/user_guide/usage/api.rst

@@ -0,0 +1,279 @@
+
+Api Usage
+===========
+
+
+PDF
+----
+
+Local File Example
+^^^^^^^^^^^^^^^^^^
+
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
+    from magic_pdf.data.dataset import PymuDocDataset
+    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+    from magic_pdf.config.enums import SupportedPdfParseMethod
+
+    # args
+    pdf_file_name = "abc.pdf"  # replace with the real pdf path
+    name_without_suff = pdf_file_name.split(".")[0]
+
+    # prepare env
+    local_image_dir, local_md_dir = "output/images", "output"
+    image_dir = str(os.path.basename(local_image_dir))
+
+    os.makedirs(local_image_dir, exist_ok=True)
+
+    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
+        local_md_dir
+    )
+
+    # read bytes
+    reader1 = FileBasedDataReader("")
+    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
+
+    # proc
+    ## Create Dataset Instance
+    ds = PymuDocDataset(pdf_bytes)
+
+    ## inference
+    if ds.classify() == SupportedPdfParseMethod.OCR:
+        infer_result = ds.apply(doc_analyze, ocr=True)
+
+        ## pipeline
+        pipe_result = infer_result.pipe_ocr_mode(image_writer)
+
+    else:
+        infer_result = ds.apply(doc_analyze, ocr=False)
+
+        ## pipeline
+        pipe_result = infer_result.pipe_txt_mode(image_writer)
+
+    ### draw model result on each page
+    infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
+
+    ### get model inference result
+    model_inference_result = infer_result.get_infer_res()
+
+    ### draw layout result on each page
+    pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
+
+    ### draw spans result on each page
+    pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
+
+    ### get markdown content
+    md_content = pipe_result.get_markdown(image_dir)
+
+    ### dump markdown
+    pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)
+
+    ### get content list content
+    content_list_content = pipe_result.get_content_list(image_dir)
+
+    ### dump content list
+    pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
+
+    ### get middle json
+    middle_json_content = pipe_result.get_middle_json()
+
+    ### dump middle json
+    pipe_result.dump_middle_json(md_writer, f'{name_without_suff}_middle.json')
+
+
+
+S3 File Example
+^^^^^^^^^^^^^^^^
+
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
+    from magic_pdf.data.dataset import PymuDocDataset
+    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+    from magic_pdf.config.enums import SupportedPdfParseMethod
+
+    bucket_name = "{Your S3 Bucket Name}"  # replace with real bucket name
+    ak = "{Your S3 access key}"  # replace with real s3 access key
+    sk = "{Your S3 secret key}"  # replace with real s3 secret key
+    endpoint_url = "{Your S3 endpoint_url}"  # replace with real s3 endpoint_url
+
+    reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url)  # replace `unittest/tmp` with the real s3 prefix
+    writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
+    image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
+    md_writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
+
+    local_image_dir, local_md_dir = "output/images", "output"
+    image_dir = str(os.path.basename(local_image_dir))
+
+    # args
+    pdf_file_name = (
+        f"s3://{bucket_name}/unittest/tmp/bug5-11.pdf"  # replace with the real s3 path
+    )
+
+    # prepare env
+    local_dir = "output"
+    name_without_suff = os.path.basename(pdf_file_name).split(".")[0]
+
+    # read bytes
+    pdf_bytes = reader.read(pdf_file_name)  # read the pdf content
+
+    # proc
+    ## Create Dataset Instance
+    ds = PymuDocDataset(pdf_bytes)
+
+    ## inference
+    if ds.classify() == SupportedPdfParseMethod.OCR:
+        infer_result = ds.apply(doc_analyze, ocr=True)
+
+        ## pipeline
+        pipe_result = infer_result.pipe_ocr_mode(image_writer)
+
+    else:
+        infer_result = ds.apply(doc_analyze, ocr=False)
+
+        ## pipeline
+        pipe_result = infer_result.pipe_txt_mode(image_writer)
+
+    ### draw model result on each page
+    infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
+
+    ### get model inference result
+    model_inference_result = infer_result.get_infer_res()
+
+    ### draw layout result on each page
+    pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
+
+    ### draw spans result on each page
+    pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
+
+    ### dump markdown
+    pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)
+
+    ### dump content list
+    pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
+
+    ### get markdown content
+    md_content = pipe_result.get_markdown(image_dir)
+
+    ### get content list content
+    content_list_content = pipe_result.get_content_list(image_dir)
+
+    ### get middle json
+    middle_json_content = pipe_result.get_middle_json()
+
+    ### dump middle json
+    pipe_result.dump_middle_json(md_writer, f'{name_without_suff}_middle.json')
+
+MS-Office
+----------
+
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
+    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+    from magic_pdf.data.read_api import read_local_office
+
+    # prepare env
+    local_image_dir, local_md_dir = "output/images", "output"
+    image_dir = str(os.path.basename(local_image_dir))
+
+    os.makedirs(local_image_dir, exist_ok=True)
+
+    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
+        local_md_dir
+    )
+
+    # proc
+    ## Create Dataset Instance
+    input_file = "some_ppt.ppt"     # replace with real ms-office file
+
+    input_file_name = input_file.split(".")[0]
+    ds = read_local_office(input_file)[0]
+
+    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
+        md_writer, f"{input_file_name}.md", image_dir
+    )
+
+This code snippet can be used to manipulate **ppt**, **pptx**, **doc**, **docx** file
+
+
+Image
+---------
+
+Single Image File
+^^^^^^^^^^^^^^^^^^^
+
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import FileBasedDataWriter
+    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+    from magic_pdf.data.read_api import read_local_images
+
+    # prepare env
+    local_image_dir, local_md_dir = "output/images", "output"
+    image_dir = str(os.path.basename(local_image_dir))
+
+    os.makedirs(local_image_dir, exist_ok=True)
+
+    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
+        local_md_dir
+    )
+
+    # proc
+    ## Create Dataset Instance
+    input_file = "some_image.jpg"       # replace with real image file
+
+    input_file_name = input_file.split(".")[0]
+    ds = read_local_images(input_file)[0]
+
+    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+        md_writer, f"{input_file_name}.md", image_dir
+    )
+
+
+Directory That Contains Images
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+.. code:: python
+
+    import os
+
+    from magic_pdf.data.data_reader_writer import FileBasedDataWriter
+    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
+    from magic_pdf.data.read_api import read_local_images
+
+    # prepare env
+    local_image_dir, local_md_dir = "output/images", "output"
+    image_dir = str(os.path.basename(local_image_dir))
+
+    os.makedirs(local_image_dir, exist_ok=True)
+
+    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
+        local_md_dir
+    )
+
+    # proc
+    ## Create Dataset Instance
+    input_directory = "some_image_dir/"       # replace with real directory that contains images
+
+
+    dss = read_local_images(input_directory, suffixes=['.png', '.jpg'])
+
+    count = 0
+    for ds in dss:
+        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
+            md_writer, f"{count}.md", image_dir
+        )
+        count += 1
+
+
+Check :doc:`../data/data_reader_writer` for more [reader | writer] examples and check :doc:`../../api/pipe_operators` or :doc:`../../api/model_operators` for api details

+ 18 - 3
next_docs/en/user_guide/quick_start/command_line.rst → next_docs/en/user_guide/usage/command_line.rst

@@ -10,7 +10,8 @@ Command Line
 
    Options:
      -v, --version                display the version and exit
-     -p, --path PATH              local pdf filepath or directory  [required]
+     -p, --path PATH              local filepath or directory. support PDF, PPT,
+                                  PPTX, DOC, DOCX, PNG, JPG files  [required]
      -o, --output-dir PATH        output local directory  [required]
      -m, --method [ocr|txt|auto]  the method for parsing pdf. ocr: using ocr
                                   technique to extract information from pdf. txt:
@@ -40,6 +41,20 @@ Command Line
    ## command line example
    magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
 
+
+.. admonition:: Important
+    :class: tip
+
+    The file must endswith with the following suffix.
+       .pdf 
+       .png
+       .jpg
+       .ppt
+       .pptx
+       .doc
+       .docx
+
+
 ``{some_pdf}`` can be a single PDF file or a directory containing
 multiple PDFs. The results will be saved in the ``{some_output_dir}``
 directory. The output file list is as follows:
@@ -57,6 +72,6 @@ directory. The output file list is as follows:
 
 .. admonition:: Tip
    :class: tip
+   
 
-   For more information about the output files, please refer to the :doc:`../tutorial/output_file_description`
-
+   For more information about the output files, please refer to the :doc:`../inference_result` or :doc:`../pipe_result`

+ 24 - 0
next_docs/en/user_guide/usage/docker.rst

@@ -0,0 +1,24 @@
+
+
+Docker 
+=======
+
+.. admonition:: Important
+   :class: tip
+
+   Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
+
+   Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker. 
+
+   .. code-block:: bash
+
+      bash  docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
+
+
+.. code:: sh
+
+   wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
+   docker build -t mineru:latest .
+   docker run --rm -it --gpus=all mineru:latest /bin/bash
+   magic-pdf --help
+

+ 1 - 0
next_docs/requirements.txt

@@ -8,6 +8,7 @@ myst-parser
 Pillow==8.4.0
 pydantic>=2.7.2,<2.8.0
 PyMuPDF>=1.24.9
+pdfminer.six==20231228
 sphinx
 sphinx-argparse>=0.5.2
 sphinx-book-theme>=1.1.3

Some files were not shown because too many files changed in this diff