Explorar el Código

docs: remove outdated documentation files

- Deleted .readthedocs.yaml files from multiple directories
- Removed outdated API and user guide documentation files
- Deleted command line usage examples
- Removed CUDA acceleration guide
myhloli hace 5 meses
padre
commit
cf5c8f47f4
Se han modificado 100 ficheros con 0 adiciones y 5319 borrados
  1. 0 16
      .readthedocs.yaml
  2. 0 51
      docs/README_Ascend_NPU_Acceleration_zh_CN.md
  3. 0 111
      docs/README_Ubuntu_CUDA_Acceleration_en_US.md
  4. 0 115
      docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md
  5. 0 83
      docs/README_Windows_CUDA_Acceleration_en_US.md
  6. 0 86
      docs/README_Windows_CUDA_Acceleration_zh_CN.md
  7. 0 23
      docs/how_to_download_models_en.md
  8. 0 37
      docs/how_to_download_models_zh_cn.md
  9. 0 16
      next_docs/en/.readthedocs.yaml
  10. 0 20
      next_docs/en/Makefile
  11. BIN
      next_docs/en/_static/image/MinerU-logo-hq.png
  12. BIN
      next_docs/en/_static/image/MinerU-logo.png
  13. 0 13
      next_docs/en/_static/image/ReadTheDocs.svg
  14. BIN
      next_docs/en/_static/image/datalab_logo.png
  15. BIN
      next_docs/en/_static/image/flowchart_en.png
  16. BIN
      next_docs/en/_static/image/flowchart_zh_cn.png
  17. BIN
      next_docs/en/_static/image/inference_result.png
  18. BIN
      next_docs/en/_static/image/layout_example.png
  19. BIN
      next_docs/en/_static/image/logo.png
  20. 0 3
      next_docs/en/_static/image/pipeline.drawio.svg
  21. BIN
      next_docs/en/_static/image/poly.png
  22. BIN
      next_docs/en/_static/image/project_panorama_en.png
  23. BIN
      next_docs/en/_static/image/project_panorama_zh_cn.png
  24. BIN
      next_docs/en/_static/image/spans_example.png
  25. BIN
      next_docs/en/_static/image/web_demo_1.png
  26. 0 88
      next_docs/en/additional_notes/faq.rst
  27. 0 14
      next_docs/en/additional_notes/glossary.rst
  28. 0 20
      next_docs/en/additional_notes/known_issues.rst
  29. 0 11
      next_docs/en/api.rst
  30. 0 44
      next_docs/en/api/data_reader_writer.rst
  31. 0 28
      next_docs/en/api/dataset.rst
  32. 0 33
      next_docs/en/api/io.rst
  33. 0 8
      next_docs/en/api/model_operators.rst
  34. 0 9
      next_docs/en/api/pipe_operators.rst
  35. 0 6
      next_docs/en/api/read_api.rst
  36. 0 10
      next_docs/en/api/schemas.rst
  37. 0 151
      next_docs/en/conf.py
  38. 0 111
      next_docs/en/index.rst
  39. 0 35
      next_docs/en/make.bat
  40. 0 12
      next_docs/en/user_guide.rst
  41. 0 19
      next_docs/en/user_guide/data.rst
  42. 0 236
      next_docs/en/user_guide/data/data_reader_writer.rst
  43. 0 40
      next_docs/en/user_guide/data/dataset.rst
  44. 0 25
      next_docs/en/user_guide/data/io.rst
  45. 0 106
      next_docs/en/user_guide/data/read_api.rst
  46. 0 144
      next_docs/en/user_guide/inference_result.rst
  47. 0 12
      next_docs/en/user_guide/install.rst
  48. 0 255
      next_docs/en/user_guide/install/boost_with_cuda.rst
  49. 0 168
      next_docs/en/user_guide/install/config.rst
  50. 0 37
      next_docs/en/user_guide/install/download_model_weight_files.rst
  51. 0 142
      next_docs/en/user_guide/install/install.rst
  52. 0 335
      next_docs/en/user_guide/pipe_result.rst
  53. 0 12
      next_docs/en/user_guide/quick_start.rst
  54. 0 47
      next_docs/en/user_guide/quick_start/convert_image.rst
  55. 0 60
      next_docs/en/user_guide/quick_start/convert_ms_office.rst
  56. 0 56
      next_docs/en/user_guide/quick_start/convert_pdf.rst
  57. 0 11
      next_docs/en/user_guide/tutorial.rst
  58. 0 412
      next_docs/en/user_guide/tutorial/output_file_description.rst
  59. 0 182
      next_docs/en/user_guide/tutorial/pipeline.rst
  60. 0 12
      next_docs/en/user_guide/usage.rst
  61. 0 279
      next_docs/en/user_guide/usage/api.rst
  62. 0 77
      next_docs/en/user_guide/usage/command_line.rst
  63. 0 24
      next_docs/en/user_guide/usage/docker.rst
  64. 0 17
      next_docs/requirements.txt
  65. 0 16
      next_docs/zh_cn/.readthedocs.yaml
  66. 0 20
      next_docs/zh_cn/Makefile
  67. BIN
      next_docs/zh_cn/_static/image/MinerU-logo-hq.png
  68. BIN
      next_docs/zh_cn/_static/image/MinerU-logo.png
  69. 0 13
      next_docs/zh_cn/_static/image/ReadTheDocs.svg
  70. BIN
      next_docs/zh_cn/_static/image/datalab_logo.png
  71. BIN
      next_docs/zh_cn/_static/image/flowchart_en.png
  72. BIN
      next_docs/zh_cn/_static/image/flowchart_zh_cn.png
  73. BIN
      next_docs/zh_cn/_static/image/inference_result.png
  74. BIN
      next_docs/zh_cn/_static/image/layout_example.png
  75. BIN
      next_docs/zh_cn/_static/image/logo.png
  76. 0 3
      next_docs/zh_cn/_static/image/pipeline.drawio.svg
  77. BIN
      next_docs/zh_cn/_static/image/poly.png
  78. BIN
      next_docs/zh_cn/_static/image/project_panorama_en.png
  79. BIN
      next_docs/zh_cn/_static/image/project_panorama_zh_cn.png
  80. BIN
      next_docs/zh_cn/_static/image/spans_example.png
  81. BIN
      next_docs/zh_cn/_static/image/web_demo_1.png
  82. 0 72
      next_docs/zh_cn/additional_notes/faq.rst
  83. 0 11
      next_docs/zh_cn/additional_notes/glossary.rst
  84. 0 13
      next_docs/zh_cn/additional_notes/known_issues.rst
  85. 0 151
      next_docs/zh_cn/conf.py
  86. 0 81
      next_docs/zh_cn/index.rst
  87. 0 35
      next_docs/zh_cn/make.bat
  88. 0 10
      next_docs/zh_cn/user_guide.rst
  89. 0 20
      next_docs/zh_cn/user_guide/data.rst
  90. 0 218
      next_docs/zh_cn/user_guide/data/data_reader_writer.rst
  91. 0 31
      next_docs/zh_cn/user_guide/data/dataset.rst
  92. 0 21
      next_docs/zh_cn/user_guide/data/io.rst
  93. 0 82
      next_docs/zh_cn/user_guide/data/read_api.rst
  94. 0 13
      next_docs/zh_cn/user_guide/install.rst
  95. 0 272
      next_docs/zh_cn/user_guide/install/boost_with_cuda.rst
  96. 0 64
      next_docs/zh_cn/user_guide/install/download_model_weight_files.rst
  97. 0 103
      next_docs/zh_cn/user_guide/install/install.rst
  98. 0 13
      next_docs/zh_cn/user_guide/quick_start.rst
  99. 0 61
      next_docs/zh_cn/user_guide/quick_start/command_line.rst
  100. 0 134
      next_docs/zh_cn/user_guide/quick_start/to_markdown.rst

+ 0 - 16
.readthedocs.yaml

@@ -1,16 +0,0 @@
-version: 2
-
-build:
-  os: ubuntu-22.04
-  tools:
-    python: "3.10"
-
-formats:
-  - epub
-
-python:
-  install:
-    - requirements: next_docs/zh_cn/requirements.txt
-
-sphinx:
-  configuration: next_docs/zh_cn/conf.py

+ 0 - 51
docs/README_Ascend_NPU_Acceleration_zh_CN.md

@@ -1,51 +0,0 @@
-# Ascend NPU 加速
-
-## 简介
-
-本文档介绍如何在 Ascend NPU 上使用 MinerU。本文档内容已在`华为 Atlas 800T A2`服务器上测试通过。
-```
-CPU:鲲鹏 920 aarch64 2.6GHz
-NPU:Ascend 910B 64GB
-OS:openEuler 22.03 (LTS-SP3)/ Ubuntu 22.04.5 LTS
-CANN:8.0.RC2
-驱动版本:24.1.rc2.1
-```
-由于适配 Ascend NPU 的环境较为复杂,建议使用 Docker 容器运行 MinerU。
-
-通过docker运行MinerU前需确保物理机已安装支持CANN 8.0.RC2的驱动和固件。
-
-
-## 构建镜像
-请保持网络状况良好,并执行以下代码构建镜像。    
-```bash
-wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/ascend_npu/Dockerfile -O Dockerfile
-docker build -t mineru_npu:latest .
-```
-如果构建过程中未发生报错则说明镜像构建成功。
-
-
-## 运行容器
-
-```bash
-docker run -it -u root --name mineru-npu --privileged=true \
-    --ipc=host \
-    --network=host \
-    --device=/dev/davinci0 \
-    --device=/dev/davinci1 \
-    --device=/dev/davinci2 \
-    --device=/dev/davinci3 \
-    --device=/dev/davinci4 \
-    --device=/dev/davinci5 \
-    --device=/dev/davinci6 \
-    --device=/dev/davinci7 \
-    --device=/dev/davinci_manager \
-    --device=/dev/devmm_svm \
-    --device=/dev/hisi_hdc \
-    -v /var/log/npu/:/usr/slog \
-    -v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-    -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-    mineru_npu:latest \
-    /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
-
-magic-pdf --help
-```

+ 0 - 111
docs/README_Ubuntu_CUDA_Acceleration_en_US.md

@@ -1,111 +0,0 @@
-# Ubuntu 22.04 LTS
-
-### 1. Check if NVIDIA Drivers Are Installed
-
-```sh
-nvidia-smi
-```
-
-If you see information similar to the following, it means that the NVIDIA drivers are already installed, and you can skip Step 2.
-
-> [!NOTE]
-> Notice:`CUDA Version` should be >= 12.4, If the displayed version number is less than 12.4, please upgrade the driver.
-
-```plaintext
-+---------------------------------------------------------------------------------------+
-| NVIDIA-SMI 570.133.07             Driver Version: 572.83         CUDA Version: 12.8   |
-|-----------------------------------------+----------------------+----------------------+
-| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
-| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
-|                                         |                      |               MIG M. |
-|=========================================+======================+======================|
-|   0  NVIDIA GeForce RTX 3060 Ti   WDDM  | 00000000:01:00.0  On |                  N/A |
-|  0%   51C    P8              12W / 200W |   1489MiB /  8192MiB |      5%      Default |
-|                                         |                      |                  N/A |
-+-----------------------------------------+----------------------+----------------------+
-```
-
-### 2. Install the Driver
-
-If no driver is installed, use the following command:
-
-```sh
-sudo apt-get update
-sudo apt-get install nvidia-driver-570-server
-```
-
-Install the proprietary driver and restart your computer after installation.
-
-```sh
-reboot
-```
-
-### 3. Install Anaconda
-
-If Anaconda is already installed, skip this step.
-
-```sh
-wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
-bash Anaconda3-2024.06-1-Linux-x86_64.sh
-```
-
-In the final step, enter `yes`, close the terminal, and reopen it.
-
-### 4. Create an Environment Using Conda
-
-```bash
-conda create -n mineru 'python=3.12' -y
-conda activate mineru
-```
-
-### 5. Install Applications
-
-```sh
-pip install -U magic-pdf[full]
-```
-> [!TIP]
-> After installation, you can check the version of `magic-pdf` using the following command:
->
-> ```sh
-> magic-pdf --version
-> ```
-
-
-### 6. Download Models
-
-
-Refer to detailed instructions on [how to download model files](how_to_download_models_en.md).
-
-
-## 7. Understand the Location of the Configuration File
-
-After completing the [6. Download Models](#6-download-models) step, the script will automatically generate a `magic-pdf.json` file in the user directory and configure the default model path.
-You can find the `magic-pdf.json` file in your user directory.
-
-> [!TIP]
-> The user directory for Linux is "/home/username".
-
-
-### 8. First Run
-
-Download a sample file from the repository and test it.
-
-```sh
-wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf
-magic-pdf -p small_ocr.pdf -o ./output
-```
-
-### 9. Test CUDA Acceleration
-
-If your graphics card has at least **6GB** of VRAM, follow these steps to test CUDA acceleration:
-
-1. Modify the value of `"device-mode"` in the `magic-pdf.json` configuration file located in your home directory.
-   ```json
-   {
-     "device-mode": "cuda"
-   }
-   ```
-2. Test CUDA acceleration with the following command:
-   ```sh
-   magic-pdf -p small_ocr.pdf -o ./output
-   ```

+ 0 - 115
docs/README_Ubuntu_CUDA_Acceleration_zh_CN.md

@@ -1,115 +0,0 @@
-# Ubuntu 22.04 LTS
-
-## 1. 检测是否已安装nvidia驱动
-
-```bash
-nvidia-smi
-```
-
-如果看到类似如下的信息,说明已经安装了nvidia驱动,可以跳过步骤2
-
-> [!NOTE]
-> `CUDA Version` 显示的版本号应 >= 12.4,如显示的版本号小于12.4,请升级驱动
-
-```plaintext
-+---------------------------------------------------------------------------------------+
-| NVIDIA-SMI 570.133.07             Driver Version: 572.83         CUDA Version: 12.8   |
-|-----------------------------------------+----------------------+----------------------+
-| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
-| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
-|                                         |                      |               MIG M. |
-|=========================================+======================+======================|
-|   0  NVIDIA GeForce RTX 3060 Ti   WDDM  | 00000000:01:00.0  On |                  N/A |
-|  0%   51C    P8              12W / 200W |   1489MiB /  8192MiB |      5%      Default |
-|                                         |                      |                  N/A |
-+-----------------------------------------+----------------------+----------------------+
-```
-
-## 2. 安装驱动
-
-如没有驱动,则通过如下命令
-
-```bash
-sudo apt-get update
-sudo apt-get install nvidia-driver-570-server
-```
-
-安装专有驱动,安装完成后,重启电脑
-
-```bash
-reboot
-```
-
-## 3. 安装anacoda
-
-如果已安装conda,可以跳过本步骤
-
-```bash
-wget -U NoSuchBrowser/1.0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
-bash Anaconda3-2024.06-1-Linux-x86_64.sh
-```
-
-最后一步输入yes,关闭终端重新打开
-
-## 4. 使用conda 创建环境
-
-```bash
-conda create -n mineru 'python=3.12' -y
-conda activate mineru
-```
-
-## 5. 安装应用
-
-```bash
-pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
-```
-
-> [!TIP]
-> 下载完成后,您可以通过以下命令检查`magic-pdf`的版本:
->
-> ```bash
-> magic-pdf --version
-> ```
-
-
-## 6. 下载模型
-
-
-详细参考 [如何下载模型文件](how_to_download_models_zh_cn.md)
-
-## 7. 了解配置文件存放的位置
-
-完成[6.下载模型](#6-下载模型)步骤后,脚本会自动生成用户目录下的magic-pdf.json文件,并自动配置默认模型路径。
-您可在【用户目录】下找到magic-pdf.json文件。
-
-> [!TIP]
-> linux用户目录为 "/home/用户名"
-
-## 8. 第一次运行
-
-从仓库中下载样本文件,并测试
-
-```bash
-wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/demo/pdfs/small_ocr.pdf
-magic-pdf -p small_ocr.pdf -o ./output
-```
-
-## 9. 测试CUDA加速
-
-如果您的显卡显存大于等于 **6GB** ,可以进行以下流程,测试CUDA解析加速效果
-
-**1.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**
-
-```json
-{
-  "device-mode":"cuda"
-}
-```
-
-**2.运行以下命令测试cuda加速效果**
-
-```bash
-magic-pdf -p small_ocr.pdf -o ./output
-```
-> [!TIP]
-> CUDA加速是否生效可以根据log中输出的各个阶段cost耗时来简单判断,通常情况下,使用cuda加速会比cpu更快。

+ 0 - 83
docs/README_Windows_CUDA_Acceleration_en_US.md

@@ -1,83 +0,0 @@
-# Windows 10/11
-
-### 1. Install CUDA and cuDNN
-
-You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
-
-- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
-- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
-- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
-- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
-
-### 2. Install Anaconda
-
-If Anaconda is already installed, you can skip this step.
-
-Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86_64.exe
-
-### 3. Create an Environment Using Conda
-
-```bash
-conda create -n mineru 'python=3.12' -y
-conda activate mineru
-```
-
-### 4. Install Applications
-
-```
-pip install -U magic-pdf[full]
-```
-
-> [!IMPORTANT]
-> After installation, you can check the version of `magic-pdf` using the following command:
->
-> ```bash
-> magic-pdf --version
-> ```
-
-
-### 5. Download Models
-
-Refer to detailed instructions on [how to download model files](how_to_download_models_en.md).
-
-### 6. Understand the Location of the Configuration File
-
-After completing the [5. Download Models](#5-download-models) step, the script will automatically generate a `magic-pdf.json` file in the user directory and configure the default model path.
-You can find the `magic-pdf.json` file in your 【user directory】 .
-
-> [!TIP]
-> The user directory for Windows is "C:/Users/username".
-
-### 7. First Run
-
-Download a sample file from the repository and test it.
-
-```powershell
-  wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf -O small_ocr.pdf
-  magic-pdf -p small_ocr.pdf -o ./output
-```
-
-### 8. Test CUDA Acceleration
-
-If your graphics card has at least 6GB of VRAM, follow these steps to test CUDA-accelerated parsing performance.
-
-1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
-
-   ```
-   pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
-   ```
-
-2. **Modify the value of `"device-mode"`** in the `magic-pdf.json` configuration file located in your user directory.
-
-   ```json
-   {
-     "device-mode": "cuda"
-   }
-   ```
-
-
-3. **Run the following command to test CUDA acceleration**:
-
-   ```
-   magic-pdf -p small_ocr.pdf -o ./output
-   ```

+ 0 - 86
docs/README_Windows_CUDA_Acceleration_zh_CN.md

@@ -1,86 +0,0 @@
-# Windows10/11
-
-## 1. 安装cuda环境
-
-需要安装符合torch要求的cuda版本,具体可参考[torch官网](https://pytorch.org/get-started/locally/)
-
-- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
-- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
-- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
-- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
-
-## 2. 安装anaconda
-
-如果已安装conda,可以跳过本步骤
-
-下载链接:
-https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Windows-x86_64.exe
-
-## 3. 使用conda 创建环境
-
-```bash
-conda create -n mineru 'python=3.12' -y
-conda activate mineru
-```
-
-## 4. 安装应用
-
-```bash
-pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
-```
-
-> [!IMPORTANT]
-> 下载完成后,您可以通过以下命令检查magic-pdf的版本
->
-> ```bash
-> magic-pdf --version
-> ```
-
-
-## 5. 下载模型
-
-详细参考 [如何下载模型文件](how_to_download_models_zh_cn.md)
-
-## 6. 了解配置文件存放的位置
-
-完成[5.下载模型](#5-下载模型)步骤后,脚本会自动生成用户目录下的magic-pdf.json文件,并自动配置默认模型路径。
-您可在【用户目录】下找到magic-pdf.json文件。
-
-> [!TIP]
-> windows用户目录为 "C:/Users/用户名"
-
-## 7. 第一次运行
-
-从仓库中下载样本文件,并测试
-
-```powershell
- wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf -O small_ocr.pdf
- magic-pdf -p small_ocr.pdf -o ./output
-```
-
-## 8. 测试CUDA加速
-
-如果您的显卡显存大于等于 **6GB** ,可以进行以下流程,测试CUDA解析加速效果
-
-**1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url,具体可参考[torch官网](https://pytorch.org/get-started/locally/))
-
-```bash
-pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
-```
-
-**2.修改【用户目录】中配置文件magic-pdf.json中"device-mode"的值**
-
-```json
-{
-  "device-mode":"cuda"
-}
-```
-
-**3.运行以下命令测试cuda加速效果**
-
-```bash
-magic-pdf -p small_ocr.pdf -o ./output
-```
-
-> [!TIP]
-> CUDA加速是否生效可以根据log中输出的各个阶段的耗时来简单判断,通常情况下,cuda加速后运行速度比cpu更快。

+ 0 - 23
docs/how_to_download_models_en.md

@@ -1,23 +0,0 @@
-Model downloads are divided into initial downloads and updates to the model directory. Please refer to the corresponding documentation for instructions on how to proceed.
-
-
-# Initial download of model files
-
-### Download the Model from Hugging Face
-
-Use a Python Script to Download Model Files from Hugging Face
-```bash
-pip install huggingface_hub
-wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
-python download_models_hf.py
-```
-The Python script will automatically download the model files and configure the model directory in the configuration file.
-
-The configuration file can be found in the user directory, with the filename `magic-pdf.json`.
-
-
-# How to update models previously downloaded
-
-## 1. Models downloaded via Hugging Face or Model Scope
-
-If you previously downloaded models via Hugging Face or Model Scope, you can rerun the Python script used for the initial download. This will automatically update the model directory to the latest version.

+ 0 - 37
docs/how_to_download_models_zh_cn.md

@@ -1,37 +0,0 @@
-模型下载分为首次下载和更新模型目录,请参考对应的文档内容进行操作
-
-# 首次下载模型文件
-
-模型文件可以从 Hugging Face 或 Model Scope 下载,由于网络原因,国内用户访问HF可能会失败,请使用 ModelScope。
-
-<details>
-  <summary>方法一:从 Hugging Face 下载模型</summary>
-  <p>使用python脚本 从Hugging Face下载模型文件</p>
-  <pre><code>pip install huggingface_hub
-wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
-python download_models_hf.py</code></pre>
-  <p>python脚本会自动下载模型文件并配置好配置文件中的模型目录</p>
-</details>
-
-## 方法二:从 ModelScope 下载模型
-
-### 使用python脚本 从ModelScope下载模型文件
-
-```bash
-pip install modelscope
-wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py
-python download_models.py
-```
-python脚本会自动下载模型文件并配置好配置文件中的模型目录
-
-配置文件可以在用户目录中找到,文件名为`magic-pdf.json`
-
-> [!TIP]
-> windows的用户目录为 "C:\\Users\\用户名", linux用户目录为 "/home/用户名", macOS用户目录为 "/Users/用户名"
-
-
-# 此前下载过模型,如何更新
-
-## 1. 通过 Hugging Face 或 Model Scope 下载过模型
-
-如此前通过 HuggingFace 或 Model Scope 下载过模型,可以重复执行此前的模型下载python脚本,将会自动将模型目录更新到最新版本。

+ 0 - 16
next_docs/en/.readthedocs.yaml

@@ -1,16 +0,0 @@
-version: 2
-
-build:
-  os: ubuntu-22.04
-  tools:
-    python: "3.10"
-
-formats:
-  - epub
-
-python:
-  install:
-    - requirements: next_docs/requirements.txt
-
-sphinx:
-  configuration: next_docs/en/conf.py

+ 0 - 20
next_docs/en/Makefile

@@ -1,20 +0,0 @@
-# Minimal makefile for Sphinx documentation
-#
-
-# You can set these variables from the command line, and also
-# from the environment for the first two.
-SPHINXOPTS    ?=
-SPHINXBUILD   ?= sphinx-build
-SOURCEDIR     = .
-BUILDDIR      = _build
-
-# Put it first so that "make" without argument is like "make help".
-help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-
-.PHONY: help Makefile
-
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

BIN
next_docs/en/_static/image/MinerU-logo-hq.png


BIN
next_docs/en/_static/image/MinerU-logo.png


La diferencia del archivo ha sido suprimido porque es demasiado grande
+ 0 - 13
next_docs/en/_static/image/ReadTheDocs.svg


BIN
next_docs/en/_static/image/datalab_logo.png


BIN
next_docs/en/_static/image/flowchart_en.png


BIN
next_docs/en/_static/image/flowchart_zh_cn.png


BIN
next_docs/en/_static/image/inference_result.png


BIN
next_docs/en/_static/image/layout_example.png


BIN
next_docs/en/_static/image/logo.png


La diferencia del archivo ha sido suprimido porque es demasiado grande
+ 0 - 3
next_docs/en/_static/image/pipeline.drawio.svg


BIN
next_docs/en/_static/image/poly.png


BIN
next_docs/en/_static/image/project_panorama_en.png


BIN
next_docs/en/_static/image/project_panorama_zh_cn.png


BIN
next_docs/en/_static/image/spans_example.png


BIN
next_docs/en/_static/image/web_demo_1.png


+ 0 - 88
next_docs/en/additional_notes/faq.rst

@@ -1,88 +0,0 @@
-FAQ
-==========================
-
-1. When using the command ``pip install magic-pdf[full]`` on newer versions of macOS, the error ``zsh: no matches found: magic-pdf[full]`` occurs.
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-On macOS, the default shell has switched from Bash to Z shell, which has
-special handling logic for certain types of string matching. This can
-lead to the “no matches found” error. You can try disabling the globbing
-feature in the command line and then run the installation command again.
-
-.. code:: bash
-
-   setopt no_nomatch
-   pip install magic-pdf[full]
-
-2. Encountering the error ``pickle.UnpicklingError: invalid load key, 'v'.`` during use
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-This might be due to an incomplete download of the model file. You can
-try re-downloading the model file and then try again. Reference:
-https://github.com/opendatalab/MinerU/issues/143
-
-3. Where should the model files be downloaded and how should the ``/models-dir`` configuration be set?
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The path for the model files is configured in “magic-pdf.json”. just
-like:
-
-.. code:: json
-
-   {
-     "models-dir": "/tmp/models"
-   }
-
-This path is an absolute path, not a relative path. You can obtain the
-absolute path in the models directory using the “pwd” command.
-Reference:
-https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
-
-4. Encountered the error ``ImportError: libGL.so.1: cannot open shared object file: No such file or directory`` in Ubuntu 22.04 on WSL2
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The ``libgl`` library is missing in Ubuntu 22.04 on WSL2. You can
-install the ``libgl`` library with the following command to resolve the
-issue:
-
-.. code:: bash
-
-   sudo apt-get install libgl1-mesa-glx
-
-Reference: https://github.com/opendatalab/MinerU/issues/388
-
-5. Encountered error ``ModuleNotFoundError: No module named 'fairscale'``
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-You need to uninstall the module and reinstall it:
-
-.. code:: bash
-
-   pip uninstall fairscale
-   pip install fairscale
-
-Reference: https://github.com/opendatalab/MinerU/issues/411
-
-6. On some newer devices like the H100, the text parsed during OCR using CUDA acceleration is garbled.
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-The compatibility of cuda11 with new graphics cards is poor, and the
-CUDA version used by Paddle needs to be upgraded.
-
-.. code:: bash
-
-   pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
-
-Reference: https://github.com/opendatalab/MinerU/issues/558
-
-
-7. On some Linux servers, the program immediately reports an error ``Illegal instruction (core dumped)``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This might be because the server's CPU does not support the AVX/AVX2
-instruction set, or the CPU itself supports it but has been disabled by
-the system administrator. You can try contacting the system
-administrator to remove the restriction or change to a different server.
-
-References: https://github.com/opendatalab/MinerU/issues/591 ,
-https://github.com/opendatalab/MinerU/issues/736

+ 0 - 14
next_docs/en/additional_notes/glossary.rst

@@ -1,14 +0,0 @@
-
-
-Glossary 
-===========
-
-1. jsonl 
-    Newline-delimited (\n), and each line must be a valid, independent JSON object. 
-    Currently, All the function shipped with **MinerU** assume that json object must contain one field named with either **path** or **file_location**
-
-
-2. magic-pdf.json 
-    TODO
-
-

+ 0 - 20
next_docs/en/additional_notes/known_issues.rst

@@ -1,20 +0,0 @@
-Known Issues
-============
-
--  Reading order is determined by the model based on the spatial
-   distribution of readable content, and may be out of order in some
-   areas under extremely complex layouts.
--  Vertical text is not supported.
--  Tables of contents and lists are recognized through rules, and some
-   uncommon list formats may not be recognized.
--  Only one level of headings is supported; hierarchical headings are
-   not currently supported.
--  Code blocks are not yet supported in the layout model.
--  Comic books, art albums, primary school textbooks, and exercises
-   cannot be parsed well.
--  Table recognition may result in row/column recognition errors in
-   complex tables.
--  OCR recognition may produce inaccurate characters in PDFs of
-   lesser-known languages (e.g., diacritical marks in Latin script,
-   easily confused characters in Arabic script).
--  Some formulas may not render correctly in Markdown.

+ 0 - 11
next_docs/en/api.rst

@@ -1,11 +0,0 @@
-
-.. toctree::
-   :maxdepth: 2
-
-   api/dataset
-   api/data_reader_writer
-   api/read_api
-   api/schemas
-   api/io
-   api/pipe_operators
-   api/model_operators

+ 0 - 44
next_docs/en/api/data_reader_writer.rst

@@ -1,44 +0,0 @@
-
-Data Reader Writer
-===================
-
-.. autoclass:: magic_pdf.data.data_reader_writer.DataReader
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.data_reader_writer.DataWriter
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.data_reader_writer.S3DataReader
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.data_reader_writer.S3DataWriter
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.data_reader_writer.FileBasedDataReader
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.data_reader_writer.FileBasedDataWriter
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.data_reader_writer.MultiBucketS3DataReader
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.data_reader_writer.MultiBucketS3DataWriter
-   :members:
-   :inherited-members:
-   :show-inheritance:
-

+ 0 - 28
next_docs/en/api/dataset.rst

@@ -1,28 +0,0 @@
-Dataset
-========
-
-.. autoclass:: magic_pdf.data.dataset.PageableData
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-
-.. autoclass:: magic_pdf.data.dataset.Dataset
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.dataset.ImageDataset
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.dataset.PymuDocDataset
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.dataset.Doc
-   :members:
-   :inherited-members:
-   :show-inheritance:

+ 0 - 33
next_docs/en/api/io.rst

@@ -1,33 +0,0 @@
-IO
-==
-
-.. autoclass:: magic_pdf.data.io.base.IOReader
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.io.base.IOWriter
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.io.s3.S3Reader
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.io.s3.S3Writer
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.io.http.HttpReader
-   :members:
-   :inherited-members:
-   :show-inheritance:
-
-.. autoclass:: magic_pdf.data.io.http.HttpWriter
-   :members:
-   :inherited-members:
-   :show-inheritance:
-

+ 0 - 8
next_docs/en/api/model_operators.rst

@@ -1,8 +0,0 @@
-
-Model Api
-==========
-
-.. autoclass:: magic_pdf.operators.InferenceResultBase
-   :members:
-   :inherited-members:
-   :show-inheritance:

+ 0 - 9
next_docs/en/api/pipe_operators.rst

@@ -1,9 +0,0 @@
-
-
-Pipeline Api
-=============
-
-.. autoclass:: magic_pdf.operators.pipes.PipeResult
-   :members:
-   :inherited-members:
-   :show-inheritance:

+ 0 - 6
next_docs/en/api/read_api.rst

@@ -1,6 +0,0 @@
-read_api
-=========
-
-.. automodule:: magic_pdf.data.read_api
-   :members:
-   :inherited-members:

+ 0 - 10
next_docs/en/api/schemas.rst

@@ -1,10 +0,0 @@
-
-schemas 
-===========
-
-.. autopydantic_model:: magic_pdf.data.schemas.S3Config
-   :members:
-
-.. autopydantic_model:: magic_pdf.data.schemas.PageInfo
-   :members:
-

+ 0 - 151
next_docs/en/conf.py

@@ -1,151 +0,0 @@
-# Configuration file for the Sphinx documentation builder.
-#
-# This file only contains a selection of the most common options. For a full
-# list see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-
-import os
-import subprocess
-import sys
-
-from sphinx.ext import autodoc
-from docutils import nodes
-from docutils.parsers.rst import Directive
-
-def install(package):
-    subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
-
-
-requirements_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'requirements.txt'))
-if os.path.exists(requirements_path):
-    with open(requirements_path) as f:
-        packages = f.readlines()
-    for package in packages:
-        install(package.strip())
-
-sys.path.insert(0, os.path.abspath('../..'))
-
-# -- Project information -----------------------------------------------------
-
-project = 'MinerU'
-copyright = '2024, MinerU Contributors'
-author = 'OpenDataLab'
-
-# The full version, including alpha/beta/rc tags
-version_file = '../../magic_pdf/libs/version.py'
-with open(version_file) as f:
-    exec(compile(f.read(), version_file, 'exec'))
-__version__ = locals()['__version__']
-# The short X.Y version
-version = __version__
-# The full version, including alpha/beta/rc tags
-release = __version__
-
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
-    'sphinx.ext.napoleon',
-    'sphinx.ext.viewcode',
-    'sphinx.ext.intersphinx',
-    'sphinx_copybutton',
-    'sphinx.ext.autodoc',
-    'sphinx.ext.autosummary',
-    'sphinx.ext.inheritance_diagram',
-    'myst_parser',
-    'sphinxarg.ext',
-    'sphinxcontrib.autodoc_pydantic',
-]
-
-# class hierarchy diagram
-inheritance_graph_attrs = dict(rankdir="LR", size='"8.0, 12.0"', fontsize=14, ratio='compress')
-inheritance_node_attrs = dict(shape='ellipse', fontsize=14, height=0.75)
-inheritance_edge_attrs = dict(arrow='vee')
-
-autodoc_pydantic_model_show_json = True
-autodoc_pydantic_model_show_config_summary = False
-
-# Add any paths that contain templates here, relative to this directory.
-templates_path = ['_templates']
-
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
-
-# Exclude the prompt "$" when copying code
-copybutton_prompt_text = r'\$ '
-copybutton_prompt_is_regexp = True
-
-language = 'en'
-
-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_theme = 'sphinx_book_theme'
-html_logo = '_static/image/logo.png'
-html_theme_options = {
-    'path_to_docs': 'next_docs/en',
-    'repository_url': 'https://github.com/opendatalab/MinerU',
-    'use_repository_button': True,
-}
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-# html_static_path = ['_static']
-
-# Mock out external dependencies here.
-autodoc_mock_imports = [
-    'cpuinfo',
-    'torch',
-    'transformers',
-    'psutil',
-    'prometheus_client',
-    'sentencepiece',
-    'vllm.cuda_utils',
-    'vllm._C',
-    # 'numpy',
-    'tqdm',
-]
-
-
-class MockedClassDocumenter(autodoc.ClassDocumenter):
-    """Remove note about base class when a class is derived from object."""
-
-    def add_line(self, line: str, source: str, *lineno: int) -> None:
-        if line == '   Bases: :py:class:`object`':
-            return
-        super().add_line(line, source, *lineno)
-
-
-autodoc.ClassDocumenter = MockedClassDocumenter
-
-navigation_with_keys = False
-
-
-# add custom directive 
-
-
-class VideoDirective(Directive):
-    required_arguments = 1
-    optional_arguments = 0
-    final_argument_whitespace = True
-    option_spec = {}
-
-    def run(self):
-        url = self.arguments[0]
-        video_node = nodes.raw('', f'<iframe width="560" height="315" src="{url}" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>', format='html')
-        return [video_node]
-
-def setup(app):
-    app.add_directive('video', VideoDirective)

+ 0 - 111
next_docs/en/index.rst

@@ -1,111 +0,0 @@
-.. xtuner documentation master file, created by
-   sphinx-quickstart on Tue Jan  9 16:33:06 2024.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-
-Welcome to the MinerU Documentation
-==============================================
-
-.. figure:: ./_static/image/logo.png
-  :align: center
-  :alt: mineru
-  :class: no-scaled-link
-
-.. raw:: html
-
-   <p style="text-align:center">
-   <strong>A one-stop, open-source, high-quality data extraction tool
-   </strong>
-   </p>
-
-   <p style="text-align:center">
-   <script async defer src="https://buttons.github.io/buttons.js"></script>
-   <a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
-   <a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
-   <a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
-   </p>
-
-
-Project Introduction
---------------------
-
-MinerU is a tool that converts PDFs into machine-readable formats (e.g.,
-markdown, JSON), allowing for easy extraction into any format. MinerU
-was born during the pre-training process of
-`InternLM <https://github.com/InternLM/InternLM>`__. We focus on solving
-symbol conversion issues in scientific literature and hope to contribute
-to technological development in the era of large models. Compared to
-well-known commercial products, MinerU is still young. If you encounter
-any issues or if the results are not as expected, please submit an issue
-on `issue <https://github.com/opendatalab/MinerU/issues>`__ and **attach
-the relevant PDF**.
-
-.. video:: https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
-
-
-Key Features
-------------
-
--  Remove headers, footers, footnotes, page numbers, etc., to ensure
-   semantic coherence.
--  Output text in human-readable order, suitable for single-column,
-   multi-column, and complex layouts.
--  Preserve the structure of the original document, including headings,
-   paragraphs, lists, etc.
--  Extract images, image descriptions, tables, table titles, and
-   footnotes.
--  Automatically recognize and convert formulas in the document to LaTeX
-   format.
--  Automatically recognize and convert tables in the document to LaTeX
-   or HTML format.
--  Automatically detect scanned PDFs and garbled PDFs and enable OCR
-   functionality.
--  OCR supports detection and recognition of 84 languages.
--  Supports multiple output formats, such as multimodal and NLP
-   Markdown, JSON sorted by reading order, and rich intermediate
-   formats.
--  Supports various visualization results, including layout
-   visualization and span visualization, for efficient confirmation of
-   output quality.
--  Supports both CPU and GPU environments.
--  Compatible with Windows, Linux, and Mac platforms.
-
-
-.. tip::
-
-   Get started with MinerU by trying the `online demo <https://www.modelscope.cn/studios/OpenDataLab/MinerU>`_ or :doc:`installing it locally <user_guide/install/install>`.
-
-
-User Guide
--------------
-.. toctree::
-   :maxdepth: 2
-   :caption: User Guide
-
-   user_guide
-
-
-API Reference
--------------
-
-If you are looking for information on a specific function, class or
-method, this part of the documentation is for you.
-
-.. toctree::
-   :maxdepth: 2
-   :caption: API
-
-   api
-
-
-Additional Notes
-------------------
-.. toctree::
-   :maxdepth: 1
-   :caption: Additional Notes
-
-   additional_notes/known_issues
-   additional_notes/faq
-   additional_notes/glossary
-
-

+ 0 - 35
next_docs/en/make.bat

@@ -1,35 +0,0 @@
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
-	set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=.
-set BUILDDIR=_build
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
-	echo.
-	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
-	echo.installed, then set the SPHINXBUILD environment variable to point
-	echo.to the full path of the 'sphinx-build' executable. Alternatively you
-	echo.may add the Sphinx directory to PATH.
-	echo.
-	echo.If you don't have Sphinx installed, grab it from
-	echo.https://www.sphinx-doc.org/
-	exit /b 1
-)
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-
-:end
-popd

+ 0 - 12
next_docs/en/user_guide.rst

@@ -1,12 +0,0 @@
-
-
-.. toctree::
-    :maxdepth: 2
-
-    user_guide/install
-    user_guide/usage
-    user_guide/quick_start
-    user_guide/tutorial
-    user_guide/data
-    user_guide/inference_result
-    user_guide/pipe_result

+ 0 - 19
next_docs/en/user_guide/data.rst

@@ -1,19 +0,0 @@
-
-
-Data
-=========
-
-.. toctree::
-   :maxdepth: 2
-
-   data/dataset
-
-   data/read_api
-
-   data/data_reader_writer 
-
-   data/io
-
-
-
-

+ 0 - 236
next_docs/en/user_guide/data/data_reader_writer.rst

@@ -1,236 +0,0 @@
-
-Data Reader Writer 
-====================
-
-Aims for read or write bytes from different media, You can implement new classes to meet the needs of your personal scenarios 
-if MinerU have not provide the suitable classes. It is easy to implement new classes, the only one requirement is to inherit from
-``DataReader`` or ``DataWriter``
-
-.. code:: python
-
-    class SomeReader(DataReader):
-        def read(self, path: str) -> bytes:
-            pass
-
-        def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
-            pass
-
-
-    class SomeWriter(DataWriter):
-        def write(self, path: str, data: bytes) -> None:
-            pass
-
-        def write_string(self, path: str, data: str) -> None:
-            pass
-
-
-Reader may curious about the difference between :doc:`io` and this section. Those two sections look very similarity at first glance.
-:doc:`io` provides fundamental functions, while This section thinks more at application level. Customer can build they own classes to meet 
-their own applications need which may share same IO function. That is why we have :doc:`io`.
-
-
-Important Classes
------------------
-
-.. code:: python
-
-    class FileBasedDataReader(DataReader):
-        def __init__(self, parent_dir: str = ''):
-            pass
-
-
-    class FileBasedDataWriter(DataWriter):
-        def __init__(self, parent_dir: str = '') -> None:
-            pass
-
-Class ``FileBasedDataReader`` initialized with unary param ``parent_dir``, That means that every method ``FileBasedDataReader`` provided will have features as follow.
-
-Features:
-    #. read content from the absolute path file, ``parent_dir`` will be ignored.
-    #. read the relative path, file will first join with ``parent_dir``, then read content from the merged path
-
-
-.. note::
-
-    ``FileBasedDataWriter`` shares the same behavior with ``FileBaseDataReader``
-
-
-.. code:: python 
-
-    class MultiS3Mixin:
-        def __init__(self, default_prefix: str, s3_configs: list[S3Config]):
-            pass
-
-    class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
-        pass
-
-All read-related method that class ``MultiBucketS3DataReader`` provided will have features as follow.
-
-Features:
-    #. read object with full s3-format path, for example ``s3://test_bucket/test_object``, ``default_prefix`` will be ignored.
-    #. read object with relative path, file will join ``default_prefix`` and trim the ``bucket_name`` firstly, then read the content. ``bucket_name`` is the first element of the result after split ``default_prefix`` with delimiter ``\`` 
-
-.. note::
-    ``MultiBucketS3DataWriter`` shares the same behavior with ``MultiBucketS3DataReader``
-
-
-.. code:: python
-
-    class S3DataReader(MultiBucketS3DataReader):
-        pass
-
-``S3DataReader`` is build on top of MultiBucketS3DataReader which only support for bucket. So is ``S3DataWriter``. 
-
-
-Read Examples
-------------
-
-.. code:: python
-
-    import os 
-    from magic_pdf.data.data_reader_writer import *
-    from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
-    from magic_pdf.data.schemas import S3Config
-
-    # file based related
-    file_based_reader1 = FileBasedDataReader('')
-
-    ## will read file abc
-    file_based_reader1.read('abc')
-
-    file_based_reader2 = FileBasedDataReader('/tmp')
-
-    ## will read /tmp/abc
-    file_based_reader2.read('abc')
-
-    ## will read /tmp/logs/message.txt
-    file_based_reader2.read('/tmp/logs/message.txt')
-
-    # multi bucket s3 releated
-    bucket = "bucket"               # replace with real bucket
-    ak = "ak"                       # replace with real access key
-    sk = "sk"                       # replace with real secret key
-    endpoint_url = "endpoint_url"   # replace with real endpoint_url
-
-    bucket_2 = "bucket_2"               # replace with real bucket
-    ak_2 = "ak_2"                       # replace with real access key
-    sk_2 = "sk_2"                       # replace with real secret key 
-    endpoint_url_2 = "endpoint_url_2"   # replace with real endpoint_url
-
-    test_prefix = 'test/unittest'
-    multi_bucket_s3_reader1 = MultiBucketS3DataReader(f"{bucket}/{test_prefix}", [S3Config(
-            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-        ),
-        S3Config(
-            bucket_name=bucket_2,
-            access_key=ak_2,
-            secret_key=sk_2,
-            endpoint_url=endpoint_url_2,
-        )])
-
-    ## will read s3://{bucket}/{test_prefix}/abc
-    multi_bucket_s3_reader1.read('abc')
-
-    ## will read s3://{bucket}/{test_prefix}/efg
-    multi_bucket_s3_reader1.read(f's3://{bucket}/{test_prefix}/efg')
-
-    ## will read s3://{bucket2}/{test_prefix}/abc
-    multi_bucket_s3_reader1.read(f's3://{bucket_2}/{test_prefix}/abc')
-
-    # s3 related
-    s3_reader1 = S3DataReader(
-        test_prefix,
-        bucket,
-        ak,
-        sk,
-        endpoint_url
-    )
-
-    ## will read s3://{bucket}/{test_prefix}/abc
-    s3_reader1.read('abc')
-
-    ## will read s3://{bucket}/efg
-    s3_reader1.read(f's3://{bucket}/efg')
-
-
-Write Examples
----------------
-
-.. code:: python
-
-    import os
-    from magic_pdf.data.data_reader_writer import *
-    from magic_pdf.data.data_reader_writer import MultiBucketS3DataWriter
-    from magic_pdf.data.schemas import S3Config
-
-    # file based related
-    file_based_writer1 = FileBasedDataWriter("")
-
-    ## will write 123 to abc
-    file_based_writer1.write("abc", "123".encode())
-
-    ## will write 123 to abc
-    file_based_writer1.write_string("abc", "123")
-
-    file_based_writer2 = FileBasedDataWriter("/tmp")
-
-    ## will write 123 to /tmp/abc
-    file_based_writer2.write_string("abc", "123")
-
-    ## will write 123 to /tmp/logs/message.txt
-    file_based_writer2.write_string("/tmp/logs/message.txt", "123")
-
-    # multi bucket s3 releated
-    bucket = "bucket"               # replace with real bucket
-    ak = "ak"                       # replace with real access key
-    sk = "sk"                       # replace with real secret key
-    endpoint_url = "endpoint_url"   # replace with real endpoint_url
-
-    bucket_2 = "bucket_2"               # replace with real bucket
-    ak_2 = "ak_2"                       # replace with real access key
-    sk_2 = "sk_2"                       # replace with real secret key 
-    endpoint_url_2 = "endpoint_url_2"   # replace with real endpoint_url
-
-    test_prefix = "test/unittest"
-    multi_bucket_s3_writer1 = MultiBucketS3DataWriter(
-        f"{bucket}/{test_prefix}",
-        [
-            S3Config(
-                bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-            ),
-            S3Config(
-                bucket_name=bucket_2,
-                access_key=ak_2,
-                secret_key=sk_2,
-                endpoint_url=endpoint_url_2,
-            ),
-        ],
-    )
-
-    ## will write 123 to s3://{bucket}/{test_prefix}/abc
-    multi_bucket_s3_writer1.write_string("abc", "123")
-
-    ## will write 123 to s3://{bucket}/{test_prefix}/abc
-    multi_bucket_s3_writer1.write("abc", "123".encode())
-
-    ## will write 123 to s3://{bucket}/{test_prefix}/efg
-    multi_bucket_s3_writer1.write(f"s3://{bucket}/{test_prefix}/efg", "123".encode())
-
-    ## will write 123 to s3://{bucket_2}/{test_prefix}/abc
-    multi_bucket_s3_writer1.write(f's3://{bucket_2}/{test_prefix}/abc', '123'.encode())
-
-    # s3 related
-    s3_writer1 = S3DataWriter(test_prefix, bucket, ak, sk, endpoint_url)
-
-    ## will write 123 to s3://{bucket}/{test_prefix}/abc
-    s3_writer1.write("abc", "123".encode())
-
-    ## will write 123 to s3://{bucket}/{test_prefix}/abc
-    s3_writer1.write_string("abc", "123")
-
-    ## will write 123 to s3://{bucket}/efg
-    s3_writer1.write(f"s3://{bucket}/efg", "123".encode())
-
-
-
-Check :doc:`../../api/data_reader_writer` for more details

+ 0 - 40
next_docs/en/user_guide/data/dataset.rst

@@ -1,40 +0,0 @@
-
-
-Dataset 
-===========
-
-
-Import Classes 
------------------
-
-Dataset 
-^^^^^^^^
-
-Each pdfs or image will form one ``Dataset``. As we all know, Pdf has two categories, :ref:`digital_method_section` or :ref:`ocr_method_section`.
-Will get ``ImageDataset`` which is subclass of ``Dataset`` with images and get ``PymuDocDataset`` from pdf files.
-The difference between ``ImageDataset`` and ``PymuDocDataset`` is that ``ImageDataset`` only support ``OCR`` parse method, 
-while ``PymuDocDataset`` support both ``OCR`` and ``TXT``
-
-.. note::
-
-    In fact some pdf may generated by images, that means it can not support ``TXT`` methods. Currently it is something the user needs to ensure does not happen
-
-
-
-Pdf Parse Methods
-------------------
-
-.. _ocr_method_section:
-OCR 
-^^^^
-Extract chars via ``Optical Character Recognition`` technical.
-
-.. _digital_method_section:
-TXT
-^^^^^^^^
-Extract chars via third-party library, currently we use ``pymupdf``. 
-
-
-
-Check :doc:`../../api/dataset` for more details
-

+ 0 - 25
next_docs/en/user_guide/data/io.rst

@@ -1,25 +0,0 @@
-
-IO
-===
-
-Aims for read or write bytes from different media, Currently We provide ``S3Reader``, ``S3Writer`` for AWS S3 compatible media 
-and ``HttpReader``, ``HttpWriter`` for remote Http file. You can implement new classes to meet the needs of your personal scenarios 
-if MinerU have not provide the suitable classes. It is easy to implement new classes, the only one requirement is to inherit from
-``IOReader`` or ``IOWriter``
-
-.. code:: python
-
-    class SomeReader(IOReader):
-        def read(self, path: str) -> bytes:
-            pass
-
-        def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
-            pass
-
-
-    class SomeWriter(IOWriter):
-        def write(self, path: str, data: bytes) -> None:
-            pass
-
-Check :doc:`../../api/io` for more details
-

+ 0 - 106
next_docs/en/user_guide/data/read_api.rst

@@ -1,106 +0,0 @@
-
-read_api 
-==========
-
-Read the content from file or directory to create ``Dataset``, Currently we provided serval functions that cover some scenarios.
-if you have new scenarios that is common to most of the users, you can post it on the offical github issues with detail descriptions.
-Also it is easy to implement your own read-related funtions.
-
-
-Important Functions
--------------------
-
-
-read_jsonl
-^^^^^^^^^^^^^^^^
-
-Read the contet from jsonl which may located on local machine or remote s3. if you want to know more about jsonl, please goto :doc:`../../additional_notes/glossary`
-
-.. code:: python
-
-    from magic_pdf.data.read_api import *
-    from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
-    from magic_pdf.data.schemas import S3Config
-
-    # read jsonl from local machine
-    datasets = read_jsonl("tt.jsonl", None)   # replace with real jsonl file
-
-    # read jsonl from remote s3
-
-    bucket = "bucket_1"                     # replace with real s3 bucket
-    ak = "access_key_1"                     # replace with real s3 access key
-    sk = "secret_key_1"                     # replace with real s3 secret key
-    endpoint_url = "endpoint_url_1"         # replace with real s3 endpoint url
-
-    bucket_2 = "bucket_2"                   # replace with real s3 bucket
-    ak_2 = "access_key_2"                   # replace with real s3 access key
-    sk_2 = "secret_key_2"                   # replace with real s3 secret key
-    endpoint_url_2 = "endpoint_url_2"       # replace with real s3 endpoint url
-
-    s3configs = [
-        S3Config(
-            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-        ),
-        S3Config(
-            bucket_name=bucket_2,
-            access_key=ak_2,
-            secret_key=sk_2,
-            endpoint_url=endpoint_url_2,
-        ),
-    ]
-
-    s3_reader = MultiBucketS3DataReader(bucket, s3configs)
-
-    datasets = read_jsonl(f"s3://bucket_1/tt.jsonl", s3_reader)  # replace with real s3 jsonl file
-
-read_local_pdfs
-^^^^^^^^^^^^^^^^^
-
-Read pdf from path or directory.
-
-
-.. code:: python
-
-    from magic_pdf.data.read_api import *
-
-    # read pdf path
-    datasets = read_local_pdfs("tt.pdf")
-
-    # read pdfs under directory
-    datasets = read_local_pdfs("pdfs/")
-
-
-read_local_images
-^^^^^^^^^^^^^^^^^^^
-
-Read images from path or directory
-
-.. code:: python 
-
-    from magic_pdf.data.read_api import *
-
-    # read from image path 
-    datasets = read_local_images("tt.png")  # replace with real file path
-
-    # read files from directory that endswith suffix in suffixes array 
-    datasets = read_local_images("images/", suffixes=[".png", ".jpg"])  # replace with real directory 
-
-
-read_local_office
-^^^^^^^^^^^^^^^^^^^^
-Read MS-Office files from path or directory
-
-.. code:: python 
-
-    from magic_pdf.data.read_api import *
-
-    # read from image path 
-    datasets = read_local_office("tt.doc")  # replace with real file path
-
-    # read files from directory that endswith suffix in suffixes array 
-    datasets = read_local_office("docs/")  # replace with real directory 
-
-
-
-
-Check :doc:`../../api/read_api` for more details

+ 0 - 144
next_docs/en/user_guide/inference_result.rst

@@ -1,144 +0,0 @@
-
-Inference Result
-==================
-
-.. admonition:: Tip
-    :class: tip
-
-    Please first navigate to :doc:`tutorial/pipeline` to get an initial understanding of how the pipeline works; this will help in understanding the content of this section.
-
-The **InferenceResult** class is a container for storing model inference results and implements a series of methods related to these results, such as draw_model, dump_model.
-Checkout :doc:`../api/model_operators` for more details about **InferenceResult**
-
-
-Model Inference Result
------------------------
-
-Structure Definition
-^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. code:: python
-
-    from pydantic import BaseModel, Field
-    from enum import IntEnum
-
-    class CategoryType(IntEnum):
-            title = 0               # Title
-            plain_text = 1          # Text
-            abandon = 2             # Includes headers, footers, page numbers, and page annotations
-            figure = 3              # Image
-            figure_caption = 4      # Image description
-            table = 5               # Table
-            table_caption = 6       # Table description
-            table_footnote = 7      # Table footnote
-            isolate_formula = 8     # Block formula
-            formula_caption = 9     # Formula label
-
-            embedding = 13          # Inline formula
-            isolated = 14           # Block formula
-            text = 15               # OCR recognition result
-
-
-    class PageInfo(BaseModel):
-        page_no: int = Field(description="Page number, the first page is 0", ge=0)
-        height: int = Field(description="Page height", gt=0)
-        width: int = Field(description="Page width", ge=0)
-
-    class ObjectInferenceResult(BaseModel):
-        category_id: CategoryType = Field(description="Category", ge=0)
-        poly: list[float] = Field(description="Quadrilateral coordinates, representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively")
-        score: float = Field(description="Confidence of the inference result")
-        latex: str | None = Field(description="LaTeX parsing result", default=None)
-        html: str | None = Field(description="HTML parsing result", default=None)
-
-    class PageInferenceResults(BaseModel):
-            layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
-            page_info: PageInfo = Field(description="Page metadata")
-
-
-Example
-^^^^^^^^^^^
-
-.. code:: json
-
-    [
-        {
-            "layout_dets": [
-                {
-                    "category_id": 2,
-                    "poly": [
-                        99.1906967163086,
-                        100.3119125366211,
-                        730.3707885742188,
-                        100.3119125366211,
-                        730.3707885742188,
-                        245.81326293945312,
-                        99.1906967163086,
-                        245.81326293945312
-                    ],
-                    "score": 0.9999997615814209
-                }
-            ],
-            "page_info": {
-                "page_no": 0,
-                "height": 2339,
-                "width": 1654
-            }
-        },
-        {
-            "layout_dets": [
-                {
-                    "category_id": 5,
-                    "poly": [
-                        99.13092803955078,
-                        2210.680419921875,
-                        497.3183898925781,
-                        2210.680419921875,
-                        497.3183898925781,
-                        2264.78076171875,
-                        99.13092803955078,
-                        2264.78076171875
-                    ],
-                    "score": 0.9999997019767761
-                }
-            ],
-            "page_info": {
-                "page_no": 1,
-                "height": 2339,
-                "width": 1654
-            }
-        }
-    ]
-
-The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3],
-representing the coordinates of the top-left, top-right, bottom-right,
-and bottom-left points respectively. |Poly Coordinate Diagram|
-
-
-
-Inference Result
--------------------------
-
-
-.. code:: python
-
-    from magic_pdf.operators.models import InferenceResult
-    from magic_pdf.data.dataset import Dataset
-
-    dataset : Dataset = some_data_set    # not real dataset
-
-    # The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
-    model_inference_result: list[PageInferenceResults] = []
-
-    Inference_result = InferenceResult(model_inference_result, dataset)
-
-
-
-some_model.pdf
-^^^^^^^^^^^^^^^^^^^^
-
-.. figure:: ../_static/image/inference_result.png
-
-
-
-.. |Poly Coordinate Diagram| image:: ../_static/image/poly.png

+ 0 - 12
next_docs/en/user_guide/install.rst

@@ -1,12 +0,0 @@
-
-Installation
-==============
-
-.. toctree::
-   :maxdepth: 1
-
-   install/install
-   install//boost_with_cuda
-   install/download_model_weight_files
-   install/config
-

+ 0 - 255
next_docs/en/user_guide/install/boost_with_cuda.rst

@@ -1,255 +0,0 @@
-
-Boost With Cuda 
-================
-
-
-If your device supports CUDA and meets the GPU requirements of the
-mainline environment, you can use GPU acceleration. Please select the
-appropriate guide based on your system:
-
--  :ref:`ubuntu_22_04_lts_section`
--  :ref:`windows_10_or_11_section`
-
-
-.. _ubuntu_22_04_lts_section:
-
-Ubuntu 22.04 LTS
------------------
-
-1. Check if NVIDIA Drivers Are Installed
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: sh
-
-   nvidia-smi
-
-If you see information similar to the following, it means that the
-NVIDIA drivers are already installed, and you can skip Step 2.
-
-.. note::
-
-   ``CUDA Version`` should be >= 12.4, If the displayed version number is less than 12.4, please upgrade the driver.
-
-.. code:: text
-
-   +---------------------------------------------------------------------------------------+
-   | NVIDIA-SMI 570.133.07             Driver Version: 572.83         CUDA Version: 12.8   |
-   |-----------------------------------------+----------------------+----------------------+
-   | GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
-   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
-   |                                         |                      |               MIG M. |
-   |=========================================+======================+======================|
-   |   0  NVIDIA GeForce RTX 3060 Ti   WDDM  | 00000000:01:00.0  On |                  N/A |
-   |  0%   51C    P8              12W / 200W |   1489MiB /  8192MiB |      5%      Default |
-   |                                         |                      |                  N/A |
-   +-----------------------------------------+----------------------+----------------------+
-
-2. Install the Driver
-~~~~~~~~~~~~~~~~~~~~~
-
-If no driver is installed, use the following command:
-
-.. code:: sh
-
-   sudo apt-get update
-   sudo apt-get install nvidia-driver-570-server
-
-Install the proprietary driver and restart your computer after
-installation.
-
-.. code:: sh
-
-   reboot
-
-3. Install Anaconda
-~~~~~~~~~~~~~~~~~~~
-
-If Anaconda is already installed, skip this step.
-
-.. code:: sh
-
-   wget https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
-   bash Anaconda3-2024.06-1-Linux-x86_64.sh
-
-In the final step, enter ``yes``, close the terminal, and reopen it.
-
-4. Create an Environment Using Conda
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Specify Python version 3.10~3.13.
-
-.. code:: sh
-
-    conda create -n mineru 'python=3.12' -y
-    conda activate mineru
-
-5. Install Applications
-~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: sh
-
-   pip install -U magic-pdf[full]
-
-.. admonition:: TIP
-    :class: tip
-
-    After installation, you can check the version of ``magic-pdf`` using the following command:
-
-.. code:: sh
-
-   magic-pdf --version
-
-
-6. Download Models
-~~~~~~~~~~~~~~~~~~
-
-Refer to detailed instructions on :doc:`download_model_weight_files`
-
-7. Understand the Location of the Configuration File
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-After completing the `6. Download Models <#6-download-models>`__ step,
-the script will automatically generate a ``magic-pdf.json`` file in the
-user directory and configure the default model path. You can find the
-``magic-pdf.json`` file in your user directory.
-
-.. admonition:: TIP
-    :class: tip
-
-    The user directory for Linux is “/home/username”.
-
-8. First Run
-~~~~~~~~~~~~
-
-Download a sample file from the repository and test it.
-
-.. code:: sh
-
-   wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf
-   magic-pdf -p small_ocr.pdf -o ./output
-
-9. Test CUDA Acceleration
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If your graphics card has at least **8GB** of VRAM, follow these steps
-to test CUDA acceleration:
-
-1. Modify the value of ``"device-mode"`` in the ``magic-pdf.json``
-   configuration file located in your home directory.
-
-   .. code:: json
-
-      {
-        "device-mode": "cuda"
-      }
-
-2. Test CUDA acceleration with the following command:
-
-   .. code:: sh
-
-      magic-pdf -p small_ocr.pdf -o ./output
-
-
-.. _windows_10_or_11_section:
-
-Windows 10/11
---------------
-
-1. Install CUDA
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-You need to install a CUDA version that is compatible with torch's requirements. For details, please refer to the [official PyTorch website](https://pytorch.org/get-started/locally/).
-
-- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
-- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
-- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
-- CUDA 12.8 https://developer.nvidia.com/cuda-12-8-0-download-archive
-
-
-2. Install Anaconda
-~~~~~~~~~~~~~~~~~~~
-
-If Anaconda is already installed, you can skip this step.
-
-Download link: https://repo.anaconda.com/archive/Anaconda3-2024.06-1-Windows-x86_64.exe
-
-3. Create an Environment Using Conda
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-::
-
-    conda create -n mineru 'python=3.12' -y
-    conda activate mineru
-
-4. Install Applications
-~~~~~~~~~~~~~~~~~~~~~~~
-
-::
-
-   pip install -U magic-pdf[full]
-
-.. admonition:: Tip
-    :class: tip
-
-    After installation, you can check the version of ``magic-pdf``:
-
-    .. code:: bash
-
-      magic-pdf --version
-
-
-5. Download Models
-~~~~~~~~~~~~~~~~~~
-
-Refer to detailed instructions on :doc:`download_model_weight_files`
-
-6. Understand the Location of the Configuration File
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-After completing the `5. Download Models <#5-download-models>`__ step,
-the script will automatically generate a ``magic-pdf.json`` file in the
-user directory and configure the default model path. You can find the
-``magic-pdf.json`` file in your 【user directory】 .
-
-.. admonition:: Tip
-    :class: tip
-
-    The user directory for Windows is “C:/Users/username”.
-
-7. First Run
-~~~~~~~~~~~~
-
-Download a sample file from the repository and test it.
-
-.. code:: powershell
-
-     wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf -O small_ocr.pdf
-     magic-pdf -p small_ocr.pdf -o ./output
-
-8. Test CUDA Acceleration
-~~~~~~~~~~~~~~~~~~~~~~~~~
-
-If your graphics card has at least 8GB of VRAM, follow these steps to
-test CUDA-accelerated parsing performance.
-
-1. **Overwrite the installation of torch and torchvision** supporting CUDA.(Please select the appropriate index-url based on your CUDA version. For more details, refer to the [PyTorch official website](https://pytorch.org/get-started/locally/).)
-
-.. code:: sh
-
-   pip install --force-reinstall torch torchvision --index-url https://download.pytorch.org/whl/cu124
-
-
-2. **Modify the value of ``"device-mode"``** in the ``magic-pdf.json``
-   configuration file located in your user directory.
-
-   .. code:: json
-
-      {
-        "device-mode": "cuda"
-      }
-
-3. **Run the following command to test CUDA acceleration**:
-
-   ::
-
-      magic-pdf -p small_ocr.pdf -o ./output

+ 0 - 168
next_docs/en/user_guide/install/config.rst

@@ -1,168 +0,0 @@
-
-
-Config
-=========
-
-File **magic-pdf.json** is typically located in the **${HOME}** directory under a Linux system or in the **C:\Users\{username}** directory under a Windows system.
-
-.. admonition:: Tip 
-    :class: tip
-
-    You can override the default location of config file via the following command:
-    
-    export MINERU_TOOLS_CONFIG_JSON=new_magic_pdf.json
-
-
-
-magic-pdf.json
-----------------
-
-.. code:: json 
-
-    {
-        "bucket_info":{
-            "bucket-name-1":["ak", "sk", "endpoint"],
-            "bucket-name-2":["ak", "sk", "endpoint"]
-        },
-        "models-dir":"/tmp/models",
-        "layoutreader-model-dir":"/tmp/layoutreader",
-        "device-mode":"cpu",
-        "layout-config": {
-            "model": "doclayout_yolo"
-        },
-        "formula-config": {
-            "mfd_model": "yolo_v8_mfd",
-            "mfr_model": "unimernet_small",
-            "enable": true
-        },
-        "table-config": {
-            "model": "rapid_table",
-            "enable": true,
-            "max_time": 400    
-        },
-        "config_version": "1.0.0"
-    }
-
-
-
-
-bucket_info
-^^^^^^^^^^^^^^
-Store the access_key, secret_key and endpoint of AWS S3 Compatible storage config
-
-Example: 
-
-.. code:: text
-
-        {
-            "image_bucket":[{access_key}, {secret_key}, {endpoint}],
-            "video_bucket":[{access_key}, {secret_key}, {endpoint}]
-        }
-
-
-models-dir
-^^^^^^^^^^^^
-
-Store the models download from **huggingface** or **modelshop**. You do not need to modify this field if you download the model using the scripts shipped with **MinerU**
-
-
-layoutreader-model-dir
-^^^^^^^^^^^^^^^^^^^^^^^
-
-Store the models download from **huggingface** or **modelshop**. You do not need to modify this field if you download the model using the scripts shipped with **MinerU**
-
-
-devide-mode
-^^^^^^^^^^^^^^
-
-This field have two options, **cpu** or **cuda**.
-
-**cpu**: inference via cpu
-
-**cuda**: using cuda to accelerate inference
-
-
-layout-config 
-^^^^^^^^^^^^^^^
-
-.. code:: json
-
-    {
-        "model": "doclayout_yolo"
-    }
-
-layout model can not be disabled now.
-
-
-formula-config
-^^^^^^^^^^^^^^^^
-
-.. code:: json
-
-    {
-        "mfd_model": "yolo_v8_mfd",   
-        "mfr_model": "unimernet_small",
-        "enable": true 
-    }
-
-
-mfd_model
-""""""""""
-
-Specify the formula detection model, options are ['yolo_v8_mfd']
-
-
-mfr_model
-""""""""""
-Specify the formula recognition model, options are ['unimernet_small']
-
-Check `UniMERNet <https://github.com/opendatalab/UniMERNet>`_ for more details
-
-
-enable
-""""""""
-
-on-off flag, options are [true, false]. **true** means enable formula inference, **false** means disable formula inference
-
-
-table-config
-^^^^^^^^^^^^^^^^
-
-.. code:: json
-
-   {
-        "model": "rapid_table",
-        "enable": true,
-        "max_time": 400    
-    }
-
-model
-""""""""
-
-Specify the table inference model, options are ['rapid_table']
-
-
-max_time
-"""""""""
-
-Since table recognition is a time-consuming process, we set a timeout period. If the process exceeds this time, the table recognition will be terminated.
-
-
-
-enable
-"""""""
-
-on-off flag, options are [true, false]. **true** means enable table inference, **false** means disable table inference
-
-
-config_version
-^^^^^^^^^^^^^^^^
-
-The version of config schema.
-
-
-.. admonition:: Tip
-    :class: tip
-    
-    Check `Config Schema <https://github.com/opendatalab/MinerU/blob/master/magic-pdf.template.json>`_ for the latest details
-

+ 0 - 37
next_docs/en/user_guide/install/download_model_weight_files.rst

@@ -1,37 +0,0 @@
-
-Download Model Weight Files
-==============================
-
-Model downloads are divided into initial downloads and updates to the
-model directory. Please refer to the corresponding documentation for
-instructions on how to proceed.
-
-Initial download of model files
-------------------------------
-
-1. Download the Model from Hugging Face
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-Use a Python Script to Download Model Files from Hugging Face
-
-.. code:: bash
-
-   pip install huggingface_hub
-   wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
-   python download_models_hf.py
-
-The Python script will automatically download the model files and
-configure the model directory in the configuration file.
-
-The configuration file can be found in the user directory, with the
-filename ``magic-pdf.json``.
-
-How to update models previously downloaded
------------------------------------------
-
-1. Models downloaded via Hugging Face or Model Scope
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-If you previously downloaded models via Hugging Face or Model Scope, you
-can rerun the Python script used for the initial download. This will
-automatically update the model directory to the latest version.

+ 0 - 142
next_docs/en/user_guide/install/install.rst

@@ -1,142 +0,0 @@
-
-Install 
-===============================================================
-If you encounter any installation issues, please first consult the :doc:`../../additional_notes/faq`.
-If the parsing results are not as expected, refer to the :doc:`../../additional_notes/known_issues`.
-
-Also you can try `online demo <https://www.modelscope.cn/studios/OpenDataLab/MinerU>`_ without installation.
-
-.. admonition:: Warning
-    :class: tip
-
-    **Pre-installation Notice—Hardware and Software Environment Support**
-
-    To ensure the stability and reliability of the project, we only optimize
-    and test for specific hardware and software environments during
-    development. This ensures that users deploying and running the project
-    on recommended system configurations will get the best performance with
-    the fewest compatibility issues.
-
-    By focusing resources on the mainline environment, our team can more
-    efficiently resolve potential bugs and develop new features.
-
-    In non-mainline environments, due to the diversity of hardware and
-    software configurations, as well as third-party dependency compatibility
-    issues, we cannot guarantee 100% project availability. Therefore, for
-    users who wish to use this project in non-recommended environments, we
-    suggest carefully reading the documentation and FAQ first. Most issues
-    already have corresponding solutions in the FAQ. We also encourage
-    community feedback to help us gradually expand support.
-
-.. raw:: html
-
-    <style>
-        table, th, td {
-        border: 1px solid black;
-        border-collapse: collapse;
-        }
-    </style>
-    <table>
-    <tr>
-        <td colspan="3" rowspan="2">Operating System</td>
-    </tr>
-    <tr>
-        <td>Linux after 2019</td>
-        <td>Windows 10 / 11</td>
-        <td>macOS 11+</td>
-    </tr>
-    <tr>
-        <td colspan="3">CPU</td>
-        <td>x86_64 / arm64</td>
-        <td>x86_64(unsupported ARM Windows)</td>
-        <td>x86_64 / arm64</td>
-    </tr>
-    <tr>
-        <td colspan="3">Memory Requirements</td>
-        <td colspan="3">16GB or more, recommended 32GB+</td>
-    </tr>
-    <tr>
-        <td colspan="3">Storage Requirements</td>
-        <td colspan="3">20GB or more, with a preference for SSD</td>
-    </tr>
-    <tr>
-        <td colspan="3">Python Version</td>
-        <td colspan="3">3.10~3.13</td>
-    </tr>
-    <tr>
-        <td colspan="3">Nvidia Driver Version</td>
-        <td>latest (Proprietary Driver)</td>
-        <td>latest</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CUDA Environment</td>
-        <td colspan="2"><a href="https://pytorch.org/get-started/locally/">Refer to the PyTorch official website</a></td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CANN Environment(NPU support)</td>
-        <td>8.0+(Ascend 910b)</td>
-        <td>None</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td rowspan="2">GPU/MPS Hardware Support List</td>
-        <td colspan="2">GPU VRAM 6GB or more</td>
-        <td colspan="2">All GPUs with Tensor Cores produced from Volta(2017) onwards.<br>
-        More than 6GB VRAM </td>
-        <td rowspan="2">Apple silicon</td>
-    </tr>
-    </table>
-
-
-
-Create an environment
----------------------------
-
-.. code-block:: shell
-
-    conda create -n mineru 'python=3.12' -y
-    conda activate mineru
-    pip install -U "magic-pdf[full]"
-
-
-Download model weight files
-------------------------------
-
-.. code-block:: shell
-
-    pip install huggingface_hub
-    wget https://github.com/opendatalab/MinerU/raw/master/scripts/download_models_hf.py -O download_models_hf.py
-    python download_models_hf.py    
-
-
-
-Install LibreOffice[Optional]
-----------------------------------
-
-This section is required for handle **doc**, **docx**, **ppt**, **pptx** filetype, You can **skip** this section if no need for those filetype processing.
-
-
-Linux/Macos Platform
-""""""""""""""""""""""
-
-.. code::
-
-    apt-get/yum/brew install libreoffice
-
-
-Windows Platform 
-""""""""""""""""""""
-
-.. code::
-
-    install libreoffice 
-    append "install_dir\LibreOffice\program" to ENVIRONMENT PATH
-
-
-.. tip::
-
-    The MinerU is installed, Check out :doc:`../usage/command_line` to convert your first pdf **or** reading the following sections for more details about install
-
-

+ 0 - 335
next_docs/en/user_guide/pipe_result.rst

@@ -1,335 +0,0 @@
-
-
-Pipe Result
-==============
-
-.. admonition:: Tip
-    :class: tip
-
-    Please first navigate to :doc:`tutorial/pipeline` to get an initial understanding of how the pipeline works; this will help in understanding the content of this section.
-
-
-The **PipeResult** class is a container for storing pipeline processing results and implements a series of methods related to these results, such as draw_layout, draw_span.
-Checkout :doc:`../api/pipe_operators` for more details about **PipeResult**
-
-
-
-Structure Definitions
--------------------------------
-
-**some_pdf_middle.json**
-
-+----------------+--------------------------------------------------------------+
-| Field Name     | Description                                                  |
-|                |                                                              |
-+================+==============================================================+
-| pdf_info       | list, each element is a dict representing the parsing result |
-|                | of each PDF page, see the table below for details            |
-+----------------+--------------------------------------------------------------+
-| \_             | ocr \| txt, used to indicate the mode used in this           |
-| parse_type     | intermediate parsing state                                   |
-|                |                                                              |
-+----------------+--------------------------------------------------------------+
-| \_version_name | string, indicates the version of magic-pdf used in this      |
-|                | parsing                                                      |
-|                |                                                              |
-+----------------+--------------------------------------------------------------+
-
-**pdf_info**
-
-Field structure description
-
-+-------------------------+------------------------------------------------------------+
-| Field                   | Description                                                |
-| Name                    |                                                            |
-+=========================+============================================================+
-| preproc_blocks          | Intermediate result after PDF preprocessing, not yet       |
-|                         | segmented                                                  |
-+-------------------------+------------------------------------------------------------+
-| layout_bboxes           | Layout segmentation results, containing layout direction   |
-|                         | (vertical, horizontal), and bbox, sorted by reading order  |
-+-------------------------+------------------------------------------------------------+
-| page_idx                | Page number, starting from 0                               |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| page_size               | Page width and height                                      |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| \_layout_tree           | Layout tree structure                                      |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| images                  | list, each element is a dict representing an img_block     |
-+-------------------------+------------------------------------------------------------+
-| tables                  | list, each element is a dict representing a table_block    |
-+-------------------------+------------------------------------------------------------+
-| interline_equation      | list, each element is a dict representing an               |
-|                         | interline_equation_block                                   |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| discarded_blocks        | List, block information returned by the model that needs   |
-|                         | to be dropped                                              |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| para_blocks             | Result after segmenting preproc_blocks                     |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-
-In the above table, ``para_blocks`` is an array of dicts, each dict
-representing a block structure. A block can support up to one level of
-nesting.
-
-**block**
-
-The outer block is referred to as a first-level block, and the fields in
-the first-level block include:
-
-+------------------------+-------------------------------------------------------------+
-| Field                  | Description                                                 |
-| Name                   |                                                             |
-+========================+=============================================================+
-| type                   | Block type (table|image)                                    |
-+------------------------+-------------------------------------------------------------+
-| bbox                   | Block bounding box coordinates                              |
-+------------------------+-------------------------------------------------------------+
-| blocks                 | list, each element is a dict representing a second-level    |
-|                        | block                                                       |
-+------------------------+-------------------------------------------------------------+
-
-There are only two types of first-level blocks: “table” and “image”. All
-other blocks are second-level blocks.
-
-The fields in a second-level block include:
-
-+----------------------+----------------------------------------------------------------+
-| Field                | Description                                                    |
-| Name                 |                                                                |
-+======================+================================================================+
-|                      | Block type                                                     |
-| type                 |                                                                |
-+----------------------+----------------------------------------------------------------+
-|                      | Block bounding box coordinates                                 |
-| bbox                 |                                                                |
-+----------------------+----------------------------------------------------------------+
-|                      | list, each element is a dict representing a line, used to      |
-| lines                | describe the composition of a line of information              |
-+----------------------+----------------------------------------------------------------+
-
-Detailed explanation of second-level block types
-
-================== ======================
-type               Description
-================== ======================
-image_body         Main body of the image
-image_caption      Image description text
-table_body         Main body of the table
-table_caption      Table description text
-table_footnote     Table footnote
-text               Text block
-title              Title block
-interline_equation Block formula
-================== ======================
-
-**line**
-
-The field format of a line is as follows:
-
-+---------------------+----------------------------------------------------------------+
-| Field               | Description                                                    |
-| Name                |                                                                |
-+=====================+================================================================+
-|                     | Bounding box coordinates of the line                           |
-| bbox                |                                                                |
-+---------------------+----------------------------------------------------------------+
-| spans               | list, each element is a dict representing a span, used to      |
-|                     | describe the composition of the smallest unit                  |
-+---------------------+----------------------------------------------------------------+
-
-**span**
-
-+---------------------+-----------------------------------------------------------+
-| Field               | Description                                               |
-| Name                |                                                           |
-+=====================+===========================================================+
-| bbox                | Bounding box coordinates of the span                      |
-+---------------------+-----------------------------------------------------------+
-| type                | Type of the span                                          |
-+---------------------+-----------------------------------------------------------+
-| content             | Text spans use content, chart spans use img_path to store |
-| \|                  | the actual text or screenshot path information            |
-| img_path            |                                                           |
-+---------------------+-----------------------------------------------------------+
-
-The types of spans are as follows:
-
-================== ==============
-type               Description
-================== ==============
-image              Image
-table              Table
-text               Text
-inline_equation    Inline formula
-interline_equation Block formula
-================== ==============
-
-**Summary**
-
-A span is the smallest storage unit for all elements.
-
-The elements stored within para_blocks are block information.
-
-The block structure is as follows:
-
-First-level block (if any) -> Second-level block -> Line -> Span
-
-.. _example-1:
-
-example
-^^^^^^^
-
-.. code:: json
-
-   {
-       "pdf_info": [
-           {
-               "preproc_blocks": [
-                   {
-                       "type": "text",
-                       "bbox": [
-                           52,
-                           61.956024169921875,
-                           294,
-                           82.99800872802734
-                       ],
-                       "lines": [
-                           {
-                               "bbox": [
-                                   52,
-                                   61.956024169921875,
-                                   294,
-                                   72.0000228881836
-                               ],
-                               "spans": [
-                                   {
-                                       "bbox": [
-                                           54.0,
-                                           61.956024169921875,
-                                           296.2261657714844,
-                                           72.0000228881836
-                                       ],
-                                       "content": "dependent on the service headway and the reliability of the departure ",
-                                       "type": "text",
-                                       "score": 1.0
-                                   }
-                               ]
-                           }
-                       ]
-                   }
-               ],
-               "layout_bboxes": [
-                   {
-                       "layout_bbox": [
-                           52,
-                           61,
-                           294,
-                           731
-                       ],
-                       "layout_label": "V",
-                       "sub_layout": []
-                   }
-               ],
-               "page_idx": 0,
-               "page_size": [
-                   612.0,
-                   792.0
-               ],
-               "_layout_tree": [],
-               "images": [],
-               "tables": [],
-               "interline_equations": [],
-               "discarded_blocks": [],
-               "para_blocks": [
-                   {
-                       "type": "text",
-                       "bbox": [
-                           52,
-                           61.956024169921875,
-                           294,
-                           82.99800872802734
-                       ],
-                       "lines": [
-                           {
-                               "bbox": [
-                                   52,
-                                   61.956024169921875,
-                                   294,
-                                   72.0000228881836
-                               ],
-                               "spans": [
-                                   {
-                                       "bbox": [
-                                           54.0,
-                                           61.956024169921875,
-                                           296.2261657714844,
-                                           72.0000228881836
-                                       ],
-                                       "content": "dependent on the service headway and the reliability of the departure ",
-                                       "type": "text",
-                                       "score": 1.0
-                                   }
-                               ]
-                           }
-                       ]
-                   }
-               ]
-           }
-       ],
-       "_parse_type": "txt",
-       "_version_name": "0.6.1"
-   }
-
-
-Pipeline Result
-------------------
-
-.. code:: python
-
-    from magic_pdf.pdf_parse_union_core_v2 import pdf_parse_union
-    from magic_pdf.operators.pipes import PipeResult
-    from magic_pdf.data.dataset import Dataset
-
-    res = pdf_parse_union(*args, **kwargs)
-    res['_parse_type'] = PARSE_TYPE_OCR
-    res['_version_name'] = __version__
-    if 'lang' in kwargs and kwargs['lang'] is not None:
-        res['lang'] = kwargs['lang']
-
-    dataset : Dataset = some_dataset   # not real dataset
-    pipeResult = PipeResult(res, dataset)
-
-
-
-some_pdf_layout.pdf
-~~~~~~~~~~~~~~~~~~~
-
-Each page layout consists of one or more boxes. The number at the top
-left of each box indicates its sequence number. Additionally, in
-``layout.pdf``, different content blocks are highlighted with different
-background colors.
-
-.. figure:: ../_static/image/layout_example.png
-   :alt: layout example
-
-   layout example
-
-some_pdf_spans.pdf
-~~~~~~~~~~~~~~~~~~
-
-All spans on the page are drawn with different colored line frames
-according to the span type. This file can be used for quality control,
-allowing for quick identification of issues such as missing text or
-unrecognized inline formulas.
-
-.. figure:: ../_static/image/spans_example.png
-   :alt: spans example
-
-   spans example

+ 0 - 12
next_docs/en/user_guide/quick_start.rst

@@ -1,12 +0,0 @@
-
-Quick Start 
-==============
-
-Want to learn about the usage methods under different scenarios ? This page gives good examples about multiple usage cases match your needs.
-
-.. toctree::
-    :maxdepth: 1
-
-    quick_start/convert_pdf 
-    quick_start/convert_image
-    quick_start/convert_ms_office

+ 0 - 47
next_docs/en/user_guide/quick_start/convert_image.rst

@@ -1,47 +0,0 @@
-
-
-Convert Image
-===============
-
-
-Command Line
-^^^^^^^^^^^^^
-
-.. code:: python
-
-    # make sure the file have correct suffix
-    magic-pdf -p a.png -o output -m auto
-
-
-API
-^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_images
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_image.jpg"       # replace with real image file
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_images(input_file)[0]
-
-    # ocr mode
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )

+ 0 - 60
next_docs/en/user_guide/quick_start/convert_ms_office.rst

@@ -1,60 +0,0 @@
-
-
-Convert Doc
-=============
-
-.. admonition:: Warning
-    :class: tip
-
-    When processing MS-Office files, we first use third-party software to convert the MS-Office files to PDF.
-
-    For certain MS-Office files, the quality of the converted PDF files may not be very high, which can affect the quality of the final output.
-
-
-
-Command Line
-^^^^^^^^^^^^^
-
-.. code:: python
-
-    # replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now
-    magic-pdf -p a.doc -o output -m auto
-
-
-API
-^^^^^^^^
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_office
-    from magic_pdf.config.enums import SupportedPdfParseMethod
-
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_doc.doc"     # replace with real ms-office file, we support MS-DOC, MS-DOCX, MS-PPT, MS-PPTX now
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_office(input_file)[0]
-
-
-    ## inference
-    if ds.classify() == SupportedPdfParseMethod.OCR:
-        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir)
-    else:
-        ds.apply(doc_analyze, ocr=False).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir)

+ 0 - 56
next_docs/en/user_guide/quick_start/convert_pdf.rst

@@ -1,56 +0,0 @@
-
-
-Convert PDF
-============
-
-Command Line
-^^^^^^^^^^^^^
-
-.. code:: python
-
-    # make sure the file have correct suffix
-    magic-pdf -p a.pdf -o output -m auto
-
-
-API
-^^^^^^
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-    name_without_suff = pdf_file_name.split(".")[0]
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-
-    ## inference
-    if ds.classify() == SupportedPdfParseMethod.OCR:
-        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{name_without_suff}.md", image_dir
-    )
-
-    else:
-        ds.apply(doc_analyze, ocr=False).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{name_without_suff}.md", image_dir
-    )

+ 0 - 11
next_docs/en/user_guide/tutorial.rst

@@ -1,11 +0,0 @@
-
-Tutorial
-===========
-
-From the beginning to the end, Show how to using mineru via a minimal project
-
-.. toctree::
-    :maxdepth: 1
-
-    tutorial/pipeline
-

+ 0 - 412
next_docs/en/user_guide/tutorial/output_file_description.rst

@@ -1,412 +0,0 @@
-
-Output File Description
-=========================
-
-After executing the ``magic-pdf`` command, in addition to outputting
-files related to markdown, several other files unrelated to markdown
-will also be generated. These files will be introduced one by one.
-
-some_pdf_layout.pdf
-~~~~~~~~~~~~~~~~~~~
-
-Each page layout consists of one or more boxes. The number at the top
-left of each box indicates its sequence number. Additionally, in
-``layout.pdf``, different content blocks are highlighted with different
-background colors.
-
-.. figure:: ../../_static/image/layout_example.png
-   :alt: layout example
-
-   layout example
-
-some_pdf_spans.pdf
-~~~~~~~~~~~~~~~~~~
-
-All spans on the page are drawn with different colored line frames
-according to the span type. This file can be used for quality control,
-allowing for quick identification of issues such as missing text or
-unrecognized inline formulas.
-
-.. figure:: ../../_static/image/spans_example.png
-   :alt: spans example
-
-   spans example
-
-some_pdf_model.json
-~~~~~~~~~~~~~~~~~~~
-
-Structure Definition
-^^^^^^^^^^^^^^^^^^^^
-
-.. code:: python
-
-   from pydantic import BaseModel, Field
-   from enum import IntEnum
-
-   class CategoryType(IntEnum):
-        title = 0               # Title
-        plain_text = 1          # Text
-        abandon = 2             # Includes headers, footers, page numbers, and page annotations
-        figure = 3              # Image
-        figure_caption = 4      # Image description
-        table = 5               # Table
-        table_caption = 6       # Table description
-        table_footnote = 7      # Table footnote
-        isolate_formula = 8     # Block formula
-        formula_caption = 9     # Formula label
-
-        embedding = 13          # Inline formula
-        isolated = 14           # Block formula
-        text = 15               # OCR recognition result
-
-
-   class PageInfo(BaseModel):
-       page_no: int = Field(description="Page number, the first page is 0", ge=0)
-       height: int = Field(description="Page height", gt=0)
-       width: int = Field(description="Page width", ge=0)
-
-   class ObjectInferenceResult(BaseModel):
-       category_id: CategoryType = Field(description="Category", ge=0)
-       poly: list[float] = Field(description="Quadrilateral coordinates, representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively")
-       score: float = Field(description="Confidence of the inference result")
-       latex: str | None = Field(description="LaTeX parsing result", default=None)
-       html: str | None = Field(description="HTML parsing result", default=None)
-
-   class PageInferenceResults(BaseModel):
-        layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
-        page_info: PageInfo = Field(description="Page metadata")
-
-
-   # The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
-   inference_result: list[PageInferenceResults] = []
-
-The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3],
-representing the coordinates of the top-left, top-right, bottom-right,
-and bottom-left points respectively. |Poly Coordinate Diagram|
-
-example
-^^^^^^^
-
-.. code:: json
-
-   [
-       {
-           "layout_dets": [
-               {
-                   "category_id": 2,
-                   "poly": [
-                       99.1906967163086,
-                       100.3119125366211,
-                       730.3707885742188,
-                       100.3119125366211,
-                       730.3707885742188,
-                       245.81326293945312,
-                       99.1906967163086,
-                       245.81326293945312
-                   ],
-                   "score": 0.9999997615814209
-               }
-           ],
-           "page_info": {
-               "page_no": 0,
-               "height": 2339,
-               "width": 1654
-           }
-       },
-       {
-           "layout_dets": [
-               {
-                   "category_id": 5,
-                   "poly": [
-                       99.13092803955078,
-                       2210.680419921875,
-                       497.3183898925781,
-                       2210.680419921875,
-                       497.3183898925781,
-                       2264.78076171875,
-                       99.13092803955078,
-                       2264.78076171875
-                   ],
-                   "score": 0.9999997019767761
-               }
-           ],
-           "page_info": {
-               "page_no": 1,
-               "height": 2339,
-               "width": 1654
-           }
-       }
-   ]
-
-some_pdf_middle.json
-~~~~~~~~~~~~~~~~~~~~
-
-+----------------+--------------------------------------------------------------+
-| Field Name     | Description                                                  |
-|                |                                                              |
-+================+==============================================================+
-| pdf_info       | list, each element is a dict representing the parsing result |
-|                | of each PDF page, see the table below for details            |
-+----------------+--------------------------------------------------------------+
-| \_             | ocr \| txt, used to indicate the mode used in this           |
-| parse_type     | intermediate parsing state                                   |
-|                |                                                              |
-+----------------+--------------------------------------------------------------+
-| \_version_name | string, indicates the version of magic-pdf used in this      |
-|                | parsing                                                      |
-|                |                                                              |
-+----------------+--------------------------------------------------------------+
-
-**pdf_info**
-
-Field structure description
-
-+-------------------------+------------------------------------------------------------+
-| Field                   | Description                                                |
-| Name                    |                                                            |
-+=========================+============================================================+
-| preproc_blocks          | Intermediate result after PDF preprocessing, not yet       |
-|                         | segmented                                                  |
-+-------------------------+------------------------------------------------------------+
-| layout_bboxes           | Layout segmentation results, containing layout direction   |
-|                         | (vertical, horizontal), and bbox, sorted by reading order  |
-+-------------------------+------------------------------------------------------------+
-| page_idx                | Page number, starting from 0                               |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| page_size               | Page width and height                                      |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| \_layout_tree           | Layout tree structure                                      |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| images                  | list, each element is a dict representing an img_block     |
-+-------------------------+------------------------------------------------------------+
-| tables                  | list, each element is a dict representing a table_block    |
-+-------------------------+------------------------------------------------------------+
-| interline_equation      | list, each element is a dict representing an               |
-|                         | interline_equation_block                                   |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| discarded_blocks        | List, block information returned by the model that needs   |
-|                         | to be dropped                                              |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-| para_blocks             | Result after segmenting preproc_blocks                     |
-|                         |                                                            |
-+-------------------------+------------------------------------------------------------+
-
-In the above table, ``para_blocks`` is an array of dicts, each dict
-representing a block structure. A block can support up to one level of
-nesting.
-
-**block**
-
-The outer block is referred to as a first-level block, and the fields in
-the first-level block include:
-
-+------------------------+-------------------------------------------------------------+
-| Field                  | Description                                                 |
-| Name                   |                                                             |
-+========================+=============================================================+
-| type                   | Block type (table|image)                                    |
-+------------------------+-------------------------------------------------------------+
-| bbox                   | Block bounding box coordinates                              |
-+------------------------+-------------------------------------------------------------+
-| blocks                 | list, each element is a dict representing a second-level    |
-|                        | block                                                       |
-+------------------------+-------------------------------------------------------------+
-
-There are only two types of first-level blocks: “table” and “image”. All
-other blocks are second-level blocks.
-
-The fields in a second-level block include:
-
-+----------------------+----------------------------------------------------------------+
-| Field                | Description                                                    |
-| Name                 |                                                                |
-+======================+================================================================+
-|                      | Block type                                                     |
-| type                 |                                                                |
-+----------------------+----------------------------------------------------------------+
-|                      | Block bounding box coordinates                                 |
-| bbox                 |                                                                |
-+----------------------+----------------------------------------------------------------+
-|                      | list, each element is a dict representing a line, used to      |
-| lines                | describe the composition of a line of information              |
-+----------------------+----------------------------------------------------------------+
-
-Detailed explanation of second-level block types
-
-================== ======================
-type               Description
-================== ======================
-image_body         Main body of the image
-image_caption      Image description text
-table_body         Main body of the table
-table_caption      Table description text
-table_footnote     Table footnote
-text               Text block
-title              Title block
-interline_equation Block formula
-================== ======================
-
-**line**
-
-The field format of a line is as follows:
-
-+---------------------+----------------------------------------------------------------+
-| Field               | Description                                                    |
-| Name                |                                                                |
-+=====================+================================================================+
-|                     | Bounding box coordinates of the line                           |
-| bbox                |                                                                |
-+---------------------+----------------------------------------------------------------+
-| spans               | list, each element is a dict representing a span, used to      |
-|                     | describe the composition of the smallest unit                  |
-+---------------------+----------------------------------------------------------------+
-
-**span**
-
-+---------------------+-----------------------------------------------------------+
-| Field               | Description                                               |
-| Name                |                                                           |
-+=====================+===========================================================+
-| bbox                | Bounding box coordinates of the span                      |
-+---------------------+-----------------------------------------------------------+
-| type                | Type of the span                                          |
-+---------------------+-----------------------------------------------------------+
-| content             | Text spans use content, chart spans use img_path to store |
-| \|                  | the actual text or screenshot path information            |
-| img_path            |                                                           |
-+---------------------+-----------------------------------------------------------+
-
-The types of spans are as follows:
-
-================== ==============
-type               Description
-================== ==============
-image              Image
-table              Table
-text               Text
-inline_equation    Inline formula
-interline_equation Block formula
-================== ==============
-
-**Summary**
-
-A span is the smallest storage unit for all elements.
-
-The elements stored within para_blocks are block information.
-
-The block structure is as follows:
-
-First-level block (if any) -> Second-level block -> Line -> Span
-
-.. _example-1:
-
-example
-^^^^^^^
-
-.. code:: json
-
-   {
-       "pdf_info": [
-           {
-               "preproc_blocks": [
-                   {
-                       "type": "text",
-                       "bbox": [
-                           52,
-                           61.956024169921875,
-                           294,
-                           82.99800872802734
-                       ],
-                       "lines": [
-                           {
-                               "bbox": [
-                                   52,
-                                   61.956024169921875,
-                                   294,
-                                   72.0000228881836
-                               ],
-                               "spans": [
-                                   {
-                                       "bbox": [
-                                           54.0,
-                                           61.956024169921875,
-                                           296.2261657714844,
-                                           72.0000228881836
-                                       ],
-                                       "content": "dependent on the service headway and the reliability of the departure ",
-                                       "type": "text",
-                                       "score": 1.0
-                                   }
-                               ]
-                           }
-                       ]
-                   }
-               ],
-               "layout_bboxes": [
-                   {
-                       "layout_bbox": [
-                           52,
-                           61,
-                           294,
-                           731
-                       ],
-                       "layout_label": "V",
-                       "sub_layout": []
-                   }
-               ],
-               "page_idx": 0,
-               "page_size": [
-                   612.0,
-                   792.0
-               ],
-               "_layout_tree": [],
-               "images": [],
-               "tables": [],
-               "interline_equations": [],
-               "discarded_blocks": [],
-               "para_blocks": [
-                   {
-                       "type": "text",
-                       "bbox": [
-                           52,
-                           61.956024169921875,
-                           294,
-                           82.99800872802734
-                       ],
-                       "lines": [
-                           {
-                               "bbox": [
-                                   52,
-                                   61.956024169921875,
-                                   294,
-                                   72.0000228881836
-                               ],
-                               "spans": [
-                                   {
-                                       "bbox": [
-                                           54.0,
-                                           61.956024169921875,
-                                           296.2261657714844,
-                                           72.0000228881836
-                                       ],
-                                       "content": "dependent on the service headway and the reliability of the departure ",
-                                       "type": "text",
-                                       "score": 1.0
-                                   }
-                               ]
-                           }
-                       ]
-                   }
-               ]
-           }
-       ],
-       "_parse_type": "txt",
-       "_version_name": "0.6.1"
-   }
-
-.. |Poly Coordinate Diagram| image:: ../../_static/image/poly.png

+ 0 - 182
next_docs/en/user_guide/tutorial/pipeline.rst

@@ -1,182 +0,0 @@
-
-
-Pipeline
-==========
-
-
-Minimal Example 
-^^^^^^^^^^^^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-    name_without_suff = pdf_file_name.split(".")[0]
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-
-Running the above code will result in the following
-
-
-.. code:: bash 
-
-    output/
-    ├── abc.md
-    └── images
-
-
-Excluding the setup of the environment, such as creating directories and importing dependencies, the actual code snippet for converting pdf to markdown is as follows
-
-
-.. code:: python 
-
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-
-``ds.apply(doc_analyze, ocr=True)`` generates an ``InferenceResult`` object. The ``InferenceResult`` object, when executing the ``pipe_ocr_mode`` method, produces a ``PipeResult`` object.
-The ``PipeResult`` object, upon executing ``dump_md``, generates a ``markdown`` file at the specified location.
-
-
-The pipeline execution process is illustrated in the following diagram
-
-
-.. image:: ../../_static/image/pipeline.drawio.svg 
-
-.. raw:: html
-
-    <br> </br>
-
-Currently, the process is divided into three stages: data, inference, and processing, which correspond to the ``Dataset``, ``InferenceResult``, and ``PipeResult`` entities in the diagram.
-These stages are linked together through methods like ``apply``, ``doc_analyze``, or ``pipe_ocr_mode``
-
-
-.. admonition:: Tip
-    :class: tip
-
-    For more detailed information about ``Dataset``, ``InferenceResult``, and ``PipeResult``, please refer to :doc:`../../api/dataset`, :doc:`../../api/model_operators`, :doc:`../../api/pipe_operators`
-
-
-Pipeline Composition
-^^^^^^^^^^^^^^^^^^^^^
-
-.. code:: python 
-
-    class Dataset(ABC):
-        @abstractmethod
-        def apply(self, proc: Callable, *args, **kwargs):
-            """Apply callable method which.
-
-            Args:
-                proc (Callable): invoke proc as follows:
-                    proc(self, *args, **kwargs)
-
-            Returns:
-                Any: return the result generated by proc
-            """
-            pass
-
-    class InferenceResult(InferenceResultBase):
-
-        def apply(self, proc: Callable, *args, **kwargs):
-            """Apply callable method which.
-
-            Args:
-                proc (Callable): invoke proc as follows:
-                    proc(inference_result, *args, **kwargs)
-
-            Returns:
-                Any: return the result generated by proc
-            """
-            return proc(copy.deepcopy(self._infer_res), *args, **kwargs)
-
-        def pipe_ocr_mode(
-            self,
-            imageWriter: DataWriter,
-            start_page_id=0,
-            end_page_id=None,
-            debug_mode=False,
-            lang=None,
-            ) -> PipeResult:
-            pass
-
-    class PipeResult:
-        def apply(self, proc: Callable, *args, **kwargs):
-            """Apply callable method which.
-
-            Args:
-                proc (Callable): invoke proc as follows:
-                    proc(pipeline_result, *args, **kwargs)
-
-            Returns:
-                Any: return the result generated by proc
-            """
-            return proc(copy.deepcopy(self._pipe_res), *args, **kwargs)
-
-
-The ``Dataset``, ``InferenceResult``, and ``PipeResult`` classes all have an ``apply`` method, which can be used to chain different stages of the computation. 
-As shown below, ``MinerU`` provides a set of methods to compose these classes.
-
-
-.. code:: python 
-
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-
-
-Users can implement their own functions for chaining as needed. For example, a user could use the ``apply`` method to create a function that counts the number of pages in a ``pdf`` file.
-
-
-.. code:: python
-
-    from magic_pdf.data.data_reader_writer import  FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-
-    def count_page(ds)-> int:
-        return len(ds)
-
-    print("page number: ", ds.apply(count_page)) # will output the page count of `abc.pdf`

+ 0 - 12
next_docs/en/user_guide/usage.rst

@@ -1,12 +0,0 @@
-
-
-Usage
-========
-
-.. toctree::
-   :maxdepth: 1
-
-   usage/command_line
-   usage/api
-   usage/docker
-

+ 0 - 279
next_docs/en/user_guide/usage/api.rst

@@ -1,279 +0,0 @@
-
-Api Usage
-===========
-
-
-PDF
-----
-
-Local File Example
-^^^^^^^^^^^^^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.config.enums import SupportedPdfParseMethod
-
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-    name_without_suff = pdf_file_name.split(".")[0]
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-
-    ## inference
-    if ds.classify() == SupportedPdfParseMethod.OCR:
-        infer_result = ds.apply(doc_analyze, ocr=True)
-
-        ## pipeline
-        pipe_result = infer_result.pipe_ocr_mode(image_writer)
-
-    else:
-        infer_result = ds.apply(doc_analyze, ocr=False)
-
-        ## pipeline
-        pipe_result = infer_result.pipe_txt_mode(image_writer)
-
-    ### draw model result on each page
-    infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
-
-    ### get model inference result
-    model_inference_result = infer_result.get_infer_res()
-
-    ### draw layout result on each page
-    pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
-
-    ### draw spans result on each page
-    pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
-
-    ### get markdown content
-    md_content = pipe_result.get_markdown(image_dir)
-
-    ### dump markdown
-    pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-
-    ### get content list content
-    content_list_content = pipe_result.get_content_list(image_dir)
-
-    ### dump content list
-    pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
-
-    ### get middle json
-    middle_json_content = pipe_result.get_middle_json()
-
-    ### dump middle json
-    pipe_result.dump_middle_json(md_writer, f'{name_without_suff}_middle.json')
-
-
-
-S3 File Example
-^^^^^^^^^^^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.config.enums import SupportedPdfParseMethod
-
-    bucket_name = "{Your S3 Bucket Name}"  # replace with real bucket name
-    ak = "{Your S3 access key}"  # replace with real s3 access key
-    sk = "{Your S3 secret key}"  # replace with real s3 secret key
-    endpoint_url = "{Your S3 endpoint_url}"  # replace with real s3 endpoint_url
-
-    reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url)  # replace `unittest/tmp` with the real s3 prefix
-    writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
-    image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
-    md_writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
-
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    # args
-    pdf_file_name = (
-        f"s3://{bucket_name}/unittest/tmp/bug5-11.pdf"  # replace with the real s3 path
-    )
-
-    # prepare env
-    local_dir = "output"
-    name_without_suff = os.path.basename(pdf_file_name).split(".")[0]
-
-    # read bytes
-    pdf_bytes = reader.read(pdf_file_name)  # read the pdf content
-
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-
-    ## inference
-    if ds.classify() == SupportedPdfParseMethod.OCR:
-        infer_result = ds.apply(doc_analyze, ocr=True)
-
-        ## pipeline
-        pipe_result = infer_result.pipe_ocr_mode(image_writer)
-
-    else:
-        infer_result = ds.apply(doc_analyze, ocr=False)
-
-        ## pipeline
-        pipe_result = infer_result.pipe_txt_mode(image_writer)
-
-    ### draw model result on each page
-    infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
-
-    ### get model inference result
-    model_inference_result = infer_result.get_infer_res()
-
-    ### draw layout result on each page
-    pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
-
-    ### draw spans result on each page
-    pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
-
-    ### dump markdown
-    pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-
-    ### dump content list
-    pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
-
-    ### get markdown content
-    md_content = pipe_result.get_markdown(image_dir)
-
-    ### get content list content
-    content_list_content = pipe_result.get_content_list(image_dir)
-
-    ### get middle json
-    middle_json_content = pipe_result.get_middle_json()
-
-    ### dump middle json
-    pipe_result.dump_middle_json(md_writer, f'{name_without_suff}_middle.json')
-
-MS-Office
-----------
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_office
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_ppt.ppt"     # replace with real ms-office file
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_office(input_file)[0]
-
-    ds.apply(doc_analyze, ocr=True).pipe_txt_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
-
-This code snippet can be used to manipulate **ppt**, **pptx**, **doc**, **docx** file
-
-
-Image
----------
-
-Single Image File
-^^^^^^^^^^^^^^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_images
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_file = "some_image.jpg"       # replace with real image file
-
-    input_file_name = input_file.split(".")[0]
-    ds = read_local_images(input_file)[0]
-
-    ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-        md_writer, f"{input_file_name}.md", image_dir
-    )
-
-
-Directory That Contains Images
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.data.read_api import read_local_images
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-
-    # proc
-    ## Create Dataset Instance
-    input_directory = "some_image_dir/"       # replace with real directory that contains images
-
-
-    dss = read_local_images(input_directory, suffixes=['.png', '.jpg'])
-
-    count = 0
-    for ds in dss:
-        ds.apply(doc_analyze, ocr=True).pipe_ocr_mode(image_writer).dump_md(
-            md_writer, f"{count}.md", image_dir
-        )
-        count += 1
-
-
-Check :doc:`../data/data_reader_writer` for more [reader | writer] examples and check :doc:`../../api/pipe_operators` or :doc:`../../api/model_operators` for api details

+ 0 - 77
next_docs/en/user_guide/usage/command_line.rst

@@ -1,77 +0,0 @@
-
-
-Command Line
-===================
-
-.. code:: bash
-
-   magic-pdf --help
-   Usage: magic-pdf [OPTIONS]
-
-   Options:
-     -v, --version                display the version and exit
-     -p, --path PATH              local filepath or directory. support PDF, PPT,
-                                  PPTX, DOC, DOCX, PNG, JPG files  [required]
-     -o, --output-dir PATH        output local directory  [required]
-     -m, --method [ocr|txt|auto]  the method for parsing pdf. ocr: using ocr
-                                  technique to extract information from pdf. txt:
-                                  suitable for the text-based pdf only and
-                                  outperform ocr. auto: automatically choose the
-                                  best method for parsing pdf from ocr and txt.
-                                  without method specified, auto will be used by
-                                  default.
-     -l, --lang TEXT              Input the languages in the pdf (if known) to
-                                  improve OCR accuracy.  Optional. You should
-                                  input "Abbreviation" with language form url: ht
-                                  tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
-                                  /blog/multi_languages.html#5-support-languages-
-                                  and-abbreviations
-     -d, --debug BOOLEAN          Enables detailed debugging information during
-                                  the execution of the CLI commands.
-     -s, --start INTEGER          The starting page for PDF parsing, beginning
-                                  from 0.
-     -e, --end INTEGER            The ending page for PDF parsing, beginning from
-                                  0.
-     --help                       Show this message and exit.
-
-
-   ## show version
-   magic-pdf -v
-
-   ## command line example
-   magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
-
-
-.. admonition:: Important
-    :class: tip
-
-    The file must endswith with the following suffix.
-       .pdf 
-       .png
-       .jpg
-       .ppt
-       .pptx
-       .doc
-       .docx
-
-
-``{some_pdf}`` can be a single PDF file or a directory containing
-multiple PDFs. The results will be saved in the ``{some_output_dir}``
-directory. The output file list is as follows:
-
-.. code:: text
-
-   ├── some_pdf.md                          # markdown file
-   ├── images                               # directory for storing images
-   ├── some_pdf_layout.pdf                  # layout diagram
-   ├── some_pdf_middle.json                 # MinerU intermediate processing result
-   ├── some_pdf_model.json                  # model inference result
-   ├── some_pdf_origin.pdf                  # original PDF file
-   ├── some_pdf_spans.pdf                   # smallest granularity bbox position information diagram
-   └── some_pdf_content_list.json           # Rich text JSON arranged in reading order
-
-.. admonition:: Tip
-   :class: tip
-   
-
-   For more information about the output files, please refer to the :doc:`../inference_result` or :doc:`../pipe_result`

+ 0 - 24
next_docs/en/user_guide/usage/docker.rst

@@ -1,24 +0,0 @@
-
-
-Docker 
-=======
-
-.. admonition:: Important
-   :class: tip
-
-   Docker requires a GPU with at least 16GB of VRAM, and all acceleration features are enabled by default.
-
-   Before running this Docker, you can use the following command to check if your device supports CUDA acceleration on Docker. 
-
-   .. code-block:: bash
-
-      bash  docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
-
-
-.. code:: sh
-
-   wget https://github.com/opendatalab/MinerU/raw/master/Dockerfile
-   docker build -t mineru:latest .
-   docker run --rm -it --gpus=all mineru:latest /bin/bash
-   magic-pdf --help
-

+ 0 - 17
next_docs/requirements.txt

@@ -1,17 +0,0 @@
-numpy==1.26.4
-click==8.1.7
-fast-langdetect==0.2.2
-Brotli==1.1.0
-boto3>=1.28.43
-loguru>=0.6.0
-myst-parser
-Pillow==8.4.0
-pydantic>=2.7.2,<2.8.0
-PyMuPDF>=1.24.9
-pdfminer.six==20231228
-sphinx
-sphinx-argparse>=0.5.2
-sphinx-book-theme>=1.1.3
-sphinx-copybutton>=0.5.2
-sphinx_rtd_theme>=3.0.1
-autodoc_pydantic>=2.2.0

+ 0 - 16
next_docs/zh_cn/.readthedocs.yaml

@@ -1,16 +0,0 @@
-version: 2
-
-build:
-  os: ubuntu-22.04
-  tools:
-    python: "3.10"
-
-formats:
-  - epub
-
-python:
-  install:
-    - requirements: next_docs/requirements.txt
-
-sphinx:
-  configuration: next_docs/zh_cn/conf.py

+ 0 - 20
next_docs/zh_cn/Makefile

@@ -1,20 +0,0 @@
-# Minimal makefile for Sphinx documentation
-#
-
-# You can set these variables from the command line, and also
-# from the environment for the first two.
-SPHINXOPTS    ?=
-SPHINXBUILD   ?= sphinx-build
-SOURCEDIR     = .
-BUILDDIR      = _build
-
-# Put it first so that "make" without argument is like "make help".
-help:
-	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
-
-.PHONY: help Makefile
-
-# Catch-all target: route all unknown targets to Sphinx using the new
-# "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
-%: Makefile
-	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)

BIN
next_docs/zh_cn/_static/image/MinerU-logo-hq.png


BIN
next_docs/zh_cn/_static/image/MinerU-logo.png


La diferencia del archivo ha sido suprimido porque es demasiado grande
+ 0 - 13
next_docs/zh_cn/_static/image/ReadTheDocs.svg


BIN
next_docs/zh_cn/_static/image/datalab_logo.png


BIN
next_docs/zh_cn/_static/image/flowchart_en.png


BIN
next_docs/zh_cn/_static/image/flowchart_zh_cn.png


BIN
next_docs/zh_cn/_static/image/inference_result.png


BIN
next_docs/zh_cn/_static/image/layout_example.png


BIN
next_docs/zh_cn/_static/image/logo.png


La diferencia del archivo ha sido suprimido porque es demasiado grande
+ 0 - 3
next_docs/zh_cn/_static/image/pipeline.drawio.svg


BIN
next_docs/zh_cn/_static/image/poly.png


BIN
next_docs/zh_cn/_static/image/project_panorama_en.png


BIN
next_docs/zh_cn/_static/image/project_panorama_zh_cn.png


BIN
next_docs/zh_cn/_static/image/spans_example.png


BIN
next_docs/zh_cn/_static/image/web_demo_1.png


+ 0 - 72
next_docs/zh_cn/additional_notes/faq.rst

@@ -1,72 +0,0 @@
-常见问题解答
-============
-
-1.在较新版本的mac上使用命令安装pip install magic-pdf[full] zsh: no matches found: magic-pdf[full]
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-在 macOS 上,默认的 shell 从 Bash 切换到了 Z shell,而 Z shell 对于某些类型的字符串匹配有特殊的处理逻辑,这可能导致no matches found错误。 可以通过在命令行禁用globbing特性,再尝试运行安装命令
-
-.. code:: bash
-
-   setopt no_nomatch
-   pip install magic-pdf[full]
-
-2.使用过程中遇到_pickle.UnpicklingError: invalid load key, ‘v’.错误
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-可能是由于模型文件未下载完整导致,可尝试重新下载模型文件后再试。参考:https://github.com/opendatalab/MinerU/issues/143
-
-3.模型文件应该下载到哪里/models-dir的配置应该怎么填
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-模型文件的路径输入是在”magic-pdf.json”中通过
-
-.. code:: json
-
-   {
-     "models-dir": "/tmp/models"
-   }
-
-进行配置的。这个路径是绝对路径而不是相对路径,绝对路径的获取可在models目录中通过命令 “pwd” 获取。
-参考:https://github.com/opendatalab/MinerU/issues/155#issuecomment-2230216874
-
-4.在WSL2的Ubuntu22.04中遇到报错\ ``ImportError: libGL.so.1: cannot open shared object file: No such file or directory``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-WSL2的Ubuntu22.04中缺少\ ``libgl``\ 库,可通过以下命令安装\ ``libgl``\ 库解决:
-
-.. code:: bash
-
-   sudo apt-get install libgl1-mesa-glx
-
-参考:https://github.com/opendatalab/MinerU/issues/388
-
-5.遇到报错 ``ModuleNotFoundError : Nomodulenamed 'fairscale'``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-需要卸载该模块并重新安装
-
-.. code:: bash
-
-   pip uninstall fairscale
-   pip install fairscale
-
-参考:https://github.com/opendatalab/MinerU/issues/411
-
-6.在部分较新的设备如H100上,使用CUDA加速OCR时解析出的文字乱码。
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-cuda11对新显卡的兼容性不好,需要升级paddle使用的cuda版本
-
-.. code:: bash
-
-   pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu123/
-
-参考:https://github.com/opendatalab/MinerU/issues/558
-
-7.在部分Linux服务器上,程序一运行就报错 ``非法指令 (核心已转储)`` 或 ``Illegal instruction (core dumped)``
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-可能是因为服务器CPU不支持AVX/AVX2指令集,或cpu本身支持但被运维禁用了,可以尝试联系运维解除限制或更换服务器。
-
-参考:https://github.com/opendatalab/MinerU/issues/591 ,https://github.com/opendatalab/MinerU/issues/736

+ 0 - 11
next_docs/zh_cn/additional_notes/glossary.rst

@@ -1,11 +0,0 @@
-
-
-名词解释
-===========
-
-1. jsonl 
-    TODO: add description
-
-2. magic-pdf.json
-    TODO: add description
-

+ 0 - 13
next_docs/zh_cn/additional_notes/known_issues.rst

@@ -1,13 +0,0 @@
-已知问题
-============
-
--  阅读顺序基于模型对可阅读内容在空间中的分布进行排序,在极端复杂的排版下可能会部分区域乱序
--  不支持竖排文字
--  目录和列表通过规则进行识别,少部分不常见的列表形式可能无法识别
--  标题只有一级,目前不支持标题分级
--  代码块在layout模型里还没有支持
--  漫画书、艺术图册、小学教材、习题尚不能很好解析
--  表格识别在复杂表格上可能会出现行/列识别错误
--  在小语种PDF上,OCR识别可能会出现字符不准确的情况(如拉丁文的重音符号、阿拉伯文易混淆字符等)
--  部分公式可能会无法在markdown中渲染
-

+ 0 - 151
next_docs/zh_cn/conf.py

@@ -1,151 +0,0 @@
-# Configuration file for the Sphinx documentation builder.
-#
-# This file only contains a selection of the most common options. For a full
-# list see the documentation:
-# https://www.sphinx-doc.org/en/master/usage/configuration.html
-
-# -- Path setup --------------------------------------------------------------
-
-# If extensions (or modules to document with autodoc) are in another directory,
-# add these directories to sys.path here. If the directory is relative to the
-# documentation root, use os.path.abspath to make it absolute, like shown here.
-
-import os
-import subprocess
-import sys
-
-from sphinx.ext import autodoc
-from docutils import nodes
-from docutils.parsers.rst import Directive
-
-def install(package):
-    subprocess.check_call([sys.executable, '-m', 'pip', 'install', package])
-
-
-requirements_path = os.path.abspath(os.path.join(os.path.dirname(__file__), '..', 'requirements.txt'))
-if os.path.exists(requirements_path):
-    with open(requirements_path) as f:
-        packages = f.readlines()
-    for package in packages:
-        install(package.strip())
-
-sys.path.insert(0, os.path.abspath('../..'))
-
-# -- Project information -----------------------------------------------------
-
-project = 'MinerU'
-copyright = '2024, MinerU Contributors'
-author = 'OpenDataLab'
-
-# The full version, including alpha/beta/rc tags
-version_file = '../../magic_pdf/libs/version.py'
-with open(version_file) as f:
-    exec(compile(f.read(), version_file, 'exec'))
-__version__ = locals()['__version__']
-# The short X.Y version
-version = __version__
-# The full version, including alpha/beta/rc tags
-release = __version__
-
-# -- General configuration ---------------------------------------------------
-
-# Add any Sphinx extension module names here, as strings. They can be
-# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
-# ones.
-extensions = [
-    'sphinx.ext.napoleon',
-    'sphinx.ext.viewcode',
-    'sphinx.ext.intersphinx',
-    'sphinx_copybutton',
-    'sphinx.ext.autodoc',
-    'sphinx.ext.autosummary',
-    'sphinx.ext.inheritance_diagram',
-    'myst_parser',
-    'sphinxarg.ext',
-    'sphinxcontrib.autodoc_pydantic',
-]
-
-# class hierarchy diagram
-inheritance_graph_attrs = dict(rankdir="LR", size='"8.0, 12.0"', fontsize=14, ratio='compress')
-inheritance_node_attrs = dict(shape='ellipse', fontsize=14, height=0.75)
-inheritance_edge_attrs = dict(arrow='vee')
-
-autodoc_pydantic_model_show_json = True
-autodoc_pydantic_model_show_config_summary = False
-
-# Add any paths that contain templates here, relative to this directory.
-templates_path = ['_templates']
-
-# List of patterns, relative to source directory, that match files and
-# directories to ignore when looking for source files.
-# This pattern also affects html_static_path and html_extra_path.
-exclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']
-
-# Exclude the prompt "$" when copying code
-copybutton_prompt_text = r'\$ '
-copybutton_prompt_is_regexp = True
-
-language = 'zh_CN'
-
-# -- Options for HTML output -------------------------------------------------
-
-# The theme to use for HTML and HTML Help pages.  See the documentation for
-# a list of builtin themes.
-#
-html_theme = 'sphinx_book_theme'
-html_logo = '_static/image/logo.png'
-html_theme_options = {
-    'path_to_docs': 'next_docs/zh_cn',
-    'repository_url': 'https://github.com/opendatalab/MinerU',
-    'use_repository_button': True,
-}
-# Add any paths that contain custom static files (such as style sheets) here,
-# relative to this directory. They are copied after the builtin static files,
-# so a file named "default.css" will overwrite the builtin "default.css".
-# html_static_path = ['_static']
-
-# Mock out external dependencies here.
-autodoc_mock_imports = [
-    'cpuinfo',
-    'torch',
-    'transformers',
-    'psutil',
-    'prometheus_client',
-    'sentencepiece',
-    'vllm.cuda_utils',
-    'vllm._C',
-    'numpy',
-    'tqdm',
-]
-
-
-class MockedClassDocumenter(autodoc.ClassDocumenter):
-    """Remove note about base class when a class is derived from object."""
-
-    def add_line(self, line: str, source: str, *lineno: int) -> None:
-        if line == '   Bases: :py:class:`object`':
-            return
-        super().add_line(line, source, *lineno)
-
-
-autodoc.ClassDocumenter = MockedClassDocumenter
-
-navigation_with_keys = False
-
-
-# add custom directive 
-
-
-class VideoDirective(Directive):
-    required_arguments = 1
-    optional_arguments = 0
-    final_argument_whitespace = True
-    option_spec = {}
-
-    def run(self):
-        url = self.arguments[0]
-        video_node = nodes.raw('', f'<iframe width="560" height="315" src="{url}" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>', format='html')
-        return [video_node]
-
-def setup(app):
-    app.add_directive('video', VideoDirective)

+ 0 - 81
next_docs/zh_cn/index.rst

@@ -1,81 +0,0 @@
-.. xtuner documentation master file, created by
-   sphinx-quickstart on Tue Jan  9 16:33:06 2024.
-   You can adapt this file completely to your liking, but it should at least
-   contain the root `toctree` directive.
-
-欢迎来到 MinerU 文档库
-==============================================
-
-.. figure:: ./_static/image/logo.png
-  :align: center
-  :alt: mineru
-  :class: no-scaled-link
-
-.. raw:: html
-
-   <p style="text-align:center">
-   <strong> 一站式、高质量的开源文档提取工具
-   </strong>
-   </p>
-
-   <p style="text-align:center">
-   <script async defer src="https://buttons.github.io/buttons.js"></script>
-   <a class="github-button" href="https://github.com/opendatalab/MinerU" data-show-count="true" data-size="large" aria-label="Star">Star</a>
-   <a class="github-button" href="https://github.com/opendatalab/MinerU/subscription" data-icon="octicon-eye" data-size="large" aria-label="Watch">Watch</a>
-   <a class="github-button" href="https://github.com/opendatalab/MinerU/fork" data-icon="octicon-repo-forked" data-size="large" aria-label="Fork">Fork</a>
-   </p>
-
-
-项目介绍
---------------------
-
-MinerU是一款将PDF转化为机器可读格式的工具(如markdown、json),可以很方便地抽取为任意格式。
-MinerU诞生于\ `书生-浦语 <https://github.com/InternLM/InternLM>`__\ 的预训练过程中,我们将会集中精力解决科技文献中的符号转化问题,希望在大模型时代为科技发展做出贡献。
-相比国内外知名商用产品MinerU还很年轻,如果遇到问题或者结果不及预期请到\ `issue <https://github.com/opendatalab/MinerU/issues>`__\ 提交问题,同时\ **附上相关PDF**\ 。
-
-.. video:: https://github.com/user-attachments/assets/4bea02c9-6d54-4cd6-97ed-dff14340982c
-
-主要功能
---------
-
--  删除页眉、页脚、脚注、页码等元素,确保语义连贯
--  输出符合人类阅读顺序的文本,适用于单栏、多栏及复杂排版
--  保留原文档的结构,包括标题、段落、列表等
--  提取图像、图片描述、表格、表格标题及脚注
--  自动识别并转换文档中的公式为LaTeX格式
--  自动识别并转换文档中的表格为LaTeX或HTML格式
--  自动检测扫描版PDF和乱码PDF,并启用OCR功能
--  OCR支持84种语言的检测与识别
--  支持多种输出格式,如多模态与NLP的Markdown、按阅读顺序排序的JSON、含有丰富信息的中间格式等
--  支持多种可视化结果,包括layout可视化、span可视化等,便于高效确认输出效果与质检
--  支持CPU和GPU环境
--  兼容Windows、Linux和Mac平台
-
-
-用户指南
--------------
-.. toctree::
-   :maxdepth: 2
-   :caption: 用户指南
-
-   user_guide
-
-
-API 接口
--------------
-本章节主要介绍函数、类、类方法的细节信息
-
-目前只提供英文版本的接口文档,请切换到英文版本的接口文档!
-
-
-附录
-------------------
-.. toctree::
-   :maxdepth: 1
-   :caption: 附录
-
-   additional_notes/known_issues
-   additional_notes/faq
-   additional_notes/glossary
-
-

+ 0 - 35
next_docs/zh_cn/make.bat

@@ -1,35 +0,0 @@
-@ECHO OFF
-
-pushd %~dp0
-
-REM Command file for Sphinx documentation
-
-if "%SPHINXBUILD%" == "" (
-	set SPHINXBUILD=sphinx-build
-)
-set SOURCEDIR=.
-set BUILDDIR=_build
-
-%SPHINXBUILD% >NUL 2>NUL
-if errorlevel 9009 (
-	echo.
-	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
-	echo.installed, then set the SPHINXBUILD environment variable to point
-	echo.to the full path of the 'sphinx-build' executable. Alternatively you
-	echo.may add the Sphinx directory to PATH.
-	echo.
-	echo.If you don't have Sphinx installed, grab it from
-	echo.https://www.sphinx-doc.org/
-	exit /b 1
-)
-
-if "%1" == "" goto help
-
-%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-goto end
-
-:help
-%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%
-
-:end
-popd

+ 0 - 10
next_docs/zh_cn/user_guide.rst

@@ -1,10 +0,0 @@
-
-
-.. toctree::
-    :maxdepth: 2
-
-    user_guide/install
-    user_guide/quick_start
-    user_guide/tutorial
-    user_guide/data
-    

+ 0 - 20
next_docs/zh_cn/user_guide/data.rst

@@ -1,20 +0,0 @@
-
-
-数据
-=========
-
-.. toctree::
-   :maxdepth: 2
-   :caption: 数据
-
-   data/dataset
-
-   data/read_api
-
-   data/data_reader_writer 
-
-   data/io
-
-
-
-

+ 0 - 218
next_docs/zh_cn/user_guide/data/data_reader_writer.rst

@@ -1,218 +0,0 @@
-
-数据读取和写入类 
-=================
-
-旨在从不同的媒介读取或写入字节。如果 MinerU 没有提供合适的类,你可以实现新的类以满足个人场景的需求。实现新的类非常容易,唯一的要求是继承自 DataReader 或 DataWriter。
-
-.. code:: python
-
-    class SomeReader(DataReader):
-        def read(self, path: str) -> bytes:
-            pass
-
-        def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
-            pass
-
-
-    class SomeWriter(DataWriter):
-        def write(self, path: str, data: bytes) -> None:
-            pass
-
-        def write_string(self, path: str, data: str) -> None:
-            pass
-
-读者可能会对 io 和本节的区别感到好奇。乍一看,这两部分非常相似。io 提供基本功能,而本节则更注重应用层面。用户可以构建自己的类以满足特定应用需求,这些类可能共享相同的基本 IO 功能。这就是为什么我们有 io。
-
-重要类
-------------
-.. code:: python
-
-    class FileBasedDataReader(DataReader):
-        def __init__(self, parent_dir: str = ''):
-            pass
-
-
-    class FileBasedDataWriter(DataWriter):
-        def __init__(self, parent_dir: str = '') -> None:
-            pass
-
-类 FileBasedDataReader 使用单个参数 parent_dir 初始化。这意味着 FileBasedDataReader 提供的每个方法将具有以下特性:
-
-#. 从绝对路径文件读取内容,parent_dir 将被忽略。
-#. 从相对路径读取文件,首先将路径与 parent_dir 连接,然后从合并后的路径读取内容。
-
-.. note::
-
-    `FileBasedDataWriter` 与 `FileBasedDataReader` 具有相同的行为。
-
-.. code:: python
-
-    class MultiS3Mixin:
-        def __init__(self, default_prefix: str, s3_configs: list[S3Config]):
-            pass
-
-    class MultiBucketS3DataReader(DataReader, MultiS3Mixin):
-        pass
-
-MultiBucketS3DataReader 提供的所有读取相关方法将具有以下特性:
-
-#. 从完整的 S3 格式路径读取对象,例如 s3://test_bucket/test_object,default_prefix 将被忽略。
-#. 从相对路径读取对象,首先将路径与 default_prefix 连接并去掉 bucket_name,然后读取内容。bucket_name 是将 default_prefix 用分隔符 \ 分割后的第一个元素。
-
-.. note::
-    MultiBucketS3DataWriter 与 MultiBucketS3DataReader 具有类似的行为。
-
-.. code:: python
-
-    class S3DataReader(MultiBucketS3DataReader):
-        pass
-
-S3DataReader 基于 MultiBucketS3DataReader 构建,但仅支持单个桶。S3DataWriter 也是类似的情况。
-
-读取示例
----------
-.. code:: python
-
-    import os 
-    from magic_pdf.data.data_reader_writer import *
-    from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
-    from magic_pdf.data.schemas import S3Config
-
-    # 初始化 reader
-    file_based_reader1 = FileBasedDataReader('')
-
-    ## 读本地文件 abc
-    file_based_reader1.read('abc')
-
-    file_based_reader2 = FileBasedDataReader('/tmp')
-
-    ## 读本地文件 /tmp/abc
-    file_based_reader2.read('abc')
-
-    ## 读本地文件 /tmp/logs/message.txt
-    file_based_reader2.read('/tmp/logs/message.txt')
-
-    # 初始化多桶 s3 reader
-    bucket = "bucket"               # 替换为有效的 bucket
-    ak = "ak"                       # 替换为有效的 access key
-    sk = "sk"                       # 替换为有效的 secret key
-    endpoint_url = "endpoint_url"   # 替换为有效的 endpoint_url
-
-    bucket_2 = "bucket_2"               # 替换为有效的 bucket
-    ak_2 = "ak_2"                       # 替换为有效的 access key
-    sk_2 = "sk_2"                       # 替换为有效的 secret key 
-    endpoint_url_2 = "endpoint_url_2"   # 替换为有效的 endpoint_url
-
-    test_prefix = 'test/unittest'
-    multi_bucket_s3_reader1 = MultiBucketS3DataReader(f"{bucket}/{test_prefix}", [S3Config(
-            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-        ),
-        S3Config(
-            bucket_name=bucket_2,
-            access_key=ak_2,
-            secret_key=sk_2,
-            endpoint_url=endpoint_url_2,
-        )])
-
-    ## 读文件 s3://{bucket}/{test_prefix}/abc
-    multi_bucket_s3_reader1.read('abc')
-
-    ## 读文件 s3://{bucket}/{test_prefix}/efg
-    multi_bucket_s3_reader1.read(f's3://{bucket}/{test_prefix}/efg')
-
-    ## 读文件 s3://{bucket2}/{test_prefix}/abc
-    multi_bucket_s3_reader1.read(f's3://{bucket_2}/{test_prefix}/abc')
-
-    # 初始化 s3 reader
-    s3_reader1 = S3DataReader(
-        test_prefix,
-        bucket,
-        ak,
-        sk,
-        endpoint_url
-    )
-
-    ## 读文件 s3://{bucket}/{test_prefix}/abc
-    s3_reader1.read('abc')
-
-    ## 读文件 s3://{bucket}/efg
-    s3_reader1.read(f's3://{bucket}/efg')
-
-
-写入示例
-----------
-.. code:: python
-
-    import os
-    from magic_pdf.data.data_reader_writer import *
-    from magic_pdf.data.data_reader_writer import MultiBucketS3DataWriter
-    from magic_pdf.data.schemas import S3Config
-
-    # 初始化 reader
-    file_based_writer1 = FileBasedDataWriter("")
-
-    ## 写数据 123 to abc
-    file_based_writer1.write("abc", "123".encode())
-
-    ## 写数据 123 to abc
-    file_based_writer1.write_string("abc", "123")
-
-    file_based_writer2 = FileBasedDataWriter("/tmp")
-
-    ## 写数据 123 to /tmp/abc
-    file_based_writer2.write_string("abc", "123")
-
-    ## 写数据 123 to /tmp/logs/message.txt
-    file_based_writer2.write_string("/tmp/logs/message.txt", "123")
-
-    # 初始化多桶 s3 writer
-    bucket = "bucket"               # 替换为有效的 bucket
-    ak = "ak"                       # 替换为有效的 access key
-    sk = "sk"                       # 替换为有效的 secret key
-    endpoint_url = "endpoint_url"   # 替换为有效的 endpoint_url
-
-    bucket_2 = "bucket_2"               # 替换为有效的 bucket
-    ak_2 = "ak_2"                       # 替换为有效的 access key
-    sk_2 = "sk_2"                       # 替换为有效的 secret key 
-    endpoint_url_2 = "endpoint_url_2"   # 替换为有效的 endpoint_url
-
-    test_prefix = "test/unittest"
-    multi_bucket_s3_writer1 = MultiBucketS3DataWriter(
-        f"{bucket}/{test_prefix}",
-        [
-            S3Config(
-                bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-            ),
-            S3Config(
-                bucket_name=bucket_2,
-                access_key=ak_2,
-                secret_key=sk_2,
-                endpoint_url=endpoint_url_2,
-            ),
-        ],
-    )
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
-    multi_bucket_s3_writer1.write_string("abc", "123")
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
-    multi_bucket_s3_writer1.write("abc", "123".encode())
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/efg
-    multi_bucket_s3_writer1.write(f"s3://{bucket}/{test_prefix}/efg", "123".encode())
-
-    ## 写数据 123 to s3://{bucket_2}/{test_prefix}/abc
-    multi_bucket_s3_writer1.write(f's3://{bucket_2}/{test_prefix}/abc', '123'.encode())
-
-    # 初始化 s3 writer
-    s3_writer1 = S3DataWriter(test_prefix, bucket, ak, sk, endpoint_url)
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
-    s3_writer1.write("abc", "123".encode())
-
-    ## 写数据 123 to s3://{bucket}/{test_prefix}/abc
-    s3_writer1.write_string("abc", "123")
-
-    ## 写数据 123 to s3://{bucket}/efg
-    s3_writer1.write(f"s3://{bucket}/efg", "123".encode())
-

+ 0 - 31
next_docs/zh_cn/user_guide/data/dataset.rst

@@ -1,31 +0,0 @@
-
-数据集
-======
-
-导入数据类
------------
-
-数据集
-^^^^^^^^
-
-每个 PDF 或图像将形成一个 Dataset。众所周知,PDF 有两种类别::ref:`TXT <digital_method_section>` 或 :ref:`OCR <ocr_method_section>` 方法部分。从图像中可以获得 ImageDataset,它是 Dataset 的子类;从 PDF 文件中可以获得 PymuDocDataset。ImageDataset 和 PymuDocDataset 之间的区别在于 ImageDataset 仅支持 OCR 解析方法,而 PymuDocDataset 支持 OCR 和 TXT 两种方法。
-
-.. note::
-
-    实际上,有些 PDF 可能是由图像生成的,这意味着它们不支持 `TXT` 方法。目前,由用户保证不会调用 `TXT` 方法来解析图像生成的 PDF
-
-PDF 解析方法
----------------
-
-.. _ocr_method_section:
-
-OCR
-^^^^
-通过 光学字符识别 技术提取字符。
-
-.. _digital_method_section:
-
-TXT
-^^^^^^^^
-通过第三方库提取字符,目前我们使用的是 pymupdf。
-

+ 0 - 21
next_docs/zh_cn/user_guide/data/io.rst

@@ -1,21 +0,0 @@
-
-
-IO
-====
-
-旨在从不同的媒介读取或写入字节。目前,我们提供了 S3Reader 和 S3Writer 用于兼容 AWS S3 的媒介,以及 HttpReader 和 HttpWriter 用于远程 HTTP 文件。如果 MinerU 没有提供合适的类,你可以实现新的类以满足个人场景的需求。实现新的类非常容易,唯一的要求是继承自 IOReader 或 IOWriter。
-
-.. code:: python
-
-    class SomeReader(IOReader):
-        def read(self, path: str) -> bytes:
-            pass
-
-        def read_at(self, path: str, offset: int = 0, limit: int = -1) -> bytes:
-            pass
-
-
-    class SomeWriter(IOWriter):
-        def write(self, path: str, data: bytes) -> None:
-            pass
-        

+ 0 - 82
next_docs/zh_cn/user_guide/data/read_api.rst

@@ -1,82 +0,0 @@
-
-
-read_api
-=========
-
-从文件或目录读取内容以创建 Dataset。目前,我们提供了几个覆盖某些场景的函数。如果你有新的、大多数用户都会遇到的场景,可以在官方 GitHub 问题页面上发布详细描述。同时,实现你自己的读取相关函数也非常容易。
-
-重要函数
----------
-
-read_jsonl
-^^^^^^^^^^^^^^^^
-
-从本地机器或远程 S3 上的 JSONL 文件读取内容。如果你想了解更多关于 JSONL 的信息,请参阅 :doc:`../../additional_notes/glossary`。
-
-.. code:: python
-
-    from magic_pdf.data.read_api import *
-    from magic_pdf.data.data_reader_writer import MultiBucketS3DataReader
-    from magic_pdf.data.schemas import S3Config
-
-    # 读取本地 jsonl 文件
-    datasets = read_jsonl("tt.jsonl", None)   # 替换为有效的文件
-
-    # 读取 s3 jsonl 文件
-
-    bucket = "bucket_1"                     # 替换为有效的 s3 bucket
-    ak = "access_key_1"                     # 替换为有效的 s3 access key
-    sk = "secret_key_1"                     # 替换为有效的 s3 secret key
-    endpoint_url = "endpoint_url_1"         # 替换为有效的 s3 endpoint url
-
-    bucket_2 = "bucket_2"                   # 替换为有效的 s3 bucket
-    ak_2 = "access_key_2"                   # 替换为有效的 s3 access key
-    sk_2 = "secret_key_2"                   # 替换为有效的 s3 secret key
-    endpoint_url_2 = "endpoint_url_2"       # 替换为有效的 s3 endpoint url
-
-    s3configs = [
-        S3Config(
-            bucket_name=bucket, access_key=ak, secret_key=sk, endpoint_url=endpoint_url
-        ),
-        S3Config(
-            bucket_name=bucket_2,
-            access_key=ak_2,
-            secret_key=sk_2,
-            endpoint_url=endpoint_url_2,
-        ),
-    ]
-
-    s3_reader = MultiBucketS3DataReader(bucket, s3configs)
-
-    datasets = read_jsonl(f"s3://bucket_1/tt.jsonl", s3_reader)  # 替换为有效的 s3 jsonl file
-
-
-read_local_pdfs
-^^^^^^^^^^^^^^^^
-
-从路径或目录读取 PDF 文件。
-
-.. code:: python
-
-    from magic_pdf.data.read_api import *
-
-    # 读取 PDF 路径
-    datasets = read_local_pdfs("tt.pdf")  # 替换为有效的文件
-
-    # 读取目录下的 PDF 文件
-    datasets = read_local_pdfs("pdfs/")   # 替换为有效的文件目录
-
-read_local_images
-^^^^^^^^^^^^^^^^^^^
-
-从路径或目录读取图像。
-
-.. code:: python
-
-    from magic_pdf.data.read_api import *
-
-    # 从图像路径读取
-    datasets = read_local_images("tt.png")  # 替换为有效的文件
-
-    # 从目录读取以 suffixes 数组中指定后缀结尾的文件
-    datasets = read_local_images("images/", suffixes=["png", "jpg"])  # 替换为有效的文件目录

+ 0 - 13
next_docs/zh_cn/user_guide/install.rst

@@ -1,13 +0,0 @@
-
-安装
-==============
-
-.. toctree::
-   :maxdepth: 1
-   :caption: 安装文档
-
-   install/install
-   install//boost_with_cuda
-   install/download_model_weight_files
-
-

+ 0 - 272
next_docs/zh_cn/user_guide/install/boost_with_cuda.rst

@@ -1,272 +0,0 @@
-使用 CUDA 加速
-================
-
-如果您的设备支持 CUDA 并符合主线环境的 GPU 要求,您可以使用 GPU 加速。请选择适合您系统的指南:
-
--  :ref:`ubuntu_22_04_lts_section`
--  :ref:`windows_10_or_11_section`
--  使用 Docker 快速部署
- 
-.. admonition:: Important
-    :class: tip
-
-    Docker 需要至少 6GB 显存的 GPU,并且所有加速功能默认启用。
-   
-    在运行此 Docker 容器之前,您可以使用以下命令检查您的设备是否支持 Docker 上的 CUDA 加速。
-
-    .. code-block:: sh
-
-      docker run --rm --gpus=all nvidia/cuda:12.1.0-base-ubuntu22.04 nvidia-smi
-
-.. code:: sh
-
-      wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/docker/china/Dockerfile -O Dockerfile
-      docker build -t mineru:latest .
-      docker run -it --name mineru --gpus=all mineru:latest /bin/bash -c "echo 'source /opt/mineru_venv/bin/activate' >> ~/.bashrc && exec bash"
-      magic-pdf --help
-
-
-.. _ubuntu_22_04_lts_section:
-
-Ubuntu 22.04 LTS
-----------------
-1. 检测是否已安装 nvidia 驱动
----------------------------
-
-.. code:: bash
-
-   nvidia-smi
-
-如果看到类似如下的信息,说明已经安装了 nvidia 驱动,可以跳过步骤2
-
-.. admonition:: Important
-    :class: tip
-
-    ``CUDA Version`` 显示的版本号应 >= 12.4,如显示的版本号小于12.4,请升级驱动
-
-.. code:: text
-
-   +---------------------------------------------------------------------------------------+
-   | NVIDIA-SMI 570.133.07             Driver Version: 572.83         CUDA Version: 12.8   |
-   |-----------------------------------------+----------------------+----------------------+
-   | GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |
-   | Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
-   |                                         |                      |               MIG M. |
-   |=========================================+======================+======================|
-   |   0  NVIDIA GeForce RTX 3060 Ti   WDDM  | 00000000:01:00.0  On |                  N/A |
-   |  0%   51C    P8              12W / 200W |   1489MiB /  8192MiB |      5%      Default |
-   |                                         |                      |                  N/A |
-   +-----------------------------------------+----------------------+----------------------+
-
-2. 安装驱动
------------
-
-如没有驱动,则通过如下命令
-
-.. code:: bash
-
-   sudo apt-get update
-   sudo apt-get install nvidia-driver-570-server
-
-安装专有驱动,安装完成后,重启电脑
-
-.. code:: bash
-
-   reboot
-
-3. 安装 anacoda
---------------
-
-如果已安装 conda,可以跳过本步骤
-
-.. code:: bash
-
-   wget -U NoSuchBrowser/1.0 https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Linux-x86_64.sh
-   bash Anaconda3-2024.06-1-Linux-x86_64.sh
-
-最后一步输入yes,关闭终端重新打开
-
-4. 使用 conda 创建环境
----------------------
-
-.. code:: bash
-
-   conda create -n mineru 'python<3.13' -y
-   conda activate mineru
-
-5. 安装应用
------------
-
-.. code:: bash
-
-   pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
-
-.. admonition:: Important
-    :class: tip
-
-    下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
-
-    .. code:: bash
-
-       magic-pdf --version
-
-    如果版本号小于1.3.0,请到issue中向我们反馈
-
-6. 下载模型
------------
-
-详细参考 :doc:`download_model_weight_files`
-
-7. 了解配置文件存放的位置
--------------------------
-
-完成\ `6.下载模型 <#6-下载模型>`__\ 步骤后,脚本会自动生成用户目录下的magic-pdf.json文件,并自动配置默认模型路径。您可在【用户目录】下找到magic-pdf.json文件。
-
-.. admonition:: Tip
-    :class: tip
-
-    linux用户目录为 “/home/用户名”
-
-8. 第一次运行
--------------
-
-从仓库中下载样本文件,并测试
-
-.. code:: bash
-
-   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/demo/pdfs/small_ocr.pdf
-   magic-pdf -p small_ocr.pdf -o ./output
-
-9. 测试CUDA加速
----------------
-
-如果您的显卡显存大于等于 **8GB**
-,可以进行以下流程,测试CUDA解析加速效果
-
-**1.修改【用户目录】中配置文件 magic-pdf.json 中”device-mode”的值**
-
-.. code:: json
-
-   {
-     "device-mode":"cuda"
-   }
-
-**2.运行以下命令测试 cuda 加速效果**
-
-.. code:: bash
-
-   magic-pdf -p small_ocr.pdf -o ./output
-
-
-.. admonition:: Tip
-    :class: tip
-
-    CUDA 加速是否生效可以根据 log 中输出的各个阶段的耗时来简单判断,通常情况下,cuda应比cpu更快。
-
-
-
-.. _windows_10_or_11_section:
-
-Windows 10/11
---------------
-
-1. 安装 cuda 和 cuDNN
-------------------
-
-需要安装符合torch要求的cuda版本,torch目前支持11.8/12.4/12.6
-
-- CUDA 11.8 https://developer.nvidia.com/cuda-11-8-0-download-archive
-- CUDA 12.4 https://developer.nvidia.com/cuda-12-4-0-download-archive
-- CUDA 12.6 https://developer.nvidia.com/cuda-12-6-0-download-archive
-
-
-2. 安装 anaconda
----------------
-
-如果已安装 conda,可以跳过本步骤
-
-下载链接:https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-2024.06-1-Windows-x86_64.exe
-
-3. 使用 conda 创建环境
----------------------
-
-.. code:: bash
-
-    conda create -n mineru 'python<3.13' -y
-    conda activate mineru
-
-4. 安装应用
------------
-
-.. code:: bash
-
-   pip install -U magic-pdf[full] -i https://mirrors.aliyun.com/pypi/simple
-
-.. admonition:: Important
-    :class: tip
-
-    下载完成后,务必通过以下命令确认magic-pdf的版本是否正确
-
-    .. code:: bash
-
-      magic-pdf --version
-
-    如果版本号小于1.3.0,请到issue中向我们反馈
-
-5. 下载模型
------------
-
-详细参考 :doc:`download_model_weight_files`
-
-6. 了解配置文件存放的位置
--------------------------
-
-完成\ `5.下载模型 <#5-下载模型>`__\ 步骤后,脚本会自动生成用户目录下的magic-pdf.json文件,并自动配置默认模型路径。您可在【用户目录】下找到 magic-pdf.json 文件。
-
-.. admonition:: Tip
-    :class: tip
-
-    windows 用户目录为 “C:/Users/用户名”
-
-7. 第一次运行
--------------
-
-从仓库中下载样本文件,并测试
-
-.. code:: powershell
-
-    wget https://github.com/opendatalab/MinerU/raw/master/demo/pdfs/small_ocr.pdf -O small_ocr.pdf
-    magic-pdf -p small_ocr.pdf -o ./output
-
-8. 测试 CUDA 加速
----------------
-
-如果您的显卡显存大于等于 **8GB**,可以进行以下流程,测试 CUDA 解析加速效果
-
-**1.覆盖安装支持cuda的torch和torchvision**(请根据cuda版本选择合适的index-url,具体可参考[torch官网](https://pytorch.org/get-started/locally/))
-
-
-.. code:: bash
-
-   pip install --force-reinstall torch==2.6.0 torchvision==0.21.1 "numpy<2.0.0" --index-url https://download.pytorch.org/whl/cu124
-
-
-**2.修改【用户目录】中配置文件magic-pdf.json中”device-mode”的值**
-
-.. code:: json
-
-   {
-     "device-mode":"cuda"
-   }
-
-**3.运行以下命令测试cuda加速效果**
-
-.. code:: bash
-
-   magic-pdf -p small_ocr.pdf -o ./output
-
-.. admonition:: Tip
-    :class: tip
-
-    CUDA 加速是否生效可以根据 log 中输出的各个阶段的耗时来简单判断,通常情况下, cuda会比cpu更快。
-

+ 0 - 64
next_docs/zh_cn/user_guide/install/download_model_weight_files.rst

@@ -1,64 +0,0 @@
-下载模型权重文件
-==================
-
-模型下载分为初始下载和更新到模型目录。请参考相应的文档以获取如何操作的指示。
-
-首次下载模型文件
------------------
-
-模型文件可以从 Hugging Face 或 Model Scope下载,由于网络原因,国内用户访问HF可能会失败,请使用 ModelScope。
-
-
-方法一:从 Hugging Face 下载模型
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-使用python脚本 从Hugging Face下载模型文件
-
-.. code:: bash
-
-   pip install huggingface_hub
-   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
-   python download_models_hf.py
-
-python脚本会自动下载模型文件并配置好配置文件中的模型目录
-
-方法二:从 ModelScope 下载模型
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-使用python脚本从 ModelScope 下载模型文件
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-.. code:: bash
-
-   pip install modelscope
-   wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models.py -O download_models.py
-   python download_models.py
-
-python脚本会自动下载模型文件并配置好配置文件中的模型目录
-
-配置文件可以在用户目录中找到,文件名为\ ``magic-pdf.json``
-
-.. admonition:: Tip
-    :class: tip
-
-    windows的用户目录为 “C:\Users\用户名”, linux用户目录为 “/home/用户名”, macOS用户目录为 “/Users/用户名”
-
-此前下载过模型,如何更新
---------------------
-
-1. 通过 git lfs 下载过模型
-^^^^^^^^^^^^^^^^^^^^^^^
-
-.. admonition:: Important
-    :class: tip
-
-    由于部分用户反馈通过git lfs下载模型文件遇到下载不全和模型文件损坏情况,现已不推荐使用该方式下载。
-
-    0.9.x及以后版本由于PDF-Extract-Kit 1.0更换仓库和新增layout排序模型,不能通过 ``git pull``\命令更新,需要使用python脚本一键更新。
-
-当magic-pdf <= 0.8.1时,如此前通过 git lfs 下载过模型文件,可以进入到之前的下载目录中,通过 ``git pull`` 命令更新模型。
-
-2. 通过 Hugging Face 或 Model Scope 下载过模型
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
-
-如此前通过 HuggingFace 或 Model Scope 下载过模型,可以重复执行此前的模型下载 python 脚本,将会自动将模型目录更新到最新版本。

+ 0 - 103
next_docs/zh_cn/user_guide/install/install.rst

@@ -1,103 +0,0 @@
-
-安装
-=====
-
-如果您遇到任何安装问题,请首先查阅 :doc:`../../additional_notes/faq`。如果解析结果不如预期,可参考 :doc:`../../additional_notes/known_issues`。
-
-.. admonition:: Warning
-    :class: tip
-
-    **安装前必看——软硬件环境支持说明**
-
-    为了确保项目的稳定性和可靠性,我们在开发过程中仅对特定的软硬件环境进行优化和测试。这样当用户在推荐的系统配置上部署和运行项目时,能够获得最佳的性能表现和最少的兼容性问题。
-
-    通过集中资源和精力于主线环境,我们团队能够更高效地解决潜在的BUG,及时开发新功能。
-
-    在非主线环境中,由于硬件、软件配置的多样性,以及第三方依赖项的兼容性问题,我们无法100%保证项目的完全可用性。因此,对于希望在非推荐环境中使用本项目的用户,我们建议先仔细阅读文档以及 :doc:`../../additional_notes/faq` ,大多数问题已经在 :doc:`../../additional_notes/faq` 中有对应的解决方案,除此之外我们鼓励社区反馈问题,以便我们能够逐步扩大支持范围。
-
-.. raw:: html
-
-    <style>
-        table, th, td {
-        border: 1px solid black;
-        border-collapse: collapse;
-        }
-    </style>
-    <table>
-    <tr>
-        <td colspan="3" rowspan="2">操作系统</td>
-    </tr>
-    <tr>
-        <td>Linux after 2019</td>
-        <td>Windows 10 / 11</td>
-        <td>macOS 11+</td>
-    </tr>
-    <tr>
-        <td colspan="3">CPU</td>
-        <td>x86_64 / arm64</td>
-        <td>x86_64(暂不支持ARM Windows)</td>
-        <td>x86_64 / arm64</td>
-    </tr>
-    <tr>
-        <td colspan="3">内存</td>
-        <td colspan="3">大于等于16GB,推荐32G以上</td>
-    </tr>
-    <tr>
-        <td colspan="3">存储空间</td>
-        <td colspan="3">大于等于20GB,推荐使用SSD以获得最佳性能</td>
-    </tr>
-    <tr>
-        <td colspan="3">python版本</td>
-        <td colspan="3">>=3.9,<=3.12</td>
-    </tr>
-    <tr>
-        <td colspan="3">Nvidia Driver 版本</td>
-        <td>latest(专有驱动)</td>
-        <td>latest</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CUDA环境</td>
-        <td>11.8/12.4/12.6</td>
-        <td>11.8/12.4/12.6</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td colspan="3">CANN环境(NPU支持)</td>
-        <td>8.0+(Ascend 910b)</td>
-        <td>None</td>
-        <td>None</td>
-    </tr>
-    <tr>
-        <td rowspan="2">GPU/MPS 硬件支持列表</td>
-        <td colspan="2">显存6G以上</td>
-        <td colspan="2">
-        Volta(2017)及之后生产的全部带Tensor Core的GPU <br>
-        6G显存及以上</td>
-        <td rowspan="2">apple slicon</td>
-    </tr>
-    </table>
-
-
-创建环境
-~~~~~~~~~~
-
-.. code-block:: shell
-
-    conda create -n mineru 'python<3.13' -y
-    conda activate mineru
-    pip install -U "magic-pdf[full]" -i https://mirrors.aliyun.com/pypi/simple
-
-
-下载模型权重文件
-~~~~~~~~~~~~~~~
-
-.. code-block:: shell
-
-    pip install huggingface_hub
-    wget https://gcore.jsdelivr.net/gh/opendatalab/MinerU@master/scripts/download_models_hf.py -O download_models_hf.py
-    python download_models_hf.py
-
-
-MinerU 已安装,查看 :doc:`../quick_start` 或阅读 :doc:`boost_with_cuda` 以加速推理。
-

+ 0 - 13
next_docs/zh_cn/user_guide/quick_start.rst

@@ -1,13 +0,0 @@
-
-快速开始 
-==============
-
-从这里开始学习 MinerU 基本使用方法。若还没有安装,请参考安装文档进行安装
-
-.. toctree::
-    :maxdepth: 1
-    :caption: 快速开始
-
-    quick_start/command_line
-    quick_start/to_markdown
-

+ 0 - 61
next_docs/zh_cn/user_guide/quick_start/command_line.rst

@@ -1,61 +0,0 @@
-
-
-命令行
-========
-
-.. code:: bash
-
-   magic-pdf --help
-   Usage: magic-pdf [OPTIONS]
-
-   Options:
-     -v, --version                display the version and exit
-     -p, --path PATH              local pdf filepath or directory  [required]
-     -o, --output-dir PATH        output local directory  [required]
-     -m, --method [ocr|txt|auto]  the method for parsing pdf. ocr: using ocr
-                                  technique to extract information from pdf. txt:
-                                  suitable for the text-based pdf only and
-                                  outperform ocr. auto: automatically choose the
-                                  best method for parsing pdf from ocr and txt.
-                                  without method specified, auto will be used by
-                                  default.
-     -l, --lang TEXT              Input the languages in the pdf (if known) to
-                                  improve OCR accuracy.  Optional. You should
-                                  input "Abbreviation" with language form url: ht
-                                  tps://paddlepaddle.github.io/PaddleOCR/en/ppocr
-                                  /blog/multi_languages.html#5-support-languages-
-                                  and-abbreviations
-     -d, --debug BOOLEAN          Enables detailed debugging information during
-                                  the execution of the CLI commands.
-     -s, --start INTEGER          The starting page for PDF parsing, beginning
-                                  from 0.
-     -e, --end INTEGER            The ending page for PDF parsing, beginning from
-                                  0.
-     --help                       Show this message and exit.
-
-
-   ## show version
-   magic-pdf -v
-
-   ## command line example
-   magic-pdf -p {some_pdf} -o {some_output_dir} -m auto
-
-``{some_pdf}`` 可以是单个 PDF 文件或者一个包含多个 PDF 文件的目录。 解析的结果文件存放在目录 ``{some_output_dir}`` 下。 生成的结果文件列表如下所示:
-
-.. code:: text
-
-   ├── some_pdf.md                          # markdown 文件
-   ├── images                               # 存放图片目录
-   ├── some_pdf_layout.pdf                  # layout 绘图 (包含layout阅读顺序)
-   ├── some_pdf_middle.json                 # minerU 中间处理结果
-   ├── some_pdf_model.json                  # 模型推理结果
-   ├── some_pdf_origin.pdf                  # 原 pdf 文件
-   ├── some_pdf_spans.pdf                   # 最小粒度的bbox位置信息绘图
-   └── some_pdf_content_list.json           # 按阅读顺序排列的富文本json
-
-
-.. admonition:: Tip
-   :class: tip
-
-   欲知更多有关结果文件的信息,请参考 :doc:`../tutorial/output_file_description`
-

+ 0 - 134
next_docs/zh_cn/user_guide/quick_start/to_markdown.rst

@@ -1,134 +0,0 @@
-
-转换为 Markdown 文件
-========================
-
-本地文件示例
-^^^^^^^^^^^^^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import FileBasedDataWriter, FileBasedDataReader
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-    from magic_pdf.config.enums import SupportedPdfParseMethod
-
-    # args
-    pdf_file_name = "abc.pdf"  # replace with the real pdf path
-    name_without_suff = pdf_file_name.split(".")[0]
-
-    # prepare env
-    local_image_dir, local_md_dir = "output/images", "output"
-    image_dir = str(os.path.basename(local_image_dir))
-
-    os.makedirs(local_image_dir, exist_ok=True)
-
-    image_writer, md_writer = FileBasedDataWriter(local_image_dir), FileBasedDataWriter(
-        local_md_dir
-    )
-    image_dir = str(os.path.basename(local_image_dir))
-
-    # read bytes
-    reader1 = FileBasedDataReader("")
-    pdf_bytes = reader1.read(pdf_file_name)  # read the pdf content
-
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-
-    ## inference
-    if ds.classify() == SupportedPdfParseMethod.OCR:
-        infer_result = ds.apply(doc_analyze, ocr=True)
-
-        ## pipeline
-        pipe_result = infer_result.pipe_ocr_mode(image_writer)
-
-    else:
-        infer_result = ds.apply(doc_analyze, ocr=False)
-
-        ## pipeline
-        pipe_result = infer_result.pipe_txt_mode(image_writer)
-
-    ### draw model result on each page
-    infer_result.draw_model(os.path.join(local_md_dir, f"{name_without_suff}_model.pdf"))
-
-    ### draw layout result on each page
-    pipe_result.draw_layout(os.path.join(local_md_dir, f"{name_without_suff}_layout.pdf"))
-
-    ### draw spans result on each page
-    pipe_result.draw_span(os.path.join(local_md_dir, f"{name_without_suff}_spans.pdf"))
-
-    ### dump markdown
-    pipe_result.dump_md(md_writer, f"{name_without_suff}.md", image_dir)
-
-    ### dump content list
-    pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
-
-
-对象存储文件示例
-^^^^^^^^^^^^^^^^
-
-.. code:: python
-
-    import os
-
-    from magic_pdf.data.data_reader_writer import S3DataReader, S3DataWriter
-    from magic_pdf.data.dataset import PymuDocDataset
-    from magic_pdf.model.doc_analyze_by_custom_model import doc_analyze
-
-    bucket_name = "{Your S3 Bucket Name}"  # replace with real bucket name
-    ak = "{Your S3 access key}"  # replace with real s3 access key
-    sk = "{Your S3 secret key}"  # replace with real s3 secret key
-    endpoint_url = "{Your S3 endpoint_url}"  # replace with real s3 endpoint_url
-
-
-    reader = S3DataReader('unittest/tmp/', bucket_name, ak, sk, endpoint_url)  # replace `unittest/tmp` with the real s3 prefix
-    writer = S3DataWriter('unittest/tmp', bucket_name, ak, sk, endpoint_url)
-    image_writer = S3DataWriter('unittest/tmp/images', bucket_name, ak, sk, endpoint_url)
-
-    # args
-    pdf_file_name = (
-        "s3://llm-pdf-text-1/unittest/tmp/bug5-11.pdf"  # replace with the real s3 path
-    )
-
-    # prepare env
-    local_dir = "output"
-    name_without_suff = os.path.basename(pdf_file_name).split(".")[0]
-
-    # read bytes
-    pdf_bytes = reader.read(pdf_file_name)  # read the pdf content
-
-    # proc
-    ## Create Dataset Instance
-    ds = PymuDocDataset(pdf_bytes)
-
-    ## inference
-    if ds.classify() == SupportedPdfParseMethod.OCR:
-        infer_result = ds.apply(doc_analyze, ocr=True)
-
-        ## pipeline
-        pipe_result = infer_result.pipe_ocr_mode(image_writer)
-
-    else:
-        infer_result = ds.apply(doc_analyze, ocr=False)
-
-        ## pipeline
-        pipe_result = infer_result.pipe_txt_mode(image_writer)
-
-    ### draw model result on each page
-    infer_result.draw_model(os.path.join(local_dir, f'{name_without_suff}_model.pdf'))  # dump to local
-
-    ### draw layout result on each page
-    pipe_result.draw_layout(os.path.join(local_dir, f'{name_without_suff}_layout.pdf'))  # dump to local
-
-    ### draw spans result on each page
-    pipe_result.draw_span(os.path.join(local_dir, f'{name_without_suff}_spans.pdf'))  # dump to local
-
-    ### dump markdown
-    pipe_result.dump_md(writer, f'{name_without_suff}.md', "unittest/tmp/images")  # dump to remote s3
-
-    ### dump content list
-    pipe_result.dump_content_list(md_writer, f"{name_without_suff}_content_list.json", image_dir)
-
-前去 :doc:`../data/data_reader_writer` 获取更多有关 **读写** 示例

Algunos archivos no se mostraron porque demasiados archivos cambiaron en este cambio