1 年之前 · 2730b96b55
--- a/README.md
+++ b/README.md
--- a/README_zh-CN.md
+++ b/README_zh-CN.md
--- a/docs/how_to_download_models_en.md
+++ b/docs/how_to_download_models_en.md
@@ -1,20 +1,12 @@
 
				-### 1. Install Git LFS
			
 
				-Before you begin, make sure Git Large File Storage (Git LFS) is installed on your system. Install it using the following command:
			
 
				-
			
 
				-```bash
			
 
				-git lfs install
			
 
				-```
			
 
				-
			
 
				-### 2. Download the Model from Hugging Face
			
 
				-To download the `PDF-Extract-Kit` model from Hugging Face, use the following command:
			
 
				-
			
 
				+### 1. Download the Model from Hugging Face
			
 
				+Use a Python Script to Download Model Files from Hugging Face
			
 
				 ```bash
			
 
				-git lfs clone https://huggingface.co/opendatalab/PDF-Extract-Kit
			
 
				+pip install huggingface_hub
			
 
				+wget https://github.com/opendatalab/MinerU/raw/master/docs/download_models_hf.py
			
 
				+python download_models_hf.py
			
 
				 ```
			
 
				-
			
 
				-Ensure that Git LFS is enabled during the clone to properly download all large files.
			
 
				-
			
 
				-### 3. Additional steps
			
 
				+After the Python script finishes executing, it will output the directory where the models are downloaded.
			
 
				+### 2. Additional steps
			
 
				 
			
 
				 #### 1. Check whether the model directory is downloaded completely.
			
 
				 
			
@@ -65,6 +57,7 @@ The structure of the model folder is as follows, including configuration files a
 
				 
			
 
				 Please check whether the size of the model file in the directory is consistent with the description on the web page. If possible, it is best to check whether the model is downloaded completely through sha256.
			
 
				 
			
 
				-#### 3. Move the model to the solid-state drive
			
 
				+#### 3. 
			
 
				+
			
 
				+Additionally, in `~/magic-pdf.json`, update the model directory path to the absolute path of the `models` directory output by the previous Python script. Otherwise, you will encounter an error indicating that the model cannot be loaded.
			
 
				 
			
 
				-Move the 'models' directory to a directory with large disk space, preferably on a solid-state drive (SSD). In addition, modify the model directory in `~/magic-pdf.json` to point to the final model storage location, otherwise the model cannot be loaded.
			
--- a/docs/how_to_download_models_zh_cn.md
+++ b/docs/how_to_download_models_zh_cn.md
@@ -1,50 +1,26 @@
 
				 # 如何下载模型文件
			
 
				 
			
 
				-模型文件可以从Hugging Face 或 Model Scope 下载，由于网络原因，国内用户访问HF 可能会失败，请使用 ModelScope。
			
 
				-
			
 
				-
			
 
				-方法一：[从 Hugging Face 下载模型](#方法一从-hugging-face-下载模型)
			
 
				-
			
 
				-方法二：[从 ModelScope 下载模型](#方法二从-modelscope-下载模型)
			
 
				-
			
 
				-## 方法一：从 Hugging Face 下载模型
			
 
				-
			
 
				-使用Git LFS 从Hugging Face下载模型文件
			
 
				-
			
 
				-```bash
			
 
				-git lfs install # 安装 Git 大文件存储插件 (Git LFS) 
			
 
				-git lfs clone https://huggingface.co/opendatalab/PDF-Extract-Kit # 从 Hugging Face 下载 PDF-Extract-Kit 模型
			
 
				-```
			
 
				+模型文件可以从 Hugging Face 或 Model Scope 下载，由于网络原因，国内用户访问HF可能会失败，请使用 ModelScope。
			
 
				 
			
 
				+<details>
			
 
				+  <summary>方法一：从 Hugging Face 下载模型</summary>
			
 
				+  <p>使用python脚本 从Hugging Face下载模型文件</p>
			
 
				+  <pre><code>pip install huggingface_hub
			
 
				+wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models_hf.py
			
 
				+python download_models_hf.py</code></pre>
			
 
				+  <p>python脚本执行完毕后，会输出模型下载目录</p>
			
 
				+</details>
			
 
				 
			
 
				 ## 方法二：从 ModelScope 下载模型
			
 
				-ModelScope 支持SDK或模型下载，任选一个即可。
			
 
				-
			
 
				-[Git lsf下载](#1利用git-lsf下载)
			
 
				 
			
 
				-[SDK下载](#2利用sdk下载)
			
 
				-
			
 
				-### 1）利用Git lsf下载
			
 
				-
			
 
				-```bash
			
 
				-git lfs install
			
 
				-git lfs clone https://www.modelscope.cn/opendatalab/PDF-Extract-Kit.git
			
 
				-```
			
 
				-
			
 
				-### 2）利用SDK下载
			
 
				+### 使用python脚本 从ModelScope下载模型文件
			
 
				 
			
 
				 ```bash
			
 
				-# 首先安装modelscope
			
 
				 pip install modelscope
			
 
				+wget https://gitee.com/myhloli/MinerU/raw/master/docs/download_models.py
			
 
				+python download_models.py
			
 
				 ```
			
 
				-
			
 
				-```python
			
 
				-# 使用modelscope sdk下载模型
			
 
				-from modelscope import snapshot_download
			
 
				-model_dir = snapshot_download('opendatalab/PDF-Extract-Kit')
			
 
				-print(f"模型文件下载路径为：{model_dir}/models")
			
 
				-```
			
 
				-
			
 
				+python脚本执行完毕后，会输出模型下载目录
			
 
				 ## 【❗️必须要做❗️】的额外步骤（模型下载完成后请务必完成以下操作）
			
 
				 
			
 
				 ### 1.检查模型目录是否下载完整
			
@@ -95,6 +71,5 @@ print(f"模型文件下载路径为：{model_dir}/models")
 
				 ### 2.检查模型文件是否下载完整
			
 
				 请检查目录下的模型文件大小与网页上描述是否一致，如果可以的话，最好通过sha256校验模型是否下载完整
			
 
				 
			
 
				-### 3.移动模型到固态硬盘
			
 
				-将 'models' 目录移动到具有较大磁盘空间的目录中，最好是在固态硬盘(SSD)上。
			
 
				-此外在 `~/magic-pdf.json`里修改模型的目录指向最终的模型存放位置，否则会报模型无法加载的错误。
			
 
				+### 3.修改magic-pdf.json中的模型路径
			
 
				+此外在 `~/magic-pdf.json`里修改模型的目录指向之前python脚本输出的models目录的绝对路径，否则会报模型无法加载的错误。
			
--- a/magic_pdf/para/para_split_v2.py
+++ b/magic_pdf/para/para_split_v2.py
@@ -202,7 +202,9 @@ def __valign_lines(blocks, layout_bboxes):
 
				     min_distance = 3
			
 
				     min_sample = 2
			
 
				     new_layout_bboxes = []
			
 
				-
			
 
				+    # add bbox_fs for para split calculation
			
 
				+    for block in blocks:
			
 
				+        block["bbox_fs"] = copy.deepcopy(block["bbox"])
			
 
				     for layout_box in layout_bboxes:
			
 
				         blocks_in_layoutbox = [b for b in blocks if
			
 
				                                b["type"] == BlockType.Text and is_in_layout(b['bbox'], layout_box['layout_bbox'])]
			
@@ -251,8 +253,6 @@ def __valign_lines(blocks, layout_bboxes):
 
				                                     min([line['bbox'][1] for line in block['lines']]),
			
 
				                                     max([line['bbox'][2] for line in block['lines']]),
			
 
				                                     max([line['bbox'][3] for line in block['lines']])]
			
 
				-            else:
			
 
				-                block['bbox_fs'] = copy.deepcopy(block['bbox'])
			
 
				         """新计算layout的bbox，因为block的bbox变了。"""
			
 
				         layout_x0 = min([block['bbox_fs'][0] for block in blocks_in_layoutbox])
			
 
				         layout_y0 = min([block['bbox_fs'][1] for block in blocks_in_layoutbox])
			
--- a/projects/README.md
+++ b/projects/README.md
@@ -1,7 +1,6 @@
 
				-# 欢迎来到 MinerU 项目列表
			
 
				+# Welcome to the MinerU Project List
			
 
				 
			
 
				-## 项目列表
			
 
				+## Project List
			
 
				 
			
 
				-- [llama_index_rag](./llama_index_rag/README.md): 基于 llama_index 构建轻量级 RAG 系统
			
 
				+- [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
			
 
				 
			
 
				-- [web_api](./web_api/README.md): PDF解析的restful api服务
			
--- a/projects/README_zh-CN.md
+++ b/projects/README_zh-CN.md
@@ -0,0 +1,5 @@
 
				+# 欢迎来到 MinerU 项目列表
			
 
				+
			
 
				+## 项目列表
			
 
				+
			
 
				+- [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统
			
--- a/projects/llama_index_rag/README.md
+++ b/projects/llama_index_rag/README.md
@@ -1,4 +1,4 @@
 
				-## 安装
			
 
				+## Installation
			
 
				 
			
 
				 MinerU
			
 
				 
			
@@ -11,7 +11,7 @@ conda activate MinerU
 
				 pip install .[full] --extra-index-url https://wheels.myhloli.com
			
 
				 ```
			
 
				 
			
 
				-第三方软件
			
 
				+Third-party software
			
 
				 
			
 
				 ```bash
			
 
				 # install
			
@@ -26,7 +26,7 @@ pip install accelerate==0.33.0
 
				 pip uninstall transformer-engine
			
 
				 ```
			
 
				 
			
 
				-## 环境配置
			
 
				+## Environment Configuration
			
 
				 
			
 
				 ```
			
 
				 export DASHSCOPE_API_KEY={some_key}
			
@@ -34,12 +34,11 @@ export ES_USER={some_es_user}
 
				 export ES_PASSWORD={some_es_password}
			
 
				 export ES_URL=http://{es_url}:9200
			
 
				 ```
			
 
				+For instructions on obtaining a DASHSCOPE_API_KEY, refer to [documentation](https://help.aliyun.com/zh/dashscope/opening-service)
			
 
				 
			
 
				-DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
			
 
				+## Usage
			
 
				 
			
 
				-## 使用
			
 
				-
			
 
				-### 导入数据
			
 
				+### Data Ingestion
			
 
				 
			
 
				 ```bash
			
 
				 python data_ingestion.py -p some.pdf  # load data from pdf
			
@@ -49,16 +48,16 @@ python data_ingestion.py -p some.pdf  # load data from pdf
 
				 python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
			
 
				 ```
			
 
				 
			
 
				-### 查询
			
 
				+### Query
			
 
				 
			
 
				 ```bash
			
 
				 python query.py --question '{the_question_you_want_to_ask}'
			
 
				 ```
			
 
				 
			
 
				-## 示例
			
 
				+## Example
			
 
				 
			
 
				 ````bash
			
 
				-# 启动 es 服务
			
 
				+# Start the es service
			
 
				 docker compose up -d
			
 
				 
			
 
				 or
			
@@ -66,44 +65,45 @@ or
 
				 docker-compose up -d
			
 
				 
			
 
				 
			
 
				-# 配置环境变量
			
 
				+# Set environment variables
			
 
				 export ES_USER=elastic
			
 
				 export ES_PASSWORD=llama_index
			
 
				 export ES_URL=http://127.0.0.1:9200
			
 
				 export DASHSCOPE_API_KEY={some_key}
			
 
				 
			
 
				 
			
 
				-# 导入数据
			
 
				+# Ingest data
			
 
				 python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
			
 
				 
			
 
				 
			
 
				-# 查询问题
			
 
				+# Ask a question
			
 
				 python query.py -q 'how about the rights of men'
			
 
				 
			
 
				 ## outputs
			
 
				-请基于```内的内容回答问题。"
			
 
				+Please answer the question based on the content within ```:
			
 
				             ```
			
 
				             I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
			
 
				             ```
			
 
				-            我的问题是：how about the rights of men。
			
 
				+            My question is：how about the rights of men。
			
 
				 
			
 
				 question: how about the rights of men
			
 
				 answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.
			
 
				 
			
 
				 ````
			
 
				 
			
 
				-## 开发
			
 
				+## Development
			
 
				+
			
 
				+`MinerU` provides a `RAG` integration interface, allowing users to specify a single input `pdf` file or a directory. `MinerU` will automatically parse the input files and return an iterable interface for retrieving the data.
			
 
				 
			
 
				-`MinerU` 提供了 `RAG` 集成接口，用户可以通过指定输入单个 `pdf` 文件或者某个目录。`MinerU` 会自动解析输入文件并返回可以迭代的接口用于获取数据
			
 
				 
			
 
				-### API 接口
			
 
				+### API Interface
			
 
				 
			
 
				 ```python
			
 
				 from magic_pdf.integrations.rag.type import Node
			
 
				 
			
 
				 class RagPageReader:
			
 
				     def get_rel_map(self) -> list[ElementRelation]:
			
 
				-        # 获取节点间的关系
			
 
				+        # Retrieve the relationships between nodes
			
 
				         pass
			
 
				     ...
			
 
				 
			
@@ -115,41 +115,43 @@ class DataReader:
 
				         pass
			
 
				 
			
 
				     def get_documents_count(self) -> int:
			
 
				-        """获取 pdf 文档数量"""
			
 
				+        """Get the number of pdf documents"""
			
 
				         pass
			
 
				 
			
 
				     def get_document_result(self, idx: int) -> RagDocumentReader | None:
			
 
				-        """获取某个 pdf 的解析内容"""
			
 
				+        """Retrieve the parsed content of a specific pdf"""
			
 
				         pass
			
 
				 
			
 
				 
			
 
				     def get_document_filename(self, idx: int) -> Path:
			
 
				-        """获取某个 pdf 的具体路径"""
			
 
				+        """Retrieve the path of a specific pdf"""
			
 
				         pass
			
 
				 
			
 
				 
			
 
				 ```
			
 
				 
			
 
				-类型定义
			
 
				+Type Definitions
			
 
				 
			
 
				 ```python
			
 
				 
			
 
				+
			
 
				 class Node(BaseModel):
			
 
				-    category_type: CategoryType = Field(description='类别') # 类别
			
 
				-    text: str | None = Field(description='文本内容',
			
 
				-                             default=None)
			
 
				-    image_path: str | None = Field(description='图或者表格（表可能用图片形式存储）的存储路径',
			
 
				-                                   default=None)
			
 
				-    anno_id: int = Field(description='unique id', default=-1)
			
 
				-    latex: str | None = Field(description='公式或表格 latex 解析结果', default=None)
			
 
				-    html: str | None = Field(description='表格的 html 解析结果', default=None)
			
 
				+    category_type: CategoryType = Field(description='Category') # Category
			
 
				+    text: str | None = Field(description='Text content', default=None)
			
 
				+    image_path: str | None = Field(description='Path to image or table (table may be stored as an image)', default=None)
			
 
				+    anno_id: int = Field(description='Unique ID', default=-1)
			
 
				+    latex: str | None = Field(description='LaTeX output for equations or tables', default=None)
			
 
				+    html: str | None = Field(description='HTML output for tables', default=None)
			
 
				+
			
 
				+
			
 
				 
			
 
				 ```
			
 
				 
			
 
				-表格存储形式可能会是 图片、latex、html 三种形式之一。
			
 
				-anno_id 是该 Node 的在全局唯一ID。后续可以用于匹配该 Node 和其他 Node 的关系。节点的关系可以通过方法 `get_rel_map` 获取。用户可以用 `anno_id` 匹配节点之间的关系，并用于构建具备节点的关系的 rag index。
			
 
				+Tables can be stored in one of three formats: image, LaTeX, or HTML. 
			
 
				+`anno_id` is a globally unique ID for each Node. It can be used later to match this Node with other Nodes. The relationships between nodes can be retrieved using the `get_rel_map` method. Users can use `anno_id` to link nodes and construct a RAG index that includes node relationships.
			
 
				+
			
 
				 
			
 
				-### 节点类型关系矩阵
			
 
				+### Node Relationship Matrix
			
 
				 
			
 
				 |                | image_body | table_body |
			
 
				 | -------------- | ---------- | ---------- |
			
--- a/projects/llama_index_rag/README_zh-CN.md
+++ b/projects/llama_index_rag/README_zh-CN.md
@@ -0,0 +1,158 @@
 
				+## 安装
			
 
				+
			
 
				+MinerU
			
 
				+
			
 
				+```bash
			
 
				+git clone https://github.com/opendatalab/MinerU.git
			
 
				+cd MinerU
			
 
				+
			
 
				+conda create -n MinerU python=3.10
			
 
				+conda activate MinerU
			
 
				+pip install .[full] --extra-index-url https://wheels.myhloli.com
			
 
				+```
			
 
				+
			
 
				+第三方软件
			
 
				+
			
 
				+```bash
			
 
				+# install
			
 
				+pip install llama-index-vector-stores-elasticsearch==0.2.0
			
 
				+pip install llama-index-embeddings-dashscope==0.2.0
			
 
				+pip install llama-index-core==0.10.68
			
 
				+pip install einops==0.7.0
			
 
				+pip install transformers-stream-generator==0.0.5
			
 
				+pip install accelerate==0.33.0
			
 
				+
			
 
				+# uninstall
			
 
				+pip uninstall transformer-engine
			
 
				+```
			
 
				+
			
 
				+## 环境配置
			
 
				+
			
 
				+```
			
 
				+export DASHSCOPE_API_KEY={some_key}
			
 
				+export ES_USER={some_es_user}
			
 
				+export ES_PASSWORD={some_es_password}
			
 
				+export ES_URL=http://{es_url}:9200
			
 
				+```
			
 
				+
			
 
				+DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
			
 
				+
			
 
				+## 使用
			
 
				+
			
 
				+### 导入数据
			
 
				+
			
 
				+```bash
			
 
				+python data_ingestion.py -p some.pdf  # load data from pdf
			
 
				+
			
 
				+    or
			
 
				+
			
 
				+python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
			
 
				+```
			
 
				+
			
 
				+### 查询
			
 
				+
			
 
				+```bash
			
 
				+python query.py --question '{the_question_you_want_to_ask}'
			
 
				+```
			
 
				+
			
 
				+## 示例
			
 
				+
			
 
				+````bash
			
 
				+# 启动 es 服务
			
 
				+docker compose up -d
			
 
				+
			
 
				+or
			
 
				+
			
 
				+docker-compose up -d
			
 
				+
			
 
				+
			
 
				+# 配置环境变量
			
 
				+export ES_USER=elastic
			
 
				+export ES_PASSWORD=llama_index
			
 
				+export ES_URL=http://127.0.0.1:9200
			
 
				+export DASHSCOPE_API_KEY={some_key}
			
 
				+
			
 
				+
			
 
				+# 导入数据
			
 
				+python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
			
 
				+
			
 
				+
			
 
				+# 查询问题
			
 
				+python query.py -q 'how about the rights of men'
			
 
				+
			
 
				+## outputs
			
 
				+请基于```内的内容回答问题。"
			
 
				+            ```
			
 
				+            I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
			
 
				+            ```
			
 
				+            我的问题是：how about the rights of men。
			
 
				+
			
 
				+question: how about the rights of men
			
 
				+answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.
			
 
				+
			
 
				+````
			
 
				+
			
 
				+## 开发
			
 
				+
			
 
				+`MinerU` 提供了 `RAG` 集成接口，用户可以通过指定输入单个 `pdf` 文件或者某个目录。`MinerU` 会自动解析输入文件并返回可以迭代的接口用于获取数据
			
 
				+
			
 
				+### API 接口
			
 
				+
			
 
				+```python
			
 
				+from magic_pdf.integrations.rag.type import Node
			
 
				+
			
 
				+class RagPageReader:
			
 
				+    def get_rel_map(self) -> list[ElementRelation]:
			
 
				+        # 获取节点间的关系
			
 
				+        pass
			
 
				+    ...
			
 
				+
			
 
				+class RagDocumentReader:
			
 
				+    ...
			
 
				+
			
 
				+class DataReader:
			
 
				+    def __init__(self, path_or_directory: str, method: str, output_dir: str):
			
 
				+        pass
			
 
				+
			
 
				+    def get_documents_count(self) -> int:
			
 
				+        """获取 pdf 文档数量"""
			
 
				+        pass
			
 
				+
			
 
				+    def get_document_result(self, idx: int) -> RagDocumentReader | None:
			
 
				+        """获取某个 pdf 的解析内容"""
			
 
				+        pass
			
 
				+
			
 
				+
			
 
				+    def get_document_filename(self, idx: int) -> Path:
			
 
				+        """获取某个 pdf 的具体路径"""
			
 
				+        pass
			
 
				+
			
 
				+
			
 
				+```
			
 
				+
			
 
				+类型定义
			
 
				+
			
 
				+```python
			
 
				+
			
 
				+class Node(BaseModel):
			
 
				+    category_type: CategoryType = Field(description='类别') # 类别
			
 
				+    text: str | None = Field(description='文本内容',
			
 
				+                             default=None)
			
 
				+    image_path: str | None = Field(description='图或者表格（表可能用图片形式存储）的存储路径',
			
 
				+                                   default=None)
			
 
				+    anno_id: int = Field(description='unique id', default=-1)
			
 
				+    latex: str | None = Field(description='公式或表格 latex 解析结果', default=None)
			
 
				+    html: str | None = Field(description='表格的 html 解析结果', default=None)
			
 
				+
			
 
				+```
			
 
				+
			
 
				+表格存储形式可能会是 图片、latex、html 三种形式之一。
			
 
				+anno_id 是该 Node 的在全局唯一ID。后续可以用于匹配该 Node 和其他 Node 的关系。节点的关系可以通过方法 `get_rel_map` 获取。用户可以用 `anno_id` 匹配节点之间的关系，并用于构建具备节点的关系的 rag index。
			
 
				+
			
 
				+### 节点类型关系矩阵
			
 
				+
			
 
				+|                | image_body | table_body |
			
 
				+| -------------- | ---------- | ---------- |
			
 
				+| image_caption  | sibling    |            |
			
 
				+| table_caption  |            | sibling    |
			
 
				+| table_footnote |            | sibling    |
			
--- a/web_api/Dockerfile
+++ b/web_api/Dockerfile
@@ -0,0 +1,85 @@
 
				+# Use the official Ubuntu base image
			
 
				+FROM ubuntu:latest
			
 
				+
			
 
				+# ENV http_proxy http://127.0.0.1:7890
			
 
				+# ENV https_proxy http://127.0.0.1:7890
			
 
				+
			
 
				+# Set environment variables to non-interactive to avoid prompts during installation
			
 
				+ENV DEBIAN_FRONTEND=noninteractive
			
 
				+ENV LANG C.UTF-8
			
 
				+
			
 
				+# ADD sources.list /etc/apt
			
 
				+# RUN apt-get clean
			
 
				+
			
 
				+
			
 
				+
			
 
				+# Update the package list and install necessary packages
			
 
				+RUN apt-get -q update \
			
 
				+    && apt-get -q install -y --no-install-recommends \
			
 
				+        apt-utils \
			
 
				+        bats \
			
 
				+        build-essential
			
 
				+RUN apt-get update && apt-get install -y vim net-tools procps lsof curl wget iputils-ping telnet lrzsz git
			
 
				+
			
 
				+RUN apt-get update && \
			
 
				+    apt-get install -y \
			
 
				+        software-properties-common && \
			
 
				+    add-apt-repository ppa:deadsnakes/ppa && \
			
 
				+    apt-get update && \
			
 
				+    apt-get install -y \
			
 
				+        python3.10 \
			
 
				+        python3.10-venv \
			
 
				+        python3.10-distutils \
			
 
				+        python3-pip \
			
 
				+        wget \
			
 
				+        git \
			
 
				+        libgl1 \
			
 
				+        libglib2.0-0 \
			
 
				+        && rm -rf /var/lib/apt/lists/*
			
 
				+        
			
 
				+# RUN unset http_proxy && unset https_proxy
			
 
				+
			
 
				+# Set Python 3.10 as the default python3
			
 
				+RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.10 1
			
 
				+
			
 
				+# Create a virtual environment for MinerU
			
 
				+RUN python3 -m venv /opt/mineru_venv
			
 
				+RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
			
 
				+# Activate the virtual environment and install necessary Python packages
			
 
				+RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
			
 
				+    pip install --upgrade pip && \
			
 
				+    pip install magic-pdf[full] --extra-index-url https://myhloli.github.io/wheels/ --no-cache-dir"
			
 
				+
			
 
				+
			
 
				+RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
			
 
				+    pip install fastapi uvicorn python-multipart --no-cache-dir"
			
 
				+
			
 
				+RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
			
 
				+    pip uninstall  paddlepaddle -y"
			
 
				+
			
 
				+RUN /bin/bash -c "source /opt/mineru_venv/bin/activate && \
			
 
				+    python -m pip install paddlepaddle-gpu==3.0.0b1 -i https://www.paddlepaddle.org.cn/packages/stable/cu118/ --no-cache-dir"
			
 
				+
			
 
				+# Copy the configuration file template and set up the model directory
			
 
				+COPY magic-pdf.template.json /root/magic-pdf.json
			
 
				+ADD models /opt/models
			
 
				+ADD .paddleocr /root/.paddleocr 
			
 
				+ADD app.py /root/app.py
			
 
				+
			
 
				+WORKDIR /root
			
 
				+
			
 
				+# Set the models directory in the configuration file (adjust the path as needed)
			
 
				+RUN sed -i 's|/tmp/models|/opt/models|g' /root/magic-pdf.json
			
 
				+
			
 
				+# Create the models directory
			
 
				+# RUN mkdir -p /opt/models
			
 
				+
			
 
				+# Set the entry point to activate the virtual environment and run the command line tool
			
 
				+# ENTRYPOINT ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && exec \"$@\" && python3 app.py", "--"]
			
 
				+
			
 
				+
			
 
				+# Expose the port that FastAPI will run on
			
 
				+EXPOSE 8000
			
 
				+
			
 
				+# Command to run FastAPI using Uvicorn, pointing to app.py and binding to 0.0.0.0:8000
			
 
				+CMD ["/bin/bash", "-c", "source /opt/mineru_venv/bin/activate && uvicorn app:app --host 0.0.0.0 --port 8000"]
			
--- a/web_api/README.md
+++ b/web_api/README.md
@@ -0,0 +1,44 @@
 
				+基于MinerU的PDF解析API
			
 
				+
			
 
				+    - MinerU的GPU镜像构建
			
 
				+    - 基于FastAPI的PDF解析接口
			
 
				+
			
 
				+支持一键启动，已经打包到镜像中，自带模型权重，支持GPU推理加速，GPU速度相比CPU每页解析要快几十倍不等
			
 
				+
			
 
				+
			
 
				+##  启动命令：
			
 
				+
			
 
				+
			
 
				+```docker run -itd --name=mineru_server --gpus=all -p 8888:8000 quincyqiang/mineru:0.1-models```
			
 
				+
			
 
				+![](https://i-blog.csdnimg.cn/direct/bcff4f524ea5400db14421ba7cec4989.png)
			
 
				+
			
 
				+具体截图请见博客：https://blog.csdn.net/yanqianglifei/article/details/141979684
			
 
				+
			
 
				+
			
 
				+##   启动日志：
			
 
				+
			
 
				+![](https://i-blog.csdnimg.cn/direct/4eb5657567e4415eba912179dca5c8aa.png)
			
 
				+
			
 
				+##  输入参数：
			
 
				+
			
 
				+访问地址：
			
 
				+
			
 
				+    http://localhost:8888/docs
			
 
				+
			
 
				+    http://127.0.01:8888/docs
			
 
				+
			
 
				+![](https://i-blog.csdnimg.cn/direct/8b3a2bc5908042268e8cc69756e331a2.png)
			
 
				+
			
 
				+##  解析效果：
			
 
				+
			
 
				+![](https://i-blog.csdnimg.cn/direct/a54dcae834ae48d498fb595aca4212c3.png)
			
 
				+
			
 
				+
			
 
				+
			
 
				+##   镜像地址：
			
 
				+
			
 
				+> 阿里云地址：docker pull registry.cn-beijing.aliyuncs.com/quincyqiang/mineru:0.1-models
			
 
				+
			
 
				+> dockerhub地址：docker pull quincyqiang/mineru:0.1-models
			
 
				+
			
--- a/web_api/app.py
+++ b/web_api/app.py
@@ -0,0 +1,141 @@
 
				+import copy
			
 
				+import json
			
 
				+import os
			
 
				+from tempfile import NamedTemporaryFile
			
 
				+
			
 
				+import magic_pdf.model as model_config
			
 
				+import uvicorn
			
 
				+from fastapi import FastAPI, File, UploadFile, Form
			
 
				+from fastapi.responses import JSONResponse
			
 
				+from loguru import logger
			
 
				+from magic_pdf.pipe.OCRPipe import OCRPipe
			
 
				+from magic_pdf.pipe.TXTPipe import TXTPipe
			
 
				+from magic_pdf.pipe.UNIPipe import UNIPipe
			
 
				+from magic_pdf.rw.DiskReaderWriter import DiskReaderWriter
			
 
				+
			
 
				+model_config.__use_inside_model__ = True
			
 
				+
			
 
				+app = FastAPI()
			
 
				+
			
 
				+def json_md_dump(
			
 
				+        pipe,
			
 
				+        md_writer,
			
 
				+        pdf_name,
			
 
				+        content_list,
			
 
				+        md_content,
			
 
				+):
			
 
				+    # Write model results to model.json
			
 
				+    orig_model_list = copy.deepcopy(pipe.model_list)
			
 
				+    md_writer.write(
			
 
				+        content=json.dumps(orig_model_list, ensure_ascii=False, indent=4),
			
 
				+        path=f"{pdf_name}_model.json"
			
 
				+    )
			
 
				+
			
 
				+    # Write intermediate results to middle.json
			
 
				+    md_writer.write(
			
 
				+        content=json.dumps(pipe.pdf_mid_data, ensure_ascii=False, indent=4),
			
 
				+        path=f"{pdf_name}_middle.json"
			
 
				+    )
			
 
				+
			
 
				+    # Write text content results to content_list.json
			
 
				+    md_writer.write(
			
 
				+        content=json.dumps(content_list, ensure_ascii=False, indent=4),
			
 
				+        path=f"{pdf_name}_content_list.json"
			
 
				+    )
			
 
				+
			
 
				+    # Write results to .md file
			
 
				+    md_writer.write(
			
 
				+        content=md_content,
			
 
				+        path=f"{pdf_name}.md"
			
 
				+    )
			
 
				+
			
 
				+@app.post("/pdf_parse", tags=["projects"], summary="Parse PDF file")
			
 
				+async def pdf_parse_main(
			
 
				+        pdf_file: UploadFile = File(...),
			
 
				+        parse_method: str = 'auto',
			
 
				+        model_json_path: str = None,
			
 
				+        is_json_md_dump: bool = True,
			
 
				+        output_dir: str = "output"
			
 
				+):
			
 
				+    """
			
 
				+    Execute the process of converting PDF to JSON and MD, outputting MD and JSON files to the specified directory
			
 
				+    :param pdf_file: The PDF file to be parsed
			
 
				+    :param parse_method: Parsing method, can be auto, ocr, or txt. Default is auto. If results are not satisfactory, try ocr
			
 
				+    :param model_json_path: Path to existing model data file. If empty, use built-in model. PDF and model_json must correspond
			
 
				+    :param is_json_md_dump: Whether to write parsed data to .json and .md files. Default is True. Different stages of data will be written to different .json files (3 in total), md content will be saved to .md file
			
 
				+    :param output_dir: Output directory for results. A folder named after the PDF file will be created to store all results
			
 
				+    """
			
 
				+    try:
			
 
				+        # Create a temporary file to store the uploaded PDF
			
 
				+        with NamedTemporaryFile(delete=False, suffix=".pdf") as temp_pdf:
			
 
				+            temp_pdf.write(await pdf_file.read())
			
 
				+            temp_pdf_path = temp_pdf.name
			
 
				+
			
 
				+        pdf_name = os.path.basename(pdf_file.filename).split(".")[0]
			
 
				+
			
 
				+        if output_dir:
			
 
				+            output_path = os.path.join(output_dir, pdf_name)
			
 
				+        else:
			
 
				+            output_path = os.path.join(os.path.dirname(temp_pdf_path), pdf_name)
			
 
				+
			
 
				+        output_image_path = os.path.join(output_path, 'images')
			
 
				+
			
 
				+        # Get parent path of images for relative path in .md and content_list.json
			
 
				+        image_path_parent = os.path.basename(output_image_path)
			
 
				+
			
 
				+        pdf_bytes = open(temp_pdf_path, "rb").read()  # Read binary data of PDF file
			
 
				+
			
 
				+        if model_json_path:
			
 
				+            # Read original JSON data of PDF file parsed by model, list type
			
 
				+            model_json = json.loads(open(model_json_path, "r", encoding="utf-8").read())
			
 
				+        else:
			
 
				+            model_json = []
			
 
				+
			
 
				+        # Execute parsing steps
			
 
				+        image_writer, md_writer = DiskReaderWriter(output_image_path), DiskReaderWriter(output_path)
			
 
				+
			
 
				+        # Choose parsing method
			
 
				+        if parse_method == "auto":
			
 
				+            jso_useful_key = {"_pdf_type": "", "model_list": model_json}
			
 
				+            pipe = UNIPipe(pdf_bytes, jso_useful_key, image_writer)
			
 
				+        elif parse_method == "txt":
			
 
				+            pipe = TXTPipe(pdf_bytes, model_json, image_writer)
			
 
				+        elif parse_method == "ocr":
			
 
				+            pipe = OCRPipe(pdf_bytes, model_json, image_writer)
			
 
				+        else:
			
 
				+            logger.error("Unknown parse method, only auto, ocr, txt allowed")
			
 
				+            return JSONResponse(content={"error": "Invalid parse method"}, status_code=400)
			
 
				+
			
 
				+        # Execute classification
			
 
				+        pipe.pipe_classify()
			
 
				+
			
 
				+        # If no model data is provided, use built-in model for parsing
			
 
				+        if not model_json:
			
 
				+            if model_config.__use_inside_model__:
			
 
				+                pipe.pipe_analyze()  # Parse
			
 
				+            else:
			
 
				+                logger.error("Need model list input")
			
 
				+                return JSONResponse(content={"error": "Model list input required"}, status_code=400)
			
 
				+
			
 
				+        # Execute parsing
			
 
				+        pipe.pipe_parse()
			
 
				+
			
 
				+        # Save results in text and md format
			
 
				+        content_list = pipe.pipe_mk_uni_format(image_path_parent, drop_mode="none")
			
 
				+        md_content = pipe.pipe_mk_markdown(image_path_parent, drop_mode="none")
			
 
				+
			
 
				+        if is_json_md_dump:
			
 
				+            json_md_dump(pipe, md_writer, pdf_name, content_list, md_content)
			
 
				+        data = {"layout": copy.deepcopy(pipe.model_list), "info": pipe.pdf_mid_data, "content_list": content_list,'md_content':md_content}
			
 
				+        return JSONResponse(data, status_code=200)
			
 
				+
			
 
				+    except Exception as e:
			
 
				+        logger.exception(e)
			
 
				+        return JSONResponse(content={"error": str(e)}, status_code=500)
			
 
				+    finally:
			
 
				+        # Clean up the temporary file
			
 
				+        if 'temp_pdf_path' in locals():
			
 
				+            os.unlink(temp_pdf_path)
			
 
				+
			
 
				+# if __name__ == '__main__':
			
 
				+#     uvicorn.run(app, host="0.0.0.0", port=8888)
			
--- a/web_api/magic-pdf.json
+++ b/web_api/magic-pdf.json
@@ -0,0 +1,13 @@
 
				+{
			
 
				+    "bucket_info":{
			
 
				+        "bucket-name-1":["ak", "sk", "endpoint"],
			
 
				+        "bucket-name-2":["ak", "sk", "endpoint"]
			
 
				+    },
			
 
				+    "models-dir":"/opt/models",
			
 
				+    "device-mode":"cuda",
			
 
				+    "table-config": {
			
 
				+        "model": "TableMaster",
			
 
				+        "is_table_recog_enable": false,
			
 
				+        "max_time": 400
			
 
				+    }
			
 
				+}
			
--- a/web_api/magic-pdf.template.json
+++ b/web_api/magic-pdf.template.json
@@ -0,0 +1,13 @@
 
				+{
			
 
				+    "bucket_info":{
			
 
				+        "bucket-name-1":["ak", "sk", "endpoint"],
			
 
				+        "bucket-name-2":["ak", "sk", "endpoint"]
			
 
				+    },
			
 
				+    "models-dir":"/tmp/models",
			
 
				+    "device-mode":"cuda",
			
 
				+    "table-config": {
			
 
				+        "model": "TableMaster",
			
 
				+        "is_table_recog_enable": false,
			
 
				+        "max_time": 400
			
 
				+    }
			
 
				+}
			
--- a/web_api/requirements.txt
+++ b/web_api/requirements.txt
--- a/web_api/small_ocr.pdf
+++ b/web_api/small_ocr.pdf
--- a/web_api/sources.list
+++ b/web_api/sources.list
@@ -0,0 +1,10 @@
 
				+deb http://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse
			
 
				+deb-src http://mirrors.aliyun.com/ubuntu/ focal main restricted universe multiverse
			
 
				+deb http://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse
			
 
				+deb-src http://mirrors.aliyun.com/ubuntu/ focal-security main restricted universe multiverse
			
 
				+deb http://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse
			
 
				+deb-src http://mirrors.aliyun.com/ubuntu/ focal-updates main restricted universe multiverse
			
 
				+deb http://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse
			
 
				+deb-src http://mirrors.aliyun.com/ubuntu/ focal-proposed main restricted universe multiverse
			
 
				+deb http://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse
			
 
				+deb-src http://mirrors.aliyun.com/ubuntu/ focal-backports main restricted universe multiverse
			
--- a/web_api/start_mineru.sh
+++ b/web_api/start_mineru.sh
@@ -0,0 +1,7 @@
 
				+docker run -itd --name=mineru_server --gpus=all -p 8888:8000 quincyqiang/mineru:0.1-models /bin/bash
			
 
				+
			
 
				+docker run -itd --name=mineru_server --gpus=all -p 8888:8000 quincyqiang/mineru:0.3-models
			
 
				+
			
 
				+docker login --username=1185918903@qq.com registry.cn-beijing.aliyuncs.com
			
 
				+docker tag quincyqiang/mineru:0.3-models registry.cn-beijing.aliyuncs.com/quincyqiang/gomate:0.3-models
			
 
				+docker push registry.cn-beijing.aliyuncs.com/quincyqiang/gomate:0.3-models