Преглед на файлове

fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 (#574)

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

add HF、modelscope、colab url

* Update README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Rename README.md to README_zh-CN.md

* Create readme.md

* Rename readme.md to README.md

* Rename README.md to README_zh-CN.md

* Update README_zh-CN.md

* Create README.md

* Update README.md

* Update README.md

* Update README.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* Update README_zh-CN.md

* Update README.md

* Update README_zh-CN.md

* fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384

---------

Co-authored-by: Xiaomeng Zhao <moe@myhloli.com>
Co-authored-by: sfk <18810651050@163.com>
Kaiwen Liu преди 1 година
родител
ревизия
58a003177c
променени са 7 файла, в които са добавени 207 реда и са изтрити 41 реда
  1. 1 0
      README.md
  2. 1 0
      README_zh-CN.md
  3. 3 3
      magic_pdf/para/para_split_v2.py
  4. 3 4
      projects/README.md
  5. 5 0
      projects/README_zh-CN.md
  6. 36 34
      projects/llama_index_rag/README.md
  7. 158 0
      projects/llama_index_rag/README_zh-CN.md

Файловите разлики са ограничени, защото са твърде много
+ 1 - 0
README.md


Файловите разлики са ограничени, защото са твърде много
+ 1 - 0
README_zh-CN.md


+ 3 - 3
magic_pdf/para/para_split_v2.py

@@ -202,7 +202,9 @@ def __valign_lines(blocks, layout_bboxes):
     min_distance = 3
     min_sample = 2
     new_layout_bboxes = []
-
+    # add bbox_fs for para split calculation
+    for block in blocks:
+        block["bbox_fs"] = copy.deepcopy(block["bbox"])
     for layout_box in layout_bboxes:
         blocks_in_layoutbox = [b for b in blocks if
                                b["type"] == BlockType.Text and is_in_layout(b['bbox'], layout_box['layout_bbox'])]
@@ -251,8 +253,6 @@ def __valign_lines(blocks, layout_bboxes):
                                     min([line['bbox'][1] for line in block['lines']]),
                                     max([line['bbox'][2] for line in block['lines']]),
                                     max([line['bbox'][3] for line in block['lines']])]
-            else:
-                block['bbox_fs'] = copy.deepcopy(block['bbox'])
         """新计算layout的bbox,因为block的bbox变了。"""
         layout_x0 = min([block['bbox_fs'][0] for block in blocks_in_layoutbox])
         layout_y0 = min([block['bbox_fs'][1] for block in blocks_in_layoutbox])

+ 3 - 4
projects/README.md

@@ -1,7 +1,6 @@
-# 欢迎来到 MinerU 项目列表
+# Welcome to the MinerU Project List
 
-## 项目列表
+## Project List
 
-- [llama_index_rag](./llama_index_rag/README.md): 基于 llama_index 构建轻量级 RAG 系统
+- [llama_index_rag](./llama_index_rag/README.md): Build a lightweight RAG system based on llama_index
 
-- [web_api](./web_api/README.md): PDF解析的restful api服务

+ 5 - 0
projects/README_zh-CN.md

@@ -0,0 +1,5 @@
+# 欢迎来到 MinerU 项目列表
+
+## 项目列表
+
+- [llama_index_rag](./llama_index_rag/README_zh-CN.md): 基于 llama_index 构建轻量级 RAG 系统

+ 36 - 34
projects/llama_index_rag/README.md

@@ -1,4 +1,4 @@
-## 安装
+## Installation
 
 MinerU
 
@@ -11,7 +11,7 @@ conda activate MinerU
 pip install .[full] --extra-index-url https://wheels.myhloli.com
 ```
 
-第三方软件
+Third-party software
 
 ```bash
 # install
@@ -26,7 +26,7 @@ pip install accelerate==0.33.0
 pip uninstall transformer-engine
 ```
 
-## 环境配置
+## Environment Configuration
 
 ```
 export DASHSCOPE_API_KEY={some_key}
@@ -34,12 +34,11 @@ export ES_USER={some_es_user}
 export ES_PASSWORD={some_es_password}
 export ES_URL=http://{es_url}:9200
 ```
+For instructions on obtaining a DASHSCOPE_API_KEY, refer to [documentation](https://help.aliyun.com/zh/dashscope/opening-service)
 
-DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
+## Usage
 
-## 使用
-
-### 导入数据
+### Data Ingestion
 
 ```bash
 python data_ingestion.py -p some.pdf  # load data from pdf
@@ -49,16 +48,16 @@ python data_ingestion.py -p some.pdf  # load data from pdf
 python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
 ```
 
-### 查询
+### Query
 
 ```bash
 python query.py --question '{the_question_you_want_to_ask}'
 ```
 
-## 示例
+## Example
 
 ````bash
-# 启动 es 服务
+# Start the es service
 docker compose up -d
 
 or
@@ -66,44 +65,45 @@ or
 docker-compose up -d
 
 
-# 配置环境变量
+# Set environment variables
 export ES_USER=elastic
 export ES_PASSWORD=llama_index
 export ES_URL=http://127.0.0.1:9200
 export DASHSCOPE_API_KEY={some_key}
 
 
-# 导入数据
+# Ingest data
 python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
 
 
-# 查询问题
+# Ask a question
 python query.py -q 'how about the rights of men'
 
 ## outputs
-请基于```内的内容回答问题。"
+Please answer the question based on the content within ```:
             ```
             I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
             ```
-            我的问题是:how about the rights of men。
+            My question is:how about the rights of men。
 
 question: how about the rights of men
 answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.
 
 ````
 
-## 开发
+## Development
+
+`MinerU` provides a `RAG` integration interface, allowing users to specify a single input `pdf` file or a directory. `MinerU` will automatically parse the input files and return an iterable interface for retrieving the data.
 
-`MinerU` 提供了 `RAG` 集成接口,用户可以通过指定输入单个 `pdf` 文件或者某个目录。`MinerU` 会自动解析输入文件并返回可以迭代的接口用于获取数据
 
-### API 接口
+### API Interface
 
 ```python
 from magic_pdf.integrations.rag.type import Node
 
 class RagPageReader:
     def get_rel_map(self) -> list[ElementRelation]:
-        # 获取节点间的关系
+        # Retrieve the relationships between nodes
         pass
     ...
 
@@ -115,41 +115,43 @@ class DataReader:
         pass
 
     def get_documents_count(self) -> int:
-        """获取 pdf 文档数量"""
+        """Get the number of pdf documents"""
         pass
 
     def get_document_result(self, idx: int) -> RagDocumentReader | None:
-        """获取某个 pdf 的解析内容"""
+        """Retrieve the parsed content of a specific pdf"""
         pass
 
 
     def get_document_filename(self, idx: int) -> Path:
-        """获取某个 pdf 的具体路径"""
+        """Retrieve the path of a specific pdf"""
         pass
 
 
 ```
 
-类型定义
+Type Definitions
 
 ```python
 
+
 class Node(BaseModel):
-    category_type: CategoryType = Field(description='类别') # 类别
-    text: str | None = Field(description='文本内容',
-                             default=None)
-    image_path: str | None = Field(description='图或者表格(表可能用图片形式存储)的存储路径',
-                                   default=None)
-    anno_id: int = Field(description='unique id', default=-1)
-    latex: str | None = Field(description='公式或表格 latex 解析结果', default=None)
-    html: str | None = Field(description='表格的 html 解析结果', default=None)
+    category_type: CategoryType = Field(description='Category') # Category
+    text: str | None = Field(description='Text content', default=None)
+    image_path: str | None = Field(description='Path to image or table (table may be stored as an image)', default=None)
+    anno_id: int = Field(description='Unique ID', default=-1)
+    latex: str | None = Field(description='LaTeX output for equations or tables', default=None)
+    html: str | None = Field(description='HTML output for tables', default=None)
+
+
 
 ```
 
-表格存储形式可能会是 图片、latex、html 三种形式之一。
-anno_id 是该 Node 的在全局唯一ID。后续可以用于匹配该 Node 和其他 Node 的关系。节点的关系可以通过方法 `get_rel_map` 获取。用户可以用 `anno_id` 匹配节点之间的关系,并用于构建具备节点的关系的 rag index。
+Tables can be stored in one of three formats: image, LaTeX, or HTML. 
+`anno_id` is a globally unique ID for each Node. It can be used later to match this Node with other Nodes. The relationships between nodes can be retrieved using the `get_rel_map` method. Users can use `anno_id` to link nodes and construct a RAG index that includes node relationships.
+
 
-### 节点类型关系矩阵
+### Node Relationship Matrix
 
 |                | image_body | table_body |
 | -------------- | ---------- | ---------- |

+ 158 - 0
projects/llama_index_rag/README_zh-CN.md

@@ -0,0 +1,158 @@
+## 安装
+
+MinerU
+
+```bash
+git clone https://github.com/opendatalab/MinerU.git
+cd MinerU
+
+conda create -n MinerU python=3.10
+conda activate MinerU
+pip install .[full] --extra-index-url https://wheels.myhloli.com
+```
+
+第三方软件
+
+```bash
+# install
+pip install llama-index-vector-stores-elasticsearch==0.2.0
+pip install llama-index-embeddings-dashscope==0.2.0
+pip install llama-index-core==0.10.68
+pip install einops==0.7.0
+pip install transformers-stream-generator==0.0.5
+pip install accelerate==0.33.0
+
+# uninstall
+pip uninstall transformer-engine
+```
+
+## 环境配置
+
+```
+export DASHSCOPE_API_KEY={some_key}
+export ES_USER={some_es_user}
+export ES_PASSWORD={some_es_password}
+export ES_URL=http://{es_url}:9200
+```
+
+DASHSCOPE_API_KEY 开通参考[文档](https://help.aliyun.com/zh/dashscope/opening-service)
+
+## 使用
+
+### 导入数据
+
+```bash
+python data_ingestion.py -p some.pdf  # load data from pdf
+
+    or
+
+python data_ingestion.py -p /opt/data/some_pdf_directory/ # load data from multiples pdf which under the directory of {some_pdf_directory}
+```
+
+### 查询
+
+```bash
+python query.py --question '{the_question_you_want_to_ask}'
+```
+
+## 示例
+
+````bash
+# 启动 es 服务
+docker compose up -d
+
+or
+
+docker-compose up -d
+
+
+# 配置环境变量
+export ES_USER=elastic
+export ES_PASSWORD=llama_index
+export ES_URL=http://127.0.0.1:9200
+export DASHSCOPE_API_KEY={some_key}
+
+
+# 导入数据
+python data_ingestion.py example/data/declaration_of_the_rights_of_man_1789.pdf
+
+
+# 查询问题
+python query.py -q 'how about the rights of men'
+
+## outputs
+请基于```内的内容回答问题。"
+            ```
+            I. Men are born, and always continue, free and equal in respect of their rights. Civil distinctions, therefore, can be founded only on public utility.
+            ```
+            我的问题是:how about the rights of men。
+
+question: how about the rights of men
+answer: The statement implies that men are born free and equal in terms of their rights. Civil distinctions should only be based on public utility. However, it does not specify what those rights are. It is up to society and individual countries to determine and protect the specific rights of their citizens.
+
+````
+
+## 开发
+
+`MinerU` 提供了 `RAG` 集成接口,用户可以通过指定输入单个 `pdf` 文件或者某个目录。`MinerU` 会自动解析输入文件并返回可以迭代的接口用于获取数据
+
+### API 接口
+
+```python
+from magic_pdf.integrations.rag.type import Node
+
+class RagPageReader:
+    def get_rel_map(self) -> list[ElementRelation]:
+        # 获取节点间的关系
+        pass
+    ...
+
+class RagDocumentReader:
+    ...
+
+class DataReader:
+    def __init__(self, path_or_directory: str, method: str, output_dir: str):
+        pass
+
+    def get_documents_count(self) -> int:
+        """获取 pdf 文档数量"""
+        pass
+
+    def get_document_result(self, idx: int) -> RagDocumentReader | None:
+        """获取某个 pdf 的解析内容"""
+        pass
+
+
+    def get_document_filename(self, idx: int) -> Path:
+        """获取某个 pdf 的具体路径"""
+        pass
+
+
+```
+
+类型定义
+
+```python
+
+class Node(BaseModel):
+    category_type: CategoryType = Field(description='类别') # 类别
+    text: str | None = Field(description='文本内容',
+                             default=None)
+    image_path: str | None = Field(description='图或者表格(表可能用图片形式存储)的存储路径',
+                                   default=None)
+    anno_id: int = Field(description='unique id', default=-1)
+    latex: str | None = Field(description='公式或表格 latex 解析结果', default=None)
+    html: str | None = Field(description='表格的 html 解析结果', default=None)
+
+```
+
+表格存储形式可能会是 图片、latex、html 三种形式之一。
+anno_id 是该 Node 的在全局唯一ID。后续可以用于匹配该 Node 和其他 Node 的关系。节点的关系可以通过方法 `get_rel_map` 获取。用户可以用 `anno_id` 匹配节点之间的关系,并用于构建具备节点的关系的 rag index。
+
+### 节点类型关系矩阵
+
+|                | image_body | table_body |
+| -------------- | ---------- | ---------- |
+| image_caption  | sibling    |            |
+| table_caption  |            | sibling    |
+| table_footnote |            | sibling    |

Някои файлове не бяха показани, защото твърде много файлове са промени