Kaynağa Gözat

fix: update output file documentation for mineru command and enhance content descriptions in output_file_en_us.md and output_file_zh_cn.md

myhloli 4 ay önce
ebeveyn
işleme
54cf49dfc9

BIN
docs/images/layout_example.png


BIN
docs/images/web_demo_1.png


+ 47 - 9
docs/output_file_en_us.md

@@ -1,20 +1,19 @@
 ## Overview
 
-After executing the `magic-pdf` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
+After executing the `mineru` command, in addition to outputting files related to markdown, several other files unrelated to markdown will also be generated. These files will be introduced one by one.
 
 ### some_pdf_layout.pdf
 
-Each page layout consists of one or more boxes. The number at the top left of each box indicates its sequence number. Additionally, in `layout.pdf`, different content blocks are highlighted with different background colors.
-
+Each page's layout consists of one or more bounding boxes. The number in the top-right corner of each box indicates the reading order. Additionally, different content blocks are highlighted with distinct background colors within the layout.pdf.
 ![layout example](images/layout_example.png)
 
-### some_pdf_spans.pdf
+### some_pdf_spans.pdf(Applicable only to the pipeline backend)
 
 All spans on the page are drawn with different colored line frames according to the span type. This file can be used for quality control, allowing for quick identification of issues such as missing text or unrecognized inline formulas.
 
 ![spans example](images/spans_example.png)
 
-### some_pdf_model.json
+### some_pdf_model.json(Applicable only to the pipeline backend)
 
 #### Structure Definition
 
@@ -117,13 +116,39 @@ The format of the poly coordinates is \[x0, y0, x1, y1, x2, y2, x3, y3\], repres
 ]
 ```
 
+### some_pdf_model_output.txt (Applicable only to the VLM backend)
+
+This file contains the output of the VLM model, with each page's output separated by `----`.  
+Each page's output consists of text blocks starting with `<|box_start|>` and ending with `<|md_end|>`.  
+The meaning of each field is as follows:  
+- `<|box_start|>x0 y0 x1 y1<|box_end|>`  
+  x0 y0 x1 y1 represent the coordinates of a quadrilateral, indicating the top-left and bottom-right points. The values are based on a normalized page size of 1000x1000.
+- `<|ref_start|>type<|ref_end|>`  
+  `type` indicates the block type. Possible values are:
+  ```json
+  {
+      "text": "Text",
+      "title": "Title",
+      "image": "Image",
+      "image_caption": "Image Caption",
+      "image_footnote": "Image Footnote",
+      "table": "Table",
+      "table_caption": "Table Caption",
+      "table_footnote": "Table Footnote",
+      "equation": "Interline Equation"
+  }
+  ```
+- `<|md_start|>Markdown content<|md_end|>`  
+  This field contains the Markdown content of the block. If `type` is `text`, the end of the text may contain the `<|txt_contd|>` tag, indicating that this block can be connected with the following `text` block(s).
+  If `type` is `table`, the content is in `otsl` format and needs to be converted into HTML for rendering in Markdown.
+
 ### some_pdf_middle.json
 
 | Field Name     | Description                                                                                                    |
-| :------------- | :------------------------------------------------------------------------------------------------------------- |
+|:---------------| :------------------------------------------------------------------------------------------------------------- |
 | pdf_info       | list, each element is a dict representing the parsing result of each PDF page, see the table below for details |
-| \_parse_type   | ocr \| txt, used to indicate the mode used in this intermediate parsing state                                  |
-| \_version_name | string, indicates the version of magic-pdf used in this parsing                                                |
+| \_backend      | pipeline \| vlm, used to indicate the mode used in this intermediate parsing state                                  |
+| \_version_name | string, indicates the version of mineru used in this parsing                                                |
 
 <br>
 
@@ -324,7 +349,20 @@ First-level block (if any) -> Second-level block -> Line -> Span
             ]
         }
     ],
-    "_parse_type": "txt",
+    "_backend": "pipeline",
     "_version_name": "0.6.1"
 }
 ```
+
+
+### some_pdf_content_list.json
+
+This file is a JSON array where each element is a dict storing all readable content blocks in the document in reading order.  
+`content_list` can be viewed as a simplified version of `middle.json`. The content block types are mostly consistent with those in `middle.json`, but layout information is not included.  
+
+Please note that both `title` and text blocks in `content_list` are uniformly represented using the text type. The `text_level` field is used to distinguish the hierarchy of text blocks:
+- A block without the `text_level` field or with `text_level=0` represents body text.
+- A block with `text_level=1` represents a level-1 heading.
+- A block with `text_level=2` represents a level-2 heading, and so on.
+
+Each dict contains the `page_idx` field, indicating the page number (starting from 0) where the content block resides.

+ 43 - 10
docs/output_file_zh_cn.md

@@ -1,20 +1,20 @@
 ## 概览
 
-`magic-pdf` 命令执行后除了输出和 markdown 有关的文件以外,还会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
+`mineru` 命令执行后除了输出 markdown 文件以外,还可能会生成若干个和 markdown 无关的文件。现在将一一介绍这些文件
 
 ### some_pdf_layout.pdf
 
-每一页的 layout 均由一个或多个框组成。 每个框左上脚的数字表明它们的序号。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
+每一页的 layout 均由一个或多个框组成。 每个框右上角的数字表明它们的阅读顺序。此外 layout.pdf 框内用不同的背景色块圈定不同的内容块。
 
 ![layout 页面示例](images/layout_example.png)
 
-### some_pdf_spans.pdf
+### some_pdf_spans.pdf(仅适用于pipeline后端)
 
-根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行公式未识别等问题。
+根据 span 类型的不同,采用不同颜色线框绘制页面上所有 span。该文件可以用于质检,可以快速排查出文本丢失、行公式未识别等问题。
 
 ![span 页面示例](images/spans_example.png)
 
-### some_pdf_model.json
+### some_pdf_model.json(仅适用于pipeline后端)
 
 #### 结构定义
 
@@ -117,13 +117,39 @@ poly 坐标的格式 \[x0, y0, x1, y1, x2, y2, x3, y3\], 分别表示左上、
 ]
 ```
 
+### some_pdf_model_output.txt(仅适用于vlm后端)
+
+该文件是vlm模型的输出结果,使用`----`分割每一页的输出结果。  
+每一页的输出结果一些以`<|box_start|>`开头,`<|md_end|>`结尾的文本块。  
+其中字段的含义:  
+- `<|box_start|>x0 y0 x1 y1<|box_end|>`  
+    其中x0 y0 x1 y1是四边形的坐标,分别表示左上、右下的两点坐标,值为将页面缩放至1000x1000后,四边形的坐标值。
+- `<|ref_start|>type<|ref_end|>`  
+  type是该block的类型,可能的值有:
+  ```json
+  {
+      "text": "文本",
+      "title": "标题",
+      "image": "图片",
+      "image_caption": "图片描述",
+      "image_footnote": "图片脚注",
+      "table": "表格",
+      "table_caption": "表格描述",
+      "table_footnote": "表格脚注",
+      "equation": "行间公式"
+  }
+  ```
+- `<|md_start|>markdown内容<|md_end|>`  
+    该字段是该block的markdown内容,如type为text,文本末尾可能存在`<|txt_contd|>`标记,表示该文本块可以后后续text块连接。
+    如type为table,内容为`otsl`格式表示的表格内容,需要转换为html格式才能在markdown中渲染。
+
 ### some_pdf_middle.json
 
-| 字段名         | 解释                                                               |
-| :------------- | :----------------------------------------------------------------- |
+| 字段名            | 解释                                        |
+|:---------------|:------------------------------------------|
 | pdf_info       | list,每个元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
-| \_parse_type   | ocr \| txt,用来标识本次解析的中间态使用的模式                     |
-| \_version_name | string, 表示本次解析使用的 magic-pdf 的版本号                      |
+| \_backend      | pipeline \| vlm,用来标识本次解析的中间态使用的模式         |
+| \_version_name | string, 表示本次解析使用的 mineru 的版本号             |
 
 <br>
 
@@ -323,7 +349,14 @@ para_blocks内存储的元素为区块信息
             ]
         }
     ],
-    "_parse_type": "txt",
+    "_backend": "pipeline",
     "_version_name": "0.6.1"
 }
 ```
+
+### some_pdf_content_list.json
+
+该文件是一个json数组,每个元素是一个dict,按阅读顺序平铺存储文档中所有可阅读的内容块。  
+content_list可以看成简化后的middle.json,内容块的类型基本和middle.json一致,但不包含布局信息。  
+需要注意的是,content_list中的title和text块统一使用text类型表示,通过`text_level`字段来区分文本块的层级,不含`text_level`字段或`text_level`为0的文本块表示正文文本,`text_level`为1的文本块表示一级标题,`text_level`为2的文本块表示二级标题,以此类推。  
+每个dict包含`page_idx`字段,表示该内容块所在的页码,从0开始。