MinerU 输出文件说明

概览

mineru 命令执行后，除了输出主要的 markdown 文件外，还会生成多个辅助文件用于调试、质检和进一步处理。这些文件包括：

可视化调试文件：帮助用户直观了解文档解析过程和结果
结构化数据文件：包含详细的解析数据，可用于二次开发

下面将详细介绍每个文件的作用和格式。

可视化调试文件

布局分析文件 (layout.pdf)

文件命名格式：{原文件名}_layout.pdf

功能说明：

可视化展示每一页的布局分析结果
每个检测框右上角的数字表示阅读顺序
使用不同背景色块区分不同类型的内容块

使用场景：

检查布局分析是否正确
确认阅读顺序是否合理
调试布局相关问题

文本片段文件 (spans.pdf)

[!NOTE] 仅适用于 pipeline 后端

文件命名格式：{原文件名}_spans.pdf

功能说明：

根据 span 类型使用不同颜色线框标注页面内容
用于质量检查和问题排查

使用场景：

快速排查文本丢失问题
检查行内公式识别情况
验证文本分割准确性

结构化数据文件

模型推理结果 (model.json)

[!NOTE] 仅适用于 pipeline 后端

文件命名格式：{原文件名}_model.json

数据结构定义

from pydantic import BaseModel, Field
from enum import IntEnum

class CategoryType(IntEnum):
    """内容类别枚举"""
    title = 0               # 标题
    plain_text = 1          # 文本
    abandon = 2             # 包括页眉页脚页码和页面注释
    figure = 3              # 图片
    figure_caption = 4      # 图片描述
    table = 5               # 表格
    table_caption = 6       # 表格描述
    table_footnote = 7      # 表格注释
    isolate_formula = 8     # 行间公式
    formula_caption = 9     # 行间公式的标号
    embedding = 13          # 行内公式
    isolated = 14           # 行间公式
    text = 15               # OCR 识别结果

class PageInfo(BaseModel):
    """页面信息"""
    page_no: int = Field(description="页码序号，第一页的序号是 0", ge=0)
    height: int = Field(description="页面高度", gt=0)
    width: int = Field(description="页面宽度", ge=0)

class ObjectInferenceResult(BaseModel):
    """对象识别结果"""
    category_id: CategoryType = Field(description="类别", ge=0)
    poly: list[float] = Field(description="四边形坐标，格式为 [x0,y0,x1,y1,x2,y2,x3,y3]")
    score: float = Field(description="推理结果的置信度")
    latex: str | None = Field(description="LaTeX 解析结果", default=None)
    html: str | None = Field(description="HTML 解析结果", default=None)

class PageInferenceResults(BaseModel):
    """页面推理结果"""
    layout_dets: list[ObjectInferenceResult] = Field(description="页面识别结果")
    page_info: PageInfo = Field(description="页面元信息")

# 完整的推理结果
inference_result: list[PageInferenceResults] = []

坐标系统说明

poly 坐标格式：[x0, y0, x1, y1, x2, y2, x3, y3]

分别表示左上、右上、右下、左下四点的坐标
坐标原点在页面左上角

示例数据

[
    {
        "layout_dets": [
            {
                "category_id": 2,
                "poly": [
                    99.1906967163086,
                    100.3119125366211,
                    730.3707885742188,
                    100.3119125366211,
                    730.3707885742188,
                    245.81326293945312,
                    99.1906967163086,
                    245.81326293945312
                ],
                "score": 0.9999997615814209
            }
        ],
        "page_info": {
            "page_no": 0,
            "height": 2339,
            "width": 1654
        }
    },
    {
        "layout_dets": [
            {
                "category_id": 5,
                "poly": [
                    99.13092803955078,
                    2210.680419921875,
                    497.3183898925781,
                    2210.680419921875,
                    497.3183898925781,
                    2264.78076171875,
                    99.13092803955078,
                    2264.78076171875
                ],
                "score": 0.9999997019767761
            }
        ],
        "page_info": {
            "page_no": 1,
            "height": 2339,
            "width": 1654
        }
    }
]

VLM 输出结果 (model.json)

[!NOTE] 仅适用于 VLM 后端

文件命名格式：{原文件名}_model.json

文件格式说明

该文件为 VLM 模型的原始输出结果，包含两层嵌套list，外层表示页面，内层表示该页的内容块
每个内容块都是一个dict，包含 type、bbox、angle、content 字段

支持的内容类型

{
    "text",
    "title", 
    "equation",
    "image",
    "image_caption",
    "image_footnote",
    "table",
    "table_caption",
    "table_footnote",
    "phonetic",
    "code",
    "code_caption",
    "ref_text",
    "algorithm",
    "list",
    "header",
    "footer",
    "page_number",
    "aside_text", 
    "page_footnote", 
}

中间处理结果 (middle.json)

[!NOTE] 仅适用于 pipeline 后端

文件命名格式：{原文件名}_middle.json

顶层结构

字段名	类型	说明
`pdf_info`	`list[dict]`	每一页的解析结果数组
`_backend`	`string`	解析模式：`pipeline` 或 `vlm`
`_version_name`	`string`	MinerU 版本号

页面信息结构 (pdf_info)

字段名	说明
`preproc_blocks`	PDF 预处理后的未分段中间结果
`layout_bboxes`	布局分割结果，包含布局方向和边界框，按阅读顺序排序
`page_idx`	页码，从 0 开始
`page_size`	页面的宽度和高度 `[width, height]`
`_layout_tree`	布局树状结构
`images`	图片块信息列表
`tables`	表格块信息列表
`interline_equations`	行间公式块信息列表
`discarded_blocks`	需要丢弃的块信息
`para_blocks`	分段后的内容块结果

块结构层次

一级块 (table | image)
└── 二级块
    └── 行 (line)
        └── 片段 (span)

一级块字段

字段名	说明
`type`	块类型：`table` 或 `image`
`bbox`	块的矩形框坐标 `[x0, y0, x1, y1]`
`blocks`	包含的二级块列表

二级块字段

字段名	说明
`type`	块类型（详见下表）
`bbox`	块的矩形框坐标
`lines`	包含的行信息列表

二级块类型

类型	说明
`image_body`	图像本体
`image_caption`	图像描述文本
`image_footnote`	图像脚注
`table_body`	表格本体
`table_caption`	表格描述文本
`table_footnote`	表格脚注
`text`	文本块
`title`	标题块
`index`	目录块
`list`	列表块
`interline_equation`	行间公式块

行和片段结构

行 (line) 字段：

bbox：行的矩形框坐标
spans：包含的片段列表

片段 (span) 字段：

bbox：片段的矩形框坐标
type：片段类型（image、table、text、inline_equation、interline_equation）
content | img_path：文本内容或图片路径

示例数据

{
    "pdf_info": [
        {
            "preproc_blocks": [
                {
                    "type": "text",
                    "bbox": [
                        52,
                        61.956024169921875,
                        294,
                        82.99800872802734
                    ],
                    "lines": [
                        {
                            "bbox": [
                                52,
                                61.956024169921875,
                                294,
                                72.0000228881836
                            ],
                            "spans": [
                                {
                                    "bbox": [
                                        54.0,
                                        61.956024169921875,
                                        296.2261657714844,
                                        72.0000228881836
                                    ],
                                    "content": "dependent on the service headway and the reliability of the departure ",
                                    "type": "text",
                                    "score": 1.0
                                }
                            ]
                        }
                    ]
                }
            ],
            "layout_bboxes": [
                {
                    "layout_bbox": [
                        52,
                        61,
                        294,
                        731
                    ],
                    "layout_label": "V",
                    "sub_layout": []
                }
            ],
            "page_idx": 0,
            "page_size": [
                612.0,
                792.0
            ],
            "_layout_tree": [],
            "images": [],
            "tables": [],
            "interline_equations": [],
            "discarded_blocks": [],
            "para_blocks": [
                {
                    "type": "text",
                    "bbox": [
                        52,
                        61.956024169921875,
                        294,
                        82.99800872802734
                    ],
                    "lines": [
                        {
                            "bbox": [
                                52,
                                61.956024169921875,
                                294,
                                72.0000228881836
                            ],
                            "spans": [
                                {
                                    "bbox": [
                                        54.0,
                                        61.956024169921875,
                                        296.2261657714844,
                                        72.0000228881836
                                    ],
                                    "content": "dependent on the service headway and the reliability of the departure ",
                                    "type": "text",
                                    "score": 1.0
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ],
    "_backend": "pipeline",
    "_version_name": "0.6.1"
}

内容列表 (content_list.json)

[!NOTE] 仅适用于 pipeline 后端

文件命名格式：{原文件名}_content_list.json

功能说明

这是一个简化版的 middle.json，按阅读顺序平铺存储所有可读内容块，去除了复杂的布局信息，便于后续处理。

内容类型

类型	说明
`image`	图片
`table`	表格
`text`	文本/标题
`equation`	行间公式

文本层级标识

通过 text_level 字段区分文本层级：

无 text_level 或 text_level: 0：正文文本
text_level: 1：一级标题
text_level: 2：二级标题
以此类推...

通用字段

所有内容块都包含 page_idx 字段，表示所在页码（从 0 开始）。
所有内容块都包含 bbox 字段，表示内容块的边界框坐标 [x0, y0, x1, y1] 映射在0-1000范围内的结果。

示例数据

[
        {
        "type": "text",
        "text": "The response of flow duration curves to afforestation ",
        "text_level": 1, 
        "bbox": [
            62,
            480,
            946,
            904
        ],
        "page_idx": 0
    },
    {
        "type": "image",
        "img_path": "images/a8ecda1c69b27e4f79fce1589175a9d721cbdc1cf78b4cc06a015f3746f6b9d8.jpg",
        "img_caption": [
            "Fig. 1. Annual flow duration curves of daily flows from Pine Creek, Australia, 1989–2000. "
        ],
        "img_footnote": [],
        "bbox": [
            62,
            480,
            946,
            904
        ],
        "page_idx": 1
    },
    {
        "type": "equation",
        "img_path": "images/181ea56ef185060d04bf4e274685f3e072e922e7b839f093d482c29bf89b71e8.jpg",
        "text": "$$\nQ _ { \\% } = f ( P ) + g ( T )\n$$",
        "text_format": "latex",
        "bbox": [
            62,
            480,
            946,
            904
        ],
        "page_idx": 2
    },
    {
        "type": "table",
        "img_path": "images/e3cb413394a475e555807ffdad913435940ec637873d673ee1b039e3bc3496d0.jpg",
        "table_caption": [
            "Table 2 Significance of the rainfall and time terms "
        ],
        "table_footnote": [
            "indicates that the rainfall term was significant at the $5 \\%$ level, $T$ indicates that the time term was significant at the $5 \\%$ level, \\* represents significance at the $10 \\%$ level, and na denotes too few data points for meaningful analysis. "
        ],
        "table_body": "<html><body><table><tr><td rowspan=\"2\">Site</td><td colspan=\"10\">Percentile</td></tr><tr><td>10</td><td>20</td><td>30</td><td>40</td><td>50</td><td>60</td><td>70</td><td>80</td><td>90</td><td>100</td></tr><tr><td>Traralgon Ck</td><td>P</td><td>P,*</td><td>P</td><td>P</td><td>P,</td><td>P,</td><td>P,</td><td>P,</td><td>P</td><td>P</td></tr><tr><td>Redhill</td><td>P,T</td><td>P,T</td><td>，*</td><td>**</td><td>P.T</td><td>P,*</td><td>P*</td><td>P*</td><td>*</td><td>，*</td></tr><tr><td>Pine Ck</td><td></td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td><td>T</td><td>na</td><td>na</td></tr><tr><td>Stewarts Ck 5</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P.T</td><td>P.T</td><td>P,T</td><td>na</td><td>na</td><td>na</td></tr><tr><td>Glendhu 2</td><td>P</td><td>P,T</td><td>P,*</td><td>P,T</td><td>P.T</td><td>P,ns</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td></tr><tr><td>Cathedral Peak 2</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Cathedral Peak 3</td><td>P.T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Lambrechtsbos A</td><td>P,T</td><td>P</td><td>P</td><td>P,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>T</td></tr><tr><td>Lambrechtsbos B</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td></tr><tr><td>Biesievlei</td><td>P,T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>*,T</td><td>T</td><td>T</td><td>P,T</td><td>P,T</td></tr></table></body></html>",
        "bbox": [
            62,
            480,
            946,
            904
        ],  
        "page_idx": 5
    }
]

总结

以上文件为 MinerU 的完整输出结果，用户可根据需要选择合适的文件进行后续处理：

模型输出：使用原始输出（model.json、model_output.txt）
调试和验证：使用可视化文件（layout.pdf、spans.pdf）
内容提取：使用简化文件（*.md、content_list.json）
二次开发：使用结构化文件（middle.json）

output_files.md 16 KB Histórico Raw

MinerU 输出文件说明

概览

可视化调试文件

布局分析文件 (layout.pdf)

文本片段文件 (spans.pdf)

结构化数据文件

模型推理结果 (model.json)

数据结构定义

坐标系统说明

示例数据

VLM 输出结果 (model.json)

文件格式说明

支持的内容类型

中间处理结果 (middle.json)

顶层结构

页面信息结构 (pdf_info)

块结构层次

一级块字段

二级块字段

二级块类型

行和片段结构

示例数据

内容列表 (content_list.json)

功能说明

内容类型

文本层级标识

通用字段

示例数据

总结

output_files.md 16 KB

Histórico Raw