MinerU Output Files Documentation

Overview

After executing the mineru command, in addition to the main markdown file output, multiple auxiliary files are generated for debugging, quality inspection, and further processing. These files include:

Visual debugging files: Help users intuitively understand the document parsing process and results
Structured data files: Contain detailed parsing data for secondary development

The following sections provide detailed descriptions of each file's purpose and format.

Visual Debugging Files

Layout Analysis File (layout.pdf)

File naming format: {original_filename}_layout.pdf

Functionality:

Visualizes layout analysis results for each page
Numbers in the top-right corner of each detection box indicate reading order
Different background colors distinguish different types of content blocks

Use cases:

Check if layout analysis is correct
Verify if reading order is reasonable
Debug layout-related issues

Text Spans File (spans.pdf)

[!NOTE] Only applicable to pipeline backend

File naming format: {original_filename}_spans.pdf

Functionality:

Uses different colored line boxes to annotate page content based on span type
Used for quality inspection and issue troubleshooting

Use cases:

Quickly troubleshoot text loss issues
Check inline formula recognition
Verify text segmentation accuracy

Structured Data Files

Model Inference Results (model.json)

[!NOTE] Only applicable to pipeline backend

File naming format: {original_filename}_model.json

Data Structure Definition

from pydantic import BaseModel, Field
from enum import IntEnum

class CategoryType(IntEnum):
    """Content category enumeration"""
    title = 0               # Title
    plain_text = 1          # Text
    abandon = 2             # Including headers, footers, page numbers, and page annotations
    figure = 3              # Image
    figure_caption = 4      # Image caption
    table = 5               # Table
    table_caption = 6       # Table caption
    table_footnote = 7      # Table footnote
    isolate_formula = 8     # Interline formula
    formula_caption = 9     # Interline formula number
    embedding = 13          # Inline formula
    isolated = 14           # Interline formula
    text = 15               # OCR recognition result

class PageInfo(BaseModel):
    """Page information"""
    page_no: int = Field(description="Page number, first page is 0", ge=0)
    height: int = Field(description="Page height", gt=0)
    width: int = Field(description="Page width", ge=0)

class ObjectInferenceResult(BaseModel):
    """Object recognition result"""
    category_id: CategoryType = Field(description="Category", ge=0)
    poly: list[float] = Field(description="Quadrilateral coordinates, format: [x0,y0,x1,y1,x2,y2,x3,y3]")
    score: float = Field(description="Confidence score of inference result")
    latex: str | None = Field(description="LaTeX parsing result", default=None)
    html: str | None = Field(description="HTML parsing result", default=None)

class PageInferenceResults(BaseModel):
    """Page inference results"""
    layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results")
    page_info: PageInfo = Field(description="Page metadata")

# Complete inference results
inference_result: list[PageInferenceResults] = []

Coordinate System Description

poly coordinate format: [x0, y0, x1, y1, x2, y2, x3, y3]

Represents coordinates of top-left, top-right, bottom-right, bottom-left points respectively
Coordinate origin is at the top-left corner of the page

Sample Data

[
    {
        "layout_dets": [
            {
                "category_id": 2,
                "poly": [
                    99.1906967163086,
                    100.3119125366211,
                    730.3707885742188,
                    100.3119125366211,
                    730.3707885742188,
                    245.81326293945312,
                    99.1906967163086,
                    245.81326293945312
                ],
                "score": 0.9999997615814209
            }
        ],
        "page_info": {
            "page_no": 0,
            "height": 2339,
            "width": 1654
        }
    },
    {
        "layout_dets": [
            {
                "category_id": 5,
                "poly": [
                    99.13092803955078,
                    2210.680419921875,
                    497.3183898925781,
                    2210.680419921875,
                    497.3183898925781,
                    2264.78076171875,
                    99.13092803955078,
                    2264.78076171875
                ],
                "score": 0.9999997019767761
            }
        ],
        "page_info": {
            "page_no": 1,
            "height": 2339,
            "width": 1654
        }
    }
]

VLM Output Results (model_output.txt)

[!NOTE] Only applicable to VLM backend

File naming format: {original_filename}_model_output.txt

File Format Description

Uses ---- to separate output results for each page
Each page contains multiple text blocks starting with <|box_start|> and ending with <|md_end|>

Field Meanings

Tag	Format	Description
Bounding box	`<\\|box_start\\|>x0 y0 x1 y1<\\|box_end\\|>`	Quadrilateral coordinates (top-left, bottom-right points), coordinate values after scaling page to 1000×1000
Type tag	`<\\|ref_start\\|>type<\\|ref_end\\|>`	Content block type identifier
Content	`<\\|md_start\\|>markdown content<\\|md_end\\|>`	Markdown content of the block

Supported Content Types

{
    "text": "Text",
    "title": "Title", 
    "image": "Image",
    "image_caption": "Image caption",
    "image_footnote": "Image footnote",
    "table": "Table",
    "table_caption": "Table caption", 
    "table_footnote": "Table footnote",
    "equation": "Interline formula"
}

Special Tags

<|txt_contd|>: Appears at the end of text, indicating that this text block can be connected with subsequent text blocks
Table content uses otsl format and needs to be converted to HTML for rendering in Markdown

Intermediate Processing Results (middle.json)

File naming format: {original_filename}_middle.json

Top-level Structure

Field Name	Type	Description
`pdf_info`	`list[dict]`	Array of parsing results for each page
`_backend`	`string`	Parsing mode: `pipeline` or `vlm`
`_version_name`	`string`	MinerU version number

Page Information Structure (pdf_info)

Field Name	Description
`preproc_blocks`	Unsegmented intermediate results after PDF preprocessing
`layout_bboxes`	Layout segmentation results, including layout direction and bounding boxes, sorted by reading order
`page_idx`	Page number, starting from 0
`page_size`	Page width and height `[width, height]`
`_layout_tree`	Layout tree structure
`images`	Image block information list
`tables`	Table block information list
`interline_equations`	Interline formula block information list
`discarded_blocks`	Block information to be discarded
`para_blocks`	Content block results after segmentation

Block Structure Hierarchy

Level 1 blocks (table | image)
└── Level 2 blocks
    └── Lines
        └── Spans

Level 1 Block Fields

Field Name	Description
`type`	Block type: `table` or `image`
`bbox`	Rectangular box coordinates of the block `[x0, y0, x1, y1]`
`blocks`	List of contained level 2 blocks

Level 2 Block Fields

Field Name	Description
`type`	Block type (see table below)
`bbox`	Rectangular box coordinates of the block
`lines`	List of contained line information

Level 2 Block Types

Type	Description
`image_body`	Image body
`image_caption`	Image caption text
`image_footnote`	Image footnote
`table_body`	Table body
`table_caption`	Table caption text
`table_footnote`	Table footnote
`text`	Text block
`title`	Title block
`index`	Index block
`list`	List block
`interline_equation`	Interline formula block

Line and Span Structure

Line fields:

bbox: Rectangular box coordinates of the line
spans: List of contained spans

Span fields:

bbox: Rectangular box coordinates of the span
type: Span type (image, table, text, inline_equation, interline_equation)
content | img_path: Text content or image path

Sample Data

{
    "pdf_info": [
        {
            "preproc_blocks": [
                {
                    "type": "text",
                    "bbox": [
                        52,
                        61.956024169921875,
                        294,
                        82.99800872802734
                    ],
                    "lines": [
                        {
                            "bbox": [
                                52,
                                61.956024169921875,
                                294,
                                72.0000228881836
                            ],
                            "spans": [
                                {
                                    "bbox": [
                                        54.0,
                                        61.956024169921875,
                                        296.2261657714844,
                                        72.0000228881836
                                    ],
                                    "content": "dependent on the service headway and the reliability of the departure ",
                                    "type": "text",
                                    "score": 1.0
                                }
                            ]
                        }
                    ]
                }
            ],
            "layout_bboxes": [
                {
                    "layout_bbox": [
                        52,
                        61,
                        294,
                        731
                    ],
                    "layout_label": "V",
                    "sub_layout": []
                }
            ],
            "page_idx": 0,
            "page_size": [
                612.0,
                792.0
            ],
            "_layout_tree": [],
            "images": [],
            "tables": [],
            "interline_equations": [],
            "discarded_blocks": [],
            "para_blocks": [
                {
                    "type": "text",
                    "bbox": [
                        52,
                        61.956024169921875,
                        294,
                        82.99800872802734
                    ],
                    "lines": [
                        {
                            "bbox": [
                                52,
                                61.956024169921875,
                                294,
                                72.0000228881836
                            ],
                            "spans": [
                                {
                                    "bbox": [
                                        54.0,
                                        61.956024169921875,
                                        296.2261657714844,
                                        72.0000228881836
                                    ],
                                    "content": "dependent on the service headway and the reliability of the departure ",
                                    "type": "text",
                                    "score": 1.0
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ],
    "_backend": "pipeline",
    "_version_name": "0.6.1"
}

Content List (content_list.json)

File naming format: {original_filename}_content_list.json

Functionality

This is a simplified version of middle.json that stores all readable content blocks in reading order as a flat structure, removing complex layout information for easier subsequent processing.

Content Types

Type	Description
`image`	Image
`table`	Table
`text`	Text/Title
`equation`	Interline formula

Text Level Identification

Text levels are distinguished through the text_level field:

No text_level or text_level: 0: Body text
text_level: 1: Level 1 heading
text_level: 2: Level 2 heading
And so on...

Common Fields

All content blocks include a page_idx field indicating the page number (starting from 0).
All content blocks include a bbox field representing the bounding box coordinates of the content block [x0, y0, x1, y1], mapped to a range of 0-1000.

Sample Data

[
        {
        "type": "text",
        "text": "The response of flow duration curves to afforestation ",
        "text_level": 1, 
        "bbox": [
            62,
            480,
            946,
            904
        ],
        "page_idx": 0
    },
    {
        "type": "image",
        "img_path": "images/a8ecda1c69b27e4f79fce1589175a9d721cbdc1cf78b4cc06a015f3746f6b9d8.jpg",
        "img_caption": [
            "Fig. 1. Annual flow duration curves of daily flows from Pine Creek, Australia, 1989–2000. "
        ],
        "img_footnote": [],
        "bbox": [
            62,
            480,
            946,
            904
        ],
        "page_idx": 1
    },
    {
        "type": "equation",
        "img_path": "images/181ea56ef185060d04bf4e274685f3e072e922e7b839f093d482c29bf89b71e8.jpg",
        "text": "$$\nQ _ { \\% } = f ( P ) + g ( T )\n$$",
        "text_format": "latex",
        "bbox": [
            62,
            480,
            946,
            904
        ],
        "page_idx": 2
    },
    {
        "type": "table",
        "img_path": "images/e3cb413394a475e555807ffdad913435940ec637873d673ee1b039e3bc3496d0.jpg",
        "table_caption": [
            "Table 2 Significance of the rainfall and time terms "
        ],
        "table_footnote": [
            "indicates that the rainfall term was significant at the $5 \\%$ level, $T$ indicates that the time term was significant at the $5 \\%$ level, \\* represents significance at the $10 \\%$ level, and na denotes too few data points for meaningful analysis. "
        ],
        "table_body": "<html><body><table><tr><td rowspan=\"2\">Site</td><td colspan=\"10\">Percentile</td></tr><tr><td>10</td><td>20</td><td>30</td><td>40</td><td>50</td><td>60</td><td>70</td><td>80</td><td>90</td><td>100</td></tr><tr><td>Traralgon Ck</td><td>P</td><td>P,*</td><td>P</td><td>P</td><td>P,</td><td>P,</td><td>P,</td><td>P,</td><td>P</td><td>P</td></tr><tr><td>Redhill</td><td>P,T</td><td>P,T</td><td>，*</td><td>**</td><td>P.T</td><td>P,*</td><td>P*</td><td>P*</td><td>*</td><td>，*</td></tr><tr><td>Pine Ck</td><td></td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td><td>T</td><td>na</td><td>na</td></tr><tr><td>Stewarts Ck 5</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P.T</td><td>P.T</td><td>P,T</td><td>na</td><td>na</td><td>na</td></tr><tr><td>Glendhu 2</td><td>P</td><td>P,T</td><td>P,*</td><td>P,T</td><td>P.T</td><td>P,ns</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td></tr><tr><td>Cathedral Peak 2</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Cathedral Peak 3</td><td>P.T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td></tr><tr><td>Lambrechtsbos A</td><td>P,T</td><td>P</td><td>P</td><td>P,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>*,T</td><td>T</td></tr><tr><td>Lambrechtsbos B</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>P,T</td><td>T</td><td>T</td></tr><tr><td>Biesievlei</td><td>P,T</td><td>P.T</td><td>P,T</td><td>P,T</td><td>*,T</td><td>*,T</td><td>T</td><td>T</td><td>P,T</td><td>P,T</td></tr></table></body></html>",
        "bbox": [
            62,
            480,
            946,
            904
        ],  
        "page_idx": 5
    }
]

Summary

The above files constitute MinerU's complete output results. Users can choose appropriate files for subsequent processing based on their needs:

Model outputs: Use raw outputs (model.json, model_output.txt)
Debugging and verification: Use visualization files (layout.pdf, spans.pdf)
Content extraction: Use simplified files (*.md, content_list.json)
Secondary development: Use structured files (middle.json)

output_files.md 17 KB 文件歷史 原始文件

MinerU Output Files Documentation

Overview

Visual Debugging Files

Layout Analysis File (layout.pdf)

Text Spans File (spans.pdf)

Structured Data Files

Model Inference Results (model.json)

Data Structure Definition

Coordinate System Description

Sample Data

VLM Output Results (model_output.txt)

File Format Description

Field Meanings

Supported Content Types

Special Tags

Intermediate Processing Results (middle.json)

Top-level Structure

Page Information Structure (pdf_info)

Block Structure Hierarchy

Level 1 Block Fields

Level 2 Block Fields

Level 2 Block Types

Line and Span Structure

Sample Data

Content List (content_list.json)

Functionality

Content Types

Text Level Identification

Common Fields

Sample Data

Summary

output_files.md 17 KB

文件歷史原始文件