output_file_description.rst 18 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412
  1. Output File Description
  2. =========================
  3. After executing the ``magic-pdf`` command, in addition to outputting
  4. files related to markdown, several other files unrelated to markdown
  5. will also be generated. These files will be introduced one by one.
  6. some_pdf_layout.pdf
  7. ~~~~~~~~~~~~~~~~~~~
  8. Each page layout consists of one or more boxes. The number at the top
  9. left of each box indicates its sequence number. Additionally, in
  10. ``layout.pdf``, different content blocks are highlighted with different
  11. background colors.
  12. .. figure:: ../../_static/image/layout_example.png
  13. :alt: layout example
  14. layout example
  15. some_pdf_spans.pdf
  16. ~~~~~~~~~~~~~~~~~~
  17. All spans on the page are drawn with different colored line frames
  18. according to the span type. This file can be used for quality control,
  19. allowing for quick identification of issues such as missing text or
  20. unrecognized inline formulas.
  21. .. figure:: ../../_static/image/spans_example.png
  22. :alt: spans example
  23. spans example
  24. some_pdf_model.json
  25. ~~~~~~~~~~~~~~~~~~~
  26. Structure Definition
  27. ^^^^^^^^^^^^^^^^^^^^
  28. .. code:: python
  29. from pydantic import BaseModel, Field
  30. from enum import IntEnum
  31. class CategoryType(IntEnum):
  32. title = 0 # Title
  33. plain_text = 1 # Text
  34. abandon = 2 # Includes headers, footers, page numbers, and page annotations
  35. figure = 3 # Image
  36. figure_caption = 4 # Image description
  37. table = 5 # Table
  38. table_caption = 6 # Table description
  39. table_footnote = 7 # Table footnote
  40. isolate_formula = 8 # Block formula
  41. formula_caption = 9 # Formula label
  42. embedding = 13 # Inline formula
  43. isolated = 14 # Block formula
  44. text = 15 # OCR recognition result
  45. class PageInfo(BaseModel):
  46. page_no: int = Field(description="Page number, the first page is 0", ge=0)
  47. height: int = Field(description="Page height", gt=0)
  48. width: int = Field(description="Page width", ge=0)
  49. class ObjectInferenceResult(BaseModel):
  50. category_id: CategoryType = Field(description="Category", ge=0)
  51. poly: list[float] = Field(description="Quadrilateral coordinates, representing the coordinates of the top-left, top-right, bottom-right, and bottom-left points respectively")
  52. score: float = Field(description="Confidence of the inference result")
  53. latex: str | None = Field(description="LaTeX parsing result", default=None)
  54. html: str | None = Field(description="HTML parsing result", default=None)
  55. class PageInferenceResults(BaseModel):
  56. layout_dets: list[ObjectInferenceResult] = Field(description="Page recognition results", ge=0)
  57. page_info: PageInfo = Field(description="Page metadata")
  58. # The inference results of all pages, ordered by page number, are stored in a list as the inference results of MinerU
  59. inference_result: list[PageInferenceResults] = []
  60. The format of the poly coordinates is [x0, y0, x1, y1, x2, y2, x3, y3],
  61. representing the coordinates of the top-left, top-right, bottom-right,
  62. and bottom-left points respectively. |Poly Coordinate Diagram|
  63. example
  64. ^^^^^^^
  65. .. code:: json
  66. [
  67. {
  68. "layout_dets": [
  69. {
  70. "category_id": 2,
  71. "poly": [
  72. 99.1906967163086,
  73. 100.3119125366211,
  74. 730.3707885742188,
  75. 100.3119125366211,
  76. 730.3707885742188,
  77. 245.81326293945312,
  78. 99.1906967163086,
  79. 245.81326293945312
  80. ],
  81. "score": 0.9999997615814209
  82. }
  83. ],
  84. "page_info": {
  85. "page_no": 0,
  86. "height": 2339,
  87. "width": 1654
  88. }
  89. },
  90. {
  91. "layout_dets": [
  92. {
  93. "category_id": 5,
  94. "poly": [
  95. 99.13092803955078,
  96. 2210.680419921875,
  97. 497.3183898925781,
  98. 2210.680419921875,
  99. 497.3183898925781,
  100. 2264.78076171875,
  101. 99.13092803955078,
  102. 2264.78076171875
  103. ],
  104. "score": 0.9999997019767761
  105. }
  106. ],
  107. "page_info": {
  108. "page_no": 1,
  109. "height": 2339,
  110. "width": 1654
  111. }
  112. }
  113. ]
  114. some_pdf_middle.json
  115. ~~~~~~~~~~~~~~~~~~~~
  116. +----------------+--------------------------------------------------------------+
  117. | Field Name | Description |
  118. | | |
  119. +================+==============================================================+
  120. | pdf_info | list, each element is a dict representing the parsing result |
  121. | | of each PDF page, see the table below for details |
  122. +----------------+--------------------------------------------------------------+
  123. | \_ | ocr \| txt, used to indicate the mode used in this |
  124. | parse_type | intermediate parsing state |
  125. | | |
  126. +----------------+--------------------------------------------------------------+
  127. | \_version_name | string, indicates the version of magic-pdf used in this |
  128. | | parsing |
  129. | | |
  130. +----------------+--------------------------------------------------------------+
  131. **pdf_info**
  132. Field structure description
  133. +-------------------------+------------------------------------------------------------+
  134. | Field | Description |
  135. | Name | |
  136. +=========================+============================================================+
  137. | preproc_blocks | Intermediate result after PDF preprocessing, not yet |
  138. | | segmented |
  139. +-------------------------+------------------------------------------------------------+
  140. | layout_bboxes | Layout segmentation results, containing layout direction |
  141. | | (vertical, horizontal), and bbox, sorted by reading order |
  142. +-------------------------+------------------------------------------------------------+
  143. | page_idx | Page number, starting from 0 |
  144. | | |
  145. +-------------------------+------------------------------------------------------------+
  146. | page_size | Page width and height |
  147. | | |
  148. +-------------------------+------------------------------------------------------------+
  149. | \_layout_tree | Layout tree structure |
  150. | | |
  151. +-------------------------+------------------------------------------------------------+
  152. | images | list, each element is a dict representing an img_block |
  153. +-------------------------+------------------------------------------------------------+
  154. | tables | list, each element is a dict representing a table_block |
  155. +-------------------------+------------------------------------------------------------+
  156. | interline_equation | list, each element is a dict representing an |
  157. | | interline_equation_block |
  158. | | |
  159. +-------------------------+------------------------------------------------------------+
  160. | discarded_blocks | List, block information returned by the model that needs |
  161. | | to be dropped |
  162. | | |
  163. +-------------------------+------------------------------------------------------------+
  164. | para_blocks | Result after segmenting preproc_blocks |
  165. | | |
  166. +-------------------------+------------------------------------------------------------+
  167. In the above table, ``para_blocks`` is an array of dicts, each dict
  168. representing a block structure. A block can support up to one level of
  169. nesting.
  170. **block**
  171. The outer block is referred to as a first-level block, and the fields in
  172. the first-level block include:
  173. +------------------------+-------------------------------------------------------------+
  174. | Field | Description |
  175. | Name | |
  176. +========================+=============================================================+
  177. | type | Block type (table|image) |
  178. +------------------------+-------------------------------------------------------------+
  179. | bbox | Block bounding box coordinates |
  180. +------------------------+-------------------------------------------------------------+
  181. | blocks | list, each element is a dict representing a second-level |
  182. | | block |
  183. +------------------------+-------------------------------------------------------------+
  184. There are only two types of first-level blocks: “table” and “image”. All
  185. other blocks are second-level blocks.
  186. The fields in a second-level block include:
  187. +----------------------+----------------------------------------------------------------+
  188. | Field | Description |
  189. | Name | |
  190. +======================+================================================================+
  191. | | Block type |
  192. | type | |
  193. +----------------------+----------------------------------------------------------------+
  194. | | Block bounding box coordinates |
  195. | bbox | |
  196. +----------------------+----------------------------------------------------------------+
  197. | | list, each element is a dict representing a line, used to |
  198. | lines | describe the composition of a line of information |
  199. +----------------------+----------------------------------------------------------------+
  200. Detailed explanation of second-level block types
  201. ================== ======================
  202. type Description
  203. ================== ======================
  204. image_body Main body of the image
  205. image_caption Image description text
  206. table_body Main body of the table
  207. table_caption Table description text
  208. table_footnote Table footnote
  209. text Text block
  210. title Title block
  211. interline_equation Block formula
  212. ================== ======================
  213. **line**
  214. The field format of a line is as follows:
  215. +---------------------+----------------------------------------------------------------+
  216. | Field | Description |
  217. | Name | |
  218. +=====================+================================================================+
  219. | | Bounding box coordinates of the line |
  220. | bbox | |
  221. +---------------------+----------------------------------------------------------------+
  222. | spans | list, each element is a dict representing a span, used to |
  223. | | describe the composition of the smallest unit |
  224. +---------------------+----------------------------------------------------------------+
  225. **span**
  226. +---------------------+-----------------------------------------------------------+
  227. | Field | Description |
  228. | Name | |
  229. +=====================+===========================================================+
  230. | bbox | Bounding box coordinates of the span |
  231. +---------------------+-----------------------------------------------------------+
  232. | type | Type of the span |
  233. +---------------------+-----------------------------------------------------------+
  234. | content | Text spans use content, chart spans use img_path to store |
  235. | \| | the actual text or screenshot path information |
  236. | img_path | |
  237. +---------------------+-----------------------------------------------------------+
  238. The types of spans are as follows:
  239. ================== ==============
  240. type Description
  241. ================== ==============
  242. image Image
  243. table Table
  244. text Text
  245. inline_equation Inline formula
  246. interline_equation Block formula
  247. ================== ==============
  248. **Summary**
  249. A span is the smallest storage unit for all elements.
  250. The elements stored within para_blocks are block information.
  251. The block structure is as follows:
  252. First-level block (if any) -> Second-level block -> Line -> Span
  253. .. _example-1:
  254. example
  255. ^^^^^^^
  256. .. code:: json
  257. {
  258. "pdf_info": [
  259. {
  260. "preproc_blocks": [
  261. {
  262. "type": "text",
  263. "bbox": [
  264. 52,
  265. 61.956024169921875,
  266. 294,
  267. 82.99800872802734
  268. ],
  269. "lines": [
  270. {
  271. "bbox": [
  272. 52,
  273. 61.956024169921875,
  274. 294,
  275. 72.0000228881836
  276. ],
  277. "spans": [
  278. {
  279. "bbox": [
  280. 54.0,
  281. 61.956024169921875,
  282. 296.2261657714844,
  283. 72.0000228881836
  284. ],
  285. "content": "dependent on the service headway and the reliability of the departure ",
  286. "type": "text",
  287. "score": 1.0
  288. }
  289. ]
  290. }
  291. ]
  292. }
  293. ],
  294. "layout_bboxes": [
  295. {
  296. "layout_bbox": [
  297. 52,
  298. 61,
  299. 294,
  300. 731
  301. ],
  302. "layout_label": "V",
  303. "sub_layout": []
  304. }
  305. ],
  306. "page_idx": 0,
  307. "page_size": [
  308. 612.0,
  309. 792.0
  310. ],
  311. "_layout_tree": [],
  312. "images": [],
  313. "tables": [],
  314. "interline_equations": [],
  315. "discarded_blocks": [],
  316. "para_blocks": [
  317. {
  318. "type": "text",
  319. "bbox": [
  320. 52,
  321. 61.956024169921875,
  322. 294,
  323. 82.99800872802734
  324. ],
  325. "lines": [
  326. {
  327. "bbox": [
  328. 52,
  329. 61.956024169921875,
  330. 294,
  331. 72.0000228881836
  332. ],
  333. "spans": [
  334. {
  335. "bbox": [
  336. 54.0,
  337. 61.956024169921875,
  338. 296.2261657714844,
  339. 72.0000228881836
  340. ],
  341. "content": "dependent on the service headway and the reliability of the departure ",
  342. "type": "text",
  343. "score": 1.0
  344. }
  345. ]
  346. }
  347. ]
  348. }
  349. ]
  350. }
  351. ],
  352. "_parse_type": "txt",
  353. "_version_name": "0.6.1"
  354. }
  355. .. |Poly Coordinate Diagram| image:: ../../_static/image/poly.png