Przeglądaj źródła

fix: table format

xu rui 11 miesięcy temu
rodzic
commit
8152931756

+ 95 - 99
next_docs/en/user_guide/tutorial/output_file_description.rst

@@ -141,60 +141,60 @@ example
 some_pdf_middle.json
 ~~~~~~~~~~~~~~~~~~~~
 
-+-------+--------------------------------------------------------------+
-| Field | Description                                                  |
-| Name  |                                                              |
-+=======+==============================================================+
-| pdf   | list, each element is a dict representing the parsing result |
-| _info | of each PDF page, see the table below for details            |
-+-------+--------------------------------------------------------------+
-| \_    | ocr \| txt, used to indicate the mode used in this           |
-| parse | intermediate parsing state                                   |
-| _type |                                                              |
-+-------+--------------------------------------------------------------+
-| \_ve  | string, indicates the version of magic-pdf used in this      |
-| rsion | parsing                                                      |
-| _name |                                                              |
-+-------+--------------------------------------------------------------+
++----------------+--------------------------------------------------------------+
+| Field Name     | Description                                                  |
+|                |                                                              |
++================+==============================================================+
+| pdf_info       | list, each element is a dict representing the parsing result |
+|                | of each PDF page, see the table below for details            |
++----------------+--------------------------------------------------------------+
+| \_             | ocr \| txt, used to indicate the mode used in this           |
+| parse_type     | intermediate parsing state                                   |
+|                |                                                              |
++----------------+--------------------------------------------------------------+
+| \_version_name | string, indicates the version of magic-pdf used in this      |
+|                | parsing                                                      |
+|                |                                                              |
++----------------+--------------------------------------------------------------+
 
 **pdf_info**
 
 Field structure description
 
-+---------+------------------------------------------------------------+
-| Field   | Description                                                |
-| Name    |                                                            |
-+=========+============================================================+
-| preproc | Intermediate result after PDF preprocessing, not yet       |
-| _blocks | segmented                                                  |
-+---------+------------------------------------------------------------+
-| layout  | Layout segmentation results, containing layout direction   |
-| _bboxes | (vertical, horizontal), and bbox, sorted by reading order  |
-+---------+------------------------------------------------------------+
-| p       | Page number, starting from 0                               |
-| age_idx |                                                            |
-+---------+------------------------------------------------------------+
-| pa      | Page width and height                                      |
-| ge_size |                                                            |
-+---------+------------------------------------------------------------+
-| \_layo  | Layout tree structure                                      |
-| ut_tree |                                                            |
-+---------+------------------------------------------------------------+
-| images  | list, each element is a dict representing an img_block     |
-+---------+------------------------------------------------------------+
-| tables  | list, each element is a dict representing a table_block    |
-+---------+------------------------------------------------------------+
-| inter   | list, each element is a dict representing an               |
-| line_eq | interline_equation_block                                   |
-| uations |                                                            |
-+---------+------------------------------------------------------------+
-| di      | List, block information returned by the model that needs   |
-| scarded | to be dropped                                              |
-| _blocks |                                                            |
-+---------+------------------------------------------------------------+
-| para    | Result after segmenting preproc_blocks                     |
-| _blocks |                                                            |
-+---------+------------------------------------------------------------+
++-------------------------+------------------------------------------------------------+
+| Field                   | Description                                                |
+| Name                    |                                                            |
++=========================+============================================================+
+| preproc_blocks          | Intermediate result after PDF preprocessing, not yet       |
+|                         | segmented                                                  |
++-------------------------+------------------------------------------------------------+
+| layout_bboxes           | Layout segmentation results, containing layout direction   |
+|                         | (vertical, horizontal), and bbox, sorted by reading order  |
++-------------------------+------------------------------------------------------------+
+| page_idx                | Page number, starting from 0                               |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| page_size               | Page width and height                                      |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| \_layout_tree           | Layout tree structure                                      |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| images                  | list, each element is a dict representing an img_block     |
++-------------------------+------------------------------------------------------------+
+| tables                  | list, each element is a dict representing a table_block    |
++-------------------------+------------------------------------------------------------+
+| interline_equation      | list, each element is a dict representing an               |
+|                         | interline_equation_block                                   |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| discarded_blocks        | List, block information returned by the model that needs   |
+|                         | to be dropped                                              |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
+| para_blocks             | Result after segmenting preproc_blocks                     |
+|                         |                                                            |
++-------------------------+------------------------------------------------------------+
 
 In the above table, ``para_blocks`` is an array of dicts, each dict
 representing a block structure. A block can support up to one level of
@@ -205,38 +205,36 @@ nesting.
 The outer block is referred to as a first-level block, and the fields in
 the first-level block include:
 
-+---------+-------------------------------------------------------------+
-| Field   | Description                                                 |
-| Name    |                                                             |
-+=========+=============================================================+
-| type    | Block type (table|image)                                    |
-+---------+-------------------------------------------------------------+
-| bbox    | Block bounding box coordinates                              |
-+---------+-------------------------------------------------------------+
-| blocks  | list, each element is a dict representing a second-level    |
-|         | block                                                       |
-+---------+-------------------------------------------------------------+
++------------------------+-------------------------------------------------------------+
+| Field                  | Description                                                 |
+| Name                   |                                                             |
++========================+=============================================================+
+| type                   | Block type (table|image)                                    |
++------------------------+-------------------------------------------------------------+
+| bbox                   | Block bounding box coordinates                              |
++------------------------+-------------------------------------------------------------+
+| blocks                 | list, each element is a dict representing a second-level    |
+|                        | block                                                       |
++------------------------+-------------------------------------------------------------+
 
 There are only two types of first-level blocks: “table” and “image”. All
 other blocks are second-level blocks.
 
 The fields in a second-level block include:
 
-+-----+----------------------------------------------------------------+
-| Fi  | Description                                                    |
-| eld |                                                                |
-| N   |                                                                |
-| ame |                                                                |
-+=====+================================================================+
-| t   | Block type                                                     |
-| ype |                                                                |
-+-----+----------------------------------------------------------------+
-| b   | Block bounding box coordinates                                 |
-| box |                                                                |
-+-----+----------------------------------------------------------------+
-| li  | list, each element is a dict representing a line, used to      |
-| nes | describe the composition of a line of information              |
-+-----+----------------------------------------------------------------+
++----------------------+----------------------------------------------------------------+
+| Field                | Description                                                    |
+| Name                 |                                                                |
++======================+================================================================+
+|                      | Block type                                                     |
+| type                 |                                                                |
++----------------------+----------------------------------------------------------------+
+|                      | Block bounding box coordinates                                 |
+| bbox                 |                                                                |
++----------------------+----------------------------------------------------------------+
+|                      | list, each element is a dict representing a line, used to      |
+| lines                | describe the composition of a line of information              |
++----------------------+----------------------------------------------------------------+
 
 Detailed explanation of second-level block types
 
@@ -257,33 +255,31 @@ interline_equation Block formula
 
 The field format of a line is as follows:
 
-+-----+----------------------------------------------------------------+
-| Fi  | Description                                                    |
-| eld |                                                                |
-| N   |                                                                |
-| ame |                                                                |
-+=====+================================================================+
-| b   | Bounding box coordinates of the line                           |
-| box |                                                                |
-+-----+----------------------------------------------------------------+
-| sp  | list, each element is a dict representing a span, used to      |
-| ans | describe the composition of the smallest unit                  |
-+-----+----------------------------------------------------------------+
++---------------------+----------------------------------------------------------------+
+| Field               | Description                                                    |
+| Name                |                                                                |
++=====================+================================================================+
+|                     | Bounding box coordinates of the line                           |
+| bbox                |                                                                |
++---------------------+----------------------------------------------------------------+
+| spans               | list, each element is a dict representing a span, used to      |
+|                     | describe the composition of the smallest unit                  |
++---------------------+----------------------------------------------------------------+
 
 **span**
 
-+----------+-----------------------------------------------------------+
-| Field    | Description                                               |
-| Name     |                                                           |
-+==========+===========================================================+
-| bbox     | Bounding box coordinates of the span                      |
-+----------+-----------------------------------------------------------+
-| type     | Type of the span                                          |
-+----------+-----------------------------------------------------------+
-| content  | Text spans use content, chart spans use img_path to store |
-| \|       | the actual text or screenshot path information            |
-| img_path |                                                           |
-+----------+-----------------------------------------------------------+
++---------------------+-----------------------------------------------------------+
+| Field               | Description                                               |
+| Name                |                                                           |
++=====================+===========================================================+
+| bbox                | Bounding box coordinates of the span                      |
++---------------------+-----------------------------------------------------------+
+| type                | Type of the span                                          |
++---------------------+-----------------------------------------------------------+
+| content             | Text spans use content, chart spans use img_path to store |
+| \|                  | the actual text or screenshot path information            |
+| img_path            |                                                           |
++---------------------+-----------------------------------------------------------+
 
 The types of spans are as follows:
 

+ 22 - 23
next_docs/zh_cn/user_guide/tutorial/output_file_description.rst

@@ -143,11 +143,11 @@ some_pdf_middle.json
 | pdf_info  | list,每个                                               |
 |           | 元素都是一个dict,这个dict是每一页pdf的解析结果,详见下表 |
 +-----------+----------------------------------------------------------+
-| \_p       | ocr \| txt,用来标识本次解析的中间态使用的模式           |
-| arse_type |                                                          |
+|              | ocr \| txt,用来标识本次解析的中间态使用的模式           |
+| \_parse_type |                                                          |
 +-----------+----------------------------------------------------------+
-| \_ver     | string, 表示本次解析使用的 magic-pdf 的版本号            |
-| sion_name |                                                          |
+|                | string, 表示本次解析使用的 magic-pdf 的版本号            |
+| \_version_name |                                                          |
 +-----------+----------------------------------------------------------+
 
 **pdf_info** 字段结构说明
@@ -155,11 +155,11 @@ some_pdf_middle.json
 +--------------+-------------------------------------------------------+
 | 字段名       | 解释                                                  |
 +==============+=======================================================+
-| pr           | pdf预处理后,未分段的中间结果                         |
-| eproc_blocks |                                                       |
+|                 | pdf预处理后,未分段的中间结果                         |
+| preeproc_blocks |                                                       |
 +--------------+-------------------------------------------------------+
-| l            | 布局分割的结果,                                      |
-| ayout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序  |
+|               | 布局分割的结果,                                      |
+| layout_bboxes | 含有布局的方向(垂直、水平),和bbox,按阅读顺序排序  |
 +--------------+-------------------------------------------------------+
 | page_idx     | 页码,从0开始                                         |
 +--------------+-------------------------------------------------------+
@@ -172,11 +172,11 @@ some_pdf_middle.json
 +--------------+-------------------------------------------------------+
 | tables       | list,每个元素是一个dict,每个dict表示一个table_block |
 +--------------+-------------------------------------------------------+
-| interli      | list,每个元素                                        |
-| ne_equations | 是一个dict,每个dict表示一个interline_equation_block  |
+|                     | list,每个元素                                        |
+| interline_equations | 是一个dict,每个dict表示一个interline_equation_block  |
 +--------------+-------------------------------------------------------+
-| disc         | List, 模型返回的需要drop的block信息                   |
-| arded_blocks |                                                       |
+|                  | List, 模型返回的需要drop的block信息                   |
+| discarded_blocks |                                                       |
 +--------------+-------------------------------------------------------+
 | para_blocks  | 将preproc_blocks进行分段之后的结果                    |
 +--------------+-------------------------------------------------------+
@@ -205,14 +205,14 @@ blocks list,里面的每个元素都是一个dict格式的二级block
 | 段  |                                                                |
 | 名  |                                                                |
 +=====+================================================================+
-| t   | block类型                                                      |
-| ype |                                                                |
+|      | block类型                                                      |
+| type |                                                                |
 +-----+----------------------------------------------------------------+
-| b   | block矩形框坐标                                                |
-| box |                                                                |
+|      | block矩形框坐标                                                |
+| bbox |                                                                |
 +-----+----------------------------------------------------------------+
-| li  | list,每个元素都是一个dict表示的line,用来描述一行信息的构成   |
-| nes |                                                                |
+|       | list,每个元素都是一个dict表示的line,用来描述一行信息的构成   |
+| lines |                                                                |
 +-----+----------------------------------------------------------------+
 
 二级block的类型详解
@@ -242,12 +242,11 @@ line 的 字段格式如下
 | 段 |                                                                 |
 | 名 |                                                                 |
 +====+=================================================================+
-| bb | line的矩形框坐标                                                |
-| ox |                                                                 |
+| bbox  | line的矩形框坐标                                                |
+|       |                                                                 |
 +----+-----------------------------------------------------------------+
-| s  | list,                                                          |
-| pa | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成  |
-| ns |                                                                 |
+| spans  | list,                                                       |
+|        | 每个元素都是一个dict表示的span,用来描述一个最小组成单元的构成  |
 +----+-----------------------------------------------------------------+
 
 **span**