comments: true

General Layout Parsing Pipeline Tutorial

1. Introduction to the General Layout Parsing Pipeline

Layout parsing is a technology that extracts structured information from document images, primarily used to convert complex document layouts into machine-readable data formats. This technology has extensive applications in document management, information extraction, and data digitization. By combining Optical Character Recognition (OCR), image processing, and machine learning algorithms, layout parsing can identify and extract text blocks, titles, paragraphs, images, tables, and other layout elements from documents. The process typically involves three main steps: layout analysis, element analysis, and data formatting, ultimately generating structured document data to improve data processing efficiency and accuracy.

The General Layout Parsing Pipeline includes modules for table structure recognition, layout region analysis, text detection, text recognition, formula recognition, seal text detection, text image rectification, and document image orientation classification.

If you prioritize model accuracy, choose a model with higher accuracy. If you prioritize inference speed, choose a model with faster inference. If you prioritize model storage size, choose a model with a smaller storage size.

👉Model List Details

Table Structure Recognition Module Models:

Model	Model Download Link	Accuracy (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Size (M)	Description
SLANet	Inference Model/Trained Model	59.52	103.08 / 103.08	197.99 / 197.99	6.9 M	SLANet is a table structure recognition model developed by Baidu PaddleX Team. The model significantly improves the accuracy and inference speed of table structure recognition by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information.
SLANet_plus	Inference Model/Trained Model	63.69	140.29 / 140.29	195.39 / 195.39	6.9 M	SLANet_plus is an enhanced version of SLANet, the table structure recognition model developed by Baidu PaddleX Team. Compared to SLANet, SLANet_plus significantly improves the recognition ability for wireless and complex tables and reduces the model's sensitivity to the accuracy of table positioning, enabling more accurate recognition even with offset table positioning.

Layout Detection Module Models:

Model	Model Download Link	mAP(0.5) (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Storage Size (M)	Introduction
PP-DocLayout-L	Inference Model/Training Model	90.4	34.6244 / 10.3945	510.57 / -	123.76 M	A high-precision layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using RT-DETR-L.
PP-DocLayout-M	Inference Model/Training Model	75.2	13.3259 / 4.8685	44.0680 / 44.0680	22.578	A layout area localization model with balanced precision and efficiency, trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-L.
PP-DocLayout-S	Inference Model/Training Model	70.9	8.3008 / 2.3794	10.0623 / 9.9296	4.834	A high-efficiency layout area localization model trained on a self-built dataset containing Chinese and English papers, magazines, contracts, books, exams, and research reports using PicoDet-S.

Note: The evaluation dataset for the above precision metrics is a self-built layout area detection dataset by PaddleOCR, containing 500 common document-type images of Chinese and English papers, magazines, contracts, books, exams, and research reports. GPU inference time is based on an NVIDIA Tesla T4 machine with FP32 precision. CPU inference speed is based on an Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision.

❗ The above list includes the 3 core models that are key supported by the text recognition module. The module actually supports a total of 11 full models, including several predefined models with different categories. The complete model list is as follows:

👉 Details of Model List

* Table Layout Detection Model

Model	Model Download Link	mAP(0.5) (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Storage Size (M)	Introduction
PicoDet_layout_1x_table	Inference Model/Training Model	97.5	8.02 / 3.09	23.70 / 20.41	7.4 M	A high-efficiency layout area localization model trained on a self-built dataset using PicoDet-1x, capable of detecting table regions.

Note: The evaluation dataset for the above precision metrics is a self-built layout table area detection dataset by PaddleOCR, containing 7835 Chinese and English document images with tables. GPU inference time is based on an NVIDIA Tesla T4 machine with FP32 precision. CPU inference speed is based on an Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision. * 3-Class Layout Detection Model, including Table, Image, and Stamp

Model	Model Download Link	mAP(0.5) (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Storage Size (M)	Introduction
PicoDet-S_layout_3cls	Inference Model/Training Model	88.2	8.99 / 2.22	16.11 / 8.73	4.8	A high-efficiency layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-S.
PicoDet-L_layout_3cls	Inference Model/Training Model	89.0	13.05 / 4.50	41.30 / 41.30	22.6	A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L.
RT-DETR-H_layout_3cls	Inference Model/Training Model	95.8	114.93 / 27.71	947.56 / 947.56	470.1	A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H.

Note: The evaluation dataset for the above precision metrics is a self-built layout area detection dataset by PaddleOCR, containing 1154 common document images of Chinese and English papers, magazines, and research reports. GPU inference time is based on an NVIDIA Tesla T4 machine with FP32 precision. CPU inference speed is based on an Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision. * 5-Class English Document Area Detection Model, including Text, Title, Table, Image, and List

Model	Model Download Link	mAP(0.5) (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Storage Size (M)	Introduction
PicoDet_layout_1x	Inference Model/Training Model	97.8	9.03 / 3.10	25.82 / 20.70	7.4	A high-efficiency English document layout area localization model trained on the PubLayNet dataset using PicoDet-1x.

Note: The evaluation dataset for the above precision metrics is the [PubLayNet](https://developer.ibm.com/exchanges/data/all/publaynet/) dataset, containing 11245 English document images. GPU inference time is based on an NVIDIA Tesla T4 machine with FP32 precision. CPU inference speed is based on an Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision. * 17-Class Area Detection Model, including 17 common layout categories: Paragraph Title, Image, Text, Number, Abstract, Content, Figure Caption, Formula, Table, Table Caption, References, Document Title, Footnote, Header, Algorithm, Footer, and Stamp

Model	Model Download Link	mAP(0.5) (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Storage Size (M)	Introduction
PicoDet-S_layout_17cls	Inference Model/Training Model	87.4	9.11 / 2.12	15.42 / 9.12	4.8	A high-efficiency layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-S.
PicoDet-L_layout_17cls	Inference Model/Training Model	89.0	13.50 / 4.69	43.32 / 43.32	22.6	A balanced efficiency and precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using PicoDet-L.
RT-DETR-H_layout_17cls	Inference Model/Training Model	98.3	115.29 / 104.09	995.27 / 995.27	470.2	A high-precision layout area localization model trained on a self-built dataset of Chinese and English papers, magazines, and research reports using RT-DETR-H.

Text Detection Module Models:

Model	Model Download Link	Detection Hmean (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Size (M)	Description
PP-OCRv4_server_det	Inference Model/Trained Model	82.69	83.34 / 80.91	442.58 / 442.58	109	PP-OCRv4's server-side text detection model, featuring higher accuracy, suitable for deployment on high-performance servers
PP-OCRv4_mobile_det	Inference Model/Trained Model	77.79	8.79 / 3.13	51.00 / 28.58	4.7	PP-OCRv4's mobile text detection model, optimized for efficiency, suitable for deployment on edge devices

Text Recognition Module Models:

Model	Model Download Link	Recognition Avg Accuracy (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Size (M)	Description
PP-OCRv4_mobile_rec	Inference Model/Trained Model	78.20	4.82 / 4.82	16.74 / 4.64	10.6 M	PP-OCRv4 is the next version of Baidu PaddlePaddle's self-developed text recognition model PP-OCRv3. By introducing data augmentation schemes and GTC-NRTR guidance branches, it further improves text recognition accuracy without compromising inference speed. The model offers both server (server) and mobile (mobile) versions to meet industrial needs in different scenarios.
PP-OCRv4_server_rec	Inference Model/Trained Model	79.20	6.58 / 6.58	33.17 / 33.17	71.2 M

Model	Model Download Link	Recognition Avg Accuracy (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Size (M)	Description
ch_SVTRv2_rec	Inference Model/Trained Model	68.81	8.08 / 8.08	50.17 / 42.50	73.9 M	SVTRv2 is a server-side text recognition model developed by the OpenOCR team at the Vision and Learning Lab (FVL) of Fudan University. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 6% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the A-list.

Model	Model Download Link	Recognition Avg Accuracy (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Size (M)	Description
ch_RepSVTR_rec	Inference Model/Trained Model	65.07	5.93 / 5.93	20.73 / 7.32	22.1 M	The RepSVTR text recognition model is a mobile-oriented text recognition model based on SVTRv2. It won the first prize in the OCR End-to-End Recognition Task of the PaddleOCR Algorithm Model Challenge, with a 2.5% improvement in end-to-end recognition accuracy compared to PP-OCRv4 on the B-list, while maintaining similar inference speed.

Formula Recognition Module Models:

Model Name	Model Download Link	BLEU Score	Normed Edit Distance	ExpRate (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Size
LaTeX_OCR_rec	Inference Model/Trained Model	0.8821	0.0823	40.01	2047.13 / 2047.13	10582.73 / 10582.73	89.7 M

Seal Text Detection Module Models:

Model	Model Download Link	Detection Hmean (%)	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU Inference Time (ms) [Normal Mode / High-Performance Mode]	Model Size (M)	Description
PP-OCRv4_server_seal_det	Inference Model/Trained Model	98.21	74.75 / 67.72	382.55 / 382.55	109	PP-OCRv4's server-side seal text detection model, featuring higher accuracy, suitable for deployment on better-equipped servers
PP-OCRv4_mobile_seal_det	Inference Model/Trained Model	96.47	7.82 / 3.09	48.28 / 23.97	4.6	PP-OCRv4's mobile seal text detection model, offering higher efficiency, suitable for deployment on edge devices

**Test Environment Description**: - **Performance Test Environment** - **Test Dataset**: - Text Image Rectification Model: [DocUNet](https://www3.cs.stonybrook.edu/~cvl/docunet.html). - Layout Region Detection Model: A self-built layout analysis dataset using PaddleOCR, containing 10,000 images of common document types such as Chinese and English papers, magazines, and research reports. - Table Structure Recognition Model: A self-built English table recognition dataset using PaddleX. - Text Detection Model: A self-built Chinese dataset using PaddleOCR, covering multiple scenarios such as street scenes, web images, documents, and handwriting, with 500 images for detection. - Chinese Recognition Model: A self-built Chinese dataset using PaddleOCR, covering multiple scenarios such as street scenes, web images, documents, and handwriting, with 11,000 images for text recognition. - ch_SVTRv2_rec: Evaluation set A for "OCR End-to-End Recognition Task" in the [PaddleOCR Algorithm Model Challenge](https://aistudio.baidu.com/competition/detail/1131/0/introduction). - ch_RepSVTR_rec: Evaluation set B for "OCR End-to-End Recognition Task" in the [PaddleOCR Algorithm Model Challenge](https://aistudio.baidu.com/competition/detail/1131/0/introduction). - English Recognition Model: A self-built English dataset using PaddleX. - Multilingual Recognition Model: A self-built multilingual dataset using PaddleX. - Text Line Orientation Classification Model: A self-built dataset using PaddleX, covering various scenarios such as ID cards and documents, containing 1000 images. - Seal Text Detection Model: A self-built dataset using PaddleX, containing 500 images of circular seal textures. - **Hardware Configuration**: - GPU: NVIDIA Tesla T4 - CPU: Intel Xeon Gold 6271C @ 2.60GHz - Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2 - **Inference Mode Description** | Mode | GPU Configuration | CPU Configuration | Acceleration Technology Combination | |-------------|----------------------------------------|-------------------|---------------------------------------------------| | Regular Mode| FP32 Precision / No TRT Acceleration | FP32 Precision / 8 Threads | PaddleInference | | High-Performance Mode | Optimal combination of pre-selected precision types and acceleration strategies | FP32 Precision / 8 Threads | Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.) |

2. Quick Start

PaddleX provides pre-trained model pipelines that can be quickly experienced. You can experience the effect of the General Image Classification pipeline online, or locally using command line or Python.

Before using the General Layout Parsing pipeline locally, please ensure you have completed the installation of the PaddleX wheel package according to the PaddleX Local Installation Guide.

2.1 Experience via Command Line

One command is all you need to quickly experience the effect of the Layout Parsing pipeline. Use the test file and replace --input with your local path to make predictions.

paddlex --pipeline layout_parsing --input demo_paper.png --device gpu:0

Parameter Explanation:

--pipeline: The name of the pipeline, here it is the Layout Parsing pipeline.
--input: The local path or URL of the input image to be processed.
--device: The GPU index to use (e.g., gpu:0 indicates using the first GPU, gpu:1,2 indicates using the second and third GPUs). You can also choose to use CPU (--device cpu).

When executing the above command, the default Layout Parsing pipeline configuration file is loaded. If you need to customize the configuration file, you can execute the following command to obtain it:

👉Click to expand

paddlex --get_pipeline_config layout_parsing

After execution, the layout parsing pipeline configuration file will be saved in the current directory. If you wish to customize the save location, you can execute the following command (assuming the custom save location is ./my_path):

paddlex --get_pipeline_config layout_parsing --save_path ./my_path

After obtaining the pipeline configuration file, you can replace --pipeline with the saved path of the configuration file to make it take effect. For example, if the configuration file is saved as ./layout_parsing.yaml, simply execute:

paddlex --pipeline ./layout_parsing.yaml --input layout_parsing.jpg

Here, parameters such as --model and --device do not need to be specified, as they will use the parameters in the configuration file. If these parameters are still specified, the specified parameters will take precedence.

After running, the result will be:

👉Click to expand

{'input_path': PosixPath('/root/.paddlex/temp/tmp5jmloefs.png'), 'parsing_result': [{'input_path': PosixPath('/root/.paddlex/temp/tmpshsq8_w0.png'), 'layout_bbox': [51.46833, 74.22329, 542.4082, 232.77504], 'image': {'img': array([[[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [213, 221, 238],
        [217, 223, 240],
        [233, 234, 241]],

       [[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]]], dtype=uint8), 'image_text': ''}, 'layout': 'single'}, {'input_path': PosixPath('/root/.paddlex/temp/tmpcd2q9uyu.png'), 'layout_bbox': [47.68295, 243.08054, 546.28253, 295.71045], 'figure_title': 'Overview of RT-DETR, We feed th', 'layout': 'single'}, {'input_path': PosixPath('/root/.paddlex/temp/tmpr_iqa8b3.png'), 'layout_bbox': [58.416977, 304.1531, 275.9134, 400.07513], 'image': {'img': array([[[255, 255, 255],
        [255, 255, 255],
        [255, 255, 255],
        ...,
        [255, 255, 255],
        [255, 255, 255],
        [255, 255, 255]]], dtype=uint8), 'image_text': ''}, 'layout': 'left'}, {'input_path': PosixPath('/root/.paddlex/temp/tmphpblxl3p.png'), 'layout_bbox': [100.62961, 405.97458, 234.79774, 414.77414], 'figure_title': 'Figure 5. The fusion block in CCFF.', 'layout': 'left'}, {'input_path': PosixPath('/root/.paddlex/temp/tmplgnczrsf.png'), 'layout_bbox': [47.81724, 421.9041, 288.01566, 550.538], 'text': 'D, Ds, not only significantly reduces latency (35% faster),\nRut\nnproves accuracy (0.4% AP higher), CCFF is opti\nased on the cross-scale fusion module, which\nnsisting of convolutional lavers intc\npath.\nThe role of the fusion block is t\n into a new feature, and its\nFigure 5. The f\nblock contains tw\n1 x1\nchannels, /V RepBlock\n. anc\n: two-path outputs are fused by element-wise add. We\ntormulate the calculation ot the hvbrid encoder as:', 'layout': 'left'}, {'input_path': PosixPath('/root/.paddlex/temp/tmpsq0ey9md.png'), 'layout_bbox': [94.60716, 558.703, 288.04193, 600.19434], 'formula': '\\begin{array}{l}{{\\Theta=K=\\mathrm{p.s.sp{\\pm}}\\mathrm{i.s.s.}(\\mathrm{l.s.}(\\mathrm{l.s.}(\\mathrm{l.s.}}),{\\qquad\\mathrm{{a.s.}}\\mathrm{s.}}}\\\\ {{\\tau_{\\mathrm{{s.s.s.s.s.}}(\\mathrm{l.s.},\\mathrm{l.s.},\\mathrm{s.s.}}\\mathrm{s.}\\mathrm{s.}}\\end{array}),}}\\\\ {{\\bar{\\mathrm{e-c.c.s.s.}(\\mathrm{s.},\\mathrm{s.s.},\\ s_{s}}\\mathrm{s.s.},\\tau),}}\\end{array}', 'layout': 'left'}, {'input_path': PosixPath('/root/.paddlex/temp/tmpv30qy0v4.png'), 'layout_bbox': [47.975555, 607.12024, 288.5776, 629.1252], 'text': 'tened feature to the same shape as Ss.\nwhere Re shape represents restoring the shape of the flat-', 'layout': 'left'}, {'input_path': PosixPath('/root/.paddlex/temp/tmp0jejzwwv.png'), 'layout_bbox': [48.383354, 637.581, 245.96404, 648.20496], 'paragraph_title': '4.3. Uncertainty-minimal Query Selection', 'layout': 'left'}, {'input_path': PosixPath('/root/.paddlex/temp/tmpushex416.png'), 'layout_bbox': [47.80134, 656.002, 288.50192, 713.24994], 'text': 'To reduce the difficulty of optimizing object queries in\nDETR, several subsequent works [42, 44, 45] propose query\nselection schemes, which have in common that they use the\nconfidence score to select the top K’ features from the en-\ncoder to initialize object queries (or just position queries).', 'layout': 'left'}, {'input_path': PosixPath('/root/.paddlex/temp/tmpki7e_6wc.png'), 'layout_bbox': [306.6371, 302.1026, 546.3772, 419.76724], 'text': 'The confidence score represents the likelihood that the fea\nture includes foreground objects. Nevertheless, the \nare required to simultaneously model the category\nojects, both of which determine the quality of the\npertor\ncore of the fes\nBased on the analysis, the current query\n considerable level of uncertainty in the\nresulting in sub-optimal initialization for\nand hindering the performance of the detector.', 'layout': 'right'}, {'input_path': PosixPath('/root/.paddlex/temp/tmppbxrfehp.png'), 'layout_bbox': [306.0642, 422.7347, 546.9216, 539.45734], 'text': 'To address this problem, we propose the uncertainty mini\nmal query selection scheme, which explicitly const\noptim\n the epistemic uncertainty to model the\nfeatures, thereby providing \nhigh-quality\nr the decoder. Specifically,\n the discrepancy between i\nalization P\nand classificat\n.(2\ntunction for the gradie', 'layout': 'right'}, {'input_path': PosixPath('/root/.paddlex/temp/tmp1mgiyd21.png'), 'layout_bbox': [331.52808, 549.32635, 546.5229, 586.15546], 'formula': '\\begin{array}{c c c}{{}}&{{}}&{{\\begin{array}{c}{{i\\langle X\\rangle=({\\bar{Y}}({\\bar{X}})+{\\bar{Z}}({\\bar{X}})\\mid X\\in{\\bar{\\pi}}^{\\prime}}}&{{}}\\\\ {{}}&{{}}&{{}}\\end{array}}}&{{\\emptyset}}\\\\ {{}}&{{}}&{{C(\\bar{X},{\\bar{X}})=C..\\scriptstyle(\\bar{0},{\\bar{Y}})+{\\mathcal{L}}_{{\\mathrm{s}}}({\\bar{X}}),\\ 6)}}&{{}}\\end{array}', 'layout': 'right'}, {'input_path': PosixPath('/root/.paddlex/temp/tmp8t73dpym.png'), 'layout_bbox': [306.44016, 592.8762, 546.84314, 630.60126], 'text': 'where  and y denote the prediction and ground truth,\n= (c, b), c and b represent the category and bounding\nbox respectively, X represent the encoder feature.', 'layout': 'right'}, {'input_path': PosixPath('/root/.paddlex/temp/tmpftnxeyjm.png'), 'layout_bbox': [306.15652, 632.3142, 546.2463, 713.19073], 'text': 'Effectiveness analysis. To analyze the effectiveness of the\nuncertainty-minimal query selection, we visualize the clas-\nsificatior\nscores and IoU scores of the selected fe\nCOCO\na 12017, Figure 6. We draw the scatterplo\nt with\ndots\nrepresent the selected features from the model trained\nwith uncertainty-minimal query selection and vanilla query', 'layout': 'right'}]}

2.2 Python Script Integration

A few lines of code are all you need to quickly perform inference on your pipeline. Taking the general layout parsing pipeline as an example:

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="layout_parsing")

output = pipeline.predict("demo_paper.png")
for res in output:
    res.print()  # Print the structured output of the prediction
    res.save_to_img("./output/")  # Save the result as an image file
    res.save_to_xlsx("./output/")  # Save the result as an Excel file
    res.save_to_html("./output/")  # Save the result as an HTML file

The results obtained are the same as those from the command line method.

In the above Python script, the following steps are executed:

(1) Instantiate the create_pipeline to create a pipeline object: Specific parameter descriptions are as follows:

Parameter	Description	Type	Default
`pipeline`	The name of the pipeline or the path to the pipeline configuration file. If it's a pipeline name, it must be supported by PaddleX.	`str`	None
`device`	The device for pipeline model inference. Supports: "gpu", "cpu".	`str`	"gpu"
`use_hpip`	Whether to enable high-performance inference, only available if the pipeline supports it.	`bool`	`False`

(2) Call the predict method of the pipeline object to perform inference: The predict method takes x as a parameter, which is used to input data to be predicted, supporting multiple input methods, as shown in the following examples:

Parameter Type	Description
Python Var	Supports directly passing Python variables, such as numpy.ndarray representing image data.
`str`	Supports passing the path of the file to be predicted, such as the local path of an image file: `/root/data/img.jpg`.
`str`	Supports passing the URL of the file to be predicted, such as the network URL of an image file: Example.
`str`	Supports passing a local directory, which should contain files to be predicted, such as the local path: `/root/data/`.
`dict`	Supports passing a dictionary type, where the key needs to correspond to the specific task, e.g., "img" for image classification tasks, and the value of the dictionary supports the above data types, e.g., `{"img": "/root/data1"}`.
`list`	Supports passing a list, where the list elements need to be of the above data types, e.g., `[numpy.ndarray, numpy.ndarray]`, `["/root/data/img1.jpg", "/root/data/img2.jpg"]`, `["/root/data1", "/root/data2"]`, `[{"img": "/root/data1"}, {"img": "/root/data2/img.jpg"}]`.

(3) Obtain the prediction results by calling the predict method: The predict method is a generator, so prediction results need to be obtained through iteration. The predict method predicts data in batches, so the prediction results are in the form of a list.

(4) Process the prediction results: The prediction result for each sample is of dict type and supports printing or saving as files, with the supported file types depending on the specific pipeline, such as:

Method	Description	Method Parameters
`save_to_img`	Saves the result as an image file.	`- save_path`: `str` type, the path to save the file. When it's a directory, the saved file name is consistent with the input file name.
`save_to_html`	Saves the result as an HTML file.	`- save_path`: `str` type, the path to save the file. When it's a directory, the saved file name is consistent with the input file name.

| save_to_xlsx | Saves the result as an Excel file. | - save_path: str type, the path to save the file. When it's a directory, the saved file name is consistent with the input file name.

Within this tutorial on Artificial Intelligence and Computer Vision, we will explore the capabilities of saving and exporting results from various processes, including OCR (Optical Character Recognition), layout analysis, and table structure recognition. Specifically, the save_to_img function enables saving visualization results, save_to_html converts tables directly into HTML files, and save_to_xlsx exports tables as Excel files.

Upon obtaining the configuration file, you can customize various settings for the layout parsing pipeline by simply modifying the pipeline parameter within the create_pipeline method to point to your configuration file path.

For instance, if your configuration file is saved at ./my_path/layout_parsing.yaml, you can execute the following code:

from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="./my_path/layout_parsing.yaml")
output = pipeline.predict("layout_parsing.jpg")
for res in output:
    res.print()  # Prints the structured output of the layout parsing prediction
    res.save_to_img("./output/")  # Saves the img format results from each submodule of the pipeline
    res.save_to_xlsx("./output/")  # Saves the xlsx format results from the table recognition module
    res.save_to_html("./output/")  # Saves the html results from the table recognition module

3. Development Integration/Deployment

If the pipeline meets your requirements in terms of inference speed and accuracy, you can proceed with development integration or deployment.

To directly apply the pipeline in your Python project, refer to the example code in 2.2 Python Script Integration.

Additionally, PaddleX offers three other deployment methods, detailed as follows:

🚀 High-Performance Inference: In production environments, many applications require stringent performance metrics, especially response speed, to ensure efficient operation and smooth user experience. PaddleX provides a high-performance inference plugin that deeply optimizes model inference and pre/post-processing for significant end-to-end speedups. For detailed instructions on high-performance inference, refer to the PaddleX High-Performance Inference Guide.

☁️ Serving: Serving is a common deployment strategy in real-world production environments. By encapsulating inference functions into services, clients can access these services via network requests to obtain inference results. PaddleX supports various solutions for serving pipelines. For detailed pipeline serving procedures, please refer to the PaddleX Pipeline Serving Guide.

Below are the API reference and multi-language service invocation examples for the basic serving solution:

API Reference

For the main operations provided by the service:

The HTTP request method is POST.
Both the request body and response body are JSON data (JSON objects).
When the request is processed successfully, the response status code is 200, and the attributes of the response body are as follows:

Name	Type	Meaning
`logId`	`string`	The UUID of the request.
`errorCode`	`integer`	Error code. Fixed as `0`.
`errorMsg`	`string`	Error message. Fixed as `"Success"`.
`result`	`object`	The result of the operation.

When the request is not processed successfully, the attributes of the response body are as follows:

Name	Type	Meaning
`logId`	`string`	The UUID of the request.
`errorCode`	`integer`	Error code. Same as the response status code.
`errorMsg`	`string`	Error message.

The main operations provided by the service are as follows:

infer

Perform layout parsing.

POST /layout-parsing

The attributes of the request body are as follows:

Name	Type	Meaning	Required
`file`	`string`	The URL of an image or PDF file accessible by the server, or the Base64-encoded content of the above file types. For PDF files with more than 10 pages, only the content of the first 10 pages will be used.	Yes
`fileType`	`integer`｜`null`	File type. `0` represents a PDF file, and `1` represents an image file. If this attribute is missing from the request body, the file type will be inferred based on the URL.	No
`useDocOrientationClassify`	`boolean` \| `null`	See the description of the `use_doc_orientation_classify` parameter in the `predict` method of the pipeline.	No
`useDocUnwarping`	`boolean` \| `null`	See the description of the `use_doc_unwarping` parameter in the `predict` method of the pipeline.	No
`useTextlineOrientation`	`boolean` \| `null`	See the description of the `use_textline_orientation` parameter in the `predict` method of the pipeline.	No
`useGeneralOcr`	`boolean` \| `null`	See the description of the `use_general_ocr` parameter in the `predict` method of the pipeline.	No
`useSealRecognition`	`boolean` \| `null`	See the description of the `use_seal_recognition` parameter in the `predict` method of the pipeline.	No
`useTableRecognition`	`boolean` \| `null`	See the description of the `use_table_recognition` parameter in the `predict` method of the pipeline.	No
`useFormulaRecognition`	`boolean` \| `null`	See the description of the `use_formula_recognition` parameter in the `predict` method of the pipeline.	No
`textDetLimitSideLen`	`integer` \| `null`	See the description of the `text_det_limit_side_len` parameter in the `predict` method of the pipeline.	No
`textDetLimitType`	`string` \| `null`	See the description of the `text_det_limit_type` parameter in the `predict` method of the pipeline.	No
`textDetThresh`	`number` \| `null`	See the description of the `text_det_thresh` parameter in the `predict` method of the pipeline.	No
`textDetBoxThresh`	`number` \| `null`	See the description of the `text_det_box_thresh` parameter in the `predict` method of the pipeline.	No
`textDetUnclipRatio`	`number` \| `null`	See the description of the `text_det_unclip_ratio` parameter in the `predict` method of the pipeline.	No
`textRecScoreThresh`	`number` \| `null`	See the description of the `text_rec_score_thresh` parameter in the `predict` method of the pipeline.	No
`sealDetLimitSideLen`	`integer` \| `null`	See the description of the `seal_det_limit_side_len` parameter in the `predict` method of the pipeline.	No
`sealDetLimitType`	`string` \| `null`	See the description of the `seal_det_limit_type` parameter in the `predict` method of the pipeline.	No
`sealDetThresh`	`number` \| `null`	See the description of the `seal_det_thresh` parameter in the `predict` method of the pipeline.	No
`sealDetBoxThresh`	`number` \| `null`	See the description of the `seal_det_box_thresh` parameter in the `predict` method of the pipeline.	No
`sealDetUnclipRatio`	`number` \| `null`	See the description of the `seal_det_unclip_ratio` parameter in the `predict` method of the pipeline.	No
`sealRecScoreThresh`	`number` \| `null`	See the description of the `seal_rec_score_thresh` parameter in the `predict` method of the pipeline.	No
`layoutThreshold`	`number` \| `null`	See the description of the `layout_threshold` parameter in the `predict` method of the pipeline.	No
`layoutNms`	`boolean` \| `null`	See the description of the `layout_nms` parameter in the `predict` method of the pipeline.	No
`layoutUnclipRatio`	`number` \| `array` \| `null`	See the description of the `layout_unclip_ratio` parameter in the `predict` method of the pipeline.	No
`layoutMergeBboxesMode`	`string` \| `null`	See the description of the `layout_merge_bboxes_mode` parameter in the `predict` method of the pipeline.	No

When the request is processed successfully, the response body's result has the following attributes:

Name	Type	Meaning
`layoutParsingResults`	`array`	The layout parsing results. The length of the array is 1 (for image input) or the smaller of the document page count and 10 (for PDF input). For PDF input, each element in the array represents the processing result of each page in the PDF file.
`dataInfo`	`object`	Information about the input data.

Each element in layoutParsingResults is an object with the following attributes:

Name	Type	Meaning
`prunedResult`	`object`	A simplified version of the `res` field in the JSON representation generated by the `predict` method of the production object, with the `input_path` field removed.
`outputImages`	`object` \| `null`	A key-value pair of the input image and the prediction result image. The images are in JPEG format and encoded in Base64.
`inputImage`	`string` \| `null`	The input image. The image is in JPEG format and encoded in Base64.

Multi-language Service Call Example

Python

import base64
import requests

API_URL = "http://localhost:8080/layout-parsing" # Service URL
file_path = "./demo.jpg"

with open(file_path, "rb") as file:

file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")


payload = {

"file": file_data, # Base64-encoded file content or file URL
"fileType": 1,


}

Call the API

response = requests.post(API_URL, json=payload)

Process the response data

assert response.status_code == 200
result = response.json()["result"]
print("\nDetected layout elements:")
for i, res in enumerate(result["layoutParsingResults"]):

print(res["prunedResult"])
for img_name, img in res["outputImages"].items():
    img_path = f"{img_name}_{i}.jpg"
    with open(img_path, "wb") as f:
        f.write(base64.b64decode(img))
    print(f"Output image saved at {img_path}")

📱 Edge Deployment: Edge deployment refers to placing computational and data processing capabilities directly on user devices, enabling them to process data without relying on remote servers. PaddleX supports deploying models on edge devices such as Android. For detailed edge deployment procedures, please refer to the PaddleX Edge Deployment Guide.

You can choose an appropriate method to deploy your model pipeline based on your needs, and proceed with subsequent AI application integration.

4. Custom Development

If the default model weights provided by the general layout parsing pipeline do not meet your requirements in terms of accuracy or speed for your specific scenario, you can try to further fine-tune the existing models using your own domain-specific or application-specific data to improve the recognition performance of the general layout parsing pipeline in your scenario.

4.1 Model Fine-tuning

Since the general layout parsing pipeline consists of 7 modules, unsatisfactory performance may stem from any of these modules.

You can analyze images with poor recognition results and follow the guidelines below for analysis and model fine-tuning:

Incorrect table structure detection (e.g., wrong row/column recognition, incorrect cell positions) may indicate deficiencies in the table structure recognition module. You need to refer to the Customization section in the Table Structure Recognition Module Development Tutorial and fine-tune the table structure recognition model using your private dataset.
Misplaced layout elements (e.g., incorrect positioning of tables, seals) may suggest issues with the layout detection module. You should consult the Customization section in the Layout Detection Module Development Tutorial and fine-tune the layout detection model with your private dataset.
Frequent undetected texts (i.e., text missing detection) indicate potential weaknesses in the text detection model. Follow the Customization section in the Text Detection Module Development Tutorial to fine-tune the text detection model using your private dataset.
High text recognition errors (i.e., recognized text content does not match the actual text) suggest further improvements to the text recognition model. Refer to the Customization section in the Text Recognition Module Development Tutorial to fine-tune the text recognition model.
Frequent recognition errors in detected seal texts indicate the need for improvements to the seal text detection model. Consult the Customization section in the Seal Text Detection Module Development Tutorials to fine-tune the seal text detection model.
High recognition errors in detected formulas (i.e., recognized formula content does not match the actual formula) suggest further enhancements to the formula recognition model. Follow the Customization section in the Formula Recognition Module Development Tutorial to fine-tune the formula recognition model.
Frequent misclassifications of document or certificate orientations with text areas indicate the need for improvements to the document image orientation classification model. Refer to the Customization section in the Document Image Orientation Classification Module Development Tutorial to fine-tune the model.

4.2 Model Application

After fine-tuning your model with a private dataset, you will obtain local model weights files.

To use the fine-tuned model weights, simply modify the pipeline configuration file by replacing the local paths of the fine-tuned model weights to the corresponding positions in the configuration file:

......
 Pipeline:
  layout_model: PicoDet_layout_1x  # Can be modified to the local path of the fine-tuned model
  table_model: SLANet_plus  # Can be modified to the local path of the fine-tuned model
  text_det_model: PP-OCRv4_server_det  # Can be modified to the local path of the fine-tuned model
  text_rec_model: PP-OCRv4_server_rec  # Can be modified to the local path of the fine-tuned model
  formula_rec_model: LaTeX_OCR_rec  # Can be modified to the local path of the fine-tuned model
  seal_text_det_model: PP-OCRv4_server_seal_det   # Can be modified to the local path of the fine-tuned model
  doc_image_unwarp_model: UVDoc  # Can be modified to the local path of the fine-tuned model
  doc_image_ori_cls_model: PP-LCNet_x1_0_doc_ori  # Can be modified to the local path of the fine-tuned model
  layout_batch_size: 1
  text_rec_batch_size: 1
  table_batch_size: 1
  device: "gpu:0"
......

Subsequently, refer to the command line or Python script methods in the local experience to load the modified pipeline configuration file.

5. Multi-Hardware Support

PaddleX supports various mainstream hardware devices such as NVIDIA GPUs, Kunlun XPU, Ascend NPU, and Cambricon MLU. Simply modify the --device parameter to seamlessly switch between different hardware.

For example, if you use an NVIDIA GPU for inference in the layout parsing pipeline, the Python command is:

paddlex --pipeline layout_parsing --input layout_parsing.jpg --device gpu:0

At this point, if you want to switch the hardware to Ascend NPU, simply modify --device to npu in the Python command:

paddlex --pipeline layout_parsing --input layout_parsing.jpg --device npu:0

If you want to use the general layout parsing pipeline on more types of hardware, please refer to the PaddleX Multi-Device Usage Guide.

layout_parsing.en.md 56 KB 文件歷史 原始文件