comments: true

Document Image Preprocessing Pipeline Tutorial

1. Introduction to the Do Pipeline

The document image preprocessing pipeline integrates two major functions: document orientation classification and geometric distortion correction. The document orientation classification can automatically identify the four orientations of a document (0°, 90°, 180°, 270°) to ensure that the document is processed in the correct direction for subsequent tasks. The geometric distortion correction model is used to correct geometric distortions that occur during the document's photographing or scanning process, restoring the document to its original shape and proportions. This is suitable for digital document management, preprocessing for doc_preprocessor recognition, and any scenario where improving document image quality is necessary. Through automated orientation correction and distortion correction, this module significantly enhances the accuracy and efficiency of document processing, providing users with a more reliable foundation for image analysis. The pipeline also offers flexible service deployment options, supporting invocation using various programming languages on multiple hardware platforms. Moreover, it provides the capability for further development, allowing you to train and fine-tune on your own dataset based on this pipeline, with the trained models being seamlessly integrable.

The general document image preprocessing pipeline includes optional document image orientation classification module and document image correction module with the following models included.

Document Image Orientation Classification Module (Optional):

Model	Model download link	Top-1 Acc（%）	GPU Inference Time (ms) [Normal Mode / High-Performance Mode]	CPU inference time (ms)	Model storage size（M)	Introduction
PP-LCNet_x1_0_doc_ori	Inference Model/Train Model	99.06	3.84845	9.23735	7	A document image classification model based on PP-LCNet_x1_0, containing four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees.

Text Image Unwarping Module (Optional)：

Model	Model download link	CER	Model storage size（M)	Introduction
UVDoc	Inference Model/Train Model	0.179	30.3 M	High-Precision Text Image Correction Model

Test Environment Description:

  <li><b>Performance Test Environment</b>
      <ul>
                <li><strong>Test Dataset：</strong>
                    <ul>
                      <li>Document Image Orientation Classification Module: A self-built dataset using PaddleX, covering multiple scenarios such as ID cards and documents, containing 1000 images.</li>
                      <li>Text Image Rectification Module: <a href="https://www3.cs.stonybrook.edu/~cvl/docunet.html">DocUNet</a>.</li>
                    </ul>
                </li>
          <li><strong>Hardware Configuration：</strong>
              <ul>
                  <li>GPU: NVIDIA Tesla T4</li>
                  <li>CPU: Intel Xeon Gold 6271C @ 2.60GHz</li>
                  <li>Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2</li>
              </ul>
          </li>
      </ul>
  </li>
  <li><b>Inference Mode Description</b></li>

Mode	GPU Configuration	CPU Configuration	Acceleration Technology Combination
Normal Mode	FP32 Precision / No TRT Acceleration	FP32 Precision / 8 Threads	PaddleInference
High-Performance Mode	Optimal combination of pre-selected precision types and acceleration strategies	FP32 Precision / 8 Threads	Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.)

2. Quick Start

PaddleX supports experiencing the effects of the document image preprocessing pipeline locally via command line or Python.

Before using the document image preprocessing pipeline locally, please ensure you have completed the installation of the PaddleX wheel package according to the PaddleX Local Installation Guide. If you wish to selectively install dependencies, please refer to the relevant instructions in the installation guide. The dependency group corresponding to this pipeline is ocr.

2.1 Local Experience

2.1.1 Command Line Experience

You can quickly experience the effects of the document image preprocessing pipeline with a single command. Use the test file and replace --input with the local path to perform predictions.

paddlex --pipeline doc_preprocessor \
        --input doc_test_rotated.jpg \
        --use_doc_orientation_classify True \
        --use_doc_unwarping True \
        --save_path ./output \
        --device gpu:0

You can refer to the parameter descriptions in 2.1.2 Python Script Integration for related parameter details.

After running, the results will be printed to the terminal as follows:

{'res': {'input_path': 'doc_test_rotated.jpg', 'model_settings': {'use_doc_orientation_classify': True, 'use_doc_unwarping': True}, 'angle': 180}}

You can refer to the results explanation in 2.1.2 Python Script Integration for a description of the output parameters.

The visualized results are saved under save_path. The visualized results are as follows:

2.1.2 Python Script Integration

The above command line is for quickly experiencing and viewing the effect. Generally, in a project, it is often necessary to integrate through code. You can complete quick inference in a pipeline with just a few lines of code. The inference code is as follows:

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="doc_preprocessor")
output = pipeline.predict(
    input="doc_test_rotated.jpg",
    use_doc_orientation_classify=True,
    use_doc_unwarping=True,
)
for res in output:
    res.print()
    res.save_to_img(save_path="./output/")
    res.save_to_json(save_path="./output/")

In the above Python script, the following steps were executed:

(1) Instantiate the doc_preprocessor pipeline object using create_pipeline(). The specific parameter descriptions are as follows:

Parameter	Description	Type	Default
`pipeline`	The pipeline name or the path to the pipeline configuration file. If it is a pipeline name, it must be a pipeline supported by PaddleX.	`str`	`None`
`device`	Inference device for the pipeline. Supports specifying the GPU card number, such as "gpu:0", other hardware card numbers, such as "npu:0", and CPU as "cpu".	`str`	`gpu:0`
`use_hpip`	Whether to enable the high-performance inference plugin. If set to `None`, the setting from the configuration file will be used.	`bool`	None	`None`
`hpi_config`	High-performance inference configuration	`dict` \| `None`	None	`None`

(2) Call the predict() method of the doc_preprocessor pipeline object for inference prediction. This method will return a generator. Below are the parameters of the predict() method and their descriptions:

Parameter	Description	Type	Options	Default
`input`	Data to be predicted, supporting various input types, required	`Python Var\|str\|list`	Python Var: Such as image data represented by `numpy.ndarray` str: Such as the local path of an image file or PDF file: `/root/data/img.jpg`; As URL link, such as the network URL of an image file or PDF file: example; As a local directory, which should contain images to be predicted, such as a local path: `/root/data/` (currently does not support directory prediction for PDFs, PDF files need to be specified to the specific file path) List: List elements must be of the above types, such as `[numpy.ndarray, numpy.ndarray]`, `["/root/data/img1.jpg", "/root/data/img2.jpg"]`, `["/root/data1", "/root/data2"]`	`None`
`device`	Inference device for the pipeline	`str\|None`	CPU: Like `cpu`, indicating inference using CPU; GPU: Like `gpu:0`, indicating inference using the first GPU; NPU: Like `npu:0`, indicating inference using the first NPU; XPU: Like `xpu:0`, indicating inference using the first XPU; MLU: Like `mlu:0`, indicating inference using the first MLU; DCU: Like `dcu:0`, indicating inference using the first DCU; None: If set to `None`, the default value initialized by the pipeline will be used. During initialization, it will preferentially use the local GPU device 0, if none, then the CPU device;	`None`
`use_doc_orientation_classify`	Whether to use the document orientation classification module	`bool\|None`	bool: `True` or `False`; None: If set to `None`, the default value initialized by the pipeline will be used, initialized to `True`;	`None`
`use_doc_unwarping`	Whether to use the document unwarping correction module	`bool\|None`	bool: `True` or `False`; None: If set to `None`, the default value initialized by the pipeline will be used, initialized to `True`;	`None`

(3) Process the prediction results, where the prediction result for each sample is of dict type. Additionally, these results support operations such as printing, saving as an image, and saving as a json file.

Method	Description	Parameter	Type	Description	Default
`print()`	Prints the results to the terminal	`format_json`	`bool`	Whether to format the output using `JSON` indentation	`True`
		`indent`	`int`	Specifies the indentation level to beautify the output `JSON` data for better readability, effective only when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Controls whether to escape non-`ASCII` characters as `Unicode`. When set to `True`, all non-`ASCII` characters will be escaped; `False` retains the original characters, effective only when `format_json` is `True`	`False`
`save_to_json()`	Saves the results as a JSON format file	`save_path`	`str`	The file path to save, naming consistent with the input file type when it is a directory	None
		`indent`	`int`	Specifies the indentation level to beautify the output `JSON` data for better readability, effective only when `format_json` is `True`	4
		`ensure_ascii`	`bool`	Controls whether to escape non-`ASCII` characters as `Unicode`. When set to `True`, all non-`ASCII` characters will be escaped; `False` retains the original characters, effective only when `format_json` is `True`	`False`
`save_to_img()`	Saves the results as an image format file	`save_path`	`str`	The file path to save, supporting both directory or file path	None

Calling the print() method will output the results to the terminal. The content printed to the terminal is explained as follows:
- input_path: (str) The input path of the image to be predicted.
- model_settings: (Dict[str, bool]) Model parameters required for configuring the pipeline.
  - use_doc_orientation_classify: (bool) Controls whether to enable the document orientation classification module.
  - use_doc_unwarping: (bool) Controls whether to enable the document unwarping module.
- angle: (int) The prediction result of the document orientation classification. When enabled, the values are [0, 90, 180, 270]; when not enabled, it is -1.
Calling the save_to_json() method will save the above content to the specified save_path. If a directory is specified, the path will be save_path/{your_img_basename}.json; if a file is specified, it will be saved directly to that file. Since JSON files do not support saving NumPy arrays, any numpy.array types will be converted to lists.
Calling the save_to_img() method will save the visualized results to the specified save_path. If a directory is specified, the path will be save_path/{your_img_basename}_doc_preprocessor_res_img.{your_img_extension}; if a file is specified, it will be saved directly to that file. (Since the pipeline typically includes multiple result images, it is not recommended to specify a specific file path directly, as multiple images may be overwritten, leaving only the last image.)
Additionally, it is also possible to obtain visualized images with results and prediction outcomes through attributes, as detailed below:

Attribute	Description
`json`	Retrieves the prediction results in `json` format
`img`	Retrieves visualized images in `dict` format

The json attribute retrieves prediction results as a dictionary type of data, consistent with the content saved by calling the save_to_json() method.
The img attribute returns prediction results as a dictionary type of data. Here, the key is preprocessed_img, and the corresponding value is an Image.Image object, which is a visualized image used to display the results of the doc_preprocessor.

Additionally, you can obtain the doc_preprocessor pipeline configuration file and load it for prediction. You can execute the following command to save the results in my_path:

paddlex --get_pipeline_config doc_preprocessor --save_path ./my_path

Once you have the configuration file, you can customize the various configurations of the doc_preprocessor pipeline by simply changing the pipeline parameter value in the create_pipeline method to the path of the pipeline configuration file. An example is as follows:

例如，若您的配置文件保存在 ./my_path/doc_preprocessor.yaml ，则只需执行：

from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="./my_path/doc_preprocessor.yaml")
output = pipeline.predict(
    input="doc_test_rotated.jpg"
    use_doc_orientation_classify=True,
    use_doc_unwarping=True,
)
for res in output:
    res.print()
    res.save_to_img("./output/")
    res.save_to_json("./output/")

Note: The parameters in the configuration file are for pipeline initialization. If you wish to modify the initialization parameters for the doc_preprocessor pipeline, you can directly edit the parameters in the configuration file and load the file for prediction. Additionally, CLI prediction also supports passing in a configuration file; simply specify the path to the configuration file using --pipeline.

3. Development Integration/Deployment

If the document image preprocessing pipeline meets your requirements for inference speed and accuracy, you can proceed directly with development integration/deployment.

If you need to apply the document image preprocessing pipeline directly to your Python project, you can refer to the sample code in 2.2 Python Script Method.

Additionally, PaddleX offers three other deployment methods, detailed as follows:

🚀 High-Performance Inference: In real production environments, many applications have stringent performance standards for deployment strategies, especially regarding response speed, to ensure efficient system operation and a smooth user experience. To address this, PaddleX provides a high-performance inference plugin designed to deeply optimize model inference and pre/post-processing, resulting in significant end-to-end process acceleration. For detailed high-performance inference procedures, please refer to the PaddleX High-Performance Inference Guide.

API Reference

For the main operations provided by the service:

The HTTP request method is POST.
Both the request body and response body are JSON data (JSON objects).
When the request is processed successfully, the response status code is 200, and the attributes of the response body are as follows:

Name	Type	Meaning
`logId`	`string`	The UUID of the request.
`errorCode`	`integer`	Error code. Fixed as `0`.
`errorMsg`	`string`	Error message. Fixed as `"Success"`.
`result`	`object`	The result of the operation.

When the request is not processed successfully, the attributes of the response body are as follows:

Name	Type	Meaning
`logId`	`string`	The UUID of the request.
`errorCode`	`integer`	Error code. Same as the response status code.
`errorMsg`	`string`	Error message.

The main operations provided by the service are as follows:

infer

Obtain the document image preprocessing results.

POST /document-preprocessing

The attributes of the request body are as follows:

Name	Type	Meaning	Required
`file`	`string`	The URL of an image or PDF file accessible by the server, or the Base64-encoded content of the file. By default, for PDF files exceeding 10 pages, only the first 10 pages will be processed. To remove the page limit, please add the following configuration to the pipeline configuration file: `Serving: extra: max_num_input_imgs: null`	Yes
`fileType`	`integer` \| `null`	The type of the file. `0` for PDF files, `1` for image files. If this attribute is missing, the file type will be inferred from the URL.	No
`useDocOrientationClassify`	`boolean` \| `null`	Please refer to the description of the `use_doc_orientation_classify` parameter of the pipeline object's `predict` method.	No
`useDocUnwarping`	`boolean` \| `null`	Please refer to the description of the `use_doc_unwarping` parameter of the pipeline object's `predict` method.	No

When the request is processed successfully, the result in the response body has the following attributes:

Name	Type	Meaning
`docPreprocessingResults`	`object`	Document image preprocessing results. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file.
`dataInfo`	`object`	Information about the input data.

Each element in docPreprocessingResults is an object with the following attributes:

Name	Type	Meaning
`outputImage`	`string`	The preprocessed image. The image is in PNG format and is Base64-encoded.
`prunedResult`	`object`	A simplified version of the `res` field in the JSON representation of the result generated by the pipeline object's `predict` method, excluding the `input_path` and the `page_index` fields.
`docPreprocessingImage`	`string` \| `null`	The visualization result image. The image is in JPEG format and is Base64-encoded.
`inputImage`	`string` \| `null`	The input image. The image is in JPEG format and is Base64-encoded.

Multi-language Service Call Example

Python

import base64
import requests

API_URL = "http://localhost:8080/document-preprocessing"
file_path = "./demo.jpg"

with open(file_path, "rb") as file:

file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")


payload = {"file": file_data, "fileType": 1}

response = requests.post(API_URL, json=payload)

assert response.status_code == 200
result = response.json()["result"]
for i, res in enumerate(result["docPreprocessingResults"]):

print(res["prunedResult"])
output_img_path = f"out_{i}.png"
with open(output_img_path, "wb") as f:
    f.write(base64.b64decode(res["outputImage"]))
print(f"Output image saved at {output_img_path}")

☁️ Service Deployment: Service deployment is a common form of deployment in real production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. PaddleX supports multiple pipeline service deployment solutions. For detailed pipeline service deployment procedures, please refer to the PaddleX Service Deployment Guide.

4. Custom Development

If the default model weights provided by the document image preprocessing pipeline do not meet your accuracy or speed requirements in your specific scenario, you can try to further fine-tune the existing model using data from your specific domain or application scenario to enhance the recognition performance of the document image preprocessing pipeline in your context.

4.1 Model Fine-Tuning

Since the document image preprocessing pipeline consists of several modules, if the pipeline's performance does not meet expectations, it may be due to any one of these modules. You can analyze the images with poor recognition results to identify which module has issues, and then refer to the corresponding fine-tuning tutorial link in the table below to fine-tune the model.

situation	Fine-tuning model	Fine-tuning reference link
The overall image rotation correction is inaccurate.	Image orientation classification module	Link
The image distortion correction is inaccurate.	Image Unwarping	Fine-tuning is not supported at the moment.

4.2 Model Application

After completing fine-tuning training with a private dataset, you can obtain a local model weights file.

If you need to use the fine-tuned model weights, simply modify the pipeline configuration file by entering the local path of the fine-tuned model weights into the model_dir field in the pipeline configuration file.

......
  DocOrientationClassify:
    module_name: doc_text_orientation
    model_name: PP-LCNet_x1_0_doc_ori
    model_dir: ./output/best_model/inference  # Replace it with the path of the fine-tuned document image orientation classification model weights.
......

Then, refer to the command line method or Python script method in 2. Quick Start to load the modified pipeline configuration file.

5. Multi-Hardware Support

PaddleX supports a variety of mainstream hardware devices such as NVIDIA GPU, Kunlunxin XPU, Ascend NPU, and Cambricon MLU. You can achieve seamless switching between different hardware by simply modifying the --device parameter.

For example, if you are using an Ascend NPU for inference in a document image preprocessing pipeline, the Python command you would use is:

paddlex --pipeline doc_preprocessor \
        --input doc_test_rotated.jpg \
        --use_doc_orientation_classify True \
        --use_doc_unwarping True \
        --save_path ./output \
        --device npu:0

If you want to use the document image preprocessing pipeline on more types of hardware, please refer to the PaddleX Multi-Hardware Usage Guide.

doc_preprocessor.en.md 28 KB History Raw