The document image preprocessing pipeline integrates two major functions: document orientation classification and geometric distortion correction. The document orientation classification can automatically identify the four orientations of a document (0°, 90°, 180°, 270°) to ensure that the document is processed in the correct direction for subsequent tasks. The geometric distortion correction model is used to correct geometric distortions that occur during the document's photographing or scanning process, restoring the document to its original shape and proportions. This is suitable for digital document management, preprocessing for doc_preprocessor recognition, and any scenario where improving document image quality is necessary. Through automated orientation correction and distortion correction, this module significantly enhances the accuracy and efficiency of document processing, providing users with a more reliable foundation for image analysis. The pipeline also offers flexible service deployment options, supporting invocation using various programming languages on multiple hardware platforms. Moreover, it provides the capability for further development, allowing you to train and fine-tune on your own dataset based on this pipeline, with the trained models being seamlessly integrable.
The general document image preprocessing pipeline includes optional document image orientation classification module and document image correction module with the following models included.
Document Image Orientation Classification Module (Optional):
| Model | Model download link | Top-1 Acc(%) | GPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU inference time (ms) | Model storage size(M) | Introduction |
|---|---|---|---|---|---|---|
| PP-LCNet_x1_0_doc_ori | Inference Model/Train Model | 99.06 | 3.84845 | 9.23735 | 7 | A document image classification model based on PP-LCNet_x1_0, containing four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees. |
Text Image Unwarping Module (Optional):
| Model | Model download link | CER | Model storage size(M) | Introduction |
|---|---|---|---|---|
| UVDoc | Inference Model/Train Model | 0.179 | 30.3 M | High-Precision Text Image Correction Model |
Test Environment Description:
<li><b>Performance Test Environment</b>
<ul>
<li><strong>Test Dataset:</strong>
<ul>
<li>Document Image Orientation Classification Module: A self-built dataset using PaddleX, covering multiple scenarios such as ID cards and documents, containing 1000 images.</li>
<li>Text Image Rectification Module: <a href="https://www3.cs.stonybrook.edu/~cvl/docunet.html">DocUNet</a>.</li>
</ul>
</li>
<li><strong>Hardware Configuration:</strong>
<ul>
<li>GPU: NVIDIA Tesla T4</li>
<li>CPU: Intel Xeon Gold 6271C @ 2.60GHz</li>
<li>Other Environments: Ubuntu 20.04 / cuDNN 8.6 / TensorRT 8.5.2.2</li>
</ul>
</li>
</ul>
</li>
<li><b>Inference Mode Description</b></li>
| Mode | GPU Configuration | CPU Configuration | Acceleration Technology Combination |
|---|---|---|---|
| Normal Mode | FP32 Precision / No TRT Acceleration | FP32 Precision / 8 Threads | PaddleInference |
| High-Performance Mode | Optimal combination of pre-selected precision types and acceleration strategies | FP32 Precision / 8 Threads | Pre-selected optimal backend (Paddle/OpenVINO/TRT, etc.) |
PaddleX supports experiencing the effects of the document image preprocessing pipeline locally via command line or Python.
Before using the document image preprocessing pipeline locally, please ensure you have completed the installation of the PaddleX wheel package according to the PaddleX Local Installation Guide. If you wish to selectively install dependencies, please refer to the relevant instructions in the installation guide. The dependency group corresponding to this pipeline is ocr.
You can quickly experience the effects of the document image preprocessing pipeline with a single command. Use the test file and replace --input with the local path to perform predictions.
paddlex --pipeline doc_preprocessor \
--input doc_test_rotated.jpg \
--use_doc_orientation_classify True \
--use_doc_unwarping True \
--save_path ./output \
--device gpu:0
You can refer to the parameter descriptions in 2.1.2 Python Script Integration for related parameter details.
After running, the results will be printed to the terminal as follows:
{'res': {'input_path': 'doc_test_rotated.jpg', 'model_settings': {'use_doc_orientation_classify': True, 'use_doc_unwarping': True}, 'angle': 180}}
You can refer to the results explanation in 2.1.2 Python Script Integration for a description of the output parameters.
The visualized results are saved under save_path. The visualized results are as follows:
The above command line is for quickly experiencing and viewing the effect. Generally, in a project, it is often necessary to integrate through code. You can complete quick inference in a pipeline with just a few lines of code. The inference code is as follows:
from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="doc_preprocessor")
output = pipeline.predict(
input="doc_test_rotated.jpg",
use_doc_orientation_classify=True,
use_doc_unwarping=True,
)
for res in output:
res.print()
res.save_to_img(save_path="./output/")
res.save_to_json(save_path="./output/")
In the above Python script, the following steps were executed:
(1) Instantiate the doc_preprocessor pipeline object using create_pipeline(). The specific parameter descriptions are as follows:
| Parameter | Description | Type | Default | |
|---|---|---|---|---|
pipeline |
The pipeline name or the path to the pipeline configuration file. If it is a pipeline name, it must be a pipeline supported by PaddleX. | str |
None |
|
device |
Inference device for the pipeline. Supports specifying the GPU card number, such as "gpu:0", other hardware card numbers, such as "npu:0", and CPU as "cpu". | str |
gpu:0 |
|
use_hpip |
Whether to enable the high-performance inference plugin. If set to `None`, the setting from the configuration file will be used. | bool |
None | None |
hpi_config |
High-performance inference configuration | dict | None |
None | None |
(2) Call the predict() method of the doc_preprocessor pipeline object for inference prediction. This method will return a generator. Below are the parameters of the predict() method and their descriptions:
| Parameter | Description | Type | Options | Default |
|---|---|---|---|---|
input |
Data to be predicted, supporting various input types, required | Python Var|str|list |
|
None |
device |
Inference device for the pipeline | str|None |
|
None |
use_doc_orientation_classify |
Whether to use the document orientation classification module | bool|None |
|
None |
use_doc_unwarping |
Whether to use the document unwarping correction module | bool|None |
|
None |
(3) Process the prediction results, where the prediction result for each sample is of dict type. Additionally, these results support operations such as printing, saving as an image, and saving as a json file.
| Method | Description | Parameter | Type | Description | Default |
|---|---|---|---|---|---|
print() |
Prints the results to the terminal | format_json |
bool |
Whether to format the output using JSON indentation |
True |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters as Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True |
False |
||
save_to_json() |
Saves the results as a JSON format file | save_path |
str |
The file path to save, naming consistent with the input file type when it is a directory | None |
indent |
int |
Specifies the indentation level to beautify the output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Controls whether to escape non-ASCII characters as Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True |
False |
||
save_to_img() |
Saves the results as an image format file | save_path |
str |
The file path to save, supporting both directory or file path | None |
Calling the print() method will output the results to the terminal. The content printed to the terminal is explained as follows:
input_path: (str) The input path of the image to be predicted.
model_settings: (Dict[str, bool]) Model parameters required for configuring the pipeline.
use_doc_orientation_classify: (bool) Controls whether to enable the document orientation classification module.use_doc_unwarping: (bool) Controls whether to enable the document unwarping module.angle: (int) The prediction result of the document orientation classification. When enabled, the values are [0, 90, 180, 270]; when not enabled, it is -1.
Calling the save_to_json() method will save the above content to the specified save_path. If a directory is specified, the path will be save_path/{your_img_basename}.json; if a file is specified, it will be saved directly to that file. Since JSON files do not support saving NumPy arrays, any numpy.array types will be converted to lists.
Calling the save_to_img() method will save the visualized results to the specified save_path. If a directory is specified, the path will be save_path/{your_img_basename}_doc_preprocessor_res_img.{your_img_extension}; if a file is specified, it will be saved directly to that file. (Since the pipeline typically includes multiple result images, it is not recommended to specify a specific file path directly, as multiple images may be overwritten, leaving only the last image.)
Additionally, it is also possible to obtain visualized images with results and prediction outcomes through attributes, as detailed below:
| Attribute | Description |
|---|---|
json |
Retrieves the prediction results in json format |
img |
Retrieves visualized images in dict format |
json attribute retrieves prediction results as a dictionary type of data, consistent with the content saved by calling the save_to_json() method.img attribute returns prediction results as a dictionary type of data. Here, the key is preprocessed_img, and the corresponding value is an Image.Image object, which is a visualized image used to display the results of the doc_preprocessor.Additionally, you can obtain the doc_preprocessor pipeline configuration file and load it for prediction. You can execute the following command to save the results in my_path:
paddlex --get_pipeline_config doc_preprocessor --save_path ./my_path
Once you have the configuration file, you can customize the various configurations of the doc_preprocessor pipeline by simply changing the pipeline parameter value in the create_pipeline method to the path of the pipeline configuration file. An example is as follows:
例如,若您的配置文件保存在 ./my_path/doc_preprocessor.yaml ,则只需执行:
from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="./my_path/doc_preprocessor.yaml")
output = pipeline.predict(
input="doc_test_rotated.jpg"
use_doc_orientation_classify=True,
use_doc_unwarping=True,
)
for res in output:
res.print()
res.save_to_img("./output/")
res.save_to_json("./output/")
Note: The parameters in the configuration file are for pipeline initialization. If you wish to modify the initialization parameters for the doc_preprocessor pipeline, you can directly edit the parameters in the configuration file and load the file for prediction. Additionally, CLI prediction also supports passing in a configuration file; simply specify the path to the configuration file using --pipeline.
If the document image preprocessing pipeline meets your requirements for inference speed and accuracy, you can proceed directly with development integration/deployment.
If you need to apply the document image preprocessing pipeline directly to your Python project, you can refer to the sample code in 2.2 Python Script Method.
Additionally, PaddleX offers three other deployment methods, detailed as follows:
🚀 High-Performance Inference: In real production environments, many applications have stringent performance standards for deployment strategies, especially regarding response speed, to ensure efficient system operation and a smooth user experience. To address this, PaddleX provides a high-performance inference plugin designed to deeply optimize model inference and pre/post-processing, resulting in significant end-to-end process acceleration. For detailed high-performance inference procedures, please refer to the PaddleX High-Performance Inference Guide.
For the main operations provided by the service: The main operations provided by the service are as follows: Obtain the document image preprocessing results. Each element in API Reference
200, and the attributes of the response body are as follows:
Name
Type
Meaning
logIdstringThe UUID of the request.
errorCodeintegerError code. Fixed as
0.
errorMsgstringError message. Fixed as
"Success".
resultobjectThe result of the operation.
Name
Type
Meaning
logIdstringThe UUID of the request.
errorCodeintegerError code. Same as the response status code.
errorMsgstringError message.
inferPOST /document-preprocessing
Name
Type
Meaning
Required
filestringThe URL of an image or PDF file accessible by the server, or the Base64-encoded content of the file. By default, for PDF files exceeding 10 pages, only the first 10 pages will be processed.
To remove the page limit, please add the following configuration to the pipeline configuration file:
Serving:
extra:
max_num_input_imgs: null
Yes
fileTypeinteger | nullThe type of the file.
0 for PDF files, 1 for image files. If this attribute is missing, the file type will be inferred from the URL.No
useDocOrientationClassifyboolean | nullPlease refer to the description of the
use_doc_orientation_classify parameter of the pipeline object's predict method.No
useDocUnwarpingboolean | nullPlease refer to the description of the
use_doc_unwarping parameter of the pipeline object's predict method.No
result in the response body has the following attributes:
Name
Type
Meaning
docPreprocessingResultsobjectDocument image preprocessing results. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file.
dataInfoobjectInformation about the input data.
docPreprocessingResults is an object with the following attributes:
Name
Type
Meaning
outputImagestringThe preprocessed image. The image is in PNG format and is Base64-encoded.
prunedResultobjectA simplified version of the
res field in the JSON representation of the result generated by the pipeline object's predict method, excluding the input_path and the page_index fields.
docPreprocessingImagestring | nullThe visualization result image. The image is in JPEG format and is Base64-encoded.
inputImagestring | nullThe input image. The image is in JPEG format and is Base64-encoded.
Multi-language Service Call Example
Python
import base64
import requests
API_URL = "http://localhost:8080/document-preprocessing" file_path = "./demo.jpg"
with open(file_path, "rb") as file:
file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")
payload = {"file": file_data, "fileType": 1}
response = requests.post(API_URL, json=payload)
assert response.status_code == 200 result = response.json()["result"] for i, res in enumerate(result["docPreprocessingResults"]):
print(res["prunedResult"])
output_img_path = f"out_{i}.png"
with open(output_img_path, "wb") as f:
f.write(base64.b64decode(res["outputImage"]))
print(f"Output image saved at {output_img_path}")
☁️ Service Deployment: Service deployment is a common form of deployment in real production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. PaddleX supports multiple pipeline service deployment solutions. For detailed pipeline service deployment procedures, please refer to the PaddleX Service Deployment Guide.
If the default model weights provided by the document image preprocessing pipeline do not meet your accuracy or speed requirements in your specific scenario, you can try to further fine-tune the existing model using data from your specific domain or application scenario to enhance the recognition performance of the document image preprocessing pipeline in your context.
Since the document image preprocessing pipeline consists of several modules, if the pipeline's performance does not meet expectations, it may be due to any one of these modules. You can analyze the images with poor recognition results to identify which module has issues, and then refer to the corresponding fine-tuning tutorial link in the table below to fine-tune the model.
| situation | Fine-tuning model | Fine-tuning reference link |
|---|---|---|
| The overall image rotation correction is inaccurate. | Image orientation classification module | Link |
| The image distortion correction is inaccurate. | Image Unwarping | Fine-tuning is not supported at the moment. |
After completing fine-tuning training with a private dataset, you can obtain a local model weights file.
If you need to use the fine-tuned model weights, simply modify the pipeline configuration file by entering the local path of the fine-tuned model weights into the model_dir field in the pipeline configuration file.
......
DocOrientationClassify:
module_name: doc_text_orientation
model_name: PP-LCNet_x1_0_doc_ori
model_dir: ./output/best_model/inference # Replace it with the path of the fine-tuned document image orientation classification model weights.
......
Then, refer to the command line method or Python script method in 2. Quick Start to load the modified pipeline configuration file.
PaddleX supports a variety of mainstream hardware devices such as NVIDIA GPU, Kunlunxin XPU, Ascend NPU, and Cambricon MLU. You can achieve seamless switching between different hardware by simply modifying the --device parameter.
For example, if you are using an Ascend NPU for inference in a document image preprocessing pipeline, the Python command you would use is:
paddlex --pipeline doc_preprocessor \
--input doc_test_rotated.jpg \
--use_doc_orientation_classify True \
--use_doc_unwarping True \
--save_path ./output \
--device npu:0
If you want to use the document image preprocessing pipeline on more types of hardware, please refer to the PaddleX Multi-Hardware Usage Guide.