The General Document Translation Pipeline (PP-DocTranslation) is a document intelligent translation solution provided by PaddlePaddle. It integrates advanced general layout analysis technology with the capabilities of large language models (LLMs) to offer you efficient document intelligent translation services. This solution can accurately identify and extract various elements within documents, including text blocks, headings, paragraphs, images, tables, and other complex layout structures, and on this basis, achieve high-quality multilingual translation. PP-DocTranslation supports mutual translation among multiple mainstream languages, and is particularly adept at handling documents with complex layouts and strong contextual dependencies, striving to deliver accurate, natural, fluent, and professional translation results. This pipeline also provides flexible Serving deployment options, supporting the use of multiple programming languages across various hardware. Moreover, it offers the capability for secondary development, allowing you to train and fine-tune models on your own datasets, with the trained models being seamlessly integrable.

The general document translation pipeline uses PP-StructureV3 sub-pipeline, and thus has all the functions of PP-StructureV3. For more information on the functions and usage details of PP-StructureV3, click on the PP-StructureV3 Documentation page.
If you prioritize model accuracy, choose a high-accuracy model; if you prioritize model inference speed, choose a faster inference model; if you prioritize model storage size, choose a smaller storage model.
The inference time only includes the model inference time and does not include the time for pre- or post-processing.
Document image orientation classification module:
| Model | Model download link | Top-1 Acc (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| PP-LCNet_x1_0_doc_ori | Inference model/Training model | 99.06 | 2.62 / 0.59 | 3.24 / 1.19 | 7 | A document image classification model based on PP-LCNet_x1_0, with four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees |
Text image rectification module:
| Model | Model download link | CER | Model storage size (M) | Introduction |
|---|---|---|---|---|
| UVDoc | Inference model/Training model | 0.179 | 30.3 M | A high-precision text image rectification model |
Layout region detection module model:
| Model | Model download link | mAP(0.5) (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| PP-DocLayout_plus-L | Inference model/Training model | 83.2 | 53.03 / 17.23 | 634.62 / 378.32 | 126.01 M | A higher-precision layout region localization model trained on a self-built dataset based on RT-DETR-L, covering scenarios such as Chinese and English papers, multi-column magazines, newspapers, PPTs, contracts, books, examination papers, research reports, ancient books, Japanese documents, and documents with vertical text. | PP-DocLayout-L | Inference model/Training model | 90.4 | 33.59 / 33.59 | 503.01 / 251.08 | 123.76 M | A high-precision layout region localization model trained on a self-built dataset based on RT-DETR-L, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports. |
| PP-DocLayout-M | Inference model/Training model | 75.2 | 13.03 / 4.72 | 43.39 / 24.44 | 22.578 | A layout region localization model with balanced precision and efficiency trained on a self-built dataset based on PicoDet-L, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports. |
| PP-DocLayout-S | Inference model/Training model | 70.9 | 11.54 / 3.86 | 18.53 / 6.29 | 4.834 | A highly efficient layout region localization model trained on a self-built dataset based on PicoDet-S, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports. |
Table structure recognition module:
| Model | Model download link | Accuracy (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| SLANeXt_wired | Inference model/Training model | 69.65 | 85.92 / 85.92 | - / 501.66 | 351M | The SLANeXt series is a new generation of table structure recognition models independently developed by Baidu PaddlePaddle's vision team. Compared to SLANet and SLANet_plus, SLANeXt focuses on recognizing table structures and has trained dedicated weights for wired and wireless tables separately. Its recognition capabilities for various types of tables have been significantly improved, especially for wired tables. |
| SLANeXt_wireless | Inference model/Training model |
Table classification module model:
| Model | Model download link | Top1 Acc(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) |
|---|---|---|---|---|---|
| PP-LCNet_x1_0_table_cls | Inference model/Training model | 94.2 | 2.62 / 0.60 | 3.17 / 1.14 | 6.6M |
Table cell detection module model:
| Model | Model download link | mAP(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| RT-DETR-L_wired_table_cell_det | Inference model/Training model | 82.7 | 33.47 / 27.02 | 402.55 / 256.56 | 124M | RT-DETR is the first real-time end-to-end object detection model. The Baidu PaddlePaddle Vision team completed pre-training on a self-built table cell detection dataset using RT-DETR-L as the base model, achieving table cell detection with good performance for both wired and wireless tables. |
| RT-DETR-L_wireless_table_cell_det | Inference model/Training model |
Text detection module:
| Model | Model download link | Detection Hmean (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| PP-OCRv5_server_det | Inference model/Training model | 83.8 | 89.55 / 70.19 | 383.15 / 383.15 | 84.3 | The server-side text detection model of PP-OCRv5, with higher accuracy, suitable for deployment on servers with better performance |
| PP-OCRv5_mobile_det | Inference model/Training model | 79.0 | 10.67 / 6.36 | 57.77 / 28.15 | 4.7 | The mobile-side text detection model of PP-OCRv5, with higher efficiency, suitable for deployment on end-side devices |
| PP-OCRv4_server_det | Inference model/Training model | 69.2 | 127.82 / 98.87 | 585.95 / 489.77 | 109 | PP-OCRv4 server-side text detection model with higher accuracy, suitable for deployment on servers with better performance |
| PP-OCRv4_mobile_det | Inference model/Training model | 63.8 | 9.87 / 4.17 | 56.60 / 20.79 | 4.7 | PP-OCRv4 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices |
| PP-OCRv3_mobile_det | Inference model/Training model | Accuracy close to PP-OCRv4_mobile_det | 9.90 / 3.60 | 41.93 / 20.76 | 2.1 | PP-OCRv3 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices |
| PP-OCRv3_server_det | Inference model/Training model | Accuracy close to PP-OCRv4_server_det | 119.50 / 75.00 | 379.35 / 318.35 | 102.1 | PP-OCRv3 server-side text detection model with higher accuracy, suitable for deployment on servers with better performance |
Text recognition module models:
*Chinese recognition model| Model | Model download link | Recognition Avg Accuracy(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
|---|---|---|---|---|---|---|
| PP-OCRv5_server_rec | Inference Model/Training Model | 86.38 | 8.46 / 2.36 | 31.21 / 31.21 | 81 M | PP-OCRv5_rec is a new-generation text recognition model. This model is dedicated to efficiently and accurately supporting the recognition of four major languages, namely Simplified Chinese, Traditional Chinese, English, and Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters, all within a single model. While maintaining recognition accuracy, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios. |
| PP-OCRv5_mobile_rec | Inference Model/Training Model | 81.29 | 5.43 / 1.46 | 21.20 / 5.32 | 16 M | |
| PP-OCRv4_server_rec_doc | Inference Model/Training Model | 86.58 | 8.69 / 2.78 | 37.93 / 37.93 | 74.7 M | PP-OCRv4_server_rec_doc is trained on a mixed dataset of more Chinese document data and PP-OCR training data, based on PP-OCRv4_server_rec. It has enhanced the ability to recognize some Traditional Chinese characters, Japanese characters, and special characters, supporting the recognition of over 15,000 characters. In addition to improving the document-related text recognition capabilities, it has also enhanced the recognition capabilities for general text. |
| PP-OCRv4_mobile_rec | Inference Model/Training Model | 78.74 | 5.26 / 1.12 | 17.48 / 3.61 | 10.6 M | Lightweight recognition model of PP-OCRv4 with high inference efficiency, which can be deployed on various hardware devices including edge devices |
| PP-OCRv4_server_rec | Inference model/Training model | 80.61 | 8.75 / 2.49 | 36.93 / 36.93 | 71.2 M | Server-side model of PP-OCRv4 with high inference accuracy, which can be deployed on various servers |
| PP-OCRv3_mobile_rec | Inference model/Training model | 72.96 | 3.89 / 1.16 | 8.72 / 3.56 | 9.2 M | Lightweight recognition model of PP-OCRv3 with high inference efficiency, which can be deployed on various hardware devices including edge devices |
| Model | Model download link | Recognition Avg Accuracy(%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| ch_SVTRv2_rec | Inference model/Training model | 68.81 | 10.38 / 8.31 | 66.52 / 30.83 | 73.9 M | SVTRv2 is a server-side text recognition model developed by the OpenOCR team of the Vision and Learning Lab (FVL) at Fudan University. It won the first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task. The end-to-end recognition accuracy on Leaderboard A improved by 6% compared to PP-OCRv4. |
| Model | Model download link | Identify Avg Accuracy(%) | GPU inference time (ms) [Normal Mode / High-Performance Mode] |
CPU inference time (ms) [Normal Mode / High-Performance Mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| ch_RepSVTR_rec | Inference model/Training model | 65.07 | 6.29 / 1.57 | 20.64 / 5.40 | 22.1 M | The RepSVTR text recognition model is a mobile-end text recognition model based on SVTRv2. It won the first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task. Compared with PP-OCRv4, its end-to-end recognition accuracy on Leaderboard B has increased by 2.5%, while the inference speed remains the same. |
| Model | Model download link | Identify Avg Accuracy(%) | GPU inference time (ms) [Normal Mode / High-Performance Mode] |
CPU inference time (ms) [Normal Mode / High-Performance Mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| en_PP-OCRv4_mobile_rec | Inference model/Training model | 70.39 | 4.81 / 1.23 | 17.20 / 4.18 | 6.8 M | An ultra-lightweight English recognition model trained based on the PP-OCRv4 recognition model, supporting English and numeric recognition |
| en_PP-OCRv3_mobile_rec | Inference model/Training model | 70.69 | 3.56 / 0.78 | 8.44 / 5.78 | 7.8 M | An ultra-lightweight English recognition model trained based on the PP-OCRv3 recognition model, supporting English and number recognition |
| Model | Model download link | Avg Accuracy of recognition (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| korean_PP-OCRv3_mobile_rec | Inference model/Training model | 60.21 | 3.73 / 0.98 | 8.76 / 2.91 | 8.6 M | An ultra-lightweight Korean recognition model trained based on the PP-OCRv3 recognition model, supporting Korean and number recognition |
| japan_PP-OCRv3_mobile_rec | Inference model/Training model | 45.69 | 3.86 / 1.01 | 8.62 / 2.92 | 8.8 M | An ultra-lightweight Japanese recognition model trained based on the PP-OCRv3 recognition model, supporting Japanese and number recognition |
| chinese_cht_PP-OCRv3_mobile_rec | Inference model/Training model | 82.06 | 3.90 / 1.16 | 9.24 / 3.18 | 9.7 M | An ultra-lightweight traditional Chinese recognition model trained based on the PP-OCRv3 recognition model, supporting traditional Chinese and number recognition |
| te_PP-OCRv3_mobile_rec | Inference model/Training model | 95.88 | 3.59 / 0.81 | 8.28 / 6.21 | 7.8 M | An ultra-lightweight Telugu recognition model trained based on the PP-OCRv3 recognition model, supporting Telugu and digit recognition |
| ka_PP-OCRv3_mobile_rec | Inference model/Training model | 96.96 | 3.49 / 0.89 | 8.63 / 2.77 | 8.0 M | An ultra-lightweight Kannada recognition model trained based on the PP-OCRv3 recognition model, supporting Kannada and digit recognition |
| ta_PP-OCRv3_mobile_rec | Inference model/Training model | 76.83 | 3.49 / 0.86 | 8.35 / 3.41 | 8.0 M | An ultra-lightweight Tamil recognition model trained based on the PP-OCRv3 recognition model, supporting Tamil and digit recognition |
| latin_PP-OCRv3_mobile_rec | Inference model/Training model | 76.93 | 3.53 / 0.78 | 8.50 / 6.83 | 7.8 M | An ultra-lightweight Latin recognition model trained based on the PP-OCRv3 recognition model, supporting Latin and digit recognition |
| arabic_PP-OCRv3_mobile_rec | Inference model/Training model | 73.55 | 3.60 / 0.83 | 8.44 / 4.69 | 7.8 M | An ultra-lightweight Arabic letter recognition model trained based on the PP-OCRv3 recognition model, supporting Arabic letter and digit recognition |
| cyrillic_PP-OCRv3_mobile_rec | Inference model/Training model | 94.28 | 3.56 / 0.79 | 8.22 / 2.76 | 7.9 M | An ultra-lightweight Slavic letter recognition model trained based on the PP-OCRv3 recognition model, supporting Slavic letter and digit recognition |
| devanagari_PP-OCRv3_mobile_rec | Inference model/Training model | 96.44 | 3.60 / 0.78 | 6.95 / 2.87 | 7.9 M | An ultra-lightweight Devanagari letter recognition model trained based on the PP-OCRv3 recognition model, supporting Devanagari letter and digit recognition |
Text line orientation classification module (optional):
| Model | Model download link | Top-1 Acc (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| PP-LCNet_x0_25_textline_ori | Inference model/Training model | 95.54 | 2.16 / 0.41 | 2.37 / 0.73 | 0.32 | A text line classification model based on PP-LCNet_x0_25, with two categories, namely 0 degree and 180 degrees |
Formula recognition module:
| Model | Model download link | Avg-BLEU(%) | GPU inference latency (ms) [Normal Mode / High-Performance Mode] |
CPU inference latency (ms) [Normal Mode / High-Performance Mode] |
Model storage size (M) | Introduction | UniMERNet | Inference Model/Training Model | 86.13 | 2266.96/- | -/- | 1.4 G | UniMERNet is a formula recognition model developed by Shanghai AI Lab. It employs Donut Swin as the encoder and MBartDecoder as the decoder. By training on a dataset of one million entries, including simple formulas, complex formulas, scanned formulas, and handwritten formulas, the model significantly enhances its recognition accuracy for formulas in real-world scenarios. | PP-FormulaNet-S | Inference Model/Training Model | 87.12 | 1311.84 / 1311.84 | - / 8288.07 | 167.9 M | PP-FormulaNet is an advanced formula recognition model developed by Baidu PaddlePaddle Vision Team, supporting the recognition of 50,000 common LateX source code vocabulary. The PP-FormulaNet-S version adopts PP-HGNetV2-B4 as its backbone network. Through techniques such as parallel masking and model distillation, it significantly improves the model's inference speed while maintaining high recognition accuracy, suitable for scenarios such as simple printed formulas and cross-line simple printed formulas. The PP-FormulaNet-L version, on the other hand, is based on Vary_VIT_B as the backbone network and has undergone in-depth training on a large-scale formula dataset. It shows significant improvement in recognizing complex formulas compared to PP-FormulaNet-S and is suitable for scenarios such as simple printed formulas, complex printed formulas, and handwritten formulas. | PP-FormulaNet-L | Inference Model/Training Model | 92.13 | 1976.52/- | -/- | 535.2 M | LaTeX_OCR_rec | Inference model/Training model | 71.63 | 1088.89 / 1088.89 | - / - | 89.7 M | LaTeX-OCR is a formula recognition algorithm based on an autoregressive large model. By adopting Hybrid ViT as the backbone network and transformer as the decoder, it significantly improves the accuracy of formula recognition. |
|---|
Seal text detection module:
| Model | Model download link | Detection Hmean (%) | GPU inference time (ms) [Normal mode / High-performance mode] |
CPU inference time (ms) [Normal mode / High-performance mode] |
Model storage size (M) | Introduction |
|---|---|---|---|---|---|---|
| PP-OCRv4_server_seal_det | Inference model/Training model | 98.21 | 124.64 / 91.57 | 545.68 / 439.86 | 109 | PP-OCRv4's server-side seal text detection model with higher accuracy, suitable for deployment on better servers |
| PP-OCRv4_mobile_seal_det | Inference model/Training model | 96.47 | 9.70 / 3.56 | 50.38 / 19.64 | 4.6 | PP-OCRv4's mobile-side seal text detection model with higher efficiency, suitable for deployment on the end side |
| Mode | GPU configuration | CPU configuration | Combination of acceleration technologies |
|---|---|---|---|
| Regular mode | FP32 precision / No TRT acceleration | FP32 precision / 8 threads | PaddleInference |
| High-performance mode | Select the optimal combination of prior precision type and acceleration strategy | FP32 precision / 8 threads | Select the prior optimal backend (Paddle/OpenVINO/TRT, etc.) |
Before using the PP-DocTranslation pipeline locally, ensure that you have completed the installation of PaddleX (refer to the Installation Guide). This pipeline depends on the group named translation.
Before use, you need to prepare the API key for a large language model, which supports the Baidu Cloud Qianfan Platform or local large model services that comply with the OpenAI interface standards.
from paddlex import create_pipeline
# Create a translation pipeline
pipeline = create_pipeline(pipeline="PP-DocTranslation")
# Document path
input_path = "document_sample.pdf"
# Output directory
output_path = "./output"
# Large model configuration
chat_bot_config = {
"module_name": "chat_bot",
"model_name": "ernie-3.5-8k",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "openai",
"api_key": "api_key", # your api_key
}
if input_path.lower().endswith(".md"):
# Read markdown documents, supporting passing in directories and url links with the .md suffix
ori_md_info_list = pipeline.load_from_markdown(input_path)
else:
# Use PP-StructureV3 to perform layout analysis on PDF/image documents to obtain markdown information
visual_predict_res = pipeline.visual_predict(
input_path,
use_doc_orientation_classify=False,
use_doc_unwarping=False,
use_common_ocr=True,
use_seal_recognition=True,
use_table_recognition=True,
)
ori_md_info_list = []
for res in visual_predict_res:
layout_parsing_result = res["layout_parsing_result"]
ori_md_info_list.append(layout_parsing_result.markdown)
layout_parsing_result.save_to_img(output_path)
layout_parsing_result.save_to_markdown(output_path)
# Concatenate the markdown information of multi-page documents into a single markdown file, and save the merged original markdown
if input_path.lower().endswith(".pdf"):
ori_md_info = pipeline.concatenate_markdown_pages(ori_md_info_list)
ori_md_info.save_to_markdown(output_path)
# Perform document translation (target language: English)
tgt_md_info_list = pipeline.translate(
ori_md_info_list=ori_md_info_list,
target_language="en",
chunk_size=5000,
chat_bot_config=chat_bot_config,
)
# Save the translation results
for tgt_md_info in tgt_md_info_list:
tgt_md_info.save_to_markdown(output_path)
After executing the above code, you will obtain the parsed results of the original document to be translated, the Markdown file of the original document to be translated, and the Markdown file of the translated document, all saved in the output folder.
create_pipeline Instantiate a pipeline object, with specific parameter descriptions as follows:| Parameter | Description | Parameter Type | Default Value |
|---|---|---|---|
pipeline |
Pipeline name or configuration file path (set to "PP-DocTranslation") | str |
None |
device |
Inference device (such as "gpu:0", "npu:0", "cpu", etc.) | str |
gpu |
use_hpip |
Whether to enable high-performance inference plugin | bool|None |
None |
hpi_config |
High-performance inference configuration | dict|None |
None |
initial_predictor |
Whether to initialize the inference module | bool |
True |
visual_predict() method of the PP-DocTranslation pipeline object to obtain visual prediction results. This method will return a generator. Below are the parameters and descriptions of the visual_predict()method:| Parameter | Description | Parameter Type | Options | Default Value |
|---|---|---|---|---|
input |
Data to be predicted, supporting multiple input types, required | Python Var|str|list |
|
None |
use_doc_orientation_classify |
Whether to use the document orientation classification module | bool|None |
|
None |
use_doc_unwarping |
Whether to use the document unwarping module | bool|None |
|
None |
use_textline_orientation |
Whether to use the text line orientation classification module | bool|None |
|
None |
use_general_ocr |
Whether to use the OCR sub-pipeline | bool|None |
|
None |
use_seal_recognition |
Whether to use the seal recognition sub-pipeline | bool|None |
|
None |
use_table_recognition |
Whether to use the table recognition sub-pipeline | bool|None |
|
None |
use_formula_recognition |
Whether to use the sub-pipeline for formula recognition | bool|None |
|
False |
use_chart_recognition |
Whether to use the sub-pipeline for chart recognition | bool|None |
|
None |
use_region_detection |
Whether to use the pipeline for document region detection | bool|None |
|
None |
layout_threshold |
Score threshold for layout model | float|dict|None |
|
; |
None |
layout_nms | Whether the layout region detection model uses NMS post-processing |
|
; |
None |
layout_unclip_ratio | Expansion coefficient of the detection box for the layout region detection model |
|
None |
layout_merge_bboxes_mode |
The filtering method for overlapping boxes in layout area detection | str|dict|None |
|
None |
text_det_limit_side_len |
The image side length limit for text detection | int|None |
|
None |
text_det_limit_type |
Image side length limit type for text detection | str|None |
|
None |
text_det_thresh |
Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels | float|None |
|
text_det_box_thresh |
Detection box threshold. When the average score of all pixels within the detection result border is greater than this threshold, the result will be considered as a text area |
float|None | float |
|
None |
text_det_unclip_ratio |
Text detection expansion coefficient. This method is used to expand the text area. The larger the value, the larger the expanded area. | float|None |
|
None |
text_rec_score_thresh |
Text recognition threshold. Text results with scores greater than this threshold will be retained. | float|None |
|
None |
seal_det_limit_side_len |
Image side length limit for seal detection | int|None |
|
; |
None |
seal_det_limit_type | Image side length limit type for seal detection |
|
None |
seal_det_thresh |
Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map are considered as seal pixels. | float|None |
|
seal_det_box_thresh |
Detection box threshold. When the average score of all pixels within the bounding box of the detection result is greater than this threshold, the result is considered as a seal area. |
float|None | float |
|
Seal detection expansion coefficient. This method is used to expand the text area. The larger the value, the larger the expanded area. |
float|None |
float | : Any floating-point number greater than |
|
None |
seal_rec_score_thresh |
Seal recognition threshold. Text results with scores greater than this threshold will be retained. | float|None |
|
. That is, no threshold is set. |
None |
use_wired_table_cells_trans_to_html | Whether to enable direct conversion of wired table cell detection results to HTML. The default is False. If enabled, HTML will be constructed directly based on the geometric relationships of wired table cell detection results. |
| ; |
False |
use_wired_table_cells_trans_to_html (Note: This line seems to be a duplicate in the original text, but translated as per the context it would be the same as above. However, if it's intended to be a different parameter, it might need clarification. Here, I've translated it as if it's a repetition for the sake of completeness.) | Whether to enable direct conversion of detection results (without wired table cells) to HTML. The default is False. If enabled, HTML will be constructed directly based on the geometric relationships of detection results (without wired table cells). |
| False |
use_table_orientation_classify |
Whether to enable table orientation classification. When enabled, if the table in the image is rotated by 90/180/270 degrees, the orientation can be corrected and table recognition can be completed correctly. | bool|None |
|
True |
use_ocr_results_with_table_cells |
Whether to enable cell segmentation OCR. When enabled, OCR detection results will be segmented and re-recognized based on cell prediction results to avoid missing text. | bool|None |
|
True |
use_e2e_wired_table_rec_model |
Whether to enable the end-to-end wired table recognition mode. When enabled, the cell detection model will not be used, and only the table structure recognition model will be used. | bool|None |
|
False |
use_e2e_wireless_table_rec_model |
Whether to enable the wireless table end-to-end table recognition mode. If enabled, the cell detection model will not be used, and only the table structure recognition model will be used. | bool|None |
|
True |
json file:| Method | Method Description | Parameter | Parameter Type | Parameter Description | Default Value |
|---|---|---|---|---|---|
print() |
Print the result to the terminal | format_json |
bool |
Whether to use JSON indentation formatting for the output contentJSONindentation formatting |
True |
indent |
int |
Specify the indentation level to beautify the output JSON data and make it more readable. This is only valid when format_json is True4 |
ensure_ascii | ||
bool |
Controls whether non- |
ASCII characters are escaped to 字符转义为 Unicode. When set to True, all non-ASCII characters will be escaped;False will retain the original characters, valid only when format_json is True. |
False |
||
save_to_json() |
Saves the result as a file in JSON format. | save_path |
str |
The file path for saving. When it is a directory, the saved file name will be consistent with the input file type name. | None |
indent |
int |
Specifies the indentation level to beautify the output JSON data and make it more readable, valid only when format_json is True. |
4 | ||
ensure_ascii |
bool |
Controls whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters will be escaped;False will retain the original characters, valid only when format_json is True. |
False |
||
save_to_img() |
Saves the visualization images of each intermediate module in PNG format. | save_path |
str |
The file path for saving. Supports directory or file path. | None |
save_to_markdown() |
Save each page in an image or PDF file as a file in markdown format separately | save_path |
str |
The saved file path, which supports directory or file path | None |
save_to_html() |
Save the tables in the file as a file in html format | save_path |
str |
The saved file path, which supports directory or file path | None |
save_to_xlsx() |
Save the tables in the file as a file in xlsx format | save_path |
str |
The saved file path, which supports directory or file path | None |
translate() method to perform document translation. This method will return the original markdown text and the translated text, which is a markdown object. You can save the required parts locally by executing the save_to_markdown() method. The following are the descriptions of the relevant parameters of the translate() method:| Parameter | Parameter Description | Parameter Type | Options | Default Value |
|---|---|---|---|---|
ori_md_info_list |
A list of data in the original markdown format, containing the content to be translated | List[Dict] |
It must be a list composed of dictionaries, with each dictionary representing a document block | No default value (required) |
target_language |
Target translation language code | str |
ISO 639-1 language code (such as "en"/"ja"/"fr") | "zh" |
chunk_size |
The character threshold for chunking the text to be translated | int |
An integer greater than 0 | 5000 |
task_description |
Custom task description prompt | str|None |
|
None |
output_format |
Specify output format requirements | str|None |
|
None |
rules_str |
Custom translation rule description | str|None |
|
None |
few_shot_demo_text_content |
Example text content for few-shot learning | str|None |
|
None |
few_shot_demo_key_value_list |
Structured few-shot example data | str|None |
|
None |
glossary |
Glossary of technical terms | dict|None |
|
None |
llm_request_interval |
Time interval in seconds for sending requests to the large language model. This parameter can be used to prevent overly frequent calls to the large language model. | float |
A floating-point number greater than or equal to 0 | 0.0 |
chat_bot_config |
Large language model configuration | dict|None |
|
None |
In addition, you can obtain the configuration file for the layout analysis pipeline and load the configuration file for prediction. You can execute the following command to save the results in my_path:
paddlex --get_pipeline_config PP-DocTranslation --save_path ./my_path
If you obtain the configuration file, you can customize the configurations of the layout parsing pipeline by simply modifying the value of the pipeline parameter in the create_pipeline method to the path of the pipeline configuration file. Here is an example:
from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="./my_path/PP-DocTranslation.yaml")
# Document path
img_path = "document_sample.pdf"
# Large model configuration
chat_bot_config = {
"module_name": "chat_bot",
"model_name": "ernie-3.5-8k",
"base_url": "https://qianfan.baidubce.com/v2",
"api_type": "openai",
"api_key": "api_key", # Replace with the actual API key
}
# Perform layout analysis
visual_predict_res = pipeline.visual_predict(
img_path,
use_doc_orientation_classify=False,
use_doc_unwarping=False,
use_common_ocr=True,
use_seal_recognition=True,
use_table_recognition=True,
)
# Extract the original document structure information
ori_md_info_list = []
for res in visual_predict_res:
layout_parsing_result = res["layout_parsing_result"]
ori_md_info_list.append(layout_parsing_result.markdown)
layout_parsing_result.print()
layout_parsing_result.save_to_img("./output")
layout_parsing_result.save_to_json("./output")
# Document translation
tgt_md_info_list = pipeline.translate(
ori_md_info_list=ori_md_info_list,
target_language="en",
chunk_size=5000,
chat_bot_config=chat_bot_config,
)
# Save the translation results
for tgt_md_info in tgt_md_info_list:
tgt_md_info.save_to_markdown(output_path)
Note:The parameters in the configuration file are pipeline initialization parameters. If you want to change the initialization parameters of the general layout parsing v3 pipeline, you can directly modify the parameters in the configuration file and load the configuration file for prediction. Meanwhile, CLI prediction also supports passing in the configuration file, and you can specify the path of the configuration file with --pipeline.
If the pipeline can meet your requirements for inference speed and accuracy, you can directly proceed with development integration/deployment.
If you need to directly apply the pipeline to your Python project, you can refer to the sample code in 2.2 Integration via Python Script.
In addition, PaddleX also provides three other deployment methods, which are described in detail below:
🚀High-performance inference: In real production environments, many applications have stringent performance metrics (especially response speed) for deployment strategies to ensure efficient system operation and smooth user experience. To this end, PaddleX provides a high-performance inference plugin designed to deeply optimize model inference and pre/post-processing, achieving significant acceleration of the end-to-end process. For detailed information on the high-performance inference process, please refer to the PaddleX High-Performance Inference Guide.
☁️Serving: Serving is a common form of deployment in actual production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. PaddleX supports multiple pipeline Serving deployment solutions. For detailed pipeline Serving deployment procedures, please refer to the PaddleX Serving Deployment Guide.
The following is the API reference for basic Serving and examples of multilingual service invocation:
Main operations provided by the service: The main operations provided by the service are as follows: Analyze images using computer vision models to obtain OCR, table recognition results, etc. Translate documents using large models.API reference
200, and the attributes of the response body are as follows:
Name
Type
Meaning
logIdstringThe UUID of the request.
errorCodeintegerError code. Fixed as
0.
errorMsgstringError description. Fixed as
"Success".
resultobjectOperation result.
Name
Type
Meaning
logIdstringThe UUID of the request.
errorCodeintegerError code. Same as the response status code.
errorMsgstringError description.
analyzeImagesPOST /doctrans-visual
Name
Type
Meaning
Required
filestringThe URL of an image file or PDF file accessible by the server, or the Base64-encoded result of the content of the aforementioned file types. By default, for PDF files with more than 10 pages, only the first 10 pages will be processed.
To remove the page limit, add the following configuration to the pipeline configuration file:Serving:
extra:
max_num_input_imgs: null</code></pre>
Yes
fileTypeinteger|nullFile type.
0Indicates a PDF file,1Indicates an image file. If this attribute is not present in the request body, the file type will be inferred from the URL.No
useDocOrientationClassifyboolean|nullRefer to the description of the
use_doc_orientation_classifyparameter in the visual_predictmethod of the pipeline object.No
useDocUnwarpingboolean|nullRefer to the description of the
use_doc_unwarpingparameter in the visual_predictmethod of the pipeline object.No
useTextlineOrientationboolean|nullRefer to the description of the
use_textline_orientationparameter in the visual_predictmethod of the pipeline object.No
useSealRecognitionboolean|nullSee the description of the
use_seal_recognition parameter for the visual_predict method in the pipeline object.No
useTableRecognitionboolean|nullSee the description of the
use_table_recognition parameter for the visual_predict method in the pipeline object.No
useFormulaRecognitionboolean|nullSee the description of the
use_formula_recognition parameter for the visual_predict method in the pipeline object.No
useChartRecognitionboolean|nullSee the description of the
use_chart_recognition parameter for the visual_predict method in the pipeline object.No
useRegionDetectionboolean|nullSee the description of the
use_region_detection parameter for the visual_predict method in the pipeline object.No
layoutThresholdnumber|object|nullRefer to the description of the
visual_predict method's layout_threshold parameter in the pipeline object.No
layoutNmsboolean|nullRefer to the description of the
visual_predict method's layout_nms parameter in the pipeline object.No
layoutUnclipRationumber|array|object|nullRefer to the description of the
visual_predict method's layout_unclip_ratio parameter in the pipeline object.No
layoutMergeBboxesModestring|object|nullRefer to the description of the
visual_predict method's layout_merge_bboxes_mode parameter in the pipeline object.No
textDetLimitSideLeninteger|nullRefer to the description of the
visual_predict method's text_det_limit_side_len parameter in the pipeline object.No
textDetLimitTypestring|nullRefer to the description of the
text_det_limit_type parameter of the visual_predict method in the pipeline object.No
textDetThreshnumber|nullRefer to the description of the
text_det_thresh parameter of the visual_predict method in the pipeline object.No
textDetBoxThreshnumber|nullRefer to the description of the
text_det_box_thresh parameter of the visual_predict method in the pipeline object.No
textDetUnclipRationumber|nullRefer to the description of the
text_det_unclip_ratio parameter of the visual_predict method in the pipeline object.No
textRecScoreThreshnumber|nullRefer to the description of the
text_rec_score_thresh parameter of the visual_predict method in the pipeline object.No
sealDetLimitSideLeninteger|nullRefer to the description in the pipeline object
visual_predictmethodseal_det_limit_side_lenparameter description.No
sealDetLimitTypestring|nullRefer to the parameter description of
visual_predictmethod'sseal_det_limit_type in the pipeline object.No
sealDetThreshnumber|nullRefer to the parameter description of
visual_predictmethod'sseal_det_thresh in the pipeline object.No
sealDetBoxThreshnumber|nullRefer to the parameter description of
visual_predictmethod'sseal_det_box_thresh in the pipeline object.No
sealDetUnclipRationumber|nullRefer to the parameter description of
visual_predictmethod'sseal_det_unclip_ratio in the pipeline object.No
sealRecScoreThreshnumber|nullRefer to the parameter description of
visual_predictmethod'sseal_rec_score_threshParameter description.No
useWiredTableCellsTransToHtmlbooleanRefer to the parameter description of
use_wired_table_cells_trans_to_html in the visual_predict method of the pipeline object.No
useWirelessTableCellsTransToHtmlbooleanRefer to the parameter description of
use_wireless_table_cells_trans_to_html in the visual_predict method of the pipeline object.No
useTableOrientationClassifybooleanRefer to the parameter description of
use_table_orientation_classify in the visual_predict method of the pipeline object.No
useOcrResultsWithTableCellsbooleanRefer to the parameter description of
use_ocr_results_with_table_cells in the visual_predict method of the pipeline object.No
useE2eWiredTableRecModelbooleanRefer to the parameter description of
use_e2e_wired_table_rec_model in the visual_predict method of the pipeline object.No
useE2eWirelessTableRecModelbooleanRefer to the
visual_predictmethod's use_e2e_wireless_table_rec_modelparameter description in the pipeline object.No
visualizeboolean|nullWhether to return visualization result graphs and intermediate images during processing, etc.
true: Return images.false: Do not return images.nullis passed in: Follow the setting in the Serving.visualizeof the pipeline configuration file.
For example, add the following field in the pipeline configuration file:
Images will not be returned by default, and the default behavior can be overridden by the Serving:
visualize: Falsevisualizeparameter in the request body. If neither the request body nor the configuration file is set (or nullis passed in the request body and nothing is set in the configuration file), images will be returned by default.No
resultin the response body has the following attributes:
Name
Type
Meaning
layoutParsingResultsarrayLayout parsing results. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file in sequence.
dataInfoobjectEnter the data information.
Each element in layoutParsingResults is an object
with the following attributes:
Name
Type
MeaningprunedResultobject
A simplified version of the res field in the JSON representation of the layout_parsing_result generated by the visual_predict method of the pipeline object, with the input_path and page_index
fields removed.markdownobject
Markdown result.outputImagesobject|null
Refer to the description of the img
attribute in the pipeline prediction results. The image is in JPEG format and encoded with Base64.inputImagestring|null
Input image. The image is in JPEG format and encoded with Base64.markdown is an object
with the following attributes:
Name
Type
Meaningtextstring
Markdown text.imagesobject
A key-value pair of relative paths of Markdown images and Base64-encoded images.isStartWhether the first element of the current page is the start of a paragraph.
isEndbooleanWhether the last element of the current page is the end of a paragraph.
translatePOST /doctrans-translate
name
type
meaning
required
markdownListarrayA list of Markdown to be translated. It can be obtained from the result of the
analyzeImagesoperation.The imagesattribute will not be used.
YestargetLanguagestring
Refer to the description of the target_languageparameter in the translatemethod of the pipeline object.
NochunkSizeinteger
Refer to the description of the chunk_sizeparameter in the translatemethod of the pipeline object.
NotaskDescriptionstring|null
Refer to the description of the task_descriptionparameter in the translatemethod of the pipeline object.
NooutputFormatstring|null
Refer to the description of the methodoutput_formatDescription of the relevant parameters.No
rulesStrstring|nullRefer to the description of the
translatemethod'srules_strparameter in the pipeline object.No
fewShotDemoTextContentstring|nullRefer to the description of the
translatemethod'sfew_shot_demo_text_contentparameter in the pipeline object.No
fewShotDemoKeyValueListstring|nullRefer to the description of the
translatemethod'sfew_shot_demo_key_value_listparameter in the pipeline object.No
llmRequestIntervalnumber|nullRefer to the description of the
translatemethod'sllm_request_intervalparameter in the pipeline object.No
chatBotConfigobject|nullRefer to the description of the
translatemethod'schat_bot_configparameter in the pipeline object.No
result has the following attributes:
Name
Type
Meaning
translationResultsarrayTranslation results.
Each element in translationResults is an object
with the following attributes:
Name
Type
Meaninglanguagestring
Target language.markdownobject
Markdown result. The object definition is consistent with the markdown returned by the analyzeImages
Including sensitive parameters such as the API key for large model calls in the request body may pose security risks. If not necessary, set these parameters in the configuration file and do not pass them during the request.
Example of multilingual service invocation
import base64
import pathlib
import pprint
import sys
import requests
API_BASE_URL = "http://127.0.0.1:8080"
file_path = "./demo.jpg" target_language = "en"
with open(file_path, "rb") as file:
file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")
payload = {
"file": file_data,
"fileType": 1,
} resp_visual = requests.post(url=f"{API_BASE_URL}/doctrans-visual", json=payload) if resp_visual.status_code != 200:
print(
f"Request to doctrans-visual failed with status code {resp_visual.status_code}."
)
pprint.pp(resp_visual.json())
sys.exit(1)
result_visual = resp_visual.json()["result"]
markdown_list = [] for i, res in enumerate(result_visual["layoutParsingResults"]):
md_dir = pathlib.Path(f"markdown_{i}")
md_dir.mkdir(exist_ok=True)
(md_dir / "doc.md").write_text(res["markdown"]["text"])
for img_path, img in res["markdown"]["images"].items():
img_path = md_dir / img_path
img_path.parent.mkdir(parents=True, exist_ok=True)
img_path.write_bytes(base64.b64decode(img))
print(f"The Markdown document to be translated is saved at {md_dir / 'doc.md'}")
del res["markdown"]["images"]
markdown_list.append(res["markdown"])
for img_name, img in res["outputImages"].items():
img_path = f"{img_name}_{i}.jpg"
with open(img_path, "wb") as f:
f.write(base64.b64decode(img))
print(f"Output image saved at {img_path}")
payload = {
"markdownList": markdown_list,
"targetLanguage": target_language,
} resp_translate = requests.post(url=f"{API_BASE_URL}/doctrans-translate", json=payload) if resp_translate.status_code != 200:
print(
f"Request to doctrans-translate failed with status code {resp_translate.status_code}."
)
pprint.pprint(resp_translate.json())
sys.exit(1)
result_translate = resp_translate.json()["result"]
for i, res in enumerate(result_translate["translationResults"]):
md_dir = pathlib.Path(f"markdown_{i}")
(md_dir / "doc_translated.md").write_text(res["markdown"]["text"])
print(f"Translated markdown document saved at {md_dir / 'doc_translated.md'}")</code></pre></details>
📱On-device deployment: On-device deployment is a method that places computing and data processing functions on the user's device itself, allowing the device to process data directly without relying on a remote server. PaddleX supports deploying models on on-device devices such as Android. For detailed on-device deployment procedures, please refer to the PaddleX On-device Deployment Guide.
You can choose an appropriate way to deploy the model pipeline according to your needs, and then proceed with subsequent AI application integration.
If the default model weights provided by the Layout Analysis v3 sub-pipeline in the general document translation pipeline do not meet your accuracy or speed requirements in your scenario, you can try to usethe data from your own specific domain or application scenarioto furtherfine-tunethe existing model to improve the recognition performance of the general Layout Analysis v3 sub-pipeline in your scenario.
Since the General Layout Analysis v3 sub-pipeline contains several modules, the subpar performance of the model pipeline may stem from any one of these modules. You can analyze the cases with poor extraction results, use visualized images to identify which module is problematic, and refer to the corresponding fine-tuning tutorial links in the following table to fine-tune the model.
| Scenario | Fine-tuning module | Fine-tuning reference link |
|---|---|---|
| Inaccurate detection of layout areas, such as failure to detect seals and tables | Layout area detection module | Link |
| Inaccurate recognition of table structures | Table structure recognition module | Link |
| Inaccurate recognition of formulas | Formula recognition module | Link |
| Omission in detecting seal text | Seal text detection module | Link |
| Omission in detecting text | Text detection module | Link |
| Inaccurate text content | Text recognition module | Link |
| Inaccurate correction of vertical or rotated text lines | Text line orientation classification module | Link |
| Inaccurate correction of whole image rotation | Document image orientation classification module | Link |
| Inaccurate correction of image distortion | Text image correction module | Fine-tuning is temporarily not supported |
After completing fine-tuning training with your private dataset, you can obtain a local model weight file.
If you need to use the fine-tuned model weights, simply modify the pipeline configuration file by replacing the local path of the fine-tuned model weights into the corresponding location in the pipeline configuration file:
......
SubModules:
LayoutDetection:
module_name: layout_detection
model_name: PP-DocLayout_plus-L
model_dir: null # Replace with the path to the weights of the fine-tuned layout region detection model
......
SubPipelines:
GeneralOCR:
pipeline_name: OCR
text_type: general
use_doc_preprocessor: False
use_textline_orientation: False
SubModules:
TextDetection:
module_name: text_detection
model_name: PP-OCRv5_server_det
model_dir: null # Replace with the path to the weights of the fine-tuned text detection model
limit_side_len: 960
limit_type: max
max_side_limit: 4000
thresh: 0.3
box_thresh: 0.6
unclip_ratio: 1.5
TextRecognition:
module_name: text_recognition
model_name: PP-OCRv5_server_rec
model_dir: null # Replace with the path to the weights of the fine-tuned text recognition model
batch_size: 1
score_thresh: 0
......
Then, refer to the command-line method or Python script method in the local experience to load the modified pipeline configuration file.
PaddleX supports multiple mainstream hardware devices such as NVIDIA GPU, Kunlunxin XPU, Ascend NPU, and Cambrian MLU, and only requires setting the device parameter to achieve seamless switching between different hardware.
For example, when using the document scenario information extraction v4 pipeline, to change the running device from NVIDIA GPU to Ascend NPU, you only need to modify device to npu in the script:
from paddlex import create_pipeline
pipeline = create_pipeline(
pipeline="PP-DocTranslation",
device="npu:0" # change gpu:0 to npu:0
)
If you want to use the general document translation pipeline on more types of hardware, please refer to the PaddleX Multi-Hardware Usage Guide.