PP-DocTranslation.en.md 104 KB


comments: true

PP-DocTranslation Pipeline Tutorial

1. Introduction to PP-DocTranslation Pipeline

The General Document Translation Pipeline (PP-DocTranslation) is a document intelligent translation solution provided by PaddlePaddle. It integrates advanced general layout analysis technology with the capabilities of large language models (LLMs) to offer you efficient document intelligent translation services. This solution can accurately identify and extract various elements within documents, including text blocks, headings, paragraphs, images, tables, and other complex layout structures, and on this basis, achieve high-quality multilingual translation. PP-DocTranslation supports mutual translation among multiple mainstream languages, and is particularly adept at handling documents with complex layouts and strong contextual dependencies, striving to deliver accurate, natural, fluent, and professional translation results. This pipeline also provides flexible Serving deployment options, supporting the use of multiple programming languages across various hardware. Moreover, it offers the capability for secondary development, allowing you to train and fine-tune models on your own datasets, with the trained models being seamlessly integrable.

The general document translation pipeline uses PP-StructureV3 sub-pipeline, and thus has all the functions of PP-StructureV3. For more information on the functions and usage details of PP-StructureV3, click on the PP-StructureV3 Documentation page.

If you prioritize model accuracy, choose a high-accuracy model; if you prioritize model inference speed, choose a faster inference model; if you prioritize model storage size, choose a smaller storage model.

The inference time only includes the model inference time and does not include the time for pre- or post-processing.

👉Details of model list

Document image orientation classification module:

ModelModel download link Top-1 Acc (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-LCNet_x1_0_doc_ori Inference model/Training model 99.06 2.62 / 0.59 3.24 / 1.19 7 A document image classification model based on PP-LCNet_x1_0, with four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees

Text image rectification module:

ModelModel download link CER Model storage size (M) Introduction
UVDoc Inference model/Training model 0.179 30.3 M A high-precision text image rectification model

Layout region detection module model:

ModelModel download link mAP(0.5) (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-DocLayout_plus-L Inference model/Training model 83.2 53.03 / 17.23 634.62 / 378.32 126.01 M A higher-precision layout region localization model trained on a self-built dataset based on RT-DETR-L, covering scenarios such as Chinese and English papers, multi-column magazines, newspapers, PPTs, contracts, books, examination papers, research reports, ancient books, Japanese documents, and documents with vertical text.
PP-DocLayout-L Inference model/Training model 90.4 33.59 / 33.59 503.01 / 251.08 123.76 M A high-precision layout region localization model trained on a self-built dataset based on RT-DETR-L, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports.
PP-DocLayout-M Inference model/Training model 75.2 13.03 / 4.72 43.39 / 24.44 22.578 A layout region localization model with balanced precision and efficiency trained on a self-built dataset based on PicoDet-L, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports.
PP-DocLayout-S Inference model/Training model 70.9 11.54 / 3.86 18.53 / 6.29 4.834 A highly efficient layout region localization model trained on a self-built dataset based on PicoDet-S, covering scenarios such as Chinese and English papers, magazines, contracts, books, examination papers, and research reports.

Table structure recognition module:

ModelModel download link Accuracy (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
SLANeXt_wired Inference model/Training model 69.65 85.92 / 85.92 - / 501.66 351M The SLANeXt series is a new generation of table structure recognition models independently developed by Baidu PaddlePaddle's vision team. Compared to SLANet and SLANet_plus, SLANeXt focuses on recognizing table structures and has trained dedicated weights for wired and wireless tables separately. Its recognition capabilities for various types of tables have been significantly improved, especially for wired tables.
SLANeXt_wireless Inference model/Training model

Table classification module model:

ModelModel download link Top1 Acc(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M)
PP-LCNet_x1_0_table_cls Inference model/Training model 94.2 2.62 / 0.60 3.17 / 1.14 6.6M

Table cell detection module model:

ModelModel download link mAP(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
RT-DETR-L_wired_table_cell_det Inference model/Training model 82.7 33.47 / 27.02 402.55 / 256.56 124M RT-DETR is the first real-time end-to-end object detection model. The Baidu PaddlePaddle Vision team completed pre-training on a self-built table cell detection dataset using RT-DETR-L as the base model, achieving table cell detection with good performance for both wired and wireless tables.
RT-DETR-L_wireless_table_cell_det Inference model/Training model

Text detection module:

ModelModel download link Detection Hmean (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-OCRv5_server_det Inference model/Training model 83.8 89.55 / 70.19 383.15 / 383.15 84.3 The server-side text detection model of PP-OCRv5, with higher accuracy, suitable for deployment on servers with better performance
PP-OCRv5_mobile_det Inference model/Training model 79.0 10.67 / 6.36 57.77 / 28.15 4.7 The mobile-side text detection model of PP-OCRv5, with higher efficiency, suitable for deployment on end-side devices
PP-OCRv4_server_det Inference model/Training model 69.2 127.82 / 98.87 585.95 / 489.77 109 PP-OCRv4 server-side text detection model with higher accuracy, suitable for deployment on servers with better performance
PP-OCRv4_mobile_det Inference model/Training model 63.8 9.87 / 4.17 56.60 / 20.79 4.7 PP-OCRv4 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices
PP-OCRv3_mobile_det Inference model/Training model Accuracy close to PP-OCRv4_mobile_det 9.90 / 3.60 41.93 / 20.76 2.1 PP-OCRv3 mobile-side text detection model with higher efficiency, suitable for deployment on edge devices
PP-OCRv3_server_det Inference model/Training model Accuracy close to PP-OCRv4_server_det 119.50 / 75.00 379.35 / 318.35 102.1 PP-OCRv3 server-side text detection model with higher accuracy, suitable for deployment on servers with better performance

Text recognition module models:

*Chinese recognition model
ModelModel download link Recognition Avg Accuracy(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal Mode / High-Performance Mode]
Model Storage Size (M) Introduction
PP-OCRv5_server_rec Inference Model/Training Model 86.38 8.46 / 2.36 31.21 / 31.21 81 M PP-OCRv5_rec is a new-generation text recognition model. This model is dedicated to efficiently and accurately supporting the recognition of four major languages, namely Simplified Chinese, Traditional Chinese, English, and Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters, all within a single model. While maintaining recognition accuracy, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
PP-OCRv5_mobile_rec Inference Model/Training Model 81.29 5.43 / 1.46 21.20 / 5.32 16 M
PP-OCRv4_server_rec_doc Inference Model/Training Model 86.58 8.69 / 2.78 37.93 / 37.93 74.7 M PP-OCRv4_server_rec_doc is trained on a mixed dataset of more Chinese document data and PP-OCR training data, based on PP-OCRv4_server_rec. It has enhanced the ability to recognize some Traditional Chinese characters, Japanese characters, and special characters, supporting the recognition of over 15,000 characters. In addition to improving the document-related text recognition capabilities, it has also enhanced the recognition capabilities for general text.
PP-OCRv4_mobile_rec Inference Model/Training Model 78.74 5.26 / 1.12 17.48 / 3.61 10.6 M Lightweight recognition model of PP-OCRv4 with high inference efficiency, which can be deployed on various hardware devices including edge devices
PP-OCRv4_server_rec Inference model/Training model 80.61 8.75 / 2.49 36.93 / 36.93 71.2 M Server-side model of PP-OCRv4 with high inference accuracy, which can be deployed on various servers
PP-OCRv3_mobile_rec Inference model/Training model 72.96 3.89 / 1.16 8.72 / 3.56 9.2 M Lightweight recognition model of PP-OCRv3 with high inference efficiency, which can be deployed on various hardware devices including edge devices
ModelModel download link Recognition Avg Accuracy(%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
ch_SVTRv2_rec Inference model/Training model 68.81 10.38 / 8.31 66.52 / 30.83 73.9 M SVTRv2 is a server-side text recognition model developed by the OpenOCR team of the Vision and Learning Lab (FVL) at Fudan University. It won the first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task. The end-to-end recognition accuracy on Leaderboard A improved by 6% compared to PP-OCRv4.
ModelModel download link Identify Avg Accuracy(%) GPU inference time (ms)
[Normal Mode / High-Performance Mode]
CPU inference time (ms)
[Normal Mode / High-Performance Mode]
Model storage size (M) Introduction
ch_RepSVTR_rec Inference model/Training model 65.07 6.29 / 1.57 20.64 / 5.40 22.1 M The RepSVTR text recognition model is a mobile-end text recognition model based on SVTRv2. It won the first prize in the PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition Task. Compared with PP-OCRv4, its end-to-end recognition accuracy on Leaderboard B has increased by 2.5%, while the inference speed remains the same.
*English recognition model
ModelModel download link Identify Avg Accuracy(%) GPU inference time (ms)
[Normal Mode / High-Performance Mode]
CPU inference time (ms)
[Normal Mode / High-Performance Mode]
Model storage size (M) Introduction
en_PP-OCRv4_mobile_rec Inference model/Training model 70.39 4.81 / 1.23 17.20 / 4.18 6.8 M An ultra-lightweight English recognition model trained based on the PP-OCRv4 recognition model, supporting English and numeric recognition
en_PP-OCRv3_mobile_rec Inference model/Training model 70.69 3.56 / 0.78 8.44 / 5.78 7.8 M An ultra-lightweight English recognition model trained based on the PP-OCRv3 recognition model, supporting English and number recognition
*Multilingual recognition model
ModelModel download link Avg Accuracy of recognition (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
korean_PP-OCRv3_mobile_rec Inference model/Training model 60.21 3.73 / 0.98 8.76 / 2.91 8.6 M An ultra-lightweight Korean recognition model trained based on the PP-OCRv3 recognition model, supporting Korean and number recognition
japan_PP-OCRv3_mobile_rec Inference model/Training model 45.69 3.86 / 1.01 8.62 / 2.92 8.8 M An ultra-lightweight Japanese recognition model trained based on the PP-OCRv3 recognition model, supporting Japanese and number recognition
chinese_cht_PP-OCRv3_mobile_rec Inference model/Training model 82.06 3.90 / 1.16 9.24 / 3.18 9.7 M An ultra-lightweight traditional Chinese recognition model trained based on the PP-OCRv3 recognition model, supporting traditional Chinese and number recognition
te_PP-OCRv3_mobile_rec Inference model/Training model 95.88 3.59 / 0.81 8.28 / 6.21 7.8 M An ultra-lightweight Telugu recognition model trained based on the PP-OCRv3 recognition model, supporting Telugu and digit recognition
ka_PP-OCRv3_mobile_rec Inference model/Training model 96.96 3.49 / 0.89 8.63 / 2.77 8.0 M An ultra-lightweight Kannada recognition model trained based on the PP-OCRv3 recognition model, supporting Kannada and digit recognition
ta_PP-OCRv3_mobile_rec Inference model/Training model 76.83 3.49 / 0.86 8.35 / 3.41 8.0 M An ultra-lightweight Tamil recognition model trained based on the PP-OCRv3 recognition model, supporting Tamil and digit recognition
latin_PP-OCRv3_mobile_rec Inference model/Training model 76.93 3.53 / 0.78 8.50 / 6.83 7.8 M An ultra-lightweight Latin recognition model trained based on the PP-OCRv3 recognition model, supporting Latin and digit recognition
arabic_PP-OCRv3_mobile_rec Inference model/Training model 73.55 3.60 / 0.83 8.44 / 4.69 7.8 M An ultra-lightweight Arabic letter recognition model trained based on the PP-OCRv3 recognition model, supporting Arabic letter and digit recognition
cyrillic_PP-OCRv3_mobile_rec Inference model/Training model 94.28 3.56 / 0.79 8.22 / 2.76 7.9 M An ultra-lightweight Slavic letter recognition model trained based on the PP-OCRv3 recognition model, supporting Slavic letter and digit recognition
devanagari_PP-OCRv3_mobile_rec Inference model/Training model 96.44 3.60 / 0.78 6.95 / 2.87 7.9 M An ultra-lightweight Devanagari letter recognition model trained based on the PP-OCRv3 recognition model, supporting Devanagari letter and digit recognition

Text line orientation classification module (optional):

Model Model download link Top-1 Acc (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-LCNet_x0_25_textline_ori Inference model/Training model 95.54 2.16 / 0.41 2.37 / 0.73 0.32 A text line classification model based on PP-LCNet_x0_25, with two categories, namely 0 degree and 180 degrees

Formula recognition module:

ModelModel download link Avg-BLEU(%) GPU inference latency (ms)
[Normal Mode / High-Performance Mode]
CPU inference latency (ms)
[Normal Mode / High-Performance Mode]
Model storage size (M) Introduction
UniMERNet Inference Model/Training Model 86.13 2266.96/- -/- 1.4 G UniMERNet is a formula recognition model developed by Shanghai AI Lab. It employs Donut Swin as the encoder and MBartDecoder as the decoder. By training on a dataset of one million entries, including simple formulas, complex formulas, scanned formulas, and handwritten formulas, the model significantly enhances its recognition accuracy for formulas in real-world scenarios. PP-FormulaNet-S Inference Model/Training Model 87.12 1311.84 / 1311.84 - / 8288.07 167.9 M PP-FormulaNet is an advanced formula recognition model developed by Baidu PaddlePaddle Vision Team, supporting the recognition of 50,000 common LateX source code vocabulary. The PP-FormulaNet-S version adopts PP-HGNetV2-B4 as its backbone network. Through techniques such as parallel masking and model distillation, it significantly improves the model's inference speed while maintaining high recognition accuracy, suitable for scenarios such as simple printed formulas and cross-line simple printed formulas. The PP-FormulaNet-L version, on the other hand, is based on Vary_VIT_B as the backbone network and has undergone in-depth training on a large-scale formula dataset. It shows significant improvement in recognizing complex formulas compared to PP-FormulaNet-S and is suitable for scenarios such as simple printed formulas, complex printed formulas, and handwritten formulas. PP-FormulaNet-L Inference Model/Training Model 92.13 1976.52/- -/- 535.2 M LaTeX_OCR_rec Inference model/Training model 71.63 1088.89 / 1088.89 - / - 89.7 M LaTeX-OCR is a formula recognition algorithm based on an autoregressive large model. By adopting Hybrid ViT as the backbone network and transformer as the decoder, it significantly improves the accuracy of formula recognition.

Seal text detection module:

ModelModel download link Detection Hmean (%) GPU inference time (ms)
[Normal mode / High-performance mode]
CPU inference time (ms)
[Normal mode / High-performance mode]
Model storage size (M) Introduction
PP-OCRv4_server_seal_det Inference model/Training model 98.21 124.64 / 91.57 545.68 / 439.86 109 PP-OCRv4's server-side seal text detection model with higher accuracy, suitable for deployment on better servers
PP-OCRv4_mobile_seal_det Inference model/Training model 96.47 9.70 / 3.56 50.38 / 19.64 4.6 PP-OCRv4's mobile-side seal text detection model with higher efficiency, suitable for deployment on the end side
Test environment description:
  • Performance test environment
    • Test dataset:
      • Document image orientation classification model: A self-built dataset by PaddleX, covering multiple scenarios such as certificates and documents, including 1000 images.
      • Text image rectification model: DocUNet.
      • Layout region detection model: A self-built layout region analysis dataset by PaddleOCR, including 10,000 common document-type images such as Chinese and English papers, magazines, and research reports.
      • PP-DocLayout_plus-L: A self-built layout region detection dataset by PaddleOCR, including 1300 document-type images such as Chinese and English papers, magazines, newspapers, research reports, PPTs, exam papers, and textbooks.
      • Table structure recognition model: An internally self-built English table recognition dataset by PaddleX.
      • Text detection model: A self-built Chinese dataset by PaddleOCR, covering multiple scenarios such as street views, web images, documents, and handwriting, including 500 images for detection.
      • Chinese recognition model: A self-built Chinese dataset by PaddleOCR, covering multiple scenarios such as street views, web images, documents, and handwriting, including 11,000 images for text recognition.
      • ch_SVTRv2_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition TaskPhase A evaluation set.
      • ch_RepSVTR_rec: PaddleOCR Algorithm Model Challenge - Task 1: OCR End-to-End Recognition TaskPhase B evaluation set.
      • English recognition model: A self-built English dataset by PaddleX.
      • Multilingual recognition model: A self-built multilingual dataset by PaddleX.
      • Text line orientation classification model: A self-built dataset by PaddleX, covering multiple scenarios such as certificates and documents, including 1000 images.
      • Seal text detection model: A self-built dataset by PaddleX, including 500 round seal images.
    • Hardware configuration:
      • GPU: NVIDIA Tesla T4
      • CPU: Intel Xeon Gold 6271C @ 2.60GHz
      • Other environments: Ubuntu 20.04 / CUDA 11.8 / cuDNN 8.9 / TensorRT 8.6.1.6
  • Description of inference modes
Mode GPU configuration CPU configuration Combination of acceleration technologies
Regular mode FP32 precision / No TRT acceleration FP32 precision / 8 threads PaddleInference
High-performance mode Select the optimal combination of prior precision type and acceleration strategy FP32 precision / 8 threads Select the prior optimal backend (Paddle/OpenVINO/TRT, etc.)

2. Quick Start

2.1 Local Experience

Before using the PP-DocTranslation pipeline locally, ensure that you have completed the installation of PaddleX (refer to the Installation Guide). This pipeline depends on the group named translation.

Before use, you need to prepare the API key for a large language model, which supports the Baidu Cloud Qianfan Platform or local large model services that comply with the OpenAI interface standards.

from paddlex import create_pipeline
# Create a translation pipeline
pipeline = create_pipeline(pipeline="PP-DocTranslation")

# Document path
input_path = "document_sample.pdf"

# Output directory
output_path = "./output"

# Large model configuration
chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "ernie-3.5-8k",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "openai",
    "api_key": "api_key",  # your api_key
}

if input_path.lower().endswith(".md"):
    # Read markdown documents, supporting passing in directories and url links with the .md suffix
    ori_md_info_list = pipeline.load_from_markdown(input_path)
else:
    # Use PP-StructureV3 to perform layout analysis on PDF/image documents to obtain markdown information
    visual_predict_res = pipeline.visual_predict(
        input_path,
        use_doc_orientation_classify=False,
        use_doc_unwarping=False,
        use_common_ocr=True,
        use_seal_recognition=True,
        use_table_recognition=True,
)

    ori_md_info_list = []
    for res in visual_predict_res:
        layout_parsing_result = res["layout_parsing_result"]
        ori_md_info_list.append(layout_parsing_result.markdown)
        layout_parsing_result.save_to_img(output_path)
        layout_parsing_result.save_to_markdown(output_path)

    # Concatenate the markdown information of multi-page documents into a single markdown file, and save the merged original markdown
    if input_path.lower().endswith(".pdf"):
        ori_md_info = pipeline.concatenate_markdown_pages(ori_md_info_list)
        ori_md_info.save_to_markdown(output_path)

# Perform document translation (target language: English)
tgt_md_info_list = pipeline.translate(
    ori_md_info_list=ori_md_info_list,
    target_language="en",
    chunk_size=5000,
    chat_bot_config=chat_bot_config,
)
# Save the translation results
for tgt_md_info in tgt_md_info_list:
    tgt_md_info.save_to_markdown(output_path)

After executing the above code, you will obtain the parsed results of the original document to be translated, the Markdown file of the original document to be translated, and the Markdown file of the translated document, all saved in the output folder.

PP-DocTranslation Prediction Process, API Description, and Output Description

(1) Instantiation create_pipeline Instantiate a pipeline object, with specific parameter descriptions as follows:
Parameter Description Parameter Type Default Value
pipeline Pipeline name or configuration file path (set to "PP-DocTranslation") str None
device Inference device (such as "gpu:0", "npu:0", "cpu", etc.) str gpu
use_hpip Whether to enable high-performance inference plugin bool|None None
hpi_config High-performance inference configuration dict|None None
initial_predictor Whether to initialize the inference module bool True
(2) Call the visual_predict() method of the PP-DocTranslation pipeline object to obtain visual prediction results. This method will return a generator. Below are the parameters and descriptions of the visual_predict()method:
Parameter Description Parameter Type Options Default Value
input Data to be predicted, supporting multiple input types, required Python Var|str|list
  • Python Var: such as numpy.ndarrayrepresenting image data
  • str: such as the local path of an image file or PDF file: /root/data/img.jpg;such as a URL link, such as the network URL of an image file or PDF file: example;such as a local directory, which should contain the images to be predicted, such as the local path: /root/data/(Currently, prediction for PDF files contained in directories is not supported. For PDF files, a specific file path needs to be specified.)
  • List: The list elements should be of the above-mentioned data types, such as [numpy.ndarray, numpy.ndarray], ["/root/data/img1.jpg", "/root/data/img2.jpg"], ["/root/data1", "/root/data2"]
None
use_doc_orientation_classify Whether to use the document orientation classification module bool|None
  • bool:True or False;
  • None: If set to None, the parameter value initialized by the pipeline will be used by default, which is initialized to True;
None
use_doc_unwarping Whether to use the document unwarping module bool|None
  • bool:True or False;
  • None: If set to None, the parameter value initialized by the pipeline will be used by default, initialized toTrue;
None
use_textline_orientation Whether to use the text line orientation classification module bool|None
  • bool:TrueorFalse;
  • None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized toTrue;
None
use_general_ocr Whether to use the OCR sub-pipeline bool|None
  • bool:TrueorFalse;
  • None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized toTrue;
None
use_seal_recognition Whether to use the seal recognition sub-pipeline bool|None
  • bool:TrueorFalse;
  • None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized toTrue;
None
use_table_recognition Whether to use the table recognition sub-pipeline bool|None
  • bool:TrueorFalse;
  • None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized toTrue;
None
use_formula_recognition Whether to use the sub-pipeline for formula recognition bool|None
  • bool:TrueorFalse;
  • None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized toTrue;
False
use_chart_recognition Whether to use the sub-pipeline for chart recognition bool|None
  • bool:TrueorFalse;
  • None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized toTrue;
None
use_region_detection Whether to use the pipeline for document region detection bool|None
  • bool:TrueorFalse;
  • None: If set toNoneThe parameter value initialized by the pipeline will be used by default, initialized to True;
None
layout_threshold Score threshold for layout model float|dict|None
  • float:Any floating-point number between 0-1
  • ;dict:{0:0.1}
  • The key is the class ID, and the value is the threshold for that class;None: If set to None, the parameter value initialized by the pipeline will be used by default, initialized to 0.5
;
None layout_nms Whether the layout region detection model uses NMS post-processing
  • bool|Nonebool:True or False
  • ;None: If set to None, the parameter value initialized by the pipeline will be used by default, initialized to True
;
None layout_unclip_ratio Expansion coefficient of the detection box for the layout region detection model
  • float|Tuple[float,float]|dict|Nonefloat: Any floating-point number greater than 0
  • ;Tuple[float,float]
  • : The respective expansion coefficients in the horizontal and vertical directions;Dictionary, the key of the dictionary is of int type, representing , the value is of tuple type, such as {0: (1.1, 2.0)}, which means that the center of the detection box for the 0th category output by the model remains unchanged, the width is expanded by 1.1 times, and the height is expanded by 2.0 times.
  • None: If set to None, the parameter value initialized by the pipeline will be used by default, initialized to 1.0;
None
layout_merge_bboxes_mode The filtering method for overlapping boxes in layout area detection str|dict|None
  • str:large,small,union, which respectively indicate that when filtering overlapping boxes, the large box, the small box, or both are retained.
  • dict, the key of the dictionary is of int type, representing cls_id, and the value is of str type, such as {0: "large", 2: "small"}, which means that the large mode is used for the detection box of the 0th category and the small mode is used for the detection box of the 2nd category.
  • None: If set to None, the parameter value initialized by the pipeline will be used by default, initialized to large;
None
text_det_limit_side_len The image side length limit for text detection int|None
  • int: Any integer greater than 0;
  • None: If set to NoneThe parameter value initialized by the pipeline will be used by default, initialized to 960;
None
text_det_limit_type Image side length limit type for text detection str|None
  • str: supports min and max, min means ensuring that the shortest side of the image is not less than det_limit_side_len, max means ensuring that the longest side of the image is not greater than limit_side_len
  • None: if set to None, the parameter value initialized by the pipeline will be used by default, initialized to max;
None
text_det_thresh Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map will be considered as text pixels float|None
  • float: any floating-point number greater than 0None
  • : if set to None, the parameter value initialized by the pipeline will be used by default0.3None
text_det_box_thresh
Detection box threshold. When the average score of all pixels within the detection result border is greater than this threshold, the result will be considered as a text area float|None float
  • : any floating-point number greater than 0None: if set to
  • None, the parameter value initialized by the pipeline will be used by defaultNone, 将默认使用产线初始化的该参数值 0.6
None
text_det_unclip_ratio Text detection expansion coefficient. This method is used to expand the text area. The larger the value, the larger the expanded area. float|None
  • float: any floating-point number greater than 0
  • None: if set to None, the parameter value initialized by the pipeline will be used by default2.0
None
text_rec_score_thresh Text recognition threshold. Text results with scores greater than this threshold will be retained. float|None
  • float: any floating-point number greater than 0
  • None: if set to None, the parameter value initialized by the pipeline will be used by default0.0. That is, no threshold is set
None
seal_det_limit_side_len Image side length limit for seal detection int|None
  • int: any integer greater than 0
  • ; None: if set to None, the parameter value initialized by the pipeline will be used by default, initialized to 960
;
None seal_det_limit_type Image side length limit type for seal detection
  • str|Nonestr: supports min and max, Indicates that the shortest side of the image is guaranteed to be no less thandet_limit_side_len,maxIndicates that the longest side of the image is guaranteed to be no greater thanlimit_side_len
  • None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized tomax;
None
seal_det_thresh Detection pixel threshold. Only pixels with scores greater than this threshold in the output probability map are considered as seal pixels. float|None
  • float: Any floating-point number greater than0None
  • : If set toNone, the parameter value initialized by the pipeline will be used by default.0.3None
seal_det_box_thresh
Detection box threshold. When the average score of all pixels within the bounding box of the detection result is greater than this threshold, the result is considered as a seal area. float|None float
  • : Any floating-point number greater than0None: If set to
  • None, the parameter value initialized by the pipeline will be used by default.0.6Noneseal_det_unclip_ratio
Seal detection expansion coefficient. This method is used to expand the text area. The larger the value, the larger the expanded area.
float|None float : Any floating-point number greater than
  • 0None: If set toNone
  • None:如果设置为 NoneThe parameter value initialized by the pipeline will be used by default.2.0
None
seal_rec_score_thresh Seal recognition threshold. Text results with scores greater than this threshold will be retained. float|None
  • float:Anyfloating-point number greater than
  • 0None: If set toNone, the parameter value initialized by the pipeline will be used by default.0.0
. That is, no threshold is set.
None use_wired_table_cells_trans_to_html Whether to enable direct conversion of wired table cell detection results to HTML. The default is False. If enabled, HTML will be constructed directly based on the geometric relationships of wired table cell detection results.
  • float|Nonebool:TrueorFalse
  • ;None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized toFalse
;
False use_wired_table_cells_trans_to_html (Note: This line seems to be a duplicate in the original text, but translated as per the context it would be the same as above. However, if it's intended to be a different parameter, it might need clarification. Here, I've translated it as if it's a repetition for the sake of completeness.) Whether to enable direct conversion of detection results (without wired table cells) to HTML. The default is False. If enabled, HTML will be constructed directly based on the geometric relationships of detection results (without wired table cells).
  • float|Nonebool:TrueorFalse
  • ;None: If set toNoneFalse;
False
use_table_orientation_classify Whether to enable table orientation classification. When enabled, if the table in the image is rotated by 90/180/270 degrees, the orientation can be corrected and table recognition can be completed correctly. bool|None
  • bool:TrueorFalse;
  • None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized toTrue;
True
use_ocr_results_with_table_cells Whether to enable cell segmentation OCR. When enabled, OCR detection results will be segmented and re-recognized based on cell prediction results to avoid missing text. bool|None
  • bool:TrueorFalse;
  • None: If set toNone, the parameter value initialized by the pipeline will be used by default, initialized toTrue;
True
use_e2e_wired_table_rec_model Whether to enable the end-to-end wired table recognition mode. When enabled, the cell detection model will not be used, and only the table structure recognition model will be used. bool|None
  • bool:TrueorFalse;
  • None: If set toNoneThe parameter value will be initialized by default using the pipeline initialization, initialized to False;
False
use_e2e_wireless_table_rec_model Whether to enable the wireless table end-to-end table recognition mode. If enabled, the cell detection model will not be used, and only the table structure recognition model will be used. bool|None
  • bool:TrueorFalse;
  • None: If set to None, the parameter value will be initialized by default using the pipeline initialization, initialized to False;
True
(3) Processing visual prediction results: The prediction result for each sample is a corresponding Result object, and it supports operations such as printing, saving as an image, and saving as a json file:
Method Method Description Parameter Parameter Type Parameter Description Default Value
print() Print the result to the terminal format_json bool Whether to use JSON indentation formatting for the output contentJSONindentation formatting True
indent int Specify the indentation level to beautify the output JSON data and make it more readable. This is only valid when format_json is True4 ensure_ascii
bool Controls whether non- ASCII characters are escaped to 字符转义为 Unicode. When set to True, all non-ASCII characters will be escaped;False will retain the original characters, valid only when format_json is True. False
save_to_json() Saves the result as a file in JSON format. save_path str The file path for saving. When it is a directory, the saved file name will be consistent with the input file type name. None
indent int Specifies the indentation level to beautify the output JSON data and make it more readable, valid only when format_json is True. 4
ensure_ascii bool Controls whether non-ASCII characters are escaped to Unicode. When set to True, all non-ASCII characters will be escaped;False will retain the original characters, valid only when format_json is True. False
save_to_img() Saves the visualization images of each intermediate module in PNG format. save_path str The file path for saving. Supports directory or file path. None
save_to_markdown() Save each page in an image or PDF file as a file in markdown format separately save_path str The saved file path, which supports directory or file path None
save_to_html() Save the tables in the file as a file in html format save_path str The saved file path, which supports directory or file path None
save_to_xlsx() Save the tables in the file as a file in xlsx format save_path str The saved file path, which supports directory or file path None
- Calling the `print()` method will print the results to the terminal. The content printed to the terminal is explained as follows: - `input_path`: `(str)` The input path of the image or PDF to be predicted - `page_index`: `(Union[int, None])` If the input is a PDF file, it indicates which page of the PDF it is; otherwise, it is `None` - `model_settings`: `(Dict[str, bool])` Configure the model parameters required for the pipeline - `use_doc_preprocessor`: `(bool)` Controls whether to enable the document preprocessing sub-pipeline - `use_general_ocr`: `(bool)` Controls whether to enable the OCR sub-pipeline - `use_seal_recognition`: `(bool)` Controls whether to enable the seal recognition sub-pipeline - `use_table_recognition`: `(bool)` Controls whether to enable the table recognition sub-pipeline - `use_formula_recognition`: `(bool)` Controls whether to enable the formula recognition sub-pipeline - `doc_preprocessor_res`: `(Dict[str, Union[List[float], str]])` A dictionary of document preprocessing results, which only exists when `use_doc_preprocessor=True` - `input_path`: `(str)` The image path accepted by the document preprocessing sub-pipeline. When the input is `numpy.ndarray`, it is saved as `None`, and it is `None` here - `page_index`: `None`, as the input here is `numpy.ndarray`, so the value is `None` - `model_settings`: `(Dict[str, bool])` The model configuration parameters for the document preprocessing sub-pipeline - `use_doc_orientation_classify`: `(bool)` Controls whether to enable the submodule for document image orientation classification. - `use_doc_unwarping`: `(bool)` Controls whether to enable the submodule for text image distortion correction. - `angle`: `(int)` The prediction result of the submodule for document image orientation classification. Returns the actual angle value when enabled. - `parsing_res_list`: `(List[Dict])` A list of parsing results, where each element is a dictionary. The list is ordered according to the reading sequence after parsing. - `block_bbox`: `(np.ndarray)` The bounding box of the layout area. - `block_label`: `(str)` The label of the layout area, such as `text`, `table`, etc. - `block_content`: `(str)` The content within the layout area. - `seg_start_flag`: `(bool)` Indicates whether this layout area is the start of a paragraph. - `seg_end_flag`: `(bool)` Indicates whether this layout area is the end of a paragraph. - `sub_label`: `(str)` The sub-label of the layout area. For example, the sub-label of `text` might be `title_text`. - `sub_index`: `(int)` The sub-index of the layout area, used for restoring Markdown. - `index`: `(int)` The index of the layout area, used for displaying the layout sorting results. - `overall_ocr_res`: `(Dict[str, Union[List[str], List[float], numpy.ndarray]])` A dictionary of global OCR results. - `input_path`: `(Union[str, None])` The image path accepted by the image OCR sub-pipeline. When the input is `numpy.ndarray`, it is saved as `None`. - `page_index`: `None`, the input here is `numpy.ndarray`, so the value is `None`. - `model_settings`: `(Dict)` Model configuration parameters for the OCR sub-pipeline. - `dt_polys`: `(List[numpy.ndarray])` List of polygon bounding boxes for text detection. Each bounding box is represented by a numpy array of 4 vertex coordinates, with an array shape of (4, 2) and a data type of int16. - `dt_scores`: `(List[float])` List of confidence scores for text detection bounding boxes. - `text_det_params`: `(Dict[str, Dict[str, int, float]])` Configuration parameters for the text detection module. - `limit_side_len`: `(int)` Side length limit during image preprocessing. - `limit_type`: `(str)` Method for handling side length limits. - `thresh`: `(float)` Confidence threshold for text pixel classification. - `box_thresh`: `(float)` Confidence threshold for text detection bounding boxes. - `unclip_ratio`: `(float)` Expansion coefficient for text detection bounding boxes. - `text_type`: `(str)` Type of text detection, currently fixed as "general". - `text_type`: `(str)` Type of text detection, currently fixed as "general". - `textline_orientation_angles`: `(List[int])` Prediction results for text line orientation classification. Returns actual angle values when enabled (e.g., [0,0,1]). - `text_rec_score_thresh`: `(float)` Filtering threshold for text recognition results. - `rec_texts`: `(List[str])` List of text recognition results, containing only texts with confidence scores exceeding `text_rec_score_thresh`. - `rec_scores`: `(List[float])` List of confidence scores for text recognition, filtered by `text_rec_score_thresh` - `rec_polys`: `(List[numpy.ndarray])` List of text detection bounding boxes after confidence filtering, with the same format as `dt_polys` - `formula_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` List of formula recognition results, with each element being a dictionary - `rec_formula`: `(str)` Formula recognition result - `rec_polys`: `(numpy.ndarray)` Formula detection bounding box with a shape of (4, 2) and dtype of int16 - `formula_region_id`: `(int)` Region number where the formula is located - `seal_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` List of seal recognition results, with each element being a dictionary - `input_path`: `(str)` Input path of the seal image - `page_index`: `None`, since the input here is `numpy.ndarray`, the value is `None` - `model_settings`: `(Dict)` Model configuration parameters for the seal recognition sub-pipeline - `dt_polys`: `(List[numpy.ndarray])` List of seal detection bounding boxes, with the same format as `dt_polys` - `text_det_params`: `(Dict[str, Dict[str, int, float]])` Configuration parameters for the seal detection module, with the same meanings of specific parameters as above - `text_type`: `(str)` Type of seal detection, currently fixed as "seal" - `text_rec_score_thresh`: `(float)` The filtering threshold for seal recognition results - `rec_texts`: `(List[str])` A list of seal recognition results, containing only texts with confidence scores exceeding `text_rec_score_thresh` - `rec_scores`: `(List[float])` A list of confidence scores for seal recognition, filtered by `text_rec_score_thresh` - `rec_polys`: `(List[numpy.ndarray])` A list of seal detection boxes filtered by confidence scores, with the same format as `dt_polys` - `rec_boxes`: `(numpy.ndarray)` An array of rectangular bounding boxes for detection boxes, with a shape of (n, 4) and dtype of int16. Each row represents a rectangle - `table_res_list`: `(List[Dict[str, Union[numpy.ndarray, List[float], str]]])` A list of table recognition results, with each element being a dictionary - `cell_box_list`: `(List[numpy.ndarray])` A list of bounding boxes for table cells - `pred_html`: `(str)` An HTML-formatted string for the table - `table_ocr_pred`: `(dict)` OCR recognition results for the table - `rec_polys`: `(List[numpy.ndarray])` A list of detection boxes for cells - `rec_texts`: `(List[str])` Recognition results for cells - `rec_scores`: `(List[float])` Recognition confidence scores for cells - `rec_boxes`: `(numpy.ndarray)` An array of rectangular bounding boxes for detection, with a shape of (n, 4) and dtype of int16. Each row represents a rectangle. - Calling the `save_to_json()` method will save the above content to the specified `save_path`. If a directory is specified, the saved path will be `save_path/{your_img_basename}_res.json`. If a file is specified, it will be saved directly to that file. Since JSON files do not support saving numpy arrays, the `numpy.array` type will be converted to a list format. - Calling the `save_to_img()` method will save the visualization results to the specified `save_path`. If a directory is specified, it will save the layout region detection visualization image, global OCR visualization image, layout reading order visualization image, etc. If a file is specified, it will be saved directly to that file. (The pipeline usually contains many result images, so it is not recommended to directly specify a specific file path; otherwise, multiple images will be overwritten, and only the last image will be retained.) - Calling the `save_to_markdown()` method will save the converted Markdown file to the specified `save_path`. The saved file path will be `save_path/{your_img_basename}.md`. If the input is a PDF file, it is recommended to directly specify a directory; otherwise, multiple markdown files will be overwritten. - Calling the `concatenate_markdown_pages()` method will merge the multi-page Markdown content `markdown_list` output by the PP-DocTranslation pipeline into a single complete document and return the merged Markdown content.
(4) Call translate() method to perform document translation. This method will return the original markdown text and the translated text, which is a markdown object. You can save the required parts locally by executing the save_to_markdown() method. The following are the descriptions of the relevant parameters of the translate() method:
Parameter Parameter Description Parameter Type Options Default Value
ori_md_info_list A list of data in the original markdown format, containing the content to be translated List[Dict] It must be a list composed of dictionaries, with each dictionary representing a document block No default value (required)
target_language Target translation language code str ISO 639-1 language code (such as "en"/"ja"/"fr") "zh"
chunk_size The character threshold for chunking the text to be translated int An integer greater than 0 5000
task_description Custom task description prompt str|None
  • str: Custom text for translation task instructions
  • None: Use default task description
None
output_format Specify output format requirements str|None
  • str: Format specifications (e.g., "Maintain original Markdown structure")
  • None: Do not add additional format constraints
None
rules_str Custom translation rule description str|None
  • str: Terminology/style rule text
  • None: Do not use additional rules
None
few_shot_demo_text_content Example text content for few-shot learning str|None
  • str: Example text string
  • None: Do not provide text examples
None
few_shot_demo_key_value_list Structured few-shot example data str|None
  • str: Example data in key-value pair format
  • None: Do not provide structured examples
None
glossary Glossary of technical terms dict|None
  • dict: Dictionary for glossary mapping
  • None: Use default configuration
None
llm_request_interval Time interval in seconds for sending requests to the large language model. This parameter can be used to prevent overly frequent calls to the large language model. float A floating-point number greater than or equal to 0 0.0
chat_bot_config Large language model configuration dict|None
  • dict: Dictionary of model parameter configurations
  • None: Use default configuration
None

In addition, you can obtain the configuration file for the layout analysis pipeline and load the configuration file for prediction. You can execute the following command to save the results in my_path:

paddlex --get_pipeline_config PP-DocTranslation --save_path ./my_path

If you obtain the configuration file, you can customize the configurations of the layout parsing pipeline by simply modifying the value of the pipeline parameter in the create_pipeline method to the path of the pipeline configuration file. Here is an example:

from paddlex import create_pipeline

pipeline = create_pipeline(pipeline="./my_path/PP-DocTranslation.yaml")

# Document path
img_path = "document_sample.pdf"

# Large model configuration
chat_bot_config = {
    "module_name": "chat_bot",
    "model_name": "ernie-3.5-8k",
    "base_url": "https://qianfan.baidubce.com/v2",
    "api_type": "openai",
    "api_key": "api_key",  # Replace with the actual API key
}

# Perform layout analysis
visual_predict_res = pipeline.visual_predict(
    img_path,
    use_doc_orientation_classify=False,
    use_doc_unwarping=False,
    use_common_ocr=True,
    use_seal_recognition=True,
    use_table_recognition=True,
)

# Extract the original document structure information
ori_md_info_list = []
for res in visual_predict_res:
    layout_parsing_result = res["layout_parsing_result"]
    ori_md_info_list.append(layout_parsing_result.markdown)
    layout_parsing_result.print()
    layout_parsing_result.save_to_img("./output")
    layout_parsing_result.save_to_json("./output")

# Document translation
tgt_md_info_list = pipeline.translate(
    ori_md_info_list=ori_md_info_list,
    target_language="en",
    chunk_size=5000,
    chat_bot_config=chat_bot_config,
)

# Save the translation results
for tgt_md_info in tgt_md_info_list:
    tgt_md_info.save_to_markdown(output_path)

Note:The parameters in the configuration file are pipeline initialization parameters. If you want to change the initialization parameters of the general layout parsing v3 pipeline, you can directly modify the parameters in the configuration file and load the configuration file for prediction. Meanwhile, CLI prediction also supports passing in the configuration file, and you can specify the path of the configuration file with --pipeline.

3. Development Integration/Deployment

If the pipeline can meet your requirements for inference speed and accuracy, you can directly proceed with development integration/deployment.

If you need to directly apply the pipeline to your Python project, you can refer to the sample code in 2.2 Integration via Python Script.

In addition, PaddleX also provides three other deployment methods, which are described in detail below:

🚀High-performance inference: In real production environments, many applications have stringent performance metrics (especially response speed) for deployment strategies to ensure efficient system operation and smooth user experience. To this end, PaddleX provides a high-performance inference plugin designed to deeply optimize model inference and pre/post-processing, achieving significant acceleration of the end-to-end process. For detailed information on the high-performance inference process, please refer to the PaddleX High-Performance Inference Guide.

☁️Serving: Serving is a common form of deployment in actual production environments. By encapsulating inference functions as services, clients can access these services through network requests to obtain inference results. PaddleX supports multiple pipeline Serving deployment solutions. For detailed pipeline Serving deployment procedures, please refer to the PaddleX Serving Deployment Guide.

The following is the API reference for basic Serving and examples of multilingual service invocation:

API reference

Main operations provided by the service:

  • The HTTP request method is POST.
  • Both the request body and response body are JSON data (JSON objects).
  • When the request is processed successfully, the response status code is200, and the attributes of the response body are as follows:
Name Type Meaning
logId string The UUID of the request.
errorCode integer Error code. Fixed as0.
errorMsg string Error description. Fixed as"Success".
result object Operation result.
  • When the request is not processed successfully, the attributes of the response body are as follows:
Name Type Meaning
logId string The UUID of the request.
errorCode integer Error code. Same as the response status code.
errorMsg string Error description.

The main operations provided by the service are as follows:

  • analyzeImages

Analyze images using computer vision models to obtain OCR, table recognition results, etc.

POST /doctrans-visual

  • The attributes of the request body are as follows:
Name Type Meaning Required
file string The URL of an image file or PDF file accessible by the server, or the Base64-encoded result of the content of the aforementioned file types. By default, for PDF files with more than 10 pages, only the first 10 pages will be processed.
To remove the page limit, add the following configuration to the pipeline configuration file:
Serving:
  extra:

max_num_input_imgs: null</code></pre>

Yes
fileType integernull File type.0Indicates a PDF file,1Indicates an image file. If this attribute is not present in the request body, the file type will be inferred from the URL. No
useDocOrientationClassify boolean|null Refer to the description of the use_doc_orientation_classifyparameter in the visual_predictmethod of the pipeline object. No
useDocUnwarping boolean|null Refer to the description of the use_doc_unwarpingparameter in the visual_predictmethod of the pipeline object. No
useTextlineOrientation boolean|null Refer to the description of the use_textline_orientationparameter in the visual_predictmethod of the pipeline object. No
useSealRecognition boolean|null See the description of the use_seal_recognition parameter for the visual_predict method in the pipeline object. No
useTableRecognition boolean|null See the description of the use_table_recognition parameter for the visual_predict method in the pipeline object. No
useFormulaRecognition boolean|null See the description of the use_formula_recognition parameter for the visual_predict method in the pipeline object. No
useChartRecognition boolean|null See the description of the use_chart_recognition parameter for the visual_predict method in the pipeline object. No
useRegionDetection boolean|null See the description of the use_region_detection parameter for the visual_predict method in the pipeline object. No
layoutThreshold number|object|null Refer to the description of the visual_predict method's layout_threshold parameter in the pipeline object. No
layoutNms boolean|null Refer to the description of the visual_predict method's layout_nms parameter in the pipeline object. No
layoutUnclipRatio number|array|object|null Refer to the description of the visual_predict method's layout_unclip_ratio parameter in the pipeline object. No
layoutMergeBboxesMode string|object|null Refer to the description of the visual_predict method's layout_merge_bboxes_mode parameter in the pipeline object. No
textDetLimitSideLen integer|null Refer to the description of the visual_predict method's text_det_limit_side_len parameter in the pipeline object. No
textDetLimitType string|null Refer to the description of the text_det_limit_type parameter of the visual_predict method in the pipeline object. No
textDetThresh number|null Refer to the description of the text_det_thresh parameter of the visual_predict method in the pipeline object. No
textDetBoxThresh number|null Refer to the description of the text_det_box_thresh parameter of the visual_predict method in the pipeline object. No
textDetUnclipRatio number|null Refer to the description of the text_det_unclip_ratio parameter of the visual_predict method in the pipeline object. No
textRecScoreThresh number|null Refer to the description of the text_rec_score_thresh parameter of the visual_predict method in the pipeline object. No
sealDetLimitSideLen integer|null Refer to the description in the pipeline objectvisual_predictmethodseal_det_limit_side_lenparameter description. No
sealDetLimitType string|null Refer to the parameter description of visual_predictmethod'sseal_det_limit_type in the pipeline object. No
sealDetThresh number|null Refer to the parameter description of visual_predictmethod'sseal_det_thresh in the pipeline object. No
sealDetBoxThresh number|null Refer to the parameter description of visual_predictmethod'sseal_det_box_thresh in the pipeline object. No
sealDetUnclipRatio number|null Refer to the parameter description of visual_predictmethod'sseal_det_unclip_ratio in the pipeline object. No
sealRecScoreThresh number|null Refer to the parameter description of visual_predictmethod'sseal_rec_score_threshParameter description. No
useWiredTableCellsTransToHtml boolean Refer to the parameter description of use_wired_table_cells_trans_to_html in the visual_predict method of the pipeline object. No
useWirelessTableCellsTransToHtml boolean Refer to the parameter description of use_wireless_table_cells_trans_to_html in the visual_predict method of the pipeline object. No
useTableOrientationClassify boolean Refer to the parameter description of use_table_orientation_classify in the visual_predict method of the pipeline object. No
useOcrResultsWithTableCells boolean Refer to the parameter description of use_ocr_results_with_table_cells in the visual_predict method of the pipeline object. No
useE2eWiredTableRecModel boolean Refer to the parameter description of use_e2e_wired_table_rec_model in the visual_predict method of the pipeline object. No
useE2eWirelessTableRecModel boolean Refer to the visual_predictmethod's use_e2e_wireless_table_rec_modelparameter description in the pipeline object. No
visualize boolean|null Whether to return visualization result graphs and intermediate images during processing, etc.
  • Pass in true: Return images.
  • Pass in false: Do not return images.
  • If this parameter is not provided in the request body or nullis passed in: Follow the setting in the Serving.visualizeof the pipeline configuration file.

For example, add the following field in the pipeline configuration file:
Serving:
  visualize: False
Images will not be returned by default, and the default behavior can be overridden by the visualizeparameter in the request body. If neither the request body nor the configuration file is set (or nullis passed in the request body and nothing is set in the configuration file), images will be returned by default.
No
  • When the request is processed successfully, the resultin the response body has the following attributes:
Name Type Meaning
layoutParsingResults array Layout parsing results. The array length is 1 (for image input) or the actual number of document pages processed (for PDF input). For PDF input, each element in the array represents the result of each page actually processed in the PDF file in sequence.
dataInfo object Enter the data information.

Each element in layoutParsingResults is an object

with the following attributes: Name Type
Meaning prunedResult objectA simplified version of the res field in the JSON representation of the layout_parsing_result generated by the visual_predict method of the pipeline object, with the input_path and page_index
fields removed. markdown object
Markdown result. outputImagesobject| nullRefer to the description of the img
attribute in the pipeline prediction results. The image is in JPEG format and encoded with Base64. inputImagestring| null

Input image. The image is in JPEG format and encoded with Base64.markdown is an object

with the following attributes: Name Type
Meaning text string
Markdown text. images object
A key-value pair of relative paths of Markdown images and Base64-encoded images. isStart Whether the first element of the current page is the start of a paragraph.
isEnd boolean Whether the last element of the current page is the end of a paragraph.
  • translate

Translate documents using large models.

POST /doctrans-translate

  • The attributes of the request body are as follows:
name type meaning required
markdownList array A list of Markdown to be translated. It can be obtained from the result of the analyzeImagesoperation.The images attribute will not be used.
Yes targetLanguage stringRefer to the description of the target_languageparameter in the translate method of the pipeline object.
No chunkSize integerRefer to the description of the chunk_sizeparameter in the translate method of the pipeline object.
No taskDescriptionstring| nullRefer to the description of the task_descriptionparameter in the translate method of the pipeline object.
No outputFormatstring| nullRefer to the description of the methodoutput_formatDescription of the relevant parameters. No
rulesStr string|null Refer to the description of the translatemethod'srules_strparameter in the pipeline object. No
fewShotDemoTextContent string|null Refer to the description of the translatemethod'sfew_shot_demo_text_contentparameter in the pipeline object. No
fewShotDemoKeyValueList string|null Refer to the description of the translatemethod'sfew_shot_demo_key_value_listparameter in the pipeline object. No
llmRequestInterval number|null Refer to the description of the translatemethod'sllm_request_intervalparameter in the pipeline object. No
chatBotConfig object|null Refer to the description of the translatemethod'schat_bot_configparameter in the pipeline object. No
  • When request processing is successful, the response body's result has the following attributes:
Name Type Meaning
translationResults array Translation results.

Each element in translationResults is an object

with the following attributes: Name Type
Meaning language string
Target language. markdown objectMarkdown result. The object definition is consistent with the markdown returned by the analyzeImages
  • operation.
  • Note:

    Including sensitive parameters such as the API key for large model calls in the request body may pose security risks. If not necessary, set these parameters in the configuration file and do not pass them during the request.
    Example of multilingual service invocation
    import base64
    import pathlib
    import pprint
    import sys

    import requests

    API_BASE_URL = "http://127.0.0.1:8080"

    file_path = "./demo.jpg" target_language = "en"

    with open(file_path, "rb") as file:

    file_bytes = file.read()
    file_data = base64.b64encode(file_bytes).decode("ascii")
    

    payload = {

    "file": file_data,
    "fileType": 1,
    

    } resp_visual = requests.post(url=f"{API_BASE_URL}/doctrans-visual", json=payload) if resp_visual.status_code != 200:

    print(
        f"Request to doctrans-visual failed with status code {resp_visual.status_code}."
    )
    pprint.pp(resp_visual.json())
    sys.exit(1)
    

    result_visual = resp_visual.json()["result"]

    markdown_list = [] for i, res in enumerate(result_visual["layoutParsingResults"]):

    md_dir = pathlib.Path(f"markdown_{i}")
    md_dir.mkdir(exist_ok=True)
    (md_dir / "doc.md").write_text(res["markdown"]["text"])
    

    for img_path, img in res["markdown"]["images"].items():

        img_path = md_dir / img_path
        img_path.parent.mkdir(parents=True, exist_ok=True)
        img_path.write_bytes(base64.b64decode(img))
    print(f"The Markdown document to be translated is saved at {md_dir / 'doc.md'}")
    del res["markdown"]["images"]
    markdown_list.append(res["markdown"])
    for img_name, img in res["outputImages"].items():
        img_path = f"{img_name}_{i}.jpg"
        with open(img_path, "wb") as f:
            f.write(base64.b64decode(img))
        print(f"Output image saved at {img_path}")
    

    payload = {

    "markdownList": markdown_list,
    "targetLanguage": target_language,
    

    } resp_translate = requests.post(url=f"{API_BASE_URL}/doctrans-translate", json=payload) if resp_translate.status_code != 200:

    print(
        f"Request to doctrans-translate failed with status code {resp_translate.status_code}."
    )
    pprint.pprint(resp_translate.json())
    sys.exit(1)
    

    result_translate = resp_translate.json()["result"]

    for i, res in enumerate(result_translate["translationResults"]):

    md_dir = pathlib.Path(f"markdown_{i}")
    (md_dir / "doc_translated.md").write_text(res["markdown"]["text"])
    print(f"Translated markdown document saved at {md_dir / 'doc_translated.md'}")</code></pre></details>
    


    📱On-device deployment: On-device deployment is a method that places computing and data processing functions on the user's device itself, allowing the device to process data directly without relying on a remote server. PaddleX supports deploying models on on-device devices such as Android. For detailed on-device deployment procedures, please refer to the PaddleX On-device Deployment Guide.

    You can choose an appropriate way to deploy the model pipeline according to your needs, and then proceed with subsequent AI application integration.

    4. Secondary Development

    If the default model weights provided by the Layout Analysis v3 sub-pipeline in the general document translation pipeline do not meet your accuracy or speed requirements in your scenario, you can try to usethe data from your own specific domain or application scenarioto furtherfine-tunethe existing model to improve the recognition performance of the general Layout Analysis v3 sub-pipeline in your scenario.

    4.1 Model Fine-tuning

    Since the General Layout Analysis v3 sub-pipeline contains several modules, the subpar performance of the model pipeline may stem from any one of these modules. You can analyze the cases with poor extraction results, use visualized images to identify which module is problematic, and refer to the corresponding fine-tuning tutorial links in the following table to fine-tune the model.

    Scenario Fine-tuning module Fine-tuning reference link
    Inaccurate detection of layout areas, such as failure to detect seals and tables Layout area detection module Link
    Inaccurate recognition of table structures Table structure recognition module Link
    Inaccurate recognition of formulas Formula recognition module Link
    Omission in detecting seal text Seal text detection module Link
    Omission in detecting text Text detection module Link
    Inaccurate text content Text recognition module Link
    Inaccurate correction of vertical or rotated text lines Text line orientation classification module Link
    Inaccurate correction of whole image rotation Document image orientation classification module Link
    Inaccurate correction of image distortion Text image correction module Fine-tuning is temporarily not supported

    4.2 Model Application

    After completing fine-tuning training with your private dataset, you can obtain a local model weight file.

    If you need to use the fine-tuned model weights, simply modify the pipeline configuration file by replacing the local path of the fine-tuned model weights into the corresponding location in the pipeline configuration file:

    ......
    SubModules:
      LayoutDetection:
        module_name: layout_detection
        model_name: PP-DocLayout_plus-L
        model_dir: null # Replace with the path to the weights of the fine-tuned layout region detection model
    ......
    SubPipelines:
      GeneralOCR:
        pipeline_name: OCR
        text_type: general
        use_doc_preprocessor: False
        use_textline_orientation: False
        SubModules:
          TextDetection:
            module_name: text_detection
            model_name: PP-OCRv5_server_det
            model_dir: null # Replace with the path to the weights of the fine-tuned text detection model
            limit_side_len: 960
            limit_type: max
            max_side_limit: 4000
            thresh: 0.3
            box_thresh: 0.6
            unclip_ratio: 1.5
    
          TextRecognition:
            module_name: text_recognition
            model_name: PP-OCRv5_server_rec
            model_dir: null # Replace with the path to the weights of the fine-tuned text recognition model
            batch_size: 1
            score_thresh: 0
    ......
    

    Then, refer to the command-line method or Python script method in the local experience to load the modified pipeline configuration file.

    5. Multi-hardware Support

    PaddleX supports multiple mainstream hardware devices such as NVIDIA GPU, Kunlunxin XPU, Ascend NPU, and Cambrian MLU, and only requires setting the device parameter to achieve seamless switching between different hardware.

    For example, when using the document scenario information extraction v4 pipeline, to change the running device from NVIDIA GPU to Ascend NPU, you only need to modify device to npu in the script:

    from paddlex import create_pipeline
    pipeline = create_pipeline(
        pipeline="PP-DocTranslation",
        device="npu:0" # change gpu:0 to npu:0
        )
    

    If you want to use the general document translation pipeline on more types of hardware, please refer to the PaddleX Multi-Hardware Usage Guide.