---
comments: true
---
# Seal Text Recognition Pipeline Tutorial
## 1. Introduction to Seal Text Recognition Pipeline
Seal text recognition is a technology that automatically extracts and recognizes the content of seals from documents or images. The recognition of seal text is part of document processing and has many applications in various scenarios, such as contract comparison, warehouse entry and exit review, and invoice reimbursement review.
The seal text recognition pipeline is used to recognize the text content of seals, extracting the text information from seal images and outputting it in text form. This pipeline integrates the industry-renowned end-to-end OCR system PP-OCRv4, supporting the detection and recognition of curved seal text. Additionally, this pipeline integrates an optional layout region localization module, which can accurately locate the layout position of the seal within the entire document. It also includes optional document image orientation correction and distortion correction functions. Based on this pipeline, millisecond-level accurate text content prediction can be achieved on a CPU. This pipeline also provides flexible service deployment methods, supporting the use of multiple programming languages on various hardware. Moreover, it offers secondary development capabilities, allowing you to train and fine-tune on your own dataset based on this pipeline, and the trained model can be seamlessly integrated.
The seal text recognition pipeline includes a seal text detection module and a text recognition module, as well as optional layout detection module, document image orientation classification module, and text image correction module.
If you prioritize model accuracy, choose a model with higher accuracy. If you prioritize inference speed, choose a model with faster inference speed. If you prioritize model storage size, choose a model with smaller storage size.
Layout Region Detection Module (Optional):
* Layout detection model, including 23 common categories: document title, paragraph title, text, page number, abstract, table of contents, references, footnotes, header, footer, algorithm, formula, formula number, image, chart title, table, table title, seal, chart title, chart, header image, footer image, sidebar text| Model | Model Download Link | mAP(0.5) (%) | CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Description |
|---|---|---|---|---|---|---|
| PicoDet_layout_1x | Inference Model/Trained Model | 86.8 | 9.03 / 3.10 | 25.82 / 20.70 | 7.4 | An efficient layout area localization model trained on the PubLayNet dataset based on PicoDet-1x can locate five types of areas, including text, titles, tables, images, and lists. |
| PicoDet_layout_1x_table | Inference Model/Trained Model | 95.7 | 8.02 / 3.09 | 23.70 / 20.41 | 7.4 M | An efficient layout area localization model trained on the PubLayNet dataset based on PicoDet-1x can locate one type of tables. |
| PicoDet-S_layout_3cls | Inference Model/Trained Model | 87.1 | 8.99 / 2.22 | 16.11 / 8.73 | 4.8 | An high-efficient layout area localization model trained on a self-constructed dataset based on PicoDet-S for scenarios such as Chinese and English papers, magazines, and research reports includes three categories: tables, images, and seals. |
| PicoDet-S_layout_17cls | Inference Model/Trained Model | 70.3 | 9.11 / 2.12 | 15.42 / 9.12 | 4.8 | A high-efficient layout area localization model trained on a self-constructed dataset based on PicoDet-S_layout_17cls for scenarios such as Chinese and English papers, magazines, and research reports includes 17 common layout categories, namely: paragraph titles, images, text, numbers, abstracts, content, chart titles, formulas, tables, table titles, references, document titles, footnotes, headers, algorithms, footers, and seals. |
| PicoDet-L_layout_3cls | Inference Model/Trained Model | 89.3 | 13.05 / 4.50 | 41.30 / 41.30 | 22.6 | An efficient layout area localization model trained on a self-constructed dataset based on PicoDet-L for scenarios such as Chinese and English papers, magazines, and research reports includes three categories: tables, images, and seals. |
| PicoDet-L_layout_17cls | Inference Model/Trained Model | 79.9 | 13.50 / 4.69 | 43.32 / 43.32 | 22.6 | A efficient layout area localization model trained on a self-constructed dataset based on PicoDet-L_layout_17cls for scenarios such as Chinese and English papers, magazines, and research reports includes 17 common layout categories, namely: paragraph titles, images, text, numbers, abstracts, content, chart titles, formulas, tables, table titles, references, document titles, footnotes, headers, algorithms, footers, and seals. |
| RT-DETR-H_layout_3cls | Inference Model/Trained Model | 95.9 | 114.93 / 27.71 | 947.56 / 947.56 | 470.1 | A high-precision layout area localization model trained on a self-constructed dataset based on RT-DETR-H for scenarios such as Chinese and English papers, magazines, and research reports includes three categories: tables, images, and seals. |
| RT-DETR-H_layout_17cls | Inference Model/Trained Model | 92.6 | 115.29 / 104.09 | 995.27 / 995.27 | 470.2 | A high-precision layout area localization model trained on a self-constructed dataset based on RT-DETR-H for scenarios such as Chinese and English papers, magazines, and research reports includes 17 common layout categories, namely: paragraph titles, images, text, numbers, abstracts, content, chart titles, formulas, tables, table titles, references, document titles, footnotes, headers, algorithms, footers, and seals. |
| Model | Model Download Link | mAP(0.5) (%) | CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
|---|---|---|---|---|---|---|
| PicoDet-S_layout_3cls | Inference Model/Training Model | 88.2 | 8.99 / 2.22 | 16.11 / 8.73 | 4.8 | A high-efficiency layout area localization model trained on a self-built dataset for Chinese and English papers, magazines, and research reports based on the lightweight PicoDet-S model |
| PicoDet-L_layout_3cls | Inference Model/Training Model | 89.0 | 13.05 / 4.50 | 41.30 / 41.30 | 22.6 | A layout area localization model with balanced efficiency and accuracy, trained on a self-built dataset for Chinese and English papers, magazines, and research reports based on PicoDet-L |
| RT-DETR-H_layout_3cls | Inference Model/Training Model | 95.8 | 114.93 / 27.71 | 947.56 / 947.56 | 470.1 | A high-precision layout area localization model trained on a self-built dataset for Chinese and English papers, magazines, and research reports based on RT-DETR-H |
| Model | Model Download Link | mAP(0.5) (%) | CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Introduction |
|---|---|---|---|---|---|---|
| PicoDet-S_layout_17cls | Inference Model/Training Model | 87.4 | 9.11 / 2.12 | 15.42 / 9.12 | 4.8 | A high-efficiency layout area localization model trained on a self-built dataset for Chinese and English papers, magazines, and research reports based on the lightweight PicoDet-S model |
| PicoDet-L_layout_17cls | Inference Model/Training Model | 89.0 | 13.50 / 4.69 | 43.32 / 43.32 | 22.6 | A layout area localization model with balanced efficiency and accuracy, trained on a self-built dataset for Chinese and English papers, magazines, and research reports based on PicoDet-L |
| RT-DETR-H_layout_17cls | Inference Model/Training Model | 98.3 | 115.29 / 104.09 | 995.27 / 995.27 | 470.2 | A high-precision layout area localization model trained on a self-built dataset for Chinese and English papers, magazines, and research reports based on RT-DETR-H |
Document Image Orientation Classification Module (Optional):
| Model | Model Download Link | Top-1 Acc (%) | CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Description |
|---|---|---|---|---|---|---|
| PP-LCNet_x1_0_doc_ori | Inference Model/Training Model | 99.06 | 2.31 / 0.43 | 3.37 / 1.27 | 7 | A document image classification model based on PP-LCNet_x1_0, containing four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees |
Note: The above accuracy metrics are evaluated on a self-built dataset covering multiple scenarios such as certificates and documents, containing 1000 images. GPU inference time is based on NVIDIA Tesla T4 machine, precision type is FP32, CPU inference speed is based on Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz, with 8 threads, precision type is FP32.
Text Image Correction Module (Optional):
| Model | Model Download Link | CER | Model Storage Size (M) | Description |
|---|---|---|---|---|
| UVDoc | Inference Model/Training Model | 0.179 | 30.3 M | High-precision text image correction model |
Text Detection Module:
| Model | Model Download Link | Detection Hmean (%) | CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Description |
|---|---|---|---|---|---|---|
| PP-OCRv4_server_seal_det | Inference Model/Trained Model | 98.21 | 74.75 / 67.72 | 382.55 / 382.55 | 109 | PP-OCRv4 server-side seal text detection model, with higher accuracy, suitable for deployment on better servers |
| PP-OCRv4_mobile_seal_det | Inference Model/Trained Model | 96.47 | 7.82 / 3.09 | 48.28 / 23.97 | 4.6 | PP-OCRv4 mobile-side seal text detection model, with higher efficiency, suitable for deployment on the edge |
Text Recognition Module:
| Model | Model Download Link | Recognition Avg Accuracy(%) | CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
CPU Inference Time (ms) [Normal Mode / High-Performance Mode] |
Model Storage Size (M) | Description |
|---|---|---|---|---|---|---|
| PP-OCRv4_mobile_rec | Inference Model/Trained Model | 78.20 | 4.82 / 4.82 | 16.74 / 4.64 | 10.6 M | The PP-OCRv4 recognition model is an upgrade from PP-OCRv3. Under comparable speed conditions, the effect in Chinese and English scenarios is further improved. The average recognition accuracy of the 80 multilingual models is increased by more than 8%. |
| PP-OCRv4_server_rec | Inference Model/Trained Model | 79.20 | 6.58 / 6.58 | 33.17 / 33.17 | 71.2 M | A high-precision server text recognition model, featuring high accuracy, fast speed, and multilingual support. It is suitable for text recognition tasks in various scenarios. |
| PP-OCRv3_mobile_rec | Inference Model/Training Model | 5.87 / 5.87 | 9.07 / 4.28 | An ultra-lightweight OCR model suitable for mobile applications. It adopts an encoder-decoder structure based on Transformer and enhances recognition accuracy and efficiency through techniques such as data augmentation and mixed precision training. The model size is 10.6M, making it suitable for deployment on resource-constrained devices. It can be used in scenarios such as mobile photo translation and business card recognition. |
Note: The evaluation set for the above accuracy indicators is the Chinese dataset built by PaddleOCR, covering multiple scenarios such as street view, web images, documents, and handwriting. The text recognition includes 11,000 images. The GPU inference time for all models is based on NVIDIA Tesla T4 machines with FP32 precision type. The CPU inference speed is based on Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision type.
| Model | Model Download Link | Recognition Avg Accuracy(%) | GPU Inference Time (ms) | CPU Inference Time | Model Storage Size (M) | Introduction |
|---|---|---|---|---|---|---|
| ch_SVTRv2_rec | Inference Model/Training Model | 68.81 | 8.36801 | 165.706 | 73.9 M | SVTRv2 is a server text recognition model developed by the OpenOCR team of Fudan University's Visual and Learning Laboratory (FVL). It won the first prize in the PaddleOCR Algorithm Model Challenge - Task One: OCR End-to-End Recognition Task. The end-to-end recognition accuracy on the A list is 6% higher than that of PP-OCRv4. |
Note: The evaluation set for the above accuracy indicators is the PaddleOCR Algorithm Model Challenge - Task One: OCR End-to-End Recognition Task A list. The GPU inference time for all models is based on NVIDIA Tesla T4 machines with FP32 precision type. The CPU inference speed is based on Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision type.
| Model | Model Download Link | Recognition Avg Accuracy(%) | GPU Inference Time (ms) | CPU Inference Time | Model Storage Size (M) | Introduction |
|---|---|---|---|---|---|---|
| ch_RepSVTR_rec | Inference Model/Training Model | 65.07 | 10.5047 | 51.5647 | 22.1 M | The RepSVTR text recognition model is a mobile text recognition model based on SVTRv2. It won the first prize in the PaddleOCR Algorithm Model Challenge - Task One: OCR End-to-End Recognition Task. The end-to-end recognition accuracy on the B list is 2.5% higher than that of PP-OCRv4, with the same inference speed. |
Note: The evaluation set for the above accuracy indicators is the PaddleOCR Algorithm Model Challenge - Task One: OCR End-to-End Recognition Task B list. The GPU inference time for all models is based on NVIDIA Tesla T4 machines with FP32 precision type. The CPU inference speed is based on Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision type.
* English Recognition Model| Model | Model Download Link | Recognition Avg Accuracy(%) | GPU Inference Time (ms) | CPU Inference Time | Model Storage Size (M) | Introduction |
|---|---|---|---|---|---|---|
| en_PP-OCRv4_mobile_rec | Inference Model/Training Model | [Latest] Further upgraded based on PP-OCRv3, with improved accuracy under comparable speed conditions. | ||||
| en_PP-OCRv3_mobile_rec | Inference Model/Training Model | Ultra-lightweight model, supporting English and numeric recognition. |
| Model | Model Download Link | Recognition Avg Accuracy(%) | GPU Inference Time (ms) | CPU Inference Time | Model Storage Size (M) | Introduction |
|---|---|---|---|---|---|---|
| korean_PP-OCRv3_mobile_rec | Inference Model/Training Model | Korean Recognition | ||||
| japan_PP-OCRv3_mobile_rec | Inference Model/Training Model | Japanese Recognition | ||||
| chinese_cht_PP-OCRv3_mobile_rec | Inference Model/Training Model | Traditional Chinese Recognition | ||||
| te_PP-OCRv3_mobile_rec | Inference Model/Training Model | Telugu Recognition | ||||
| ka_PP-OCRv3_mobile_rec | Inference Model/Training Model | Kannada Recognition | ||||
| ta_PP-OCRv3_mobile_rec | Inference Model/Training Model | Tamil Recognition | ||||
| latin_PP-OCRv3_mobile_rec | Inference Model/Training Model | Latin Recognition | ||||
| arabic_PP-OCRv3_mobile_rec | Inference Model/Training Model | Arabic Script Recognition | ||||
| cyrillic_PP-OCRv3_mobile_rec | Inference Model/Training Model | Cyrillic Script Recognition | ||||
| devanagari_PP-OCRv3_mobile_rec | Inference Model/Training Model | Devanagari Script Recognition |
If you are satisfied with the performance of the production line, you can directly integrate and deploy it. You can choose to download the deployment package from the cloud, or refer to the methods in [Section 2.2 Local Experience](#22-local-experience) for local deployment. If you are not satisfied with the effect, you can fine-tune the models in the production line using your private data. If you have local hardware resources for training, you can start training directly on your local machine; if not, the Star River Zero-Code platform provides a one-click training service. You don't need to write any code—just upload your data and start the training task with one click.
### 2.2 Local Experience
> ❗ Before using the seal text recognition pipeline locally, please ensure that you have completed the installation of the PaddleX wheel package according to the [PaddleX Installation Guide](../../../installation/installation.en.md).
#### 2.2.1 Command Line Experience
You can quickly experience the seal text recognition pipeline with a single command. Use the [test file](https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/seal_text_det.png), and replace `--input` with the local path for prediction.
```bash
paddlex --pipeline seal_recognition \
--input seal_text_det.png \
--use_doc_orientation_classify False \
--use_doc_unwarping False \
--device gpu:0 \
--save_path ./output
```
The relevant parameter descriptions can be referred to in the parameter explanations of [2.1.2 Integration via Python Script](#212-integration-via-python-script).
After running, the results will be printed to the terminal, as follows:
#### 2.2.2 Python Script Integration
* The above command line is for quickly experiencing and viewing the effect. Generally, in a project, you often need to integrate through code. You can complete the quick inference of the pipeline with just a few lines of code. The inference code is as follows:
```python
from paddlex import create_pipeline
pipeline = create_pipeline(pipeline="seal_recognition")
output = pipeline.predict(
"seal_text_det.png",
use_doc_orientation_classify=False,
use_doc_unwarping=False,
)
for res in output:
res.print()
res.save_to_img("./output/")
res.save_to_json("./output/")
```
In the above Python script, the following steps were executed:
(1) The seal recognition production line object was instantiated via `create_pipeline()`, with the specific parameters described as follows:
| Parameter | Description | Type | Default Value |
|---|---|---|---|
pipeline |
The name of the production line or the path to the production line configuration file. If it is a production line name, it must be supported by PaddleX. | str |
None |
config |
Specific configuration information for the production line (if set simultaneously with pipeline, it has higher priority than pipeline, and the production line name must be consistent with pipeline). |
dict[str, Any] |
None |
device |
The device used for production line inference. It supports specifying the specific card number of the GPU, such as "gpu:0", other hardware card numbers, such as "npu:0", or CPU, such as "cpu". | str |
gpu:0 |
use_hpip |
Whether to enable high-performance inference. This is only available if the production line supports high-performance inference. | bool |
False |
| Parameter | Description | Type | Options | Default Value |
|---|---|---|---|---|
input |
Data to be predicted, supports multiple input types (required) | Python Var|str|list |
|
None |
device |
Inference device for the pipeline | str|None |
|
None |
use_doc_orientation_classify |
Whether to use the document orientation classification module | bool|None |
|
None |
use_doc_unwarping |
Whether to use the document unwarping module | bool|None |
|
None |
use_layout_detection |
Whether to use the layout detection module | bool|None |
|
None |
layout_threshold |
Confidence threshold for layout detection; only scores above this threshold will be output | float|dict|None |
|
None |
layout_nms |
Whether to use Non-Maximum Suppression (NMS) for layout detection post-processing | bool|None |
|
None |
layout_unclip_ratio |
Expansion ratio of detection box edges; if not specified, the default value from the PaddleX official model configuration will be used | float|list|None |
|
|
layout_merge_bboxes_mode |
Merging mode for detection boxes in layout detection output; if not specified, the default value from the PaddleX official model configuration will be used | string|None |
|
None |
seal_det_limit_side_len |
Side length limit for seal text detection | int|None |
|
None |
seal_rec_score_thresh |
Text recognition threshold; text results with scores above this threshold will be retained | float|None |
|
None |
| Method | Description | Parameter | Parameter Type | Parameter Description | Default Value |
|---|---|---|---|---|---|
print() |
Print results to the terminal | format_json |
bool |
Whether to format the output content using JSON indentation |
True |
indent |
int |
Specify the indentation level to beautify the output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether to escape non-ASCII characters to Unicode. When set to True, all non-ASCII characters will be escaped; False will retain the original characters, effective only when format_json is True |
False |
||
save_to_json() |
Save results as a json file | save_path |
str |
The file path to save the results. When it is a directory, the saved file name will be consistent with the input file type | None |
indent |
int |
Specify the indentation level to beautify the output JSON data for better readability, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether to escape non-ASCII characters to Unicode. When set to True, all non-ASCII characters will be escaped; False will retain the original characters, effective only when format_json is True |
False |
||
save_to_img() |
Save results as an image file | save_path |
str |
The file path to save the results, supports directory or file path | None |
| Attribute | Description |
|---|---|
json |
Get the prediction results in json format. |
img |
Get the visualization results in dict format. |
For the main operations provided by the service:
200, and the response body has the following properties:| Name | Type | Description |
|---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Fixed to 0. |
errorMsg |
string |
Error message. Fixed to "Success". |
result |
object |
Operation result. |
| Name | Type | Description |
|---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Same as the response status code. |
errorMsg |
string |
Error message. |
The main operations provided by the service are as follows:
inferGet seal text recognition results.
POST /seal-recognition
| Name | Type | Description | Required |
|---|---|---|---|
file |
string |
The URL of an image or PDF file accessible to the server, or the Base64 encoded result of the content of the above file types. For PDF files exceeding 10 pages, only the content of the first 10 pages will be used. | Yes |
fileType |
integer |
File type. 0 indicates a PDF file, 1 indicates an image file. If this property is not present in the request body, the file type will be inferred from the URL. |
No |
result property of the response body has the following properties:| Name | Type | Description |
|---|---|---|
sealRecResults |
object |
Seal text recognition results. The array length is 1 (for image input) or the smaller of the document page count and 10 (for PDF input). For PDF input, each element in the array represents the processing result of each page in the PDF file in order. |
dataInfo |
object |
Input data information. |
Each element in sealRecResults is an object with the following properties:
| Name | Type | Description |
|---|---|---|
texts |
array |
Text position, content, and score. |
inputImage |
string |
Input image. The image is in JPEG format and encoded using Base64. |
layoutImage |
string |
Layout area detection result image. The image is in JPEG format and encoded using Base64. |
ocrImage |
string |
OCR result image. The image is in JPEG format and encoded using Base64. |
Each element in texts is an object with the following properties:
| Name | Type | Description |
|---|---|---|
poly |
array |
Text position. The elements in the array are the vertex coordinates of the polygon surrounding the text. |
text |
string |
Text content. |
score |
number |
Text recognition score. |
import base64
import requests
API_URL = "http://localhost:8080/seal-recognition"
file_path = "./demo.jpg"
with open(file_path, "rb") as file:
file_bytes = file.read()
file_data = base64.b64encode(file_bytes).decode("ascii")
payload = {"file": file_data, "fileType": 1}
response = requests.post(API_URL, json=payload)
assert response.status_code == 200
result = response.json()["result"]
for i, res in enumerate(result["sealRecResults"]):
print("Detected texts:")
print(res["texts"])
layout_img_path = f"layout_{i}.jpg"
with open(layout_img_path, "wb") as f:
f.write(base64.b64decode(res["layoutImage"]))
ocr_img_path = f"ocr_{i}.jpg"
with open(ocr_img_path, "wb") as f:
f.write(base64.b64decode(res["ocrImage"]))
print(f"Output images saved at {layout_img_path} and {ocr_img_path}")
| Scenario | Fine-Tuning Module | Fine-Tuning Reference Link |
|---|---|---|
| Inaccurate or missing seal position detection | Layout Detection Module | Link |
| Missing text detection | Text Detection Module | Link |
| Inaccurate text content | Text Recognition Module | Link |
| Inaccurate full-image rotation correction | Document Image Orientation Classification Module | Link |
| Inaccurate image distortion correction | Text Image Correction Module | Not supported for fine-tuning |