# Document Understanding Pipeline User Guide
## 1. Introduction to Document Understanding Pipeline
The Document Understanding Pipeline is an advanced document processing technology based on Vision-Language Models (VLM), designed to overcome the limitations of traditional document processing. Traditional methods rely on fixed templates or predefined rules to parse documents, but this pipeline uses the multimodal capabilities of VLM to accurately answer user questions by integrating visual and linguistic information with just the input of document images and user queries. This technology does not require pre-training for specific document formats, allowing it to flexibly handle diverse document content and significantly enhance the generalization and practicality of document processing. It has broad application prospects in scenarios such as intelligent Q&A and information extraction. Currently, this pipeline does not support secondary development of VLM models, but future support is planned.
The Document Understanding Pipeline includes document-based vision-language model modules. You can choose the model to use based on the benchmark test data below.
If you prioritize model accuracy, choose a model with higher accuracy; if you care more about inference speed, choose a faster model; if you are concerned about storage size, choose a model with a smaller storage footprint.
Document-based Vision-Language Model Modules (Optional):
| Model | Download Link | Storage Size (GB) | Model Score | Description |
|---|---|---|---|---|
| PP-DocBee-2B | Inference Model | 4.2 | 765 | PP-DocBee is a multimodal large model developed by the PaddlePaddle team, focused on document understanding with excellent performance on Chinese document understanding tasks. The model is fine-tuned and optimized using nearly 5 million multimodal datasets for document understanding, including general VQA, OCR, table, text-rich, math and complex reasoning, synthetic, and pure text data, with different training data ratios. On several authoritative English document understanding evaluation leaderboards in academia, PP-DocBee has generally achieved SOTA at the same parameter level. In internal business Chinese scenarios, PP-DocBee also exceeds current popular open-source and closed-source models. |
| PP-DocBee-7B | Inference Model | 15.8 | - | |
| PP-DocBee2-3B | Inference Model | 7.6 | 852 | PP-DocBee2 is a multimodal large model independently developed by the PaddlePaddle team, specifically tailored for document understanding. Building upon PP-DocBee, the team has further optimized the foundational model and introduced a new data optimization scheme to enhance data quality. With just a relatively small dataset of 470,000 samples generated using the team's proprietary data synthesis strategy, PP-DocBee2 demonstrates superior performance in Chinese document understanding tasks. In terms of internal business metrics for Chinese-language scenarios, PP-DocBee2 has achieved an approximately 11.4% improvement over PP-DocBee, outperforming both current popular open-source and closed-source models of a similar scale. |
| Parameter | Description | Type | Default | |
|---|---|---|---|---|
pipeline |
Pipeline name or configuration file path. If it's a pipeline name, it must be a pipeline supported by PaddleX. | str |
None |
|
config |
Specific configuration information for the pipeline (if set simultaneously with pipeline, it has a higher priority and requires the pipeline name to be consistent with pipeline). |
dict[str, Any] |
None |
|
device |
Inference device for the pipeline. Supports specifying a specific GPU card number, such as "gpu:0", or other hardware card numbers, like "npu:0", or CPU as "cpu". | str |
None |
|
use_hpip |
Whether to enable the high-performance inference plugin. If set to None, the setting from the configuration file or config will be used. Not supported for now. |
bool |
None | None |
hpi_config |
High-performance inference configuration. Not supported for now. | dict | None |
None | None |
| Parameter | Description | Type | Options | Default |
|---|---|---|---|---|
input |
Data to be predicted, currently only supports dictionary-type input | Python Dict |
|
None |
| Method | Description | Parameter | Type | Description | Default |
|---|---|---|---|---|---|
print() |
Print results to the terminal | format_json |
bool |
Whether to format the output content with JSON indentation |
True |
indent |
int |
Specify indentation level to beautify the JSON output, making it more readable, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether to escape non-ASCII characters to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True |
False |
||
save_to_json() |
Save results as a json format file | save_path |
str |
Path to save the file. If it is a directory, the saved file name is consistent with the input file type name | None |
indent |
int |
Specify indentation level to beautify the JSON output, making it more readable, effective only when format_json is True |
4 | ||
ensure_ascii |
bool |
Control whether to escape non-ASCII characters to Unicode. When set to True, all non-ASCII characters will be escaped; False retains the original characters, effective only when format_json is True |
False |
| Attribute | Description |
|---|---|
json |
Get the prediction result in json format |
img |
Get the visualized image in dict format |
Main operations provided by the service:
200, and the response body contains the following attributes:| Name | Type | Meaning |
|---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Fixed at 0. |
errorMsg |
string |
Error message. Fixed at "Success". |
result |
object |
Operation result. |
| Name | Type | Meaning |
|---|---|---|
logId |
string |
The UUID of the request. |
errorCode |
integer |
Error code. Same as the response status code. |
errorMsg |
string |
Error message. |
The main operations provided by the service are as follows:
inferGenerate a response based on the input message.
POST /document-understanding
Note: The above interface is an alias for /chat/completion and is compatible with OpenAI's interface.
| Name | Type | Meaning | Required | Default |
|---|---|---|---|---|
model |
string |
Name of the model to use | Required | - |
messages |
array |
List of dialogue messages | Required | - |
max_tokens |
integer |
Maximum number of tokens to generate | Optional | 1024 |
temperature |
float |
Sampling temperature | Optional | 0.1 |
top_p |
float |
Top probability for nucleus sampling | Optional | 0.95 |
stream |
boolean |
Whether to output in a streaming manner | Optional | false |
max_image_tokens |
int |
Maximum number of tokens of input image | Optional | None |
Each element in messages is an object with the following attributes:
| Name | Type | Meaning | Required |
|---|---|---|---|
role |
string |
Role of the message (user/assistant/system) | Required |
content |
string or array |
Message content (text or mixed content) | Required |
When content is an array, each element is an object with the following attributes:
| Name | Type | Meaning | Required | Default |
|---|---|---|---|---|
type |
string |
Content type (text/image_url) | Required | - |
text |
string |
Text content (when type is text) | Conditionally Required | - |
image_url |
string or object |
Image URL or object (when type is image_url) | Conditionally Required | - |
When image_url is an object, it contains the following attributes:
| Name | Type | Meaning | Required | Default |
|---|---|---|---|---|
url |
string |
Image URL | Required | - |
detail |
string |
Image detail processing method (low/high/auto) | Optional | auto |
When the request is processed successfully, the result in the response body contains the following attributes:
| Name | Type | Meaning |
|---|---|---|
id |
string |
Request ID |
object |
string |
Object type (chat.completion) |
created |
integer |
Creation timestamp |
choices |
array |
Generated result options |
usage |
object |
Token usage details |
Each element in choices is a Choice object with the following attributes:
| Name | Type | Meaning | Optional Values |
|---|---|---|---|
finish_reason |
string |
Reason the model stopped generating tokens | stop (naturally stopped)length (reached max token count)tool_calls (called a tool)content_filter (content filtered)function_call (called a function, deprecated) |
index |
integer |
Index of the option in the list | - |
logprobs |
object | null |
Log probability information of the option | - |
message |
ChatCompletionMessage |
Chat message generated by the model | - |
The message object contains the following attributes:
| Name | Type | Meaning | Note |
|---|---|---|---|
content |
string | null |
Message content | May be null |
refusal |
string | null |
Refusal message generated by the model | Provided when content is refused |
role |
string |
Role of the message author | Fixed at "assistant" |
audio |
object | null |
Audio output data | Provided when audio output is requested Learn more |
function_call |
object | null |
Name and parameters of the function to be called | Deprecated, recommended to use tool_calls |
tool_calls |
array | null |
Tool calls generated by the model | e.g., function calls, etc. |
The usage object contains the following attributes:
| Name | Type | Meaning |
|---|---|---|
prompt_tokens |
integer |
Number of prompt tokens |
completion_tokens |
integer |
Number of generated tokens |
total_tokens |
integer |
Total number of tokens |
An example of result is shown below:
{
"id": "ed960013-eb19-43fa-b826-3c1b59657e35",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "| 名次 | 国家/地区 | 金牌 | 银牌 | 铜牌 | 奖牌总数 |\n| --- | --- | --- | --- | --- | --- |\n| 1 | 中国(CHN) | 48 | 22 | 30 | 100 |\n| 2 | 美国(USA) | 36 | 39 | 37 | 112 |\n| 3 | 俄罗斯(RUS) | 24 | 13 | 23 | 60 |\n| 4 | 英国(GBR) | 19 | 13 | 19 | 51 |\n| 5 | 德国(GER) | 16 | 11 | 14 | 41 |\n| 6 | 澳大利亚(AUS) | 14 | 15 | 17 | 46 |\n| 7 | 韩国(KOR) | 13 | 11 | 8 | 32 |\n| 8 | 日本(JPN) | 9 | 8 | 8 | 25 |\n| 9 | 意大利(ITA) | 8 | 9 | 10 | 27 |\n| 10 | 法国(FRA) | 7 | 16 | 20 | 43 |\n| 11 | 荷兰(NED) | 7 | 5 | 4 | 16 |\n| 12 | 乌克兰(UKR) | 7 | 4 | 11 | 22 |\n| 13 | 肯尼亚(KEN) | 6 | 4 | 6 | 16 |\n| 14 | 西班牙(ESP) | 5 | 11 | 3 | 19 |\n| 15 | 牙买加(JAM) | 5 | 4 | 2 | 11 |\n",
"role": "assistant"
}
}
],
"created": 1745218041,
"model": "pp-docbee",
"object": "chat.completion"
}
import base64
from openai import OpenAI
API_BASE_URL = "http://127.0.0.1:8080"
# Initialize OpenAI client
client = OpenAI(
api_key='xxxxxxxxx',
base_url=f'{API_BASE_URL}'
)
# Function to convert image to base64
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Input image path
image_path = "medal_table.png"
# Convert the original image to base64
base64_image = encode_image(image_path)
# Submit information to the PP-DocBee model
response = client.chat.completions.create(
model="pp-docbee", # Select model
messages=[
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content":[
{
"type": "text",
"text": "Recognize the content of this table and output the content in HTML format."
},
{
"type": "image_url",
"image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
},
]
},
],
)
content = response.choices[0].message.content
print('Reply:', content)