In actual production environments, many applications have stringent standards for the performance metrics of deployment strategies (especially response speed) to ensure efficient system operation and smooth user experiences. To this end, PaddleX provides a high-performance inference plugin that significantly improves model inference speed for users without requiring them to focus on complex configurations and low-level details, through automatic configuration and multi-backend inference capabilities.
Before using the high-performance inference plugin, ensure you have completed the installation of PaddleX according to the PaddleX Local Installation Tutorial and successfully run the quick inference using the PaddleX pipeline command-line instructions or Python script instructions.
High-performance inference supports processing PaddlePaddle format models and ONNX format models. For ONNX format models, it is recommended to use the Paddle2ONNX plugin for conversion. If multiple format models exist in the model directory, they will be automatically selected as needed.
The processor architectures, operating systems, device types, and Python versions currently supported by high-performance inference are shown in the table below:
| Operating System | Processor Architecture | Device Type | Python Version |
|---|---|---|---|
| Linux | x86-64 | ||
| CPU | 3.8–3.12 | ||
| GPU (CUDA 11.8 + cuDNN 8.6) | 3.8–3.12 | ||
| NPU | 3.10 | ||
| aarch64 | NPU | 3.10 |
Refer to Get PaddleX based on Docker to use Docker to start the PaddleX container. After starting the container, execute the following commands according to the device type to install the high-performance inference plugin:
| Device Type | Installation Command | Description |
|---|---|---|
| CPU | paddlex --install hpi-cpu |
Installs the CPU version of high-performance inference. |
| GPU | paddlex --install hpi-gpu |
Installs the GPU version of high-performance inference. Includes all features of the CPU version, no need to install the CPU version separately. |
| NPU | paddlex --install hpi-npu |
Installs the NPU version of high-performance inference. For usage instructions, please refer to the Ascend NPU High-Performance Inference Tutorial. |
After locally installing CUDA 11.8 and installing cuDNN 8.6, execute the above installation commands.
Notes:
GPU only supports CUDA 11.8 + cuDNN 8.6, and CUDA 12.6 is under support.
Only one version of the high-performance inference plugin can exist in the same environment.
For NPU device usage instructions, refer to the Ascend NPU High-Performance Inference Tutorial.
Windows only supports installing and using the high-performance inference plugin based on Docker.
Below are examples of enabling high-performance inference in the general image classification pipeline and image classification module using PaddleX CLI and Python API.
For PaddleX CLI, specify --use_hpip to enable high-performance inference.
General Image Classification Pipeline:
paddlex \
--pipeline image_classification \
--input https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg \
--device gpu:0 \
--use_hpip
Image Classification Module:
python main.py \
-c paddlex/configs/modules/image_classification/ResNet18.yaml \
-o Global.mode=predict \
-o Predict.model_dir=None \
-o Predict.input=https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg \
-o Global.device=gpu:0 \
-o Predict.use_hpip=True
For the PaddleX Python API, the method to enable high-performance inference is similar. Taking the General Image Classification Pipeline and Image Classification Module as examples:
General Image Classification Pipeline:
from paddlex import create_pipeline
pipeline = create_pipeline(
pipeline="image_classification",
device="gpu",
use_hpip=True
)
output = pipeline.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg")
Image Classification Module:
from paddlex import create_model
model = create_model(
model_name="ResNet18",
device="gpu",
use_hpip=True
)
output = model.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/general_image_classification_001.jpg")
The inference results obtained with the high-performance inference plugin enabled are consistent with those without the plugin. For some models, it may take a longer time to complete the construction of the inference engine when enabling the high-performance inference plugin for the first time. PaddleX will cache relevant information in the model directory after the first construction of the inference engine and reuse the cached content in subsequent runs to improve initialization speed.
Enabling high-performance inference by default affects the entire pipeline/module. If you want to control the scope of application with finer granularity, such as enabling the high-performance inference plugin for only a specific sub-pipeline or sub-module within the pipeline, you can set use_hpip at different levels of configuration in the pipeline configuration file. Please refer to 2.5 Enabling/Disabling High-Performance Inference in Sub-pipelines/Sub-modules.
This section introduces the advanced usage of high-performance inference, suitable for users who have some understanding of model deployment or wish to manually configure and optimize. Users can customize the use of high-performance inference based on their own needs by referring to the configuration instructions and examples. Next, the advanced usage methods will be introduced in detail.
High-performance inference is divided into two modes:
The safe auto-configuration mode has a protection mechanism and automatically selects the configuration with better performance for the current environment by default. In this mode, users can override the default configuration, but the provided configuration will be checked, and PaddleX will reject unavailable configurations based on prior knowledge. This is the default mode.
The unrestricted manual configuration mode provides complete configuration freedom, allowing free selection of the inference backend and modification of backend configurations, but cannot guarantee successful inference. This mode is suitable for experienced users with specific needs for the inference backend and its configurations and is recommended for use after familiarizing with high-performance inference.
Common high-performance inference configurations include the following fields:
| Parameter | Description | Type | Default Value |
|---|---|---|---|
auto_config |
Whether to enable the safe auto-configuration mode.True to enable, False to enable the unrestricted manual configuration mode. |
bool |
True |
backend |
Specifies the inference backend to use. Cannot be None in unrestricted manual configuration mode. |
str | None |
None |
backend_config |
The configuration of the inference backend, which can override the default configuration items of the backend if it is not None. |
dict | None |
None |
auto_paddle2onnx |
Whether to enable the Paddle2ONNX plugin to automatically convert Paddle models to ONNX models. | bool |
True |
The available options for backend are shown in the following table:
| Option | Description | Supported Devices |
|---|---|---|
paddle |
Paddle Inference engine, supporting the Paddle Inference TensorRT subgraph engine to improve GPU inference performance of models. | CPU, GPU |
openvino |
OpenVINO, a deep learning inference tool provided by Intel, optimized for model inference performance on various Intel hardware. | CPU |
onnxruntime |
ONNX Runtime, a cross-platform, high-performance inference engine. | CPU, GPU |
tensorrt |
TensorRT, a high-performance deep learning inference library provided by NVIDIA, optimized for NVIDIA GPUs to improve speed. | GPU |
om |
OM, a inference engine of offline model format customized for Huawei Ascend NPU, deeply optimized for hardware to reduce operator computation time and scheduling time, effectively improving inference performance. | NPU |
The available values for backend_config vary depending on the backend, as shown in the following table:
| Backend | Available Values |
|---|---|
paddle |
Refer to PaddleX Single Model Python Usage Instructions: 4. Inference Backend Configuration. |
openvino |
cpu_num_threads: The number of logical processors used for CPU inference. Default is 8. |
onnxruntime |
cpu_num_threads: The number of parallel computing threads within operators for CPU inference. Default is 8. |
tensorrt |
precision: The precision used, fp16 or fp32. Default is fp32.
dynamic_shapes: Dynamic shapes. Dynamic shapes include minimum shape, optimal shape, and maximum shape, which represent TensorRT’s ability to defer specifying some or all tensor dimensions until runtime. The format is:{input tensor name}: [{minimum shape}, [{optimal shape}], [{maximum shape}]]. For more information, please refer to the TensorRT official documentation.。
|
om |
None |
| Option | Description |
|---|---|
| http_proxy | Use a specific HTTP proxy when downloading third-party libraries, default is empty |
| PYTHON_VERSION | Python version, default is 3.10.0 |
| WITH_GPU | Whether to compile support for Nvidia-GPU, default is ON |
| ENABLE_ORT_BACKEND | Whether to compile and integrate the ONNX Runtime backend, default is ON |
| ENABLE_TRT_BACKEND | Whether to compile and integrate the TensorRT backend (GPU only), default is ON |
| ENABLE_OPENVINO_BACKEND | Whether to compile and integrate the OpenVINO backend (CPU only), default is ON |
Compilation Example:
# Compilation
# export PYTHON_VERSION=...
# export WITH_GPU=...
# export ENABLE_ORT_BACKEND=...
# export ...
cd PaddleX/libs/ultra-infer/scripts/linux
bash set_up_docker_and_build_py.sh
# Installation
python -m pip install ../../python/dist/ultra_infer*.whl
1. Why is the inference speed similar to regular inference after using the high-performance inference feature?
High-performance inference accelerates inference by intelligently selecting backends, but due to factors such as model complexity or unsupported operators, some models may not be able to use accelerated backends (like OpenVINO, TensorRT, etc.). In such cases, relevant information will be prompted in the logs, and the fastest available backend known will be selected, potentially reverting to regular inference.
The high-performance inference plugin accelerates inference by intelligently selecting the backend.
For modules, due to model complexity or unsupported operators, some models may not be able to use accelerated backends (such as OpenVINO, TensorRT, etc.). In such cases, relevant information will be prompted in the logs, and the fastest available backend known will be selected, potentially falling back to regular inference.
For pipelines, the performance bottleneck may not be in the model inference stage.
You can use the PaddleX benchmark tool to conduct actual speed tests for a more accurate performance assessment.
2. Does the high-performance inference feature support all model pipelines and modules?
The high-performance inference feature supports all model pipelines and modules, but some models may not experience accelerated inference. Specific reasons can be referred to in Question 1.
3. Why does the installation of the high-performance inference plugin fail, with the log displaying: "Currently, the CUDA version must be 11.x for GPU devices."?
The environments supported by the high-performance inference feature are shown in the table in Section 1.1. If the installation fails, it may be due to the high-performance inference feature not supporting the current environment. Additionally, CUDA 12.6 is already under support.
4. Why does the program get stuck or display WARNING and ERROR messages when using the high-performance inference feature? How should this be handled?
During engine construction, due to subgraph optimization and operator processing, the program may take longer and generate WARNING and ERROR messages. However, as long as the program does not exit automatically, it is recommended to wait patiently as the program will usually continue to run until completion.