7 月之前 · 460eb08f7c
--- a/docs/module_usage/instructions/model_python_API.en.md
+++ b/docs/module_usage/instructions/model_python_API.en.md
@@ -35,7 +35,7 @@ In short, just three steps:
 
															     * `model_dir`: `str` type, local path to directory of inference model files ，such as "/path/to/PP-LCNet_x1_0_infer/", default to `None`, means that use the official model specified by `model_name`;
														
 
															     * `batch_size`: `int` type, default to `1`;
														
 
															     * `device`: `str` type, used to set the inference device, such as "cpu", "gpu:2" for GPU settings. By default, using 0 id GPU if available, otherwise CPU;
														
 
															-    * `pp_option`: `PaddlePredictorOption` type, used to set the inference engine. Please refer to [4-Inference Backend Configuration](#4-inference-backend-configuration) for more details;
														
 
															+    * `pp_option`: `PaddlePredictorOption` type, used to change inference settings (e.g. the operating mode). Please refer to [4-Inference Configuration](#4-inference-configuration) for more details;
														
 
															     * _`inference hyperparameters`_: used to set common inference hyperparameters. Please refer to specific model description document for details.
														
 
															   * Return Value: `BasePredictor` type.
														
@@ -93,31 +93,36 @@ The prediction results support to be accessed, visualized, and saved, which can
 
															     * `save_path`: `str` type, the path to save the result.
														
 
															   * Returns: None.
														
 
															-### 4. Inference Backend Configuration
														
 
															+### 4. Inference Configuration
														
 
															-PaddleX supports configuring the inference backend through `PaddlePredictorOption`. Relevant APIs are as follows:
														
 
															+PaddleX supports modifying the inference configuration through `PaddlePredictorOption`. Relevant APIs are as follows:
														
 
															 #### Attributes:
														
 
															-* `device`: Inference device;
														
 
															-  * Supports setting the device type and card number represented by `str`. Device types include 'gpu', 'cpu', 'npu', 'xpu', 'mlu', 'dcu'. When using an accelerator card, you can specify the card number, e.g., 'gpu:0' for GPU 0. By default, using 0 id GPU if available, otherwise CPU;
														
 
															+* `device`: Inference device.
														
 
															+  * Supports setting the device type and card number represented by `str`. Device types include 'gpu', 'cpu', 'npu', 'xpu', 'mlu', 'dcu'. When using an accelerator card, you can specify the card number, e.g., 'gpu:0' for GPU 0. By default, using 0 id GPU if available, otherwise CPU.
														
 
															   * Return value: `str` type, the currently set inference device.
														
 
															-* `run_mode`: Inference backend;
														
 
															-  * Supports setting the inference backend as a `str` type, options include 'paddle', 'trt_fp32', 'trt_fp16', 'trt_int8', 'mkldnn', 'mkldnn_bf16'. 'mkldnn' is only selectable when the inference device is 'cpu'. The default is 'paddle';
														
 
															-  * Return value: `str` type, the currently set inference backend.
														
 
															-* `cpu_threads`: Number of CPU threads for the acceleration library, only valid when the inference device is 'cpu';
														
 
															-  * Supports setting an `int` type for the number of CPU threads for the acceleration library during CPU inference;
														
 
															+* `run_mode`: Operating mode.
														
 
															+  * Supports setting the operating mode as a `str` type, options include 'paddle', 'trt_fp32', 'trt_fp16', 'trt_int8', 'mkldnn', 'mkldnn_bf16'. Note that 'trt_fp32' and 'trt_fp16' correspond to using the TensorRT subgraph engine for inference with FP32 and FP16 precision respectively; these options are only available when the inference device is a GPU. Additionally, 'mkldnn' is only available when the inference device is a CPU. The default value is 'paddle'.
														
 
															+  * Return value: `str` type, the currently set operating mode.
														
 
															+* `cpu_threads`: Number of CPU threads for the acceleration library, only valid when the inference device is 'cpu'.
														
 
															+  * Supports setting an `int` type for the number of CPU threads for the acceleration library during CPU inference.
														
 
															   * Return value: `int` type, the currently set number of threads for the acceleration library.
														
 
															+* `trt_dynamic_shapes`: TensorRT dynamic shapes, only effective when `run_mode` is set to 'trt_fp32' or 'trt_fp16'.
														
 
															+  * Supports setting a value of type `dict` or `None`. If it is a `dict`, the keys are the input tensor names and the values are two-level nested lists formatted as `[{minimum shape}, {optimal shape}, {maximum shape}]`, for example `[[1, 2], [1, 2], [2, 2]]`.
														
 
															+  * Return value: `dict` type or `None`, the current TensorRT dynamic shape settings.
														
 
															+* `trt_dynamic_shape_input_data`: For TensorRT usage, this parameter provides the fill data for the input tensors used to build the engine, and it is only valid when `run_mode` is set to 'trt_fp32' or 'trt_fp16'.
														
 
															+  * Supports setting a value of type `dict` or `None`. If it is a `dict`, the keys are the input tensor names and the values are two-level nested lists formatted as `[{fill data corresponding to the minimum shape}, {fill data corresponding to the optimal shape}, {fill data corresponding to the maximum shape}]`, for example `[[1.0, 1.0], [1.0, 1.0], [1.0, 1.0, 1.0, 1.0]]`. The data are floating point numbers filled in row-major order.
														
 
															+  * Return value: `dict` type or `None`, the currently set input tensor fill data.
														
 
															 #### Methods:
														
 
															-* `get_support_run_mode`: Get supported inference backend configurations;
														
 
															+* `get_support_run_mode`: Get supported operating modes;
														
 
															   * Parameters: None;
														
 
															-  * Return value: List type, the available inference backend configurations.
														
 
															+  * Return value: List type, the available operating modes.
														
 
															 * `get_support_device`: Get supported device types for running;
														
 
															   * Parameters: None;
														
 
															   * Return value: List type, the available device types.
														
 
															 * `get_device`: Get the currently set device;
														
 
															   * Parameters: None;
														
 
															   * Return value: `str` type.
														
 
															-```
														
--- a/docs/module_usage/instructions/model_python_API.md
+++ b/docs/module_usage/instructions/model_python_API.md
@@ -36,7 +36,7 @@ for res in output:
 
															     * `model_dir`：`str` 类型，本地 inference 模型文件目录路径，如“/path/to/PP-LCNet_x1_0_infer/”，默认为 `None`，表示使用`model_name`指定的官方推理模型；
														
 
															     * `batch_size`：`int` 类型，默认为 `1`；
														
 
															     * `device`：`str` 类型，用于设置模型推理设备，如为GPU设置则可以指定卡号，如“cpu”、“gpu:2”，默认情况下，如有 GPU 设置则使用 0 号 GPU，否则使用 CPU；
														
 
															-    * `pp_option`：`PaddlePredictorOption` 类型，用于设置模型推理后端，关于推理后端的详细说明，请参考下文[4-推理后端设置](#4-推理后端设置)；
														
 
															+    * `pp_option`：`PaddlePredictorOption` 类型，用于改变运行模式等配置项，关于推理配置的详细说明，请参考下文[4-推理配置](#4-推理配置)；
														
 
															     * _`推理超参数`_：支持常见推理超参数的修改，具体参数说明详见具体模型文档；
														
 
															   * 返回值：`BasePredictor` 类型。
														
@@ -95,27 +95,33 @@ for res in output:
 
															   * 返回值：无；
														
 
															 * _`更多`_：不同模型的预测结果支持不同的存储方式，更多方法请参考具体模型文档；
														
 
															-### 4. 推理后端设置
														
 
															+### 4. 推理配置
														
 
															-PaddleX 支持通过`PaddlePredictorOption`设置推理后端，相关API如下：
														
 
															+PaddleX 支持通过`PaddlePredictorOption`修改推理配置，相关API如下：
														
 
															 #### 属性：
														
 
															 * `deivce`：推理设备；
														
 
															   * 支持设置 `str` 类型表示的推理设备类型及卡号，设备类型支持可选 “gpu”、“cpu”、“npu”、“xpu”、“mlu”、“dcu”，当使用加速卡时，支持指定卡号，如使用 0 号 GPU：`gpu:0`，默认情况下，如有 GPU 设置则使用 0 号 GPU，否则使用 CPU；
														
 
															   * 返回值：`str`类型，当前设置的推理设备。
														
 
															-* `run_mode`：推理后端；
														
 
															-  * 支持设置 `str` 类型的推理后端，支持可选 'paddle'，'trt_fp32'，'trt_fp16'，'trt_int8'，'mkldnn'，'mkldnn_bf16'，其中 'mkldnn' 仅当推理设备使用 cpu 时可选，默认为 'paddle'；
														
 
															-  * 返回值：`str`类型，当前设置的推理后端。
														
 
															+* `run_mode`：运行模式；
														
 
															+  * 支持设置 `str` 类型的运行模式，支持可选 'paddle'，'trt_fp32'，'trt_fp16'，'trt_int8'，'mkldnn'，'mkldnn_bf16'，其中 'trt_fp32' 和' trt_fp16' 分别对应使用 TensorRT 子图引擎进行 FP32 和 FP16 精度的推理，仅当推理设备使用 GPU 时可选，'mkldnn' 仅当推理设备使用 CPU 时可选，默认为 'paddle'；
														
 
															+  * 返回值：`str`类型，当前设置的运行模式。
														
 
															 * `cpu_threads`：cpu 加速库计算线程数，仅当推理设备使用 cpu 时有效；
														
 
															   * 支持设置 `int` 类型，cpu 推理时加速库计算线程数；
														
 
															   * 返回值：`int` 类型，当前设置的加速库计算线程数。
														
 
															+* `trt_dynamic_shapes`：TensorRT 动态形状，仅当 `run_mode` 为 'trt_fp32' 或 'trt_fp16' 时有效；
														
 
															+  * 支持设置：`dict` 类型或 `None`，如果为 `dict`，键为输入张量名称，值为一个两级嵌套列表：`[{最小形状}, {优化形状}, {最大形状}]`，例如 `[[1, 2], [1, 2], [2, 2]]`；
														
 
															+  * 返回值：`dict` 类型或 `None`，当前设置的 TensorRT 动态形状。
														
 
															+* `trt_dynamic_shape_input_data`：使用 TensorRT 时，为用于构建引擎的输入张量填充的数据，仅当 `run_mode` 为 'trt_fp32' 或 'trt_fp16' 时有效；
														
 
															+  * 支持设置：`dict` 类型或 `None`，如果为 `dict`，键为输入张量名称，值为一个两级嵌套列表：`[{最小形状对应的填充数据}, {优化形状对应的填充数据}, {最大形状对应的填充数据}]`，例如 `[[1.0, 1.0], [1.0, 1.0], [1.0, 1.0, 1.0, 1.0]]`，数据为浮点数，按照行优先顺序填充；
														
 
															+  * 返回值：`dict` 类型或 `None`，当前设置的输入张量填充数据。
														
 
															 #### 方法：
														
 
															-* `get_support_run_mode`：获取支持的推理后端设置；
														
 
															+* `get_support_run_mode`：获取支持的运行模式；
														
 
															   * 参数：无；
														
 
															-  * 返回值：list 类型，可选的推理后端设置。
														
 
															+  * 返回值：list 类型，可选的运行模式。
														
 
															 * `get_support_device`：获取支持的运行设备类型；
														
 
															   * 参数：无；
														
 
															   * 返回值：list 类型，可选的设备类型。
														
--- a/docs/pipeline_deploy/high_performance_inference.en.md
+++ b/docs/pipeline_deploy/high_performance_inference.en.md
@@ -24,7 +24,7 @@ In real production environments, many applications impose strict performance met
 
															 Before using the high-performance inference plugin, please ensure that you have completed the PaddleX installation according to the [PaddleX Local Installation Tutorial](../installation/installation.en.md) and have run the quick inference using the PaddleX pipeline command line or the PaddleX pipeline Python script as described in the usage instructions.
														
 
															-High-performance inference supports handling **PaddlePaddle static graph models (`.pdmodel`, `.json`)** and **ONNX format models (`.onnx`)**. For ONNX format models, it is recommended to convert them using the [Paddle2ONNX Plugin](./paddle2onnx.md). If multiple model formats are present in the model directory, PaddleX will automatically choose the appropriate one as needed.
														
 
															+High-performance inference supports handling **PaddlePaddle static graph models (`.pdmodel`, `.json`)** and **ONNX format models (`.onnx`)**. For ONNX format models, it is recommended to convert them using the [Paddle2ONNX Plugin](./paddle2onnx.en.md). If multiple model formats are present in the model directory, PaddleX will automatically choose the appropriate one as needed.
														
 
															 ### 1.1 Installing the High-Performance Inference Plugin
														
@@ -76,12 +76,12 @@ Refer to [Get PaddleX based on Docker](../installation/installation.en.md#21-obt
 
															       <tr>
														
 
															           <td>CPU</td>
														
 
															           <td><code>paddlex --install hpi-cpu</code></td>
														
 
															-          <td>Installs the CPU version of the high-performance inference functionality.</td>
														
 
															+          <td>Installs the CPU version of the high-performance inference feature.</td>
														
 
															       </tr>
														
 
															       <tr>
														
 
															           <td>GPU</td>
														
 
															           <td><code>paddlex --install hpi-gpu</code></td>
														
 
															-          <td>Installs the GPU version of the high-performance inference functionality.<br />Includes all functionalities of the CPU version.</td>
														
 
															+          <td>Installs the GPU version of the high-performance inference feature.<br />Includes all functionalities of the CPU version.</td>
														
 
															       </tr>
														
 
															   </tbody>
														
 
															 </table>
														
@@ -102,7 +102,7 @@ paddlex --install hpi-cpu
 
															 ##### To install the GPU version of the high-performance inference plugin:
														
 
															-Before installation, please ensure that CUDA and cuDNN are installed in your environment. The official PaddleX currently only provides a precompiled package for CUDA 11.8 + cuDNN 8.9, so please ensure that the installed versions of CUDA and cuDNN are compatible with the compiled versions. Below are the installation documentation links for CUDA 11.8 and cuDNN 8.9:
														
 
															+Before installation, please ensure that CUDA and cuDNN are installed in your environment. The official PaddleX currently only provides precompiled packages for CUDA 11.8 + cuDNN 8.9, so please ensure that the installed versions of CUDA and cuDNN are compatible with the compiled versions. Below are the installation documentation links for CUDA 11.8 and cuDNN 8.9:
														
 
															 - [Install CUDA 11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive)
														
 
															 - [Install cuDNN 8.9](https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-890/install-guide/index.html)
														
@@ -128,11 +128,11 @@ paddlex --install hpi-gpu
 
															 ##### To install the NPU version of the high-performance inference plugin:
														
 
															-Please refer to the [Ascend NPU High-Performance Inference Tutorial](../practical_tutorials/high_performance_npu_tutorial.md).
														
 
															+Please refer to the [Ascend NPU High-Performance Inference Tutorial](../practical_tutorials/high_performance_npu_tutorial.en.md).
														
 
															 **Note:**
														
 
															-1. **Currently, the official PaddleX only provides a precompiled package for CUDA 11.8 + cuDNN 8.9**; support for CUDA 12.6 is in progress.
														
 
															+1. **Currently, the official PaddleX only provides precompiled packages for CUDA 11.8 + cuDNN 8.9**; support for CUDA 12 is in progress.
														
 
															 2. Only one version of the high-performance inference plugin should exist in the same environment.
														
 
															 3. For Windows systems, it is currently recommended to install and use the high-performance inference plugin within a Docker container.
														
@@ -216,12 +216,12 @@ In unrestricted manual configuration mode, full freedom is provided to configure
 
															 ### 2.2 High-Performance Inference Configuration
														
 
															-Common configuration fields for high-performance inference include:
														
 
															+Common configuration items for high-performance inference include:
														
 
															 <table>
														
 
															 <thead>
														
 
															 <tr>
														
 
															-<th>Parameter</th>
														
 
															+<th>Name</th>
														
 
															 <th>Description</th>
														
 
															 <th>Type</th>
														
 
															 <th>Default Value</th>
														
@@ -290,31 +290,31 @@ The optional values for `backend` are as follows:
 
															   </tr>
														
 
															 </table>
														
 
															-The available options for `backend_config` vary for different backends, as shown in the following table:
														
 
															+The available configuration items for `backend_config` vary for different backends, as shown in the following table:
														
 
															 <table>
														
 
															   <tr>
														
 
															     <th>Backend</th>
														
 
															-    <th>Options</th>
														
 
															+    <th>Configuration Items</th>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>paddle</code></td>
														
 
															-    <td>Refer to <a href="../module_usage/instructions/model_python_API.en.md">PaddleX Single-Model Python Script Usage: 4. Inference Backend Settings</a>.</td>
														
 
															+    <td>Refer to <a href="../module_usage/instructions/model_python_API.en.md#4-inference-configuration">PaddleX Single Model Python Usage Instructions</a>. The attributes of the <code>PaddlePredictorOption</code> object can be configured via key-value pairs.</td>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>openvino</code></td>
														
 
															-    <td><code>cpu_num_threads</code>: The number of logical processors used for CPU inference. The default is <code>8</code>.</td>
														
 
															+    <td><code>cpu_num_threads</code> (<code>int</code>): The number of logical processors used for CPU inference. The default is <code>8</code>.</td>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>onnxruntime</code></td>
														
 
															-    <td><code>cpu_num_threads</code>: The number of parallel computation threads within the operator during CPU inference. The default is <code>8</code>.</td>
														
 
															+    <td><code>cpu_num_threads</code> (<code>int</code>): The number of parallel computation threads within the operator during CPU inference. The default is <code>8</code>.</td>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>tensorrt</code></td>
														
 
															     <td>
														
 
															-      <code>precision</code>: The precision used, either <code>fp16</code> or <code>fp32</code>. The default is <code>fp32</code>.
														
 
															+      <code>precision</code> (<code>str</code>): The precision used, either <code>"fp16"</code> or <code>"fp32"</code>. The default is <code>"fp32"</code>.
														
 
															       <br />
														
 
															-      <code>dynamic_shapes</code>: Dynamic shapes. Dynamic shapes include the minimum shape, optimal shape, and maximum shape, which represent TensorRT’s ability to delay specifying some or all tensor dimensions until runtime. The format is: <code>{input tensor name}: [ [minimum shape], [optimal shape], [maximum shape] ]</code>. For more details, please refer to the <a href="https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes">TensorRT official documentation</a>.
														
 
															+      <code>dynamic_shapes</code> (<code>dict</code>): Dynamic shapes. Dynamic shapes include the minimum shape, optimal shape, and maximum shape, which represent TensorRT’s ability to delay specifying some or all tensor dimensions until runtime. The format is: <code>{input tensor name}: [{minimum shape}, {optimization shape}, {maximum shape}]</code>. For more details, please refer to the <a href="https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html">TensorRT official documentation</a>.
														
 
															     </td>
														
 
															   </tr>
														
 
															   <tr>
														
@@ -615,22 +615,16 @@ python -m pip install ../../python/dist/ultra_infer*.whl
 
															 **1. Why does the inference speed not appear to improve noticeably before and after enabling the high-performance inference plugin?**
														
 
															-The high-performance inference plugin accelerates inference by intelligently selecting the backend.
														
 
															+The high-performance inference plugin achieves inference acceleration by intelligently selecting and configuring the backend. However, due to the complex structure of some models or the presence of unsupported operators, not all models may be able to be accelerated. In these cases, PaddleX will provide corresponding prompts in the log. You can use the [PaddleX benchmark feature](../module_usage/instructions/benchmark.en.md) to measure the inference duration of each module component, thereby facilitating a more accurate performance evaluation. Moreover, for pipelines, the performance bottleneck of inference may not lie in the model inference, but rather in the surrounding logic, which could also result in limited acceleration gains.
														
 
															-For modules, due to model complexity or unsupported operators, some models may not be able to use acceleration backends (such as OpenVINO, TensorRT, etc.). In such cases, corresponding messages will be logged, and the fastest available backend known will be chosen, which may fall back to standard inference.
														
 
															+**2. Do all pipelines and modules support high-performance inference?**
														
 
															-For model pipelines, the performance bottleneck may not lie in the inference stage.
														
 
															-
														
 
															-You can use the [PaddleX benchmark](../module_usage/instructions/benchmark.md) tool to conduct actual speed tests for a more accurate performance evaluation.
														
 
															-
														
 
															-**2. Does the high-performance inference functionality support all model pipelines and modules?**
														
 
															-
														
 
															-The high-performance inference functionality supports all model pipelines and modules, but some models may not see an acceleration effect due to reasons mentioned in FAQ 1.
														
 
															+All pipelines and modules that use static graph models support enabling the high-performance inference plugin; however, in certain scenarios, some models might not be able to achieve accelerated inference. For detailed reasons, please refer to Question 1.
														
 
															 **3. Why does the installation of the high-performance inference plugin fail with a log message stating: “Currently, the CUDA version must be 11.x for GPU devices.”?**
														
 
															-The high-performance inference functionality currently supports only a limited set of environments. Please refer to the installation instructions. If installation fails, it may be that the current environment is not supported by the high-performance inference functionality. Note that CUDA 12.6 is already under support.
														
 
															+For the GPU version of the high-performance inference plugin, the official PaddleX currently only provides precompiled packages for CUDA 11.8 + cuDNN 8.9. The support for CUDA 12 is in progress.
														
 
															-**4. Why does the program freeze during runtime or display some WARNING and ERROR messages after using the high-performance inference functionality? What should be done in such cases?**
														
 
															+**4. Why does the program freeze during runtime or display some "WARNING" and "ERROR" messages after using the high-performance inference feature? What should be done in such cases?**
														
 
															-When initializing the model, operations such as subgraph optimization may take longer and may generate some WARNING and ERROR messages. However, as long as the program does not exit automatically, it is recommended to wait patiently, as the program usually continues to run to completion.
														
 
															+When initializing the model, operations such as subgraph optimization may take longer and may generate some "WARNING" and "ERROR" messages. However, as long as the program does not exit automatically, it is recommended to wait patiently, as the program usually continues to run to completion.
														
--- a/docs/pipeline_deploy/high_performance_inference.md
+++ b/docs/pipeline_deploy/high_performance_inference.md
@@ -128,11 +128,11 @@ paddlex --install hpi-gpu
 
															 ##### 安装 NPU 版本的高性能推理插件：
														
 
															-请参考 [昇腾 NPU 高性能推理教程](../practical_tutorials/high_performance_npu_tutorial.md)
														
 
															+请参考 [昇腾 NPU 高性能推理教程](../practical_tutorials/high_performance_npu_tutorial.md)。
														
 
															 **注意：**
														
 
															-1. **目前 PaddleX 官方仅提供 CUDA 11.8 + cuDNN 8.9 的预编译包**；CUDA 12.6 已经在支持中。
														
 
															+1. **目前 PaddleX 官方仅提供 CUDA 11.8 + cuDNN 8.9 的预编译包**；CUDA 12 已经在支持中。
														
 
															 2. 同一环境中只应该存在一个版本的高性能推理插件。
														
@@ -218,14 +218,14 @@ output = model.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/
 
															 ### 2.2 高性能推理配置
														
 
															-常用高性能推理配置包含以下字段：
														
 
															+常用高性能推理配置包含以下配置项：
														
 
															 <table>
														
 
															 <thead>
														
 
															 <tr>
														
 
															-<th>参数</th>
														
 
															-<th>参数说明</th>
														
 
															-<th>参数类型</th>
														
 
															+<th>名称</th>
														
 
															+<th>说明</th>
														
 
															+<th>类型</th>
														
 
															 <th>默认值</th>
														
 
															 </tr>
														
 
															 </thead>
														
@@ -250,7 +250,7 @@ output = model.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/
 
															 </tr>
														
 
															 <tr>
														
 
															   <td><code>auto_paddle2onnx</code></td>
														
 
															-  <td>是否启用<a href="./paddle2onnx.md">Paddle2ONNX插件</a>将Paddle模型自动转换为ONNX模型。</td>
														
 
															+  <td>是否启用 <a href="./paddle2onnx.md">Paddle2ONNX插件</a> 将Paddle模型自动转换为ONNX模型。</td>
														
 
															   <td><code>bool</code></td>
														
 
															   <td><code>True</code></td>
														
 
															 </tr>
														
@@ -268,7 +268,7 @@ output = model.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/
 
															   <tr>
														
 
															     <td><code>paddle</code></td>
														
 
															     <td>Paddle Inference 推理引擎，支持通过 Paddle Inference TensorRT 子图引擎的方式提升模型的 GPU 推理性能。</td>
														
 
															-    <td>CPU, GPU</td>
														
 
															+    <td>CPU，GPU</td>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>openvino</code></td>
														
@@ -278,7 +278,7 @@ output = model.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/
 
															   <tr>
														
 
															     <td><code>onnxruntime</code></td>
														
 
															     <td><a href="https://onnxruntime.ai/">ONNX Runtime</a>，跨平台、高性能的推理引擎。</td>
														
 
															-    <td>CPU, GPU</td>
														
 
															+    <td>CPU，GPU</td>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>tensorrt</code></td>
														
@@ -292,31 +292,31 @@ output = model.predict("https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/
 
															   </tr>
														
 
															 </table>
														
 
															-`backend_config` 对不同的后端有不同的可选值，如下表所示：
														
 
															+`backend_config` 对不同的后端有不同的配置项，如下表所示：
														
 
															 <table>
														
 
															   <tr>
														
 
															     <th>后端</th>
														
 
															-    <th>可选值</th>
														
 
															+    <th>配置项</th>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>paddle</code></td>
														
 
															-    <td>参考<a href="../module_usage/instructions/model_python_API.md">PaddleX单模型Python脚本使用说明: 4. 推理后端设置</a>。</td>
														
 
															+    <td>参考 <a href="../module_usage/instructions/model_python_API.md#4-推理配置">PaddleX单模型Python脚本使用说明</a>。可通过键值对配置 <code>PaddlePredictorOption</code> 对象的属性，例如 <code>run_mode</code>。</td>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>openvino</code></td>
														
 
															-    <td><code>cpu_num_threads</code>：CPU 推理使用的逻辑处理器数量。默认为<code>8</code>。</td>
														
 
															+    <td><code>cpu_num_threads</code>（<code>int</code>）：CPU 推理使用的逻辑处理器数量。默认为<code>8</code>。</td>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>onnxruntime</code></td>
														
 
															-    <td><code>cpu_num_threads</code>：CPU 推理时算子内部的并行计算线程数。默认为<code>8</code>。</td>
														
 
															+    <td><code>cpu_num_threads</code>（<code>int</code>）：CPU 推理时算子内部的并行计算线程数。默认为<code>8</code>。</td>
														
 
															   </tr>
														
 
															   <tr>
														
 
															     <td><code>tensorrt</code></td>
														
 
															     <td>
														
 
															-      <code>precision</code>：使用的精度，<code>fp16</code>或<code>fp32</code>。默认为<code>fp32</code>。
														
 
															+      <code>precision</code>（<code>str</code>）：使用的精度，<code>"fp16"</code>或<code>"fp32"</code>。默认为<code>"fp32"</code>。
														
 
															       <br />
														
 
															-      <code>dynamic_shapes</code>：动态形状。动态形状包含最小形状、最优形状以及最大形状，是 TensorRT 延迟指定部分或全部张量维度直到运行时的能力。格式为：<code>{输入张量名称}: [{最小形状}, [{最优形状}], [{最大形状}]]</code>。更多介绍请参考 <a href="https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#work_dynamic_shapes">TensorRT 官方文档</a>。
														
 
															+      <code>dynamic_shapes</code>（<code>dict</code>）：动态形状。动态形状包含最小形状、最优形状以及最大形状，是 TensorRT 延迟指定部分或全部张量维度直到运行时的能力。格式为：<code>{输入张量名称}: [{最小形状}, {优化形状}, {最大形状}]</code>。更多介绍请参考 <a href="https://docs.nvidia.com/deeplearning/tensorrt/latest/inference-library/work-dynamic-shapes.html">TensorRT 官方文档</a>。
														
 
															     </td>
														
 
															   </tr>
														
 
															   <tr>
														
@@ -616,22 +616,16 @@ python -m pip install ../../python/dist/ultra_infer*.whl
 
															 **1. 为什么开启高性能推理插件前后，感觉推理速度没有明显提升？**
														
 
															-高性能推理插件通过智能选择后端来加速推理。
														
 
															+高性能推理插件通过智能选择和配置后端来实现推理加速。由于模型结构复杂或存在不支持算子等情况，部分模型可能无法实现加速。此时，PaddleX 会在日志中给出相应提示。可以使用 [PaddleX benchmark 功能](../module_usage/instructions/benchmark.md) 测量模块中各部分的推理耗时情况，以便更准确地评估性能。此外，对于产线而言，推理的性能瓶颈可能不在模型推理上，而在串联逻辑上，这也可能导致加速效果不明显。
														
 
															-对于单功能模块，由于模型复杂性或不支持算子等情况，部分模型可能无法使用加速后端（如OpenVINO、TensorRT等）。此时日志中会提示相关内容，并选择已知**最快的可用后端**，因此可能退回到普通推理。
														
 
															+**2. 是否所有产线与模块均支持高性能推理？**
														
 
															-对于模型产线，性能瓶颈可能不在模型推理阶段。
														
 
															+所有使用静态图模型的产线与模块都支持启用高性能推理插件，但部分模型在某些情况下可能无法获得推理加速，具体原因可以参考问题1。
														
 
															-可以使用 [PaddleX benchmark](../module_usage/instructions/benchmark.md) 工具进行实际速度测试，以便更准确地评估性能。
														
 
															+**3. 为什么安装高性能推理插件会失败，日志显示：“Currently, the CUDA version must be 11.x for GPU devices.”？**
														
 
															-**2. 高性能推理功能是否支持所有模型产线与单功能模块？**
														
 
															+对于 GPU 版本的高性能推理插件，目前 PaddleX 官方仅提供 CUDA 11.8 + cuDNN 8.9 的预编译包。CUDA 12 目前正在支持中。
														
 
															-高性能推理功能支持所有模型产线与单功能模块，但部分模型可能无法加速推理，具体原因可以参考问题1。
														
 
															+**4. 为什么使用高性能推理功能后，程序在运行过程中会卡住或者显示一些“WARNING”和“ERROR”信息？这种情况下应该如何处理？**
														
 
															-**3. 为什么安装高性能推理插件会失败，日志显示：Currently, the CUDA version must be 11.x for GPU devices.？**
														
 
															-
														
 
															-高性能推理功能目前仅支持有限的环境，详情请参考安装说明。如果安装失败，可能是高性能推理功能不支持当前环境。另外，CUDA 12.6 已经在支持中。
														
 
															-
														
 
															-**4. 为什么使用高性能推理功能后，程序在运行过程中会卡住或者显示一些 WARNING 和 ERROR 信息？这种情况下应该如何处理？**
														
 
															-
														
 
															-在初始化模型时，子图优化等操作可能会导致程序耗时较长，并生成一些 WARNING 和 ERROR 信息。然而，只要程序没有自动退出，建议耐心等待，程序通常会继续运行至完成。
														
 
															+在初始化模型时，子图优化等操作可能会导致程序耗时较长，并生成一些“WARNING”和“ERROR”信息。然而，只要程序没有自动退出，建议耐心等待，程序通常会继续运行至完成。
														
--- a/docs/pipeline_usage/instructions/pipeline_python_API.en.md
+++ b/docs/pipeline_usage/instructions/pipeline_python_API.en.md
@@ -33,7 +33,7 @@ In short, there are only three steps:
 
															   * Parameters:
														
 
															     * `pipeline`: `str` type, the pipeline name or the local pipeline configuration file path, such as "image_classification", "/path/to/image_classification.yaml";
														
 
															     * `device`: `str` type, used to set the inference device. If set for GPU, you can specify the card number, such as "cpu", "gpu:2". By default, using 0 id GPU if available, otherwise CPU;
														
 
															-    * `pp_option`: `PaddlePredictorOption` type, used to set the inference engine. Please refer to [4-Inference Backend Configuration](#4-inference-backend-configuration) for more details;
														
 
															+    * `pp_option`: `PaddlePredictorOption` type, used to change inference settings (e.g. the operating mode). Please refer to [4-Inference Configuration](#4-inference-configuration) for more details;
														
 
															   * Return Value: `BasePredictor` type.
														
 
															 ### 2. Perform Inference by Calling the `predict()` Method of the Prediction Model Pipeline Object
														
@@ -91,31 +91,36 @@ The prediction results of the pipeline support to be accessed and saved, which c
 
															   * Returns: None.
														
 
															 * _`more funcs`_: The prediction result of different pipelines support different saving methods. Please refer to the specific pipeline tutorial documentation for details.
														
 
															-### 4. Inference Backend Configuration
														
 
															+### 4. Inference Configuration
														
 
															-PaddleX supports configuring the inference backend through `PaddlePredictorOption`. Relevant APIs are as follows:
														
 
															+PaddleX supports modifying the inference configuration through `PaddlePredictorOption`. Relevant APIs are as follows:
														
 
															 #### Attributes:
														
 
															-* `device`: Inference device;
														
 
															-  * Supports setting the device type and card number represented by `str`. Device types include 'gpu', 'cpu', 'npu', 'xpu', 'mlu'. When using an accelerator card, you can specify the card number, e.g., 'gpu:0' for GPU 0. The default is 'gpu:0';
														
 
															+* `device`: Inference device.
														
 
															+  * Supports setting the device type and card number represented by `str`. Device types include 'gpu', 'cpu', 'npu', 'xpu', 'mlu', 'dcu'. When using an accelerator card, you can specify the card number, e.g., 'gpu:0' for GPU 0. By default, using 0 id GPU if available, otherwise CPU.
														
 
															   * Return value: `str` type, the currently set inference device.
														
 
															-* `run_mode`: Inference backend;
														
 
															-  * Supports setting the inference backend as a `str` type, options include 'paddle', 'trt_fp32', 'trt_fp16', 'trt_int8', 'mkldnn', 'mkldnn_bf16'. 'mkldnn' is only selectable when the inference device is 'cpu'. The default is 'paddle';
														
 
															-  * Return value: `str` type, the currently set inference backend.
														
 
															-* `cpu_threads`: Number of CPU threads for the acceleration library, only valid when the inference device is 'cpu';
														
 
															-  * Supports setting an `int` type for the number of CPU threads for the acceleration library during CPU inference;
														
 
															+* `run_mode`: Operating mode.
														
 
															+  * Supports setting the operating mode as a `str` type, options include 'paddle', 'trt_fp32', 'trt_fp16', 'trt_int8', 'mkldnn', 'mkldnn_bf16'. Note that 'trt_fp32' and 'trt_fp16' correspond to using the TensorRT subgraph engine for inference with FP32 and FP16 precision respectively; these options are only available when the inference device is a GPU. Additionally, 'mkldnn' is only available when the inference device is a CPU. The default value is 'paddle'.
														
 
															+  * Return value: `str` type, the currently set operating mode.
														
 
															+* `cpu_threads`: Number of CPU threads for the acceleration library, only valid when the inference device is 'cpu'.
														
 
															+  * Supports setting an `int` type for the number of CPU threads for the acceleration library during CPU inference.
														
 
															   * Return value: `int` type, the currently set number of threads for the acceleration library.
														
 
															+* `trt_dynamic_shapes`: TensorRT dynamic shapes, only effective when `run_mode` is set to 'trt_fp32' or 'trt_fp16'.
														
 
															+  * Supports setting a value of type `dict` or `None`. If it is a `dict`, the keys are the input tensor names and the values are two-level nested lists formatted as `[{minimum shape}, {optimal shape}, {maximum shape}]`, for example `[[1, 2], [1, 2], [2, 2]]`.
														
 
															+  * Return value: `dict` type or `None`, the current TensorRT dynamic shape settings.
														
 
															+* `trt_dynamic_shape_input_data`: For TensorRT usage, this parameter provides the fill data for the input tensors used to build the engine, and it is only valid when `run_mode` is set to 'trt_fp32' or 'trt_fp16'.
														
 
															+  * Supports setting a value of type `dict` or `None`. If it is a `dict`, the keys are the input tensor names and the values are two-level nested lists formatted as `[{fill data corresponding to the minimum shape}, {fill data corresponding to the optimal shape}, {fill data corresponding to the maximum shape}]`, for example `[[1.0, 1.0], [1.0, 1.0], [1.0, 1.0, 1.0, 1.0]]`. The data are floating point numbers filled in row-major order.
														
 
															+  * Return value: `dict` type or `None`, the currently set input tensor fill data.
														
 
															 #### Methods:
														
 
															-* `get_support_run_mode`: Get supported inference backend configurations;
														
 
															+* `get_support_run_mode`: Get supported operating modes;
														
 
															   * Parameters: None;
														
 
															-  * Return value: List type, the available inference backend configurations.
														
 
															+  * Return value: List type, the available operating modes.
														
 
															 * `get_support_device`: Get supported device types for running;
														
 
															   * Parameters: None;
														
 
															   * Return value: List type, the available device types.
														
 
															 * `get_device`: Get the currently set device;
														
 
															   * Parameters: None;
														
 
															   * Return value: `str` type.
														
 
															-```
														
--- a/docs/pipeline_usage/instructions/pipeline_python_API.md
+++ b/docs/pipeline_usage/instructions/pipeline_python_API.md
@@ -34,7 +34,7 @@ for res in output:
 
															   * 参数：
														
 
															     * `pipeline`：`str` 类型，产线名或是本地产线配置文件路径，如“image_classification”、“/path/to/image_classification.yaml”；
														
 
															     * `device`：`str` 类型，用于设置模型推理设备，如为 GPU 则可以指定卡号，如“cpu”、“gpu:2”，默认情况下，如有 GPU 设置则使用 0 号 GPU，否则使用 CPU；
														
 
															-    * `pp_option`：`PaddlePredictorOption` 类型，用于设置模型推理后端，关于推理后端的详细说明，请参考下文[4-推理后端设置](#4-推理后端设置)；
														
 
															+    * `pp_option`：`PaddlePredictorOption` 类型，用于改变运行模式等配置项，关于推理配置的详细说明，请参考下文[4-推理配置](#4-推理配置)；
														
 
															   * 返回值：`BasePredictor`类型。
														
 
															 ### 2. 调用预测模型产线对象的`predict()`方法进行推理预测
														
@@ -94,25 +94,31 @@ for res in output:
 
															 ### 4. 推理后端设置
														
 
															-PaddleX 支持通过`PaddlePredictorOption`设置推理后端，相关API如下：
														
 
															+PaddleX 支持通过`PaddlePredictorOption`修改推理配置，相关API如下：
														
 
															 #### 属性：
														
 
															 * `deivce`：推理设备；
														
 
															-  * 支持设置 `str` 类型表示的推理设备类型及卡号，设备类型支持可选 'gpu', 'cpu', 'npu', 'xpu', 'mlu'，当使用加速卡时，支持指定卡号，如使用 0 号 gpu：'gpu:0'，默认为 'gpu:0'；
														
 
															+  * 支持设置 `str` 类型表示的推理设备类型及卡号，设备类型支持可选 “gpu”、“cpu”、“npu”、“xpu”、“mlu”、“dcu”，当使用加速卡时，支持指定卡号，如使用 0 号 GPU：`gpu:0`，默认情况下，如有 GPU 设置则使用 0 号 GPU，否则使用 CPU；
														
 
															   * 返回值：`str`类型，当前设置的推理设备。
														
 
															-* `run_mode`：推理后端；
														
 
															-  * 支持设置 `str` 类型的推理后端，支持可选 'paddle'，'trt_fp32'，'trt_fp16'，'trt_int8'，'mkldnn'，'mkldnn_bf16'，其中 'mkldnn' 仅当推理设备使用 cpu 时可选，默认为 'paddle'；
														
 
															-  * 返回值：`str`类型，当前设置的推理后端。
														
 
															+* `run_mode`：运行模式；
														
 
															+  * 支持设置 `str` 类型的运行模式，支持可选 'paddle'，'trt_fp32'，'trt_fp16'，'trt_int8'，'mkldnn'，'mkldnn_bf16'，其中 'trt_fp32' 和' trt_fp16' 分别对应使用 TensorRT 子图引擎进行 FP32 和 FP16 精度的推理，仅当推理设备使用 GPU 时可选，'mkldnn' 仅当推理设备使用 CPU 时可选，默认为 'paddle'；
														
 
															+  * 返回值：`str`类型，当前设置的运行模式。
														
 
															 * `cpu_threads`：cpu 加速库计算线程数，仅当推理设备使用 cpu 时有效；
														
 
															   * 支持设置 `int` 类型，cpu 推理时加速库计算线程数；
														
 
															   * 返回值：`int` 类型，当前设置的加速库计算线程数。
														
 
															+* `trt_dynamic_shapes`：TensorRT 动态形状，仅当 `run_mode` 为 'trt_fp32' 或 'trt_fp16' 时有效；
														
 
															+  * 支持设置：`dict` 类型或 `None`，如果为 `dict`，键为输入张量名称，值为一个两级嵌套列表：`[{最小形状}, {优化形状}, {最大形状}]`，例如 `[[1, 2], [1, 2], [2, 2]]`；
														
 
															+  * 返回值：`dict` 类型或 `None`，当前设置的 TensorRT 动态形状。
														
 
															+* `trt_dynamic_shape_input_data`：使用 TensorRT 时，为用于构建引擎的输入张量填充的数据，仅当 `run_mode` 为 'trt_fp32' 或 'trt_fp16' 时有效；
														
 
															+  * 支持设置：`dict` 类型或 `None`，如果为 `dict`，键为输入张量名称，值为一个两级嵌套列表：`[{最小形状对应的填充数据}, {优化形状对应的填充数据}, {最大形状对应的填充数据}]`，例如 `[[1.0, 1.0], [1.0, 1.0], [1.0, 1.0, 1.0, 1.0]]`，数据为浮点数，按照行优先顺序填充；
														
 
															+  * 返回值：`dict` 类型或 `None`，当前设置的输入张量填充数据。
														
 
															 #### 方法：
														
 
															-* `get_support_run_mode`：获取支持的推理后端设置；
														
 
															+* `get_support_run_mode`：获取支持的运行模式；
														
 
															   * 参数：无；
														
 
															-  * 返回值：list 类型，可选的推理后端设置。
														
 
															+  * 返回值：list 类型，可选的运行模式。
														
 
															 * `get_support_device`：获取支持的运行设备类型；
														
 
															   * 参数：无；
														
 
															   * 返回值：list 类型，可选的设备类型。
														
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -411,6 +411,7 @@ nav:
 
															          - PaddleX单模型Python脚本使用说明: module_usage/instructions/model_python_API.md
														
 
															          - PaddleX通用模型配置文件参数说明: module_usage/instructions/config_parameters_common.md
														
 
															          - PaddleX时序任务模型配置文件参数说明: module_usage/instructions/config_parameters_time_series.md
														
 
															+         - 模型推理 Benchmark: module_usage/instructions/benchmark.md
														
 
															   - 模型产线部署:
														
 
															        - 高性能推理: pipeline_deploy/high_performance_inference.md
														
 
															        - 服务化部署: pipeline_deploy/serving.md