Bläddra i källkod

update hpi doc (#4389)

* update hpi doc

* update en doc

* update

* update

* update

* update

* update
zhang-prog 3 månader sedan
förälder
incheckning
c443829f80

+ 17 - 13
docs/pipeline_deploy/high_performance_inference.en.md

@@ -38,15 +38,19 @@ Currently, the supported processor architectures, operating systems, device type
     <th>Python Version</th>
   </tr>
   <tr>
-    <td rowspan="5">Linux</td>
-    <td rowspan="4">x86-64</td>
+    <td rowspan="6">Linux</td>
+    <td rowspan="5">x86-64</td>
   </tr>
   <tr>
     <td>CPU</td>
     <td>3.8–3.12</td>
   </tr>
   <tr>
-    <td>GPU&nbsp;(CUDA&nbsp;11.8&nbsp;+&nbsp;cuDNN&nbsp;8.9)</td>
+    <td>GPU&nbsp;(CUDA&nbsp;11.8&nbsp;+&nbsp;cuDNN&nbsp;8.9)</td>
+    <td>3.8–3.12</td>
+  </tr>
+  <tr>
+    <td>GPU&nbsp;(CUDA&nbsp;12.6&nbsp;+&nbsp;cuDNN&nbsp;9.5)</td>
     <td>3.8–3.12</td>
   </tr>
   <tr>
@@ -104,10 +108,12 @@ paddlex --install hpi-cpu
 
 **To install the GPU version of the high-performance inference plugin:**
 
-Before installation, please ensure that CUDA and cuDNN are installed in your environment. The official PaddleX currently only provides precompiled packages for CUDA 11.8 + cuDNN 8.9, so please ensure that the installed versions of CUDA and cuDNN are compatible with the compiled versions. Below are the installation documentation links for CUDA 11.8 and cuDNN 8.9:
+Before installation, please ensure that CUDA and cuDNN are installed in your environment. The official PaddleX currently provides precompiled packages for CUDA 11.8 + cuDNN 8.9 and CUDA 12.6 + cuDNN 9.5, so please ensure that the installed versions of CUDA and cuDNN are compatible with the compiled versions. Below are the installation documentation links for CUDA and cuDNN:
 
 - [Install CUDA 11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive)
 - [Install cuDNN 8.9](https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-890/install-guide/index.html)
+- [Install CUDA 12.6](https://developer.nvidia.com/cuda-12-6-0-download-archive)
+- [Install cuDNN 9.5](https://docs.nvidia.com/deeplearning/cudnn/backend/v9.5.0/installation/linux.html)
 
 If you are using the official PaddlePaddle image, the CUDA and cuDNN versions in the image already meet the requirements, so there is no need for a separate installation.
 
@@ -120,7 +126,7 @@ pip list | grep nvidia-cuda
 pip list | grep nvidia-cudnn
 ```
 
-If you wish to use the Paddle Inference TensorRT subgraph engine, you will need to install TensorRT additionally. Please refer to the related instructions in the [PaddlePaddle Local Installation Tutorial](../installation/paddlepaddle_install.en.md). Note that because the underlying inference library of the high-performance inference plugin also integrates TensorRT, it is recommended to install the same version of TensorRT to avoid version conflicts. Currently, the TensorRT version integrated into the high-performance inference plugin's underlying inference library is 8.6.1.6. If you are using the official PaddlePaddle image, you do not need to worry about version conflicts.
+If you wish to use the Paddle Inference TensorRT subgraph engine, you will need to install TensorRT additionally. Please refer to the related instructions in the [PaddlePaddle Local Installation Tutorial](../installation/paddlepaddle_install.en.md). Note that because the underlying inference library of the high-performance inference plugin also integrates TensorRT, it is recommended to install the same version of TensorRT to avoid version conflicts. Currently, the TensorRT version integrated into the CUDA 11.8 high-performance inference plugin's underlying inference library is 8.6.1.6. If you are using the official PaddlePaddle image, you do not need to worry about version conflicts.
 
 After confirming that the correct versions of CUDA, cuDNN, and TensorRT (optional) are installed, run:
 
@@ -134,7 +140,7 @@ Please refer to the [Ascend NPU High-Performance Inference Tutorial](../practica
 
 **Note:**
 
-1. **Currently, the official PaddleX only provides precompiled packages for CUDA 11.8 + cuDNN 8.9**; support for CUDA 12 is in progress.
+1. **Currently, the precompiled package of CUDA 12.6 + cuDNN 9.5  provided by PaddleX only supports the OpenVINO and ONNX Runtime backends and does not yet support the TensorRT backend.**
 2. Only one version of the high-performance inference plugin should exist in the same environment.
 3. For Windows systems, it is currently recommended to install and use the high-performance inference plugin within a Docker container or in [WSL](https://learn.microsoft.com/en-us/windows/wsl/install) environments.
 
@@ -144,6 +150,8 @@ Below are examples of enabling the high-performance inference plugin in both the
 
 For the PaddleX CLI, specify `--use_hpip` to enable the high-performance inference plugin.
 
+**Before enabling high-performance inference plugins, it is recommended to install the [Paddle2ONNX plugin](./paddle2onnx.en.md).  Otherwise, PaddleX will be unable to convert PaddlePaddle models to ONNX models, preventing the use of ONNX Runtime, TensorRT, and other inference backends.** If you are directly using an ONNX model, there is no need to install the Paddle2ONNX plugin.
+
 **General Image Classification Pipeline:**
 
 ```bash
@@ -555,21 +563,17 @@ python -m pip install ../../python/dist/ultra_infer*.whl
 
 **1. Why does the inference speed not appear to improve noticeably before and after enabling the high-performance inference plugin?**
 
-The high-performance inference plugin achieves inference acceleration by intelligently selecting and configuring the backend. However, due to the complex structure of some models or the presence of unsupported operators, not all models may be able to be accelerated. In these cases, PaddleX will provide corresponding prompts in the log. You can use the [PaddleX benchmark feature](../module_usage/instructions/benchmark.en.md) to measure the inference duration of each module component, thereby facilitating a more accurate performance evaluation. Moreover, for pipelines, the performance bottleneck of inference may not lie in the model inference, but rather in the surrounding logic, which could also result in limited acceleration gains.
+The high-performance inference plugin achieves inference acceleration by intelligently selecting and configuring the backend. Firstly, due to the complex structure of some models or the presence of unsupported operators, not all models may be able to be accelerated. Secondly, if the [Paddle2ONNX plugin](./paddle2onnx.en.md) is not installed, PaddleX will be unable to convert PaddlePaddle models to ONNX models, thereby preventing the use of ONNX Runtime, TensorRT, and other inference backends for acceleration. In these cases, PaddleX will provide corresponding prompts in the log. You can use the [PaddleX benchmark feature](../module_usage/instructions/benchmark.en.md) to measure the inference duration of each module component, thereby facilitating a more accurate performance evaluation. Moreover, for pipelines, the performance bottleneck of inference may not lie in the model inference, but rather in the surrounding logic, which could also result in limited acceleration gains.
 
 **2. Do all pipelines and modules support high-performance inference?**
 
 All pipelines and modules that use static graph models support enabling the high-performance inference plugin; however, in certain scenarios, some models might not be able to achieve accelerated inference. For detailed reasons, please refer to Question 1.
 
-**3. Why does the installation of the high-performance inference plugin fail with a log message stating: “You are not using PaddlePaddle compiled with CUDA 11. Currently, CUDA versions other than 11.x are not supported by the high-performance inference plugin.”?**
-
-For the GPU version of the high-performance inference plugin, the official PaddleX currently only provides precompiled packages for CUDA 11.8 + cuDNN 8.9. The support for CUDA 12 is in progress.
-
-**4. Why does the program freeze during runtime or display some "WARNING" and "ERROR" messages after using the high-performance inference feature? What should be done in such cases?**
+**3. Why does the program freeze during runtime or display some "WARNING" and "ERROR" messages after using the high-performance inference feature? What should be done in such cases?**
 
 When initializing the model, operations such as subgraph optimization may take longer and may generate some "WARNING" and "ERROR" messages. However, as long as the program does not exit automatically, it is recommended to wait patiently, as the program usually continues to run to completion.
 
-**5. When using GPU for inference, enabling the high-performance inference plugin increases memory usage and causes OOM. How can this be resolved?**
+**4. When using GPU for inference, enabling the high-performance inference plugin increases memory usage and causes OOM. How can this be resolved?**
 
 Some acceleration methods trade off memory usage to support a broader range of inference scenarios. If memory becomes a bottleneck, consider the following optimization strategies:
 

+ 16 - 14
docs/pipeline_deploy/high_performance_inference.md

@@ -38,8 +38,8 @@ comments: true
     <th>Python 版本</th>
   </tr>
   <tr>
-    <td rowspan="5">Linux</td>
-    <td rowspan="4">x86-64</td>
+    <td rowspan="6">Linux</td>
+    <td rowspan="5">x86-64</td>
   </tr>
   <tr>
     <td>CPU</td>
@@ -50,6 +50,10 @@ comments: true
     <td>3.8–3.12</td>
   </tr>
   <tr>
+    <td>GPU&nbsp;(CUDA&nbsp;12.6&nbsp;+&nbsp;cuDNN&nbsp;9.5)</td>
+    <td>3.8–3.12</td>
+  </tr>
+  <tr>
     <td>NPU</td>
     <td>3.10</td>
   </tr>
@@ -92,8 +96,6 @@ PaddleX 官方 Docker 镜像中预装了 Paddle2ONNX 插件,以便 PaddleX 可
 
 #### 1.1.2 本地安装高性能推理插件
 
-**建议在安装高性能推理插件前,首先安装 Paddle2ONNX 插件,以便 PaddleX 可以在需要时转换模型格式。**
-
 **安装 CPU 版本的高性能推理插件:**
 
 执行:
@@ -104,10 +106,12 @@ paddlex --install hpi-cpu
 
 **安装 GPU 版本的高性能推理插件:**
 
-在安装前,需要确保环境中安装有 CUDA 与 cuDNN。目前 PaddleX 官方提供 CUDA 11.8 + cuDNN 8.9 的预编译包,请保证安装的 CUDA 和 cuDNN 版本与编译版本兼容。以下分别是 CUDA 11.8 和 cuDNN 8.9 的安装说明文档:
+在安装前,需要确保环境中安装有 CUDA 与 cuDNN。目前 PaddleX 官方提供 CUDA 11.8 + cuDNN 8.9 和 CUDA 12.6 + cuDNN 9.5 的预编译包,请保证安装的 CUDA 和 cuDNN 版本与编译版本兼容。以下分别是 CUDA 和 cuDNN 的安装说明文档:
 
 - [安装 CUDA 11.8](https://developer.nvidia.com/cuda-11-8-0-download-archive)
 - [安装 cuDNN 8.9](https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-890/install-guide/index.html)
+- [安装 CUDA 12.6](https://developer.nvidia.com/cuda-12-6-0-download-archive)
+- [安装 cuDNN 9.5](https://docs.nvidia.com/deeplearning/cudnn/backend/v9.5.0/installation/linux.html)
 
 如果使用的是飞桨框架官方镜像,则镜像中的 CUDA 和 cuDNN 版本已经是满足要求的,无需额外安装。
 
@@ -120,7 +124,7 @@ pip list | grep nvidia-cuda
 pip list | grep nvidia-cudnn
 ```
 
-如果希望使用 Paddle Inference TensorRT 子图引擎,需额外安装 TensorRT。请参考 [飞桨PaddlePaddle本地安装教程](../installation/paddlepaddle_install.md) 中的相关说明。需要注意的是,由于高性能推理插件的底层推理库也集成了 TensorRT,建议安装相同版本的 TensorRT 以避免版本冲突。目前,高性能推理插件底层推理库集成的 TensorRT 版本为 8.6.1.6。如果使用的是飞桨框架官方镜像,则无需关心版本冲突问题。
+如果希望使用 Paddle Inference TensorRT 子图引擎,需额外安装 TensorRT。请参考 [飞桨PaddlePaddle本地安装教程](../installation/paddlepaddle_install.md) 中的相关说明。需要注意的是,由于高性能推理插件的底层推理库也集成了 TensorRT,建议安装相同版本的 TensorRT 以避免版本冲突。目前,CUDA 11.8 的高性能推理插件底层推理库集成的 TensorRT 版本为 8.6.1.6。如果使用的是飞桨框架官方镜像,则无需关心版本冲突问题。
 
 确认安装了正确版本的 CUDA、cuDNN、以及 TensorRT (可选)后,执行:
 
@@ -134,7 +138,7 @@ paddlex --install hpi-gpu
 
 **注意:**
 
-1. **目前 PaddleX 官方仅提供 CUDA 11.8 + cuDNN 8.9 的预编译包**。CUDA 12 已经在支持中。
+1. **目前 PaddleX 提供的 CUDA 12.6 + cuDNN 9.5 的预编译包仅支持 OpenVINO 和 ONNX Runtime 后端,暂不支持TensorRT 后端。**
 
 2. 同一环境中只应该存在一个版本的高性能推理插件。
 
@@ -146,6 +150,8 @@ paddlex --install hpi-gpu
 
 对于 PaddleX CLI,指定 `--use_hpip`,即可启用高性能推理插件。
 
+**在启用高性能推理插件前,建议安装 [Paddle2ONNX 插件](./paddle2onnx.md) ,否则 PaddleX 将无法进行 PaddlePaddle 模型至 ONNX 模型的转换,从而无法使用 ONNX Runtime、TensorRT 等推理后端**。若您直接使用 ONNX 模型,则无需安装 Paddle2ONNX 插件。
+
 通用图像分类产线:
 
 ```bash
@@ -555,21 +561,17 @@ python -m pip install ../../python/dist/ultra_infer*.whl
 
 **1. 为什么开启高性能推理插件前后,感觉推理速度没有明显提升?**
 
-高性能推理插件通过智能选择和配置后端来实现推理加速。由于模型结构复杂或存在不支持算子等情况,部分模型可能无法实现加速。此时,PaddleX 会在日志中给出相应提示。可以使用 [PaddleX benchmark 功能](../module_usage/instructions/benchmark.md) 测量模块中各部分的推理耗时情况,以便更准确地评估性能。此外,对于产线而言,推理的性能瓶颈可能不在模型推理上,而在串联逻辑上,这也可能导致加速效果不明显。
+高性能推理插件通过智能选择和配置后端来实现推理加速。首先,由于模型结构复杂或存在不支持算子等情况,部分模型可能无法实现加速。其次,如果未安装 [Paddle2ONNX 插件](./paddle2onnx.md) ,会导致 PaddleX 无法进行 PaddlePaddle 模型至 ONNX 模型的转换,从而无法使用 ONNX Runtime、TensorRT 等推理后端进行加速。此时,PaddleX 会在日志中给出相应提示。可以使用 [PaddleX benchmark 功能](../module_usage/instructions/benchmark.md) 测量模块中各部分的推理耗时情况,以便更准确地评估性能。此外,对于产线而言,推理的性能瓶颈可能不在模型推理上,而在串联逻辑上,这也可能导致加速效果不明显。
 
 **2. 是否所有产线与模块均支持高性能推理?**
 
 所有使用静态图模型的产线与模块都支持启用高性能推理插件,但部分模型在某些情况下可能无法获得推理加速,具体原因可以参考问题1。
 
-**3. 为什么安装高性能推理插件会失败,日志显示:“You are not using PaddlePaddle compiled with CUDA 11. Currently, CUDA versions other than 11.x are not supported by the high-performance inference plugin.”?**
-
-对于 GPU 版本的高性能推理插件,目前 PaddleX 官方仅提供 CUDA 11.8 + cuDNN 8.9 的预编译包。CUDA 12 目前正在支持中。
-
-**4. 为什么使用高性能推理功能后,程序在运行过程中会卡住或者显示一些“WARNING”和“ERROR”信息?这种情况下应该如何处理?**
+**3. 为什么使用高性能推理功能后,程序在运行过程中会卡住或者显示一些“WARNING”和“ERROR”信息?这种情况下应该如何处理?**
 
 在初始化模型时,子图优化等操作可能会导致程序耗时较长,并生成一些“WARNING”和“ERROR”信息。然而,只要程序没有自动退出,建议耐心等待,程序通常会继续运行至完成。
 
-**5. 使用 GPU 推理时,启用高性能推理插件后显存占用增大并导致 OOM,如何解决?**
+**4. 使用 GPU 推理时,启用高性能推理插件后显存占用增大并导致 OOM,如何解决?**
 
 部分提速手段会以牺牲显存为代价,以支持更广泛的推理场景。如果显存成为瓶颈,可参考以下优化思路: