Эх сурвалжийг харах

[Feat] upgrade benchmark (#3416)

* update benchmark

* update

* update

* h2d d2h

* update

* update doc

* fix input sequence

* update

* update

* update

* update

* update ts common processors

* update

* update

* update

* update
zhang-prog 8 сар өмнө
parent
commit
5158085f23
33 өөрчлөгдсөн 515 нэмэгдсэн , 320 устгасан
  1. 167 46
      docs/module_usage/instructions/benchmark.md
  2. 3 0
      paddlex/engine.py
  3. 1 14
      paddlex/inference/common/batch_sampler/base_batch_sampler.py
  4. 6 4
      paddlex/inference/common/batch_sampler/image_batch_sampler.py
  5. 2 0
      paddlex/inference/common/reader/image_reader.py
  6. 9 0
      paddlex/inference/models/3d_bev_detection/processors.py
  7. 3 0
      paddlex/inference/models/anomaly_detection/processors.py
  8. 0 1
      paddlex/inference/models/base/predictor/base_predictor.py
  9. 23 16
      paddlex/inference/models/base/predictor/basic_predictor.py
  10. 32 63
      paddlex/inference/models/common/static_infer.py
  11. 7 1
      paddlex/inference/models/common/ts/processors.py
  12. 7 0
      paddlex/inference/models/common/vision/processors.py
  13. 11 0
      paddlex/inference/models/formula_recognition/processors.py
  14. 3 0
      paddlex/inference/models/image_classification/processors.py
  15. 3 0
      paddlex/inference/models/image_feature/processors.py
  16. 3 0
      paddlex/inference/models/image_multilabel_classification/processors.py
  17. 3 0
      paddlex/inference/models/image_unwarping/processors.py
  18. 2 0
      paddlex/inference/models/instance_segmentation/processors.py
  19. 3 0
      paddlex/inference/models/keypoint_detection/processors.py
  20. 10 0
      paddlex/inference/models/object_detection/processors.py
  21. 5 0
      paddlex/inference/models/open_vocabulary_detection/processors/groundingdino_processors.py
  22. 3 0
      paddlex/inference/models/open_vocabulary_segmentation/processors/sam_processer.py
  23. 3 0
      paddlex/inference/models/semantic_segmentation/processors.py
  24. 3 0
      paddlex/inference/models/table_structure_recognition/processors.py
  25. 4 0
      paddlex/inference/models/text_detection/processors.py
  26. 5 0
      paddlex/inference/models/text_recognition/processors.py
  27. 3 0
      paddlex/inference/models/ts_anomaly_detection/processors.py
  28. 4 0
      paddlex/inference/models/ts_classification/processors.py
  29. 4 0
      paddlex/inference/models/ts_forecasting/processors.py
  30. 8 0
      paddlex/inference/models/video_classification/processors.py
  31. 6 0
      paddlex/inference/models/video_detection/processors.py
  32. 168 170
      paddlex/inference/utils/benchmark.py
  33. 1 5
      paddlex/utils/flags.py

+ 167 - 46
docs/module_usage/instructions/benchmark.md

@@ -1,74 +1,195 @@
 # 模型推理 Benchmark
 
-PaddleX 支持统计模型推理耗时,需通过环境变量进行设置,具体如下:
+## 目录
+
+- [1. 使用说明](#1.使用说明)
+- [2. 使用示例](#2.使用示例)
+  - [2.1 命令行方式](#2.1-命令行方式)
+  - [2.2 Python 脚本方式](#2.2-Python-脚本方式)
+- [3. 结果说明](#3.结果说明)
+
+## 1.使用说明
+
+Benchmark 会统计模型在端到端推理过程中,所有操作(`Operation`)和阶段(`Stage`)的每次迭代的平均执行时间(`Avg Time Per Iter (ms)`)和每个样本的平均执行时间(`Avg Time Per Instance (ms)`),单位为毫秒。
+
+需通过环境变量启用 Benchmark,具体如下:
 
 * `PADDLE_PDX_INFER_BENCHMARK`:设置为 `True` 时则开启 Benchmark,默认为 `False`;
-* `PADDLE_PDX_INFER_BENCHMARK_WARMUP`:设置 warm up,在开始测试前,使用随机数据循环迭代 n 次,默认为 `0`;
-* `PADDLE_PDX_INFER_BENCHMARK_DATA_SIZE`: 设置随机数据的尺寸,默认为 `224`;
-* `PADDLE_PDX_INFER_BENCHMARK_ITER`:使用随机数据进行 Benchmark 测试的循环次数,仅当输入数据为 `None` 时,将使用随机数据进行测试,默认为 `10`;
+* `PADDLE_PDX_INFER_BENCHMARK_WARMUP`:设置 warm up,在开始测试前循环迭代 n 次,默认为 `0`;
+* `PADDLE_PDX_INFER_BENCHMARK_ITER`:进行 Benchmark 测试的循环次数,默认为 `0`;
 * `PADDLE_PDX_INFER_BENCHMARK_OUTPUT`:用于设置保存的目录,如 `./benchmark`,默认为 `None`,表示不保存 Benchmark 指标;
 
-使用示例如下:
+**注意**:
+
+* `PADDLE_PDX_INFER_BENCHMARK_WARMUP` 或 `PADDLE_PDX_INFER_BENCHMARK_ITER` 需要至少设置一个大于零的值,否则无法启用 Benchmark。
+
+## 2.使用示例
+
+您可以通过以下两种方式来使用 benchmark:命令行方式和 Python 脚本方式。
+
+### 2.1 命令行方式
+
+**注意**:
+
+- 输入参数说明可参考 [PaddleX通用模型配置文件参数说明](./config_parameters_common.md)
+- `Predict.input` 在 Benchmark 只能被设置为输入数据的本地路径。如果 `batch_size` 大于 1,输入数据将被重复 `batch_size` 次以匹配 `batch_size` 的大小。
+
+执行命令:
 
 ```bash
 PADDLE_PDX_INFER_BENCHMARK=True \
 PADDLE_PDX_INFER_BENCHMARK_WARMUP=5 \
-PADDLE_PDX_INFER_BENCHMARK_DATA_SIZE=320 \
 PADDLE_PDX_INFER_BENCHMARK_ITER=10 \
 PADDLE_PDX_INFER_BENCHMARK_OUTPUT=./benchmark \
 python main.py \
-    -c ./paddlex/configs/object_detection/PicoDet-XS.yaml \
+    -c ./paddlex/configs/modules/object_detection/PicoDet-XS.yaml \
     -o Global.mode=predict \
     -o Predict.model_dir=None \
     -o Predict.batch_size=2 \
-    -o Predict.input=None
+    -o Predict.input=./test.png
+
+# 使用pptrt推理后端
+#   -o Predict.kernel_option="{'run_mode': 'trt_fp32'}"
+```
+
+### 2.2 Python 脚本方式
+
+**注意**:
+
+- 输入参数说明可参考 [PaddleX单模型Python脚本使用说明](./model_python_API.md)
+- `input` 在 Benchmark 只能被设置为输入数据的本地路径。如果 `batch_size` 大于 1,输入数据将被重复 `batch_size` 次以匹配 `batch_size` 的大小。
+
+创建 `test_infer.py` 脚本:
+
+```python
+from paddlex import create_model
+
+model = create_model(model_name="PicoDet-XS", model_dir=None)
+output = list(model.predict(input="./test.png", batch_size=2))
+
+# 使用pptrt推理后端
+# from paddlex import create_model
+# from paddlex.inference.utils.pp_option import PaddlePredictorOption
+
+# pp_option = PaddlePredictorOption()
+# pp_option.run_mode = "trt_fp32"
+# model = create_model(model_name="PicoDet-XS", model_dir=None, pp_option=pp_option)
+# output = list(model.predict(input="./test.png", batch_size=2))
 ```
 
-在开启 Benchmark 后,将自动打印 benchmark 指标:
+执行脚本
 
+```bash
+PADDLE_PDX_INFER_BENCHMARK=True \
+PADDLE_PDX_INFER_BENCHMARK_WARMUP=5 \
+PADDLE_PDX_INFER_BENCHMARK_ITER=10 \
+PADDLE_PDX_INFER_BENCHMARK_OUTPUT=./benchmark \
+python test_infer.py
 ```
-+----------------+-----------------+-----------------+------------------------+
-|   Component    | Total Time (ms) | Number of Calls | Avg Time Per Call (ms) |
-+----------------+-----------------+-----------------+------------------------+
-|    ReadCmp     |   99.60412979   |        10       |       9.96041298       |
-|     Resize     |   17.01641083   |        20       |       0.85082054       |
-|   Normalize    |   44.61312294   |        20       |       2.23065615       |
-|   ToCHWImage   |    0.03385544   |        20       |       0.00169277       |
-|    Copy2GPU    |   13.46874237   |        10       |       1.34687424       |
-|     Infer      |   71.31743431   |        10       |       7.13174343       |
-|    Copy2CPU    |    0.39076805   |        10       |       0.03907681       |
-| DetPostProcess |    0.36168098   |        20       |       0.01808405       |
-+----------------+-----------------+-----------------+------------------------+
-+-------------+-----------------+---------------------+----------------------------+
-|    Stage    | Total Time (ms) | Number of Instances | Avg Time Per Instance (ms) |
-+-------------+-----------------+---------------------+----------------------------+
-|  PreProcess |   161.26751900  |          20         |         8.06337595         |
-|  Inference  |   85.17694473   |          20         |         4.25884724         |
-| PostProcess |    0.36168098   |          20         |         0.01808405         |
-|   End2End   |   256.90770149  |          20         |        12.84538507         |
-|    WarmUp   |  5412.37807274  |          10         |        541.23780727        |
-+-------------+-----------------+---------------------+----------------------------+
+
+## 3.结果示例
+
+在开启 Benchmark 后,将自动打印 Benchmark 结果,具体说明如下:
+
+<table border="1">
+    <thead>
+        <tr>
+            <th>字段名</th>
+            <th>字段含义</th>
+        </tr>
+    </thead>
+    <tbody>
+        <tr>
+            <td>Iters</td>
+            <td>迭代次数,指执行模型推理的循环次数。</td>
+        </tr>
+        <tr>
+            <td>Batch Size</td>
+            <td>批处理大小,指每次迭代中处理的样本数量。</td>
+        </tr>
+        <tr>
+            <td>Instances</td>
+            <td>总样本数量,计算方式为 <code>Iters</code> 乘以 <code>Batch Size</code>。</td>
+        </tr>
+        <tr>
+            <td>Operation</td>
+            <td>操作名称,如 <code>Resize</code>、<code>Normalize</code> 等。</td>
+        </tr>
+        <tr>
+            <td>Stage</td>
+            <td>阶段名称,包括预处理(PreProcess)、推理(Inference)、后处理(PostProcess)、以及端到端(End2End)。</td>
+        </tr>
+        <tr>
+            <td>Avg Time Per Iter (ms)</td>
+            <td>每次迭代的平均执行时间,单位为毫秒。</td>
+        </tr>
+        <tr>
+            <td>Avg Time Per Instance (ms)</td>
+            <td>每个样本的平均执行时间,单位为毫秒。</td>
+        </tr>
+    </tbody>
+</table>
+
+运行第2节的示例程序所得到的 Benchmark 结果如下:
+
+```
+                                             WarmUp Data
++-------+------------+-----------+-------------+------------------------+----------------------------+
+| Iters | Batch Size | Instances |    Stage    | Avg Time Per Iter (ms) | Avg Time Per Instance (ms) |
++-------+------------+-----------+-------------+------------------------+----------------------------+
+|   5   |     2      |     10    |  PreProcess |      98.70615005       |        49.35307503         |
+|   5   |     2      |     10    |  Inference  |      68.70298386       |        34.35149193         |
+|   5   |     2      |     10    | PostProcess |       0.22978783       |         0.11489391         |
+|   5   |     2      |     10    |   End2End   |      167.63892174      |        83.81946087         |
++-------+------------+-----------+-------------+------------------------+----------------------------+
+                                               Detail Data
++-------+------------+-----------+----------------+------------------------+----------------------------+
+| Iters | Batch Size | Instances |   Operation    | Avg Time Per Iter (ms) | Avg Time Per Instance (ms) |
++-------+------------+-----------+----------------+------------------------+----------------------------+
+|   10  |     2      |     20    |   ReadImage    |      77.00567245       |        38.50283623         |
+|   10  |     2      |     20    |     Resize     |      11.97342873       |         5.98671436         |
+|   10  |     2      |     20    |   Normalize    |       6.09791279       |         3.04895639         |
+|   10  |     2      |     20    |   ToCHWImage   |       0.00574589       |         0.00287294         |
+|   10  |     2      |     20    |    ToBatch     |       0.72050095       |         0.36025047         |
+|   10  |     2      |     20    |    Copy2GPU    |       3.15101147       |         1.57550573         |
+|   10  |     2      |     20    |     Infer      |       9.58673954       |         4.79336977         |
+|   10  |     2      |     20    |    Copy2CPU    |       0.07462502       |         0.03731251         |
+|   10  |     2      |     20    | DetPostProcess |       0.22695065       |         0.11347532         |
++-------+------------+-----------+----------------+------------------------+----------------------------+
+                                             Summary Data
++-------+------------+-----------+-------------+------------------------+----------------------------+
+| Iters | Batch Size | Instances |    Stage    | Avg Time Per Iter (ms) | Avg Time Per Instance (ms) |
++-------+------------+-----------+-------------+------------------------+----------------------------+
+|   10  |     2      |     20    |  PreProcess |      95.80326080       |        47.90163040         |
+|   10  |     2      |     20    |  Inference  |      12.81237602       |         6.40618801         |
+|   10  |     2      |     20    | PostProcess |       0.22695065       |         0.11347532         |
+|   10  |     2      |     20    |   End2End   |      108.84258747      |        54.42129374         |
++-------+------------+-----------+-------------+------------------------+----------------------------+
 ```
 
-在 Benchmark 结果中,会统计该模型全部组件(`Component`)的总耗时(`Total Time`,单位为“毫秒”)、**调用次数**(`Number of Calls`)、**调用**平均执行耗时(`Avg Time Per Call`,单位“毫秒”),以及按预热(`WarmUp`)、预处理(`PreProcess`)、模型推理(`Inference`)、后处理(`PostProcess`)和端到端(`End2End`)进行划分的耗时统计,包括每个阶段的总耗时(`Total Time`,单位为“毫秒”)、**样本数**(`Number of Instances`)和**单样本**平均执行耗时(`Avg Time Per Instance`,单位“毫秒”),同时,上述指标会保存到到本地: `./benchmark/detail.csv` 和 `./benchmark/summary.csv`:
+同时,由于设置了`PADDLE_PDX_INFER_BENCHMARK_OUTPUT=./benchmark`,所以上述结果会保存到到本地: `./benchmark/detail.csv` 和 `./benchmark/summary.csv`:
+
+`detail.csv` 内容如下:
 
 ```csv
-Component,Total Time (ms),Number of Calls,Avg Time Per Call (ms)
-ReadCmp,99.60412979125977,10,9.960412979125977
-Resize,17.01641082763672,20,0.8508205413818359
-Normalize,44.61312294006348,20,2.230656147003174
-ToCHWImage,0.033855438232421875,20,0.0016927719116210938
-Copy2GPU,13.468742370605469,10,1.3468742370605469
-Infer,71.31743431091309,10,7.131743431091309
-Copy2CPU,0.39076805114746094,10,0.039076805114746094
-DetPostProcess,0.3616809844970703,20,0.018084049224853516
+Iters,Batch Size,Instances,Operation,Avg Time Per Iter (ms),Avg Time Per Instance (ms)
+10,2,20,ReadImage,77.00567245,38.50283623
+10,2,20,Resize,11.97342873,5.98671436
+10,2,20,Normalize,6.09791279,3.04895639
+10,2,20,ToCHWImage,0.00574589,0.00287294
+10,2,20,ToBatch,0.72050095,0.36025047
+10,2,20,Copy2GPU,3.15101147,1.57550573
+10,2,20,Infer,9.58673954,4.79336977
+10,2,20,Copy2CPU,0.07462502,0.03731251
+10,2,20,DetPostProcess,0.22695065,0.11347532
 ```
 
+`summary.csv` 内容如下:
+
 ```csv
-Stage,Total Time (ms),Number of Instances,Avg Time Per Instance (ms)
-PreProcess,161.26751899719238,20,8.06337594985962
-Inference,85.17694473266602,20,4.258847236633301
-PostProcess,0.3616809844970703,20,0.018084049224853516
-End2End,256.90770149230957,20,12.845385074615479
-WarmUp,5412.3780727386475,10,541.2378072738647
+Iters,Batch Size,Instances,Stage,Avg Time Per Iter (ms),Avg Time Per Instance (ms)
+10,2,20,PreProcess,95.80326080,47.90163040
+10,2,20,Inference,12.81237602,6.40618801
+10,2,20,PostProcess,0.22695065,0.11347532
+10,2,20,End2End,108.84258747,54.42129374
 ```

+ 3 - 0
paddlex/engine.py

@@ -19,6 +19,7 @@ from .utils.result_saver import try_except_decorator
 from .utils.config import parse_args, get_config
 from .utils.errors import raise_unsupported_api_error
 from .model import _ModelBasedConfig
+from .utils.flags import INFER_BENCHMARK
 
 
 class Engine(object):
@@ -47,6 +48,8 @@ class Engine(object):
             return self._model.export()
         elif self._mode == "predict":
             for res in self._model.predict():
+                if INFER_BENCHMARK:
+                    continue
                 res.print()
                 if self._output:
                     res.save_all(save_path=self._output)

+ 1 - 14
paddlex/inference/common/batch_sampler/base_batch_sampler.py

@@ -15,12 +15,6 @@
 from typing import Union, Tuple, List, Dict, Any, Iterator
 from abc import ABC, abstractmethod
 
-from ....utils.flags import (
-    INFER_BENCHMARK,
-    INFER_BENCHMARK_ITER,
-    INFER_BENCHMARK_DATA_SIZE,
-)
-
 
 class BaseBatchSampler:
     """BaseBatchSampler"""
@@ -33,9 +27,6 @@ class BaseBatchSampler:
         """
         super().__init__()
         self._batch_size = batch_size
-        self._benchmark = INFER_BENCHMARK
-        self._benchmark_iter = INFER_BENCHMARK_ITER
-        self._benchmark_data_size = INFER_BENCHMARK_DATA_SIZE
 
     @property
     def batch_size(self) -> int:
@@ -69,11 +60,7 @@ class BaseBatchSampler:
         Yields:
             Iterator[List[Any]]: An iterator yielding the batch data.
         """
-        if input is None and self._benchmark:
-            for _ in range(self._benchmark_iter):
-                yield self._rand_batch(self._benchmark_data_size)
-        else:
-            yield from self.sample(input)
+        yield from self.sample(input)
 
     @abstractmethod
     def sample(self, *args: Tuple[Any], **kwargs: Dict[str, Any]) -> Iterator[list]:

+ 6 - 4
paddlex/inference/common/batch_sampler/image_batch_sampler.py

@@ -128,9 +128,11 @@ class ImageBatchSampler(BaseBatchSampler):
                 assert all(isinstance(item, int) for item in res)
                 return res
 
+        rand_batch = ImgInstance()
         size = parse_size(data_size)
-        rand_batch = [
-            np.random.randint(0, 256, (*size, 3), dtype=np.uint8)
-            for _ in range(self.batch_size)
-        ]
+        for _ in range(self.batch_size):
+            rand_batch.append(
+                np.random.randint(0, 256, (*size, 3), dtype=np.uint8), None, None
+            )
+
         return rand_batch

+ 2 - 0
paddlex/inference/common/reader/image_reader.py

@@ -16,6 +16,7 @@ import numpy as np
 import cv2
 
 from ...utils.io import ImageReader, PDFReader
+from ...utils.benchmark import benchmark
 
 
 class ReadImage:
@@ -40,6 +41,7 @@ class ReadImage:
         flags = self._FLAGS_DICT[self.format]
         self._img_reader = ImageReader(backend="opencv", flags=flags)
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
         return [self.read(img) for img in imgs]

+ 9 - 0
paddlex/inference/models/3d_bev_detection/processors.py

@@ -22,6 +22,7 @@ import lazy_paddle as paddle
 from ...utils.io import ImageReader
 from ....utils import logging
 from ...common.reader.det_3d_reader import Sample
+from ...utils.benchmark import benchmark
 
 
 cv2_interp_codes = {
@@ -70,6 +71,7 @@ class LoadPointsFromFile:
         points = np.fromfile(pts_filename, dtype=np.float32)
         return points
 
+    @benchmark.timeit
     def __call__(self, results):
         """Call function to load points data from file and process it.
 
@@ -219,6 +221,7 @@ class LoadPointsFromMultiSweeps(object):
         )
         return points[filt]
 
+    @benchmark.timeit
     def __call__(self, results):
         """Call function to load multi-sweep point clouds from files.
 
@@ -305,6 +308,7 @@ class LoadMultiViewImageFromFiles:
         self.constant_std = constant_std
         self.imread_flag = imread_flag
 
+    @benchmark.timeit
     def __call__(self, sample):
         """
         Call method to load multi-view image from files and update the sample dictionary.
@@ -636,6 +640,7 @@ class ResizeImage:
         """Resize semantic segmentation map with ``results['scale']``."""
         raise NotImplementedError
 
+    @benchmark.timeit
     def __call__(self, results):
         """Call function to resize images, bounding boxes, masks, and semantic segmentation maps according to the provided scale or scale factor.
 
@@ -709,6 +714,7 @@ class NormalizeImage:
         cv2.multiply(img, stdinv, img)  # inplace
         return img
 
+    @benchmark.timeit
     def __call__(self, results):
         """Call method to normalize images in the results dictionary.
 
@@ -853,6 +859,7 @@ class PadImage(object):
         """Pad semantic segmentation map according to ``results['pad_shape']``."""
         raise NotImplementedError
 
+    @benchmark.timeit
     def __call__(self, results):
         """Call function to pad images, masks, semantic segmentation maps."""
         self._pad_img(results)
@@ -890,6 +897,7 @@ class SampleFilterByKey:
         self.keys = keys
         self.meta_keys = meta_keys
 
+    @benchmark.timeit
     def __call__(self, sample):
         """Call function to filter sample by keys. The keys in `meta_keys` are used to filter metadata from the input sample.
 
@@ -944,6 +952,7 @@ class GetInferInput:
                 collated_batch[k] = [elem[k] for elem in batch]
         return collated_batch
 
+    @benchmark.timeit
     def __call__(self, sample):
         """Call function to infer input data from transformed sample
 

+ 3 - 0
paddlex/inference/models/anomaly_detection/processors.py

@@ -15,6 +15,8 @@
 import numpy as np
 from skimage import measure, morphology
 
+from ...utils.benchmark import benchmark
+
 
 class MapToMask:
     """Map_to_mask"""
@@ -25,6 +27,7 @@ class MapToMask:
         """
         super().__init__()
 
+    @benchmark.timeit
     def __call__(self, preds, *args):
         """apply"""
         return [self.apply(pred) for pred in preds]

+ 0 - 1
paddlex/inference/models/base/predictor/base_predictor.py

@@ -71,7 +71,6 @@ class BasePredictor(ABC):
 
         # alias predict() to the __call__()
         self.predict = self.__call__
-        self.benchmark = None
 
     @property
     def config_path(self) -> str:

+ 23 - 16
paddlex/inference/models/base/predictor/basic_predictor.py

@@ -19,6 +19,7 @@ from .....utils.subclass_register import AutoRegisterABCMetaClass
 from .....utils.flags import (
     INFER_BENCHMARK,
     INFER_BENCHMARK_WARMUP,
+    INFER_BENCHMARK_ITER,
 )
 from .....utils import logging
 from ....utils.pp_option import PaddlePredictorOption
@@ -69,7 +70,6 @@ class BasicPredictor(
         self.batch_sampler.batch_size = batch_size
 
         logging.debug(f"{self.__class__.__name__}: {self.model_dir}")
-        self.benchmark = benchmark
 
     def __call__(
         self,
@@ -93,23 +93,30 @@ class BasicPredictor(
             Iterator[Any]: An iterator yielding the prediction output.
         """
         self.set_predictor(batch_size, device, pp_option)
-        if self.benchmark:
-            self.benchmark.start()
+        if INFER_BENCHMARK:
+            # TODO(zhang-prog): Get metadata of input data
+            if not isinstance(input, str):
+                raise TypeError("Only support string as input")
+            input = [input] * batch_size
+
+            if not (INFER_BENCHMARK_WARMUP > 0 or INFER_BENCHMARK_ITER > 0):
+                raise RuntimeError(
+                    "At least one of `INFER_BENCHMARK_WARMUP` and `INFER_BENCHMARK_ITER` must be greater than zero"
+                )
+
             if INFER_BENCHMARK_WARMUP > 0:
-                output = self.apply(input, **kwargs)
-                warmup_num = 0
+                benchmark.start_warmup()
                 for _ in range(INFER_BENCHMARK_WARMUP):
-                    try:
-                        next(output)
-                        warmup_num += 1
-                    except StopIteration:
-                        logging.warning(
-                            f"There are only {warmup_num} batches in input data, but `INFER_BENCHMARK_WARMUP` has been set to {INFER_BENCHMARK_WARMUP}."
-                        )
-                        break
-                self.benchmark.warmup_stop(warmup_num)
-            output = list(self.apply(input, **kwargs))
-            self.benchmark.collect(len(output))
+                    output = list(self.apply(input, **kwargs))
+                benchmark.collect(batch_size)
+                benchmark.stop_warmup()
+
+            if INFER_BENCHMARK_ITER > 0:
+                for _ in range(INFER_BENCHMARK_ITER):
+                    output = list(self.apply(input, **kwargs))
+                benchmark.collect(batch_size)
+
+            yield output[0]
         else:
             yield from self.apply(input, **kwargs)
 

+ 32 - 63
paddlex/inference/models/common/static_infer.py

@@ -14,13 +14,12 @@
 
 from typing import Union, Tuple, List, Dict, Any, Iterator
 import os
-import shutil
-import threading
 from pathlib import Path
 import lazy_paddle as paddle
 import numpy as np
 
 from ....utils.flags import DEBUG, FLAGS_json_format_model, USE_PIR_TRT
+from ...utils.benchmark import benchmark
 from ....utils import logging
 from ...utils.pp_option import PaddlePredictorOption
 from ...utils.trt_config import TRT_CFG
@@ -90,29 +89,17 @@ def convert_trt(model_name, mode, pp_model_path, trt_save_path, trt_dynamic_shap
 
 
 class Copy2GPU:
-
-    def __init__(self, input_handlers):
-        super().__init__()
-        self.input_handlers = input_handlers
-
-    def __call__(self, x):
-        for idx in range(len(x)):
-            self.input_handlers[idx].reshape(x[idx].shape)
-            self.input_handlers[idx].copy_from_cpu(x[idx])
+    @benchmark.timeit
+    def __call__(self, arrs):
+        paddle_tensors = [paddle.to_tensor(i) for i in arrs]
+        return paddle_tensors
 
 
 class Copy2CPU:
-
-    def __init__(self, output_handlers):
-        super().__init__()
-        self.output_handlers = output_handlers
-
-    def __call__(self):
-        output = []
-        for out_tensor in self.output_handlers:
-            batch = out_tensor.copy_to_cpu()
-            output.append(batch)
-        return output
+    @benchmark.timeit
+    def __call__(self, paddle_tensors):
+        arrs = [i.numpy() for i in paddle_tensors]
+        return arrs
 
 
 class Infer:
@@ -121,8 +108,9 @@ class Infer:
         super().__init__()
         self.predictor = predictor
 
-    def __call__(self):
-        self.predictor.run()
+    @benchmark.timeit
+    def __call__(self, x):
+        return self.predictor.run(x)
 
 
 class StaticInfer:
@@ -135,22 +123,10 @@ class StaticInfer:
         self.model_dir = model_dir
         self.model_prefix = model_prefix
         self.option = option
-        self.option.changed = True
-        self._lock = threading.Lock()
-
-    def _reset(self) -> None:
-        with self._lock:
-            self.option.changed = False
-            logging.debug(f"Env: {self.option}")
-            (
-                predictor,
-                input_handlers,
-                output_handlers,
-            ) = self._create()
-
-        self.copy2gpu = Copy2GPU(input_handlers)
-        self.copy2cpu = Copy2CPU(output_handlers)
-        self.infer = Infer(predictor)
+        self.predictor = self._create()
+        self.copy2gpu = Copy2GPU()
+        self.copy2cpu = Copy2CPU()
+        self.infer = Infer(self.predictor)
 
     def _create(
         self,
@@ -303,29 +279,22 @@ class StaticInfer:
         # Get input and output handlers
         input_names = predictor.get_input_names()
         input_names.sort()
-        input_handlers = []
-        output_handlers = []
-        for input_name in input_names:
-            input_handler = predictor.get_input_handle(input_name)
-            input_handlers.append(input_handler)
-        output_names = predictor.get_output_names()
-        for output_name in output_names:
-            output_handler = predictor.get_output_handle(output_name)
-            output_handlers.append(output_handler)
-        return predictor, input_handlers, output_handlers
+
+        return predictor
 
     def __call__(self, x) -> List[Any]:
-        if self.option.changed:
-            self._reset()
-        self.copy2gpu(x)
-        self.infer()
-        pred = self.copy2cpu()
+        # NOTE: Adjust input tensors to match the sorted sequence.
+        names = self.predictor.get_input_names()
+        if len(names) != len(x):
+            raise ValueError(
+                f"The number of inputs does not match the model: {len(names)} vs {len(x)}"
+            )
+        indices = sorted(range(len(names)), key=names.__getitem__)
+        x = [x[indices.index(i)] for i in range(len(x))]
+        # TODO:
+        # Ensure that input tensors follow the model's input sequence without sorting.
+
+        inputs = self.copy2gpu(x)
+        outputs = self.infer(inputs)
+        pred = self.copy2cpu(outputs)
         return pred
-
-    @property
-    def benchmark(self):
-        return {
-            "Copy2GPU": self.copy2gpu,
-            "Infer": self.infer,
-            "Copy2CPU": self.copy2cpu,
-        }

+ 7 - 1
paddlex/inference/models/common/ts/processors.py

@@ -20,7 +20,7 @@ import numpy as np
 import pandas as pd
 
 from .funcs import load_from_dataframe, time_feature
-
+from ....utils.benchmark import benchmark
 
 __all__ = [
     "BuildTSDataset",
@@ -53,6 +53,7 @@ class TSCutOff:
         super().__init__()
         self.size = size
 
+    @benchmark.timeit
     def __call__(self, ts_list: List) -> List:
         """Applies the cut off operation to a list of time series.
 
@@ -111,6 +112,7 @@ class TSNormalize:
         self.scaler = joblib.load(scale_path)
         self.params_info = params_info
 
+    @benchmark.timeit
     def __call__(self, ts_list: List[pd.DataFrame]) -> List[pd.DataFrame]:
         """Applies normalization to a list of time series data frames.
 
@@ -158,6 +160,7 @@ class BuildTSDataset:
         super().__init__()
         self.params_info = params_info
 
+    @benchmark.timeit
     def __call__(self, ts_list: List) -> List:
         """Applies the dataset construction to a list of time series.
 
@@ -200,6 +203,7 @@ class TimeFeature:
         self.size = size
         self.holiday = holiday
 
+    @benchmark.timeit
     def __call__(self, ts_list: List) -> List:
         """Applies time feature extraction to a list of time series.
 
@@ -258,6 +262,7 @@ class TStoArray:
         super().__init__()
         self.input_data = input_data
 
+    @benchmark.timeit
     def __call__(self, ts_list: List[Dict[str, Any]]) -> List[List[np.ndarray]]:
         """Converts a list of time series data frames into arrays.
 
@@ -295,6 +300,7 @@ class TStoBatch:
     equal-length arrays or DataFrames.
     """
 
+    @benchmark.timeit
     def __call__(self, ts_list: List[np.ndarray]) -> List[np.ndarray]:
         """Convert a list of time series into batches.
 

+ 7 - 0
paddlex/inference/models/common/vision/processors.py

@@ -23,6 +23,7 @@ import cv2
 from PIL import Image
 
 from . import funcs as F
+from ....utils.benchmark import benchmark
 
 
 class _BaseResize:
@@ -112,6 +113,7 @@ class Resize(_BaseResize):
 
         self.keep_ratio = keep_ratio
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
         return [self.resize(img) for img in imgs]
@@ -155,6 +157,7 @@ class ResizeByLong(_BaseResize):
         super().__init__(size_divisor=size_divisor, interp=interp, backend=backend)
         self.target_long_edge = target_long_edge
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
         return [self.resize(img) for img in imgs]
@@ -196,6 +199,7 @@ class ResizeByShort(_BaseResize):
         super().__init__(size_divisor=size_divisor, interp=interp, backend=backend)
         self.target_short_edge = target_short_edge
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
         return [self.resize(img) for img in imgs]
@@ -243,6 +247,7 @@ class Normalize:
         self.std = np.asarray(std).astype("float32")
         self.preserve_dtype = preserve_dtype
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
         old_type = imgs[0].dtype
@@ -260,11 +265,13 @@ class Normalize:
 class ToCHWImage:
     """Reorder the dimensions of the image from HWC to CHW."""
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
         return [img.transpose((2, 0, 1)) for img in imgs]
 
 
 class ToBatch:
+    @benchmark.timeit
     def __call__(self, imgs):
         return [np.stack(imgs, axis=0).astype(dtype=np.float32, copy=False)]

+ 11 - 0
paddlex/inference/models/formula_recognition/processors.py

@@ -28,6 +28,7 @@ from tokenizers import AddedToken
 from typing import List, Tuple, Optional, Any, Dict, Union
 
 from ....utils import logging
+from ...utils.benchmark import benchmark
 
 
 class MinMaxResize:
@@ -142,6 +143,7 @@ class MinMaxResize:
             img = np.dstack((img, img, img))
             return img
 
+    @benchmark.timeit
     def __call__(self, imgs: List[np.ndarray]) -> List[np.ndarray]:
         """Applies the resize method to a list of images.
 
@@ -181,6 +183,7 @@ class LatexTestTransform:
         squeezed = np.squeeze(grayscale_image)
         return cv2.merge([squeezed] * self.num_output_channels)
 
+    @benchmark.timeit
     def __call__(self, imgs: List[np.ndarray]) -> List[np.ndarray]:
         """
         Apply the transform to a list of images.
@@ -220,6 +223,7 @@ class LatexImageFormat:
         img_expanded = img[:, :, np.newaxis].transpose(2, 0, 1)
         return img_expanded[np.newaxis, :]
 
+    @benchmark.timeit
     def __call__(self, imgs: List[np.ndarray]) -> List[np.ndarray]:
         """Applies the format method to a list of images.
 
@@ -275,6 +279,7 @@ class NormalizeImage(object):
         img = (img.astype("float32") * self.scale - self.mean) / self.std
         return img
 
+    @benchmark.timeit
     def __call__(self, imgs: List[Union[np.ndarray, Image.Image]]) -> List[np.ndarray]:
         """Apply normalization to a list of images."""
         return [self.normalize(img) for img in imgs]
@@ -287,6 +292,7 @@ class ToBatch(object):
         """Initializes the ToBatch object."""
         super(ToBatch, self).__init__()
 
+    @benchmark.timeit
     def __call__(self, imgs: List[np.ndarray]) -> List[np.ndarray]:
         """Concatenates a list of images into a single batch.
 
@@ -371,6 +377,7 @@ class LaTeXOCRDecode(object):
         ]
         return [self.post_process(dec_str) for dec_str in dec_str_list]
 
+    @benchmark.timeit
     def __call__(
         self,
         preds: np.ndarray,
@@ -543,6 +550,7 @@ class UniMERNetImgDecode(object):
         )
         return np.array(ImageOps.expand(img, padding))
 
+    @benchmark.timeit
     def __call__(self, imgs: List[np.ndarray]) -> List[Optional[np.ndarray]]:
         """Calls the img_decode method on a list of images.
 
@@ -871,6 +879,7 @@ class UniMERNetDecode(object):
         text = self.normalize(text)
         return text
 
+    @benchmark.timeit
     def __call__(
         self,
         preds: np.ndarray,
@@ -934,6 +943,7 @@ class UniMERNetTestTransform:
         img = cv2.merge([squeezed] * self.num_output_channels)
         return img
 
+    @benchmark.timeit
     def __call__(self, imgs: List[np.ndarray]) -> List[np.ndarray]:
         """
         Applies the transform to a list of images.
@@ -974,6 +984,7 @@ class UniMERNetImageFormat:
         img_expanded = img[:, :, np.newaxis].transpose(2, 0, 1)
         return img_expanded[np.newaxis, :]
 
+    @benchmark.timeit
     def __call__(self, imgs: List[np.ndarray]) -> List[np.ndarray]:
         """Applies the format method to a list of images.
 

+ 3 - 0
paddlex/inference/models/image_classification/processors.py

@@ -16,6 +16,7 @@ import numpy as np
 
 from ....utils import logging
 from ..common.vision import F
+from ...utils.benchmark import benchmark
 
 
 class Crop:
@@ -41,6 +42,7 @@ class Crop:
             raise ValueError("Unsupported interpolation method")
         self.mode = mode
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
         return [self.crop(img) for img in imgs]
@@ -78,6 +80,7 @@ class Topk:
         class_id_map = {id: str(lb) for id, lb in enumerate(class_ids)}
         return class_id_map
 
+    @benchmark.timeit
     def __call__(self, preds, topk=5):
         indexes = preds[0].argsort(axis=1)[:, -topk:][:, ::-1].astype("int32")
         scores = [

+ 3 - 0
paddlex/inference/models/image_feature/processors.py

@@ -14,6 +14,8 @@
 
 import numpy as np
 
+from ...utils.benchmark import benchmark
+
 
 class NormalizeFeatures:
     """Normalize Features Transform"""
@@ -24,6 +26,7 @@ class NormalizeFeatures:
         features = np.divide(preds[0], feas_norm)
         return features
 
+    @benchmark.timeit
     def __call__(self, preds):
         normalized_features = [self._normalize(feature) for feature in preds]
         return normalized_features

+ 3 - 0
paddlex/inference/models/image_multilabel_classification/processors.py

@@ -15,6 +15,8 @@
 import numpy as np
 from typing import Union
 
+from ...utils.benchmark import benchmark
+
 
 class MultiLabelThreshOutput:
     """MultiLabelThresh Transform"""
@@ -31,6 +33,7 @@ class MultiLabelThreshOutput:
         class_id_map = {id: str(lb) for id, lb in enumerate(class_ids)}
         return class_id_map
 
+    @benchmark.timeit
     def __call__(self, preds, threshold: Union[float, dict, list]):
         threshold_list = []
         num_classes = preds[0].shape[-1]

+ 3 - 0
paddlex/inference/models/image_unwarping/processors.py

@@ -15,6 +15,8 @@
 import numpy as np
 from typing import List, Union, Tuple
 
+from ...utils.benchmark import benchmark
+
 
 class DocTrPostProcess:
     """
@@ -44,6 +46,7 @@ class DocTrPostProcess:
             np.float32(scale) if isinstance(scale, (str, float)) else np.float32(255.0)
         )
 
+    @benchmark.timeit
     def __call__(
         self, imgs: List[Union[np.ndarray, Tuple[np.ndarray, ...]]]
     ) -> List[np.ndarray]:

+ 2 - 0
paddlex/inference/models/instance_segmentation/processors.py

@@ -18,6 +18,7 @@ from typing import List, Sequence, Tuple, Union, Optional
 import numpy as np
 from ....utils import logging
 from ..object_detection.processors import restructured_boxes
+from ...utils.benchmark import benchmark
 
 import cv2
 
@@ -78,6 +79,7 @@ class InstanceSegPostProcess(object):
 
         return result
 
+    @benchmark.timeit
     def __call__(
         self,
         batch_outputs: List[dict],

+ 3 - 0
paddlex/inference/models/keypoint_detection/processors.py

@@ -20,6 +20,7 @@ import numpy as np
 from numpy import ndarray
 
 from ..object_detection.processors import get_affine_transform
+from ...utils.benchmark import benchmark
 
 Number = Union[int, float]
 Kpts = List[dict]
@@ -136,6 +137,7 @@ class TopDownAffine:
 
         return img, center, scale
 
+    @benchmark.timeit
     def __call__(self, datas: List[dict]) -> List[dict]:
         for data in datas:
             ori_img = data["img"]
@@ -216,6 +218,7 @@ class KptPostProcess:
             for kpt, score in zip(keypoints, scores)
         ]
 
+    @benchmark.timeit
     def __call__(self, batch_outputs: List[dict], datas: List[dict]) -> List[Kpts]:
         """Apply the post-processing to a batch of outputs.
 

+ 10 - 0
paddlex/inference/models/object_detection/processors.py

@@ -21,6 +21,7 @@ from numpy import ndarray
 from ..common import Resize as CommonResize
 from ..common import Normalize as CommonNormalize
 from ...common.reader import ReadImage as CommonReadImage
+from ...utils.benchmark import benchmark
 
 Boxes = List[dict]
 Number = Union[int, float]
@@ -29,6 +30,7 @@ Number = Union[int, float]
 class ReadImage(CommonReadImage):
     """Reads images from a list of raw image data or file paths."""
 
+    @benchmark.timeit
     def __call__(self, raw_imgs: List[Union[ndarray, str, dict]]) -> List[dict]:
         """Processes the input list of raw image data or file paths and returns a list of dictionaries containing image information.
 
@@ -93,6 +95,7 @@ class ReadImage(CommonReadImage):
 
 
 class Resize(CommonResize):
+    @benchmark.timeit
     def __call__(self, datas: List[dict]) -> List[dict]:
         """
         Args:
@@ -138,6 +141,7 @@ class Normalize(CommonNormalize):
             img = img.astype(old_type, copy=False)
         return img
 
+    @benchmark.timeit
     def __call__(self, datas: List[dict]) -> List[dict]:
         """Normalizes images in a list of dictionaries. Iterates over each dictionary,
         applies normalization to the 'img' key, and returns the modified list.
@@ -150,6 +154,7 @@ class Normalize(CommonNormalize):
 class ToCHWImage:
     """Converts images in a list of dictionaries from HWC to CHW format."""
 
+    @benchmark.timeit
     def __call__(self, datas: List[dict]) -> List[dict]:
         """Converts the image data in the list of dictionaries from HWC to CHW format in-place.
 
@@ -207,6 +212,7 @@ class ToBatch:
                 dtype=dtype, copy=False
             )
 
+    @benchmark.timeit
     def __call__(self, datas: List[dict]) -> Sequence[ndarray]:
         return [self.apply(datas, key) for key in self.ordered_required_keys]
 
@@ -242,6 +248,7 @@ class DetPad:
         canvas[0:im_h, 0:im_w, :] = im.astype(np.float32)
         return canvas
 
+    @benchmark.timeit
     def __call__(self, datas: List[dict]) -> List[dict]:
         for data in datas:
             data["img"] = self.apply(data["img"])
@@ -276,6 +283,7 @@ class PadStride:
         padding_im[:, :im_h, :im_w] = im
         return padding_im
 
+    @benchmark.timeit
     def __call__(self, datas: List[dict]) -> List[dict]:
         for data in datas:
             data["img"] = self.apply(data["img"])
@@ -438,6 +446,7 @@ class WarpAffine:
 
         return inp
 
+    @benchmark.timeit
     def __call__(self, datas: List[dict]) -> List[dict]:
 
         for data in datas:
@@ -760,6 +769,7 @@ class DetPostProcess:
             )
         return boxes
 
+    @benchmark.timeit
     def __call__(
         self,
         batch_outputs: List[dict],

+ 5 - 0
paddlex/inference/models/open_vocabulary_detection/processors/groundingdino_processors.py

@@ -20,6 +20,7 @@ import PIL
 
 from ...common.tokenizer.bert_tokenizer import BertTokenizer
 from .....utils.lazy_loader import LazyLoader
+from ....utils.benchmark import benchmark
 
 # NOTE: LazyLoader is used to avoid conflicts between ultra-infer and Paddle
 paddle = LazyLoader("lazy_paddle", globals(), "paddle")
@@ -117,6 +118,7 @@ class GroundingDINOPostProcessor(object):
         self.box_threshold = box_threshold
         self.text_threshold = text_threshold
 
+    @benchmark.timeit
     def __call__(
         self,
         pred_boxes,
@@ -234,6 +236,7 @@ class GroundingDINOProcessor(object):
         assert os.path.isdir(tokenizer_dir), f"{tokenizer_dir} not exists."
         self.tokenizer = BertTokenizer.from_pretrained(tokenizer_dir)
 
+    @benchmark.timeit
     def __call__(
         self,
         images: List[PIL.Image.Image],
@@ -270,6 +273,7 @@ class GroundingDinoTextProcessor(object):
     ):
         self.max_words = max_words
 
+    @benchmark.timeit
     def __call__(
         self,
         input_ids,
@@ -387,6 +391,7 @@ class GroundingDinoImageProcessor(object):
         self.image_std = image_std
         self.do_nested = do_nested
 
+    @benchmark.timeit
     def __call__(self, images, **kwargs):
         """Preprocess an image or a batch of images."""
         return self.preprocess(images, **kwargs)

+ 3 - 0
paddlex/inference/models/open_vocabulary_segmentation/processors/sam_processer.py

@@ -20,6 +20,7 @@ import PIL
 from copy import deepcopy
 
 from .....utils.lazy_loader import LazyLoader
+from ....utils.benchmark import benchmark
 
 # NOTE: LazyLoader is used to avoid conflicts between ultra-infer and Paddle
 paddle = LazyLoader("lazy_paddle", globals(), "paddle")
@@ -159,6 +160,7 @@ class SamPromptProcessor(object):
         boxes = self.apply_coords(boxes.reshape([-1, 2, 2]), original_size)
         return boxes.reshape([-1, 4])
 
+    @benchmark.timeit
     def __call__(
         self,
         original_size,
@@ -213,6 +215,7 @@ class SamImageProcessor(object):
 
         return np.array(T.resize(image, target_size))
 
+    @benchmark.timeit
     def __call__(self, images, **kwargs):
         if not isinstance(images, (list, tuple)):
             images = [images]

+ 3 - 0
paddlex/inference/models/semantic_segmentation/processors.py

@@ -23,6 +23,7 @@ import numpy as np
 from ..common.vision.processors import _BaseResize
 
 from ..common.vision import funcs as F
+from ...utils.benchmark import benchmark
 
 
 class Resize(_BaseResize):
@@ -52,6 +53,7 @@ class Resize(_BaseResize):
 
         self.keep_ratio = keep_ratio
 
+    @benchmark.timeit
     def __call__(self, imgs, target_size=None):
         """apply"""
         target_size = self.target_size if target_size is None else target_size
@@ -88,6 +90,7 @@ class SegPostProcess:
     restoring the prediction segmentation map to the original image size for now.
     """
 
+    @benchmark.timeit
     def __call__(self, imgs, src_images):
         assert len(imgs) == len(src_images)
 

+ 3 - 0
paddlex/inference/models/table_structure_recognition/processors.py

@@ -17,6 +17,7 @@ import cv2
 import numpy as np
 from numpy import ndarray
 from ..common.vision import funcs as F
+from ...utils.benchmark import benchmark
 
 
 class Pad:
@@ -55,6 +56,7 @@ class Pad:
 
         return [img, [img.shape[1], img.shape[0]]]
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
         return [self.apply(img) for img in imgs]
@@ -119,6 +121,7 @@ class TableLabelDecode:
             assert False, "unsupported type %s in get_beg_end_flag_idx" % beg_or_end
         return idx
 
+    @benchmark.timeit
     def __call__(self, pred, img_size, ori_img_size):
         """apply"""
         bbox_preds, structure_probs = [], []

+ 4 - 0
paddlex/inference/models/text_detection/processors.py

@@ -26,6 +26,7 @@ from shapely.geometry import Polygon
 
 from ...utils.io import ImageReader
 from ....utils import logging
+from ...utils.benchmark import benchmark
 
 
 class DetResizeForTest:
@@ -50,6 +51,7 @@ class DetResizeForTest:
             self.limit_side_len = 736
             self.limit_type = "min"
 
+    @benchmark.timeit
     def __call__(
         self,
         imgs,
@@ -196,6 +198,7 @@ class NormalizeImage:
         self.mean = np.array(mean).reshape(shape).astype("float32")
         self.std = np.array(std).reshape(shape).astype("float32")
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
 
@@ -412,6 +415,7 @@ class DBPostProcess:
         cv2.fillPoly(mask, contour.reshape(1, -1, 2).astype(np.int32), 1)
         return cv2.mean(bitmap[ymin : ymax + 1, xmin : xmax + 1], mask)[0]
 
+    @benchmark.timeit
     def __call__(
         self,
         preds,

+ 5 - 0
paddlex/inference/models/text_recognition/processors.py

@@ -27,6 +27,7 @@ import tempfile
 from tokenizers import Tokenizer as TokenizerFast
 
 from ....utils import logging
+from ...utils.benchmark import benchmark
 
 
 class OCRReisizeNormImg:
@@ -57,6 +58,7 @@ class OCRReisizeNormImg:
         padding_im[:, :, 0:resized_w] = resized_image
         return padding_im
 
+    @benchmark.timeit
     def __call__(self, imgs):
         """apply"""
         return [self.resize(img) for img in imgs]
@@ -146,6 +148,7 @@ class BaseRecLabelDecode:
         """get_ignored_tokens"""
         return [0]  # for ctc blank
 
+    @benchmark.timeit
     def __call__(self, pred):
         """apply"""
         preds = np.array(pred)
@@ -168,6 +171,7 @@ class CTCLabelDecode(BaseRecLabelDecode):
     def __init__(self, character_list=None, use_space_char=True):
         super().__init__(character_list, use_space_char=use_space_char)
 
+    @benchmark.timeit
     def __call__(self, pred):
         """apply"""
         preds = np.array(pred[0])
@@ -213,6 +217,7 @@ class ToBatch:
             padded_imgs.append(padded_img)
         return padded_imgs
 
+    @benchmark.timeit
     def __call__(self, imgs: List[np.ndarray]) -> List[np.ndarray]:
         """Call method to pad images and stack them into a batch.
 

+ 3 - 0
paddlex/inference/models/ts_anomaly_detection/processors.py

@@ -16,6 +16,8 @@ from typing import List, Dict, Any
 import numpy as np
 import pandas as pd
 
+from ...utils.benchmark import benchmark
+
 
 class GetAnomaly:
     """A class to detect anomalies in time series data based on a model threshold."""
@@ -32,6 +34,7 @@ class GetAnomaly:
         self.model_threshold = model_threshold
         self.info_params = info_params
 
+    @benchmark.timeit
     def __call__(
         self, ori_ts_list: List[Dict[str, Any]], pred_list: List[np.ndarray]
     ) -> List[pd.DataFrame]:

+ 4 - 0
paddlex/inference/models/ts_classification/processors.py

@@ -16,6 +16,8 @@ import numpy as np
 import pandas as pd
 from typing import List, Any, Dict
 
+from ...utils.benchmark import benchmark
+
 
 class GetCls:
     """A class to process prediction outputs and return class IDs and scores."""
@@ -24,6 +26,7 @@ class GetCls:
         """Initializes the GetCls instance."""
         super().__init__()
 
+    @benchmark.timeit
     def __call__(self, pred_list: List[Any]) -> List[pd.DataFrame]:
         """
         Processes a list of predictions and returns a list of DataFrames with class IDs and scores.
@@ -70,6 +73,7 @@ class BuildPadMask:
         super().__init__()
         self.input_data = input_data
 
+    @benchmark.timeit
     def __call__(self, ts_list: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
         """
         Applies padding mask to a list of time series data.

+ 4 - 0
paddlex/inference/models/ts_forecasting/processors.py

@@ -17,6 +17,8 @@ import joblib
 import numpy as np
 import pandas as pd
 
+from ...utils.benchmark import benchmark
+
 
 class TSDeNormalize:
     """A class to de-normalize time series prediction data using a pre-fitted scaler."""
@@ -33,6 +35,7 @@ class TSDeNormalize:
         self.scaler = joblib.load(scale_path)
         self.params_info = params_info
 
+    @benchmark.timeit
     def __call__(self, preds_list: List[pd.DataFrame]) -> List[pd.DataFrame]:
         """
         Applies de-normalization to a list of prediction DataFrames.
@@ -73,6 +76,7 @@ class ArraytoTS:
         super().__init__()
         self.info_params = info_params
 
+    @benchmark.timeit
     def __call__(
         self, ori_ts_list: List[Dict[str, Any]], pred_list: List[np.ndarray]
     ) -> List[pd.DataFrame]:

+ 8 - 0
paddlex/inference/models/video_classification/processors.py

@@ -25,6 +25,8 @@ import json
 import tempfile
 import lazy_paddle
 
+from ...utils.benchmark import benchmark
+
 
 class Scale:
     """Scale images."""
@@ -121,6 +123,7 @@ class Scale:
         imgs = resized_imgs
         return imgs
 
+    @benchmark.timeit
     def __call__(self, videos: List[np.ndarray]) -> List[np.ndarray]:
         """
         Apply the scaling operation to a list of videos.
@@ -181,6 +184,7 @@ class CenterCrop:
                 crop_imgs.append(img[y1 : y1 + th, x1 : x1 + tw])
         return crop_imgs
 
+    @benchmark.timeit
     def __call__(self, videos: List[np.ndarray]) -> List[np.ndarray]:
         """
         Apply the center crop operation to a list of videos.
@@ -234,6 +238,7 @@ class Image2Array:
                 t_imgs = t_imgs.transpose([3, 0, 1, 2])  # cthw
         return t_imgs
 
+    @benchmark.timeit
     def __call__(self, videos: List[np.ndarray]) -> List[np.ndarray]:
         """
         Apply the image to array conversion to a list of videos.
@@ -311,6 +316,7 @@ class NormalizeVideo:
         imgs = np.expand_dims(imgs, axis=0).copy()
         return imgs
 
+    @benchmark.timeit
     def __call__(self, videos: List[np.ndarray]) -> List[np.ndarray]:
         """
         Apply normalization to a list of videos.
@@ -368,6 +374,7 @@ class VideoClasTopk:
         class_id_map = {id: str(lb) for id, lb in enumerate(class_ids)}
         return class_id_map
 
+    @benchmark.timeit
     def __call__(
         self, preds: np.ndarray, topk: int = 5
     ) -> Tuple[np.ndarray, List[np.ndarray], List[List[str]]]:
@@ -397,6 +404,7 @@ class VideoClasTopk:
 class ToBatch:
     """A class for batching videos."""
 
+    @benchmark.timeit
     def __call__(self, videos: List[np.ndarray]) -> List[np.ndarray]:
         """Call method to stack videos into a batch.
 

+ 6 - 0
paddlex/inference/models/video_detection/processors.py

@@ -21,6 +21,8 @@ import numpy as np
 import cv2
 import lazy_paddle as paddle
 
+from ...utils.benchmark import benchmark
+
 
 class ResizeVideo:
     """Resizes frames of a video to a specified target size.
@@ -75,6 +77,7 @@ class ResizeVideo:
                 )
         return video
 
+    @benchmark.timeit
     def __call__(self, videos: List) -> List:
         """Resizes frames of multiple videos.
 
@@ -129,6 +132,7 @@ class Image2Array:
             video[i] = video_one
         return video
 
+    @benchmark.timeit
     def __call__(self, videos: List[List[np.ndarray]]) -> List[np.ndarray]:
         """
         Process videos by converting each video to a transposed numpy array.
@@ -177,6 +181,7 @@ class NormalizeVideo:
 
         return video
 
+    @benchmark.timeit
     def __call__(self, videos: List[List[np.ndarray]]) -> List[List[np.ndarray]]:
         """
         Apply normalization to a list of videos.
@@ -446,5 +451,6 @@ class DetVideoPostProcess:
             pred_all.append(preds)
         return pred_all
 
+    @benchmark.timeit
     def __call__(self, preds: List, nms_thresh, score_thresh) -> List:
         return [self.postprocess(pred, nms_thresh, score_thresh) for pred in preds]

+ 168 - 170
paddlex/inference/utils/benchmark.py

@@ -21,190 +21,41 @@ import numpy as np
 from prettytable import PrettyTable
 
 from ...utils.flags import INFER_BENCHMARK, INFER_BENCHMARK_OUTPUT
-from ...utils.misc import Singleton
 from ...utils import logging
 
 
-class Benchmark(metaclass=Singleton):
-    def __init__(self):
-        self._components = {}
-        self._warmup_start = None
-        self._warmup_elapse = None
-        self._warmup_num = None
-        self._e2e_tic = None
-        self._e2e_elapse = None
-
-    def attach(self, component):
-        self._components[component.name] = component
-
-    def start(self):
-        self._warmup_start = time.time()
-        self._reset()
-
-    def warmup_stop(self, warmup_num):
-        self._warmup_elapse = (time.time() - self._warmup_start) * 1000
-        self._warmup_num = warmup_num
-        self._reset()
-
-    def _reset(self):
-        for name, cmp in self.iterate_cmp(self._components):
-            cmp.timer.reset()
-        self._e2e_tic = time.time()
-
-    def iterate_cmp(self, cmps):
-        if cmps is None:
-            return
-        for name, cmp in cmps.items():
-            if hasattr(cmp, "benchmark"):
-                yield from self.iterate_cmp(cmp.benchmark)
-            yield name, cmp
-
-    def gather(self, e2e_num):
-        # lazy import for avoiding circular import
-        from ..new_models.base import BasePaddlePredictor
-
-        detail = []
-        summary = {"preprocess": 0, "inference": 0, "postprocess": 0}
-        op_tag = "preprocess"
-        for name, cmp in self._components.items():
-            if isinstance(cmp, BasePaddlePredictor):
-                # TODO(gaotingquan): show by hierarchy. Now dont show xxxPredictor benchmark info to ensure mutual exclusivity between components.
-                for name, sub_cmp in cmp.benchmark.items():
-                    times = sub_cmp.timer.logs
-                    counts = len(times)
-                    avg = np.mean(times) * 1000
-                    total = np.sum(times) * 1000
-                    detail.append((name, total, counts, avg))
-                    summary["inference"] += total
-                op_tag = "postprocess"
-            else:
-                # TODO(gaotingquan): support sub_cmps for others
-                # if hasattr(cmp, "benchmark"):
-                times = cmp.timer.logs
-                counts = len(times)
-                avg = np.mean(times) * 1000
-                total = np.sum(times) * 1000
-                detail.append((name, total, counts, avg))
-                summary[op_tag] += total
-
-        summary = [
-            (
-                "PreProcess",
-                summary["preprocess"],
-                e2e_num,
-                summary["preprocess"] / e2e_num,
-            ),
-            (
-                "Inference",
-                summary["inference"],
-                e2e_num,
-                summary["inference"] / e2e_num,
-            ),
-            (
-                "PostProcess",
-                summary["postprocess"],
-                e2e_num,
-                summary["postprocess"] / e2e_num,
-            ),
-            ("End2End", self._e2e_elapse, e2e_num, self._e2e_elapse / e2e_num),
-        ]
-        if self._warmup_elapse:
-            warmup_elapse, warmup_num, warmup_avg = (
-                self._warmup_elapse,
-                self._warmup_num,
-                self._warmup_elapse / self._warmup_num,
-            )
-        else:
-            warmup_elapse, warmup_num, warmup_avg = 0, 0, 0
-        summary.append(
-            (
-                "WarmUp",
-                warmup_elapse,
-                warmup_num,
-                warmup_avg,
-            )
-        )
-        return detail, summary
-
-    def collect(self, e2e_num):
-        self._e2e_elapse = (time.time() - self._e2e_tic) * 1000
-        detail, summary = self.gather(e2e_num)
-
-        detail_head = [
-            "Component",
-            "Total Time (ms)",
-            "Number of Calls",
-            "Avg Time Per Call (ms)",
-        ]
-        table = PrettyTable(detail_head)
-        table.add_rows(
-            [
-                (name, f"{total:.8f}", cnts, f"{avg:.8f}")
-                for name, total, cnts, avg in detail
-            ]
-        )
-        logging.info(table)
+class Benchmark:
+    def __init__(self, enabled):
+        self._enabled = enabled
+        self._elapses = {}
+        self._warmup = False
 
-        summary_head = [
-            "Stage",
-            "Total Time (ms)",
-            "Number of Instances",
-            "Avg Time Per Instance (ms)",
-        ]
-        table = PrettyTable(summary_head)
-        table.add_rows(
-            [
-                (name, f"{total:.8f}", cnts, f"{avg:.8f}")
-                for name, total, cnts, avg in summary
-            ]
-        )
-        logging.info(table)
-
-        if INFER_BENCHMARK_OUTPUT:
-            save_dir = Path(INFER_BENCHMARK_OUTPUT)
-            save_dir.mkdir(parents=True, exist_ok=True)
-            csv_data = [detail_head, *detail]
-            with open(Path(save_dir) / "detail.csv", "w", newline="") as file:
-                writer = csv.writer(file)
-                writer.writerows(csv_data)
-
-            csv_data = [summary_head, *summary]
-            with open(Path(save_dir) / "summary.csv", "w", newline="") as file:
-                writer = csv.writer(file)
-                writer.writerows(csv_data)
-
-
-class Timer:
-    def __init__(self, component):
-        from ..new_models.base import BaseComponent
-
-        assert isinstance(component, BaseComponent)
-        benchmark.attach(component)
-        component.apply = self.watch_func(component.apply)
-        self._tic = None
-        self._elapses = []
-
-    def watch_func(self, func):
+    def timeit(self, func):
         @functools.wraps(func)
         def wrapper(*args, **kwargs):
+            if not self._enabled:
+                return func(*args, **kwargs)
+
+            name = func.__qualname__
+
             tic = time.time()
             output = func(*args, **kwargs)
             if isinstance(output, GeneratorType):
-                return self.watch_generator(output)
+                return self.watch_generator(output, name)
             else:
-                self._update(time.time() - tic)
+                self._update(time.time() - tic, name)
                 return output
 
         return wrapper
 
-    def watch_generator(self, generator):
+    def watch_generator(self, generator, name):
         @functools.wraps(generator)
         def wrapper():
-            while 1:
+            while True:
                 try:
                     tic = time.time()
                     item = next(generator)
-                    self._update(time.time() - tic)
+                    self._update(time.time() - tic, name)
                     yield item
                 except StopIteration:
                     break
@@ -212,15 +63,162 @@ class Timer:
         return wrapper()
 
     def reset(self):
-        self._tic = None
-        self._elapses = []
+        self._elapses = {}
 
-    def _update(self, elapse):
-        self._elapses.append(elapse)
+    def _update(self, elapse, name):
+        elapse = elapse * 1000
+        if name in self._elapses:
+            self._elapses[name].append(elapse)
+        else:
+            self._elapses[name] = [elapse]
 
     @property
     def logs(self):
         return self._elapses
 
+    def start_timing(self):
+        self._enabled = True
 
-benchmark = Benchmark() if INFER_BENCHMARK else None
+    def stop_timing(self):
+        self._enabled = False
+
+    def start_warmup(self):
+        self._warmup = True
+
+    def stop_warmup(self):
+        self._warmup = False
+        self.reset()
+
+    def gather(self, batch_size):
+        logs = {k.split(".")[0]: v for k, v in self.logs.items()}
+
+        iters = len(logs["Infer"])
+        instances = iters * batch_size
+        detail_list = []
+        summary = {"preprocess": 0, "inference": 0, "postprocess": 0}
+        op_tag = "preprocess"
+
+        for name, time_list in logs.items():
+            avg = np.mean(time_list)
+            detail_list.append(
+                (iters, batch_size, instances, name, avg, avg / batch_size)
+            )
+
+            if name in ["Copy2GPU", "Infer", "Copy2CPU"]:
+                summary["inference"] += avg
+                op_tag = "postprocess"
+            else:
+                summary[op_tag] += avg
+
+        summary["end2end"] = (
+            summary["preprocess"] + summary["inference"] + summary["postprocess"]
+        )
+        summary_list = [
+            (
+                iters,
+                batch_size,
+                instances,
+                "PreProcess",
+                summary["preprocess"],
+                summary["preprocess"] / batch_size,
+            ),
+            (
+                iters,
+                batch_size,
+                instances,
+                "Inference",
+                summary["inference"],
+                summary["inference"] / batch_size,
+            ),
+            (
+                iters,
+                batch_size,
+                instances,
+                "PostProcess",
+                summary["postprocess"],
+                summary["postprocess"] / batch_size,
+            ),
+            (
+                iters,
+                batch_size,
+                instances,
+                "End2End",
+                summary["end2end"],
+                summary["end2end"] / batch_size,
+            ),
+        ]
+
+        return detail_list, summary_list
+
+    def collect(self, batch_size):
+        detail_list, summary_list = self.gather(batch_size)
+
+        if self._warmup:
+            summary_head = [
+                "Iters",
+                "Batch Size",
+                "Instances",
+                "Stage",
+                "Avg Time Per Iter (ms)",
+                "Avg Time Per Instance (ms)",
+            ]
+            table = PrettyTable(summary_head)
+            summary_list = [
+                i[:4] + (f"{i[4]:.8f}", f"{i[5]:.8f}") for i in summary_list
+            ]
+            table.add_rows(summary_list)
+            header = "WarmUp Data".center(len(str(table).split("\n")[0]), " ")
+            logging.info(header)
+            logging.info(table)
+
+        else:
+            detail_head = [
+                "Iters",
+                "Batch Size",
+                "Instances",
+                "Operation",
+                "Avg Time Per Iter (ms)",
+                "Avg Time Per Instance (ms)",
+            ]
+            table = PrettyTable(detail_head)
+            detail_list = [i[:4] + (f"{i[4]:.8f}", f"{i[5]:.8f}") for i in detail_list]
+            table.add_rows(detail_list)
+            header = "Detail Data".center(len(str(table).split("\n")[0]), " ")
+            logging.info(header)
+            logging.info(table)
+
+            summary_head = [
+                "Iters",
+                "Batch Size",
+                "Instances",
+                "Stage",
+                "Avg Time Per Iter (ms)",
+                "Avg Time Per Instance (ms)",
+            ]
+            table = PrettyTable(summary_head)
+            summary_list = [
+                i[:4] + (f"{i[4]:.8f}", f"{i[5]:.8f}") for i in summary_list
+            ]
+            table.add_rows(summary_list)
+            header = "Summary Data".center(len(str(table).split("\n")[0]), " ")
+            logging.info(header)
+            logging.info(table)
+
+            if INFER_BENCHMARK_OUTPUT:
+                save_dir = Path(INFER_BENCHMARK_OUTPUT)
+                save_dir.mkdir(parents=True, exist_ok=True)
+                csv_data = [detail_head, *detail_list]
+                with open(Path(save_dir) / "detail.csv", "w", newline="") as file:
+                    writer = csv.writer(file)
+                    writer.writerows(csv_data)
+
+                csv_data = [summary_head, *summary_list]
+                with open(Path(save_dir) / "summary.csv", "w", newline="") as file:
+                    writer = csv.writer(file)
+                    writer.writerows(csv_data)
+
+
+if INFER_BENCHMARK:
+    benchmark = Benchmark(enabled=True)
+else:
+    benchmark = Benchmark(enabled=False)

+ 1 - 5
paddlex/utils/flags.py

@@ -24,7 +24,6 @@ __all__ = [
     "INFER_BENCHMARK_ITER",
     "INFER_BENCHMARK_WARMUP",
     "INFER_BENCHMARK_OUTPUT",
-    "INFER_BENCHMARK_DATA_SIZE",
     "FLAGS_json_format_model",
     "USE_PIR_TRT",
     "DISABLE_DEV_MODEL_WL",
@@ -59,7 +58,4 @@ INFER_BENCHMARK_WARMUP = get_flag_from_env_var(
 INFER_BENCHMARK_OUTPUT = get_flag_from_env_var(
     "PADDLE_PDX_INFER_BENCHMARK_OUTPUT", None
 )
-INFER_BENCHMARK_ITER = get_flag_from_env_var("PADDLE_PDX_INFER_BENCHMARK_ITER", 10, int)
-INFER_BENCHMARK_DATA_SIZE = get_flag_from_env_var(
-    "PADDLE_PDX_INFER_BENCHMARK_DATA_SIZE", 1024
-)
+INFER_BENCHMARK_ITER = get_flag_from_env_var("PADDLE_PDX_INFER_BENCHMARK_ITER", 0, int)