| Model |
Mask AP |
GPU Inference Time (ms) |
CPU Inference Time (ms) |
Model Size (M) |
Description |
| Cascade-MaskRCNN-ResNet50-FPN |
36.3 |
- |
- |
254.8 M |
Cascade-MaskRCNN is an improved Mask RCNN instance segmentation model that utilizes multiple detectors in a cascade, optimizing segmentation results by leveraging different IOU thresholds to address the mismatch between detection and inference stages, thereby enhancing instance segmentation accuracy. |
| Cascade-MaskRCNN-ResNet50-vd-SSLDv2-FPN |
39.1 |
- |
- |
254.7 M |
| Mask-RT-DETR-H |
50.6 |
132.693 |
4896.17 |
449.9 M |
Mask-RT-DETR is an instance segmentation model based on RT-DETR. By adopting the high-performance PP-HGNetV2 as the backbone network and constructing a MaskHybridEncoder encoder, along with introducing IOU-aware Query Selection technology, it achieves state-of-the-art (SOTA) instance segmentation accuracy with the same inference time. |
| Mask-RT-DETR-L |
45.7 |
46.5059 |
2575.92 |
113.6 M |
| Mask-RT-DETR-M |
42.7 |
36.8329 |
- |
66.6 M |
| Mask-RT-DETR-S |
41.0 |
33.5007 |
- |
51.8 M |
| Mask-RT-DETR-X |
47.5 |
75.755 |
3358.04 |
237.5 M |
| MaskRCNN-ResNet50-FPN |
35.6 |
- |
- |
157.5 M |
Mask R-CNN is a full-task deep learning model from Facebook AI Research (FAIR) that can perform object classification and localization in a single model, combined with image-level masks to complete segmentation tasks. |
| MaskRCNN-ResNet50-vd-FPN |
36.4 |
- |
- |
157.5 M |
| MaskRCNN-ResNet50 |
32.8 |
- |
- |
128.7 M |
| MaskRCNN-ResNet101-FPN |
36.6 |
- |
- |
225.4 M |
| MaskRCNN-ResNet101-vd-FPN |
38.1 |
- |
- |
225.1 M |
| MaskRCNN-ResNeXt101-vd-FPN |
39.5 |
- |
- |
370.0 M |
|
| PP-YOLOE_seg-S |
32.5 |
- |
- |
31.5 M |
PP-YOLOE_seg is an instance segmentation model based on PP-YOLOE. This model inherits PP-YOLOE's backbone and head, significantly enhancing instance segmentation performance and inference speed through the design of a PP-YOLOE instance segmentation head. |
| SOLOv2 |
35.5 |
- |
- |
179.1 M |
SOLOv2 is a real-time instance segmentation algorithm that segments objects by location. This model is an improved version of SOLO, achieving a good balance between accuracy and speed through the introduction of mask learning and mask NMS. |
**Note: The above accuracy metrics are based on the Mask AP of the [COCO2017](https://cocodataset.org/#home) validation set. All GPU inference times are based on an NVIDIA Tesla T4 machine with FP32 precision. CPU inference speeds are based on an Intel(R) Xeon(R) Gold 5117 CPU @ 2.00GHz with 8 threads and FP32 precision.**