ssdlite320_mobilenet_v3_large only have ~ 50% CUDA usage even with large batch size

### 🐛 Describe the bug

```
import torch
from torchvision.models.detection import ssdlite320_mobilenet_v3_large

with torch.inference_mode():
    model = ssdlite320_mobilenet_v3_large(True)
    model = model.eval()
    model = model.to('cuda')
    inputs = (torch.randint(0, 255, (256, 3, 320, 320)) / 255).to('cuda')
    for _ in range(64):
        outputs = model(inputs)
```

```
nvidia-smi
```

The CUDA usage is around 36% during inferencing, one of the CPU usage is 100%

### Versions
```
Collecting environment information...
PyTorch version: 1.10.0+cu113
Is debug build: False
CUDA used to build PyTorch: 11.3
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.21.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Sep 28 2021, 16:10:42)  [GCC 9.3.0] (64-bit runtime)
Python platform: Linux-5.4.0-89-generic-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 10.1.243
GPU models and configuration: 
GPU 0: NVIDIA GeForce GTX 1080 Ti
GPU 1: NVIDIA GeForce GTX 1080 Ti

Nvidia driver version: 495.29.05
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.3.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.3.0
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.3
[pip3] torch==1.10.0+cu113
[pip3] torchvision==0.11.1+cu113
[conda] Could not collect
```

cc @datumbox

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ssdlite320_mobilenet_v3_large only have ~ 50% CUDA usage even with large batch size #4853

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ssdlite320_mobilenet_v3_large only have ~ 50% CUDA usage even with large batch size #4853

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions