Description
🚀 The feature
ViTDet achieves very interesting results on COCO and, given that ViT is already implemented, it seems relatively straightforward to implement this in torchvision.
Motivation, pitch
The best performing object detection network in torchvision is currently FasterRCNN with a resnet50 backbone (46.7 mAP). ViTDet reports an mAP 51.6 with ViT-B backbone, 55.6 with ViT-L and an impressive 56.7 mAP with ViT-H. Similarly impressive results have been obtained with the instance aware segmentation implementation.
Alternatives
Detectron2 implements ViTDet. It could be decided that torchvision will not provide its own implementation and instead redirects users that want to use ViTDet to Detectron2.
Additional context
Implementing ViTDet opens the door to other implementations, such as EVA-02. EVA-02 achieves even better results compared to ViTDet.
I have previously implemented RetinaNet for torchvision (later merged in #2784). I might be interested in implementing ViTDet, but I would first like to see if there is interest by the maintainers.