Skip to content

ViTDet object detection #7630

Open
Open
@hgaiser

Description

@hgaiser

🚀 The feature

ViTDet achieves very interesting results on COCO and, given that ViT is already implemented, it seems relatively straightforward to implement this in torchvision.

Motivation, pitch

The best performing object detection network in torchvision is currently FasterRCNN with a resnet50 backbone (46.7 mAP). ViTDet reports an mAP 51.6 with ViT-B backbone, 55.6 with ViT-L and an impressive 56.7 mAP with ViT-H. Similarly impressive results have been obtained with the instance aware segmentation implementation.

Alternatives

Detectron2 implements ViTDet. It could be decided that torchvision will not provide its own implementation and instead redirects users that want to use ViTDet to Detectron2.

Additional context

Implementing ViTDet opens the door to other implementations, such as EVA-02. EVA-02 achieves even better results compared to ViTDet.

I have previously implemented RetinaNet for torchvision (later merged in #2784). I might be interested in implementing ViTDet, but I would first like to see if there is interest by the maintainers.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions