Description
🚀 The feature
Note: To track the progress of the project check out this board.
This is the 2nd phase of TorchVision's modernization project (see phase 1). We aim to keep TorchVision relevant by ensuring it provides off-the-shelf all the necessary primitives, model architectures and recipe utilities to produce SOTA results for the supported Computer Vision tasks.
1. New Primitives
To enable our users to reproduce the latest state-of-the-art research we will enhance TorchVision with the following data augmentations, layers, losses and other operators:
Data Augmentations
- Augmix - Adding AugMix implementation #5411
- Large Scale Jitter - Adding Scale Jitter transform for detection #5435 Fix bbox scaling estimation for Large Scale Jitter #5446 Make ScaleJitter proportional #5559
- Fixed Size Crop - Adding FixedSizeCrop transform #5607
- Random Shortest Size - Adding RandomShortestSize transform #5610
- Simple CopyPaste - Add SimpleCopyPaste augmentation #5825
Layers
- DropBlock - New Feature: add DropBlock layer #5416
- Conv3DNormActivation - Add Conv2dNormActivation and Conv3dNormActivation Blocks #5445
- MLP - Adding multi-layer perceptron in ops #6053
- Permute - Move Permute layer to ops #6055
Losses
- Generalized-IoU loss - add FCOS #4961
- Distance-IoU & Complete-IoU loss - Added CIOU loss function #5776 Distance IoU #5786 Adding ciou and diou support in
_box_loss()
#5984
Operators added in PyTorch Core
- Better EMA support in
AveragedModel
- Remove state_dict from AveragedModel and use buffers instead pytorch#71763 - Add support of empty output in SyncBatchNorm - Fix SyncBatchNorm for empty inputs pytorch#74944
2. New Architectures & Model Iterations
To ensure that our users have access to the most popular SOTA models, we will add the following architectures along with pre-trained weights. Moreover we will improve existing architectures with commonly adopted optimizations introduced in follow up research:
Image Classification
- ConvNeXt - Adding ConvNeXt architecture in prototype #5197 Adding more ConvNeXt variants + Speed optimizations #5253 Graduate ConvNeXt to main TorchVision area #5330
- EfficientNetV2 code - Adding EfficientNetV2 architecture #5450
- Swin Transformer - Adding Swin Transformer architecture #5491 add swin_s and swin_b variants and improved swin_t #6048
Object Detection & Segmentation
- FCOS add FCOS #4961
- Post-paper optimizations for RetinaNet, FasterRCNN & MaskRCNN Post-paper Detection Optimizations #5444
Video Classification
3. Improved Training Recipes & Pre-trained models
To ensure that are users can have access to strong baselines and SOTA weights, we will improve our training recipes to incorporate the newly released primitives and offer improved pre-trained models:
Reference Scripts
- Update EMA to use PyTorch Core's new implementation - Simplify EMA to use Pytorch's update_parameters #5469
- Add support of new Detection primitives in Reference Scripts - Detection recipe enhancements #5715
Pre-trained weights
- Improve the accuracy of Classification models - Adding improved MobileNetV2 weights #5560 Add shufflenetv2 1.5 and 2.0 weights #5906 Adding resnext101 64x4d model #5935 Add weight for mnasnet0_75 and mnasnet1_3 #6019
- Close the gap with SOTA for Object Detection & Segmentation models - Add RetinaNet improved weights #5756 Add FasterRCNN improved weights #5763 Add MaskRCNN improved weights #5773
- Add weakly-supervised weights for ViT and RegNets - Add SWAG Vision Transformer Weight #5714 Add regnet model from SWAG #5722 Add regnet_y_128gf from SWAG #5732 Adding the huge vision transformer from SWAG #5721 Add SWAG model weight that only the linear head is finetuned to ImageNet1K #5793
Other Candidates
There are several other Operators (#5414), Losses (#2980), Augmentations (#3817) and Models (#2707) proposed by the community. Here are some potential candidates that we could implement depending on bandwidth. Contributions are welcome for any of the below:
- AutoAugment Detection code - Implement AutoAugment for Detection #6224
- Deformable DeTR
- Polynomial LR scheduler (upstream to Core)
- Shortcut Regularizer (FX-based)