It is useful in object detection context to allow arbitrary sizes by doing dynamic mask computation (probably possible only with relative position encoding).
These kinds of edits were done in https://github.com/SwinTransformer/Swin-Transformer-Object-Detection and in https://github.com/megvii-research/SOLQ/. It would be nice if you upstreamed these changes. This will simplify trying out ESviT checkpoints as pretraining for object detection.
Also, fyi I created a similar issue in SimMIM: microsoft/SimMIM#13. Overall, having some stable version of swin_transformer.py somewhere (maybe even in main SwinTransformer/Swin-Transformer repo?) supporting dynamic masking would help a lot :)
Thanks!