Hi!
To combine Swin transformer backbone with Deformable DETR detector, SOLQ did some changes to swin_transformer.py that allow to compute the padding mask dynamically and allow for arbitrary-sized images in input (I think this is supported for relative positional encoding only).
Similar edits were done by your colleagues in https://github.com/SwinTransformer/Swin-Transformer-Object-Detection/blob/master/mmdet/models/backbones/swin_transformer.py
If this interests you, maybe you could import those edits from SOLQ / Swin-Transformer-Object-Detection or implement similar edits. This will make it simpler to experiment with SimMIM checkpoints / backbone code in object detection context and make sure that checkpoints load correctly.