Add model and architecture for Omnivore model #43

YosuaMichael · 2022-05-12T14:58:57Z

Summary:
Omnivore is a vision multimodal model that able to classify RGB image, video, and depth image (RGBD) with a shared trunk parameters.

With this PR, we aim to have the model and architecture class for Omnivore that able to load converted pretrained weight from original author: https://github.com/facebookresearch/omnivore and produce same result.

Test plan:

Add notebook in examples/omnivore/LoadOriginalPretrainedWeightAndCompare.ipynb that load the original omnivore pretrained weight in torchhub and make sure the output is the same
Add unittest

…mparing with original model with pretrained weight

…and create test

ankitade

some tests are failing on CI, can you take a look
I left some comments around passing in instantiated modules but realized you are following torchvision style. so if we plan to upstream this, its ok to leave them as is

test/models/test_omnivore.py

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

ankitade · 2022-05-17T00:14:50Z

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

+from torchvision.ops.stochastic_depth import StochasticDepth
+
+
+def _compute_pad_size_3d(


why do u need this and next function to be global?

For _compute_pad_size_3d it is needed on 2 different classes so we put it on global but private.
For _compute_attention_mask_3d we put this as function so we can cache it. We don't put it inside the function shifted_window_attention_3d because we avoid nested function as it is not supported by jit script.

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

torchmultimodal/architectures/omnivore.py

…ents

langong347 · 2022-05-17T23:13:40Z

test/models/test_omnivore.py

+
+        image = torch.randn(1, 3, 1, 112, 112)  # B C D H W
+        image_score = model(image, input_type="image")
+        self.assertEqual(image_score.size(), torch.Size((1, 1000)))


please use our test utility to assert on values and shapes. (we need to assert values too)

Thanks for the suggestion! Will do.

Use the utiltiy for assertion:

multimodal/test/test_utils.py

Line 74 in e857541

def assert_expected(

ah okay, will change self.assertEqual with assert_expected then

@langong347 after I read the utility assert_expected it seems to compare two tensor with float type. In this case, I think assertEqual is better for comparing the size since it is integer?

torchmultimodal/models/omnivore.py

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

langong347 · 2022-05-18T00:06:58Z

Q: how did you make sure that the order of pretrained model keys is the same as that of your model keys?
For example, if I implement the same model in a slightly different way (e.g., additional variables, layers), will it be a problem for me to match the original implementation?

key_mapping = {pretrained_keys[i]: model_keys[i] for i in range(len(model_keys))}

YosuaMichael · 2022-05-18T10:11:23Z

@langong347

Q: how did you make sure that the order of pretrained model keys is the same as that of your model keys?
For example, if I implement the same model in a slightly different way (e.g., additional variables, layers), will it be a problem for me to match the original implementation?

Sometimes this can be a problem, especially if we have branching operation. For instance in omnivore, the head that use ModuleDict need to have the same order as the original implementation or otherwise it will not match the correct key using that method (need a more complex way to auto assign like doing a matching with same size).

However from what I understand, if there is no branching (like the swin transformer encoder only, in fact the implementation here is different with the original one: https://github.com/facebookresearch/omnivore/blob/main/models/swin_transformer_3d.py ) usually the layout will follow the same order on how the input will be processed, and in this case it should be okay. Some of the problem that may occur here is that we might store same data in different dimension like 100 x 100 (2D) or 10000 (1D), in this case we could modify our model a bit to follow how the original one store data.

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

test/models/test_omnivore.py

langong347 · 2022-05-18T17:06:47Z

test/models/test_omnivore.py

+
+        image = torch.randn(1, 3, 1, 112, 112)  # B C D H W
+        image_score = model(image, input_type="image")
+        self.assertEqual(image_score.size(), torch.Size((1, 1000)))


Use the utiltiy for assertion:

multimodal/test/test_utils.py

Line 74 in e857541

def assert_expected(

torchmultimodal/architectures/omnivore.py

test/modules/encoders/test_swin_transformer_3d_encoder.py

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

test/modules/encoders/test_swin_transformer_3d_encoder.py

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

…former_3d_encoder code

datumbox

Adding here some notes based on an offline discussion I had with @YosuaMichael today.

datumbox · 2022-05-23T10:53:17Z

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

+# LICENSE file in the root directory of this source tree.
+
+# Modified from 2d Swin Transformers in torchvision:
+# https://github.com/pytorch/vision/blob/main/torchvision/models/swin_transformer.py


@YosuaMichael Copy-pasting Swin implementation and making it 3D looks suboptimal. We should examine the possibility of adding native support of 3d on TorchVision's implementation or refactor to make shareable more components.

Hi @datumbox , I plan to put this upstream the SwinTransformer3d on video_classification in torchvision after finishing the Omnivore first. Do you think this plan sounds okay? Or maybe it is better to put this on torchvision first?

@YosuaMichael I recommend taking the time to create a plan for this and similar models. Not only it will deliver better code quality but also sketch out how we will handle supporting Multimodal more effectively on the future. Copy pasting is a shortcut that kicks down the line the problem and only makes it harder to solve.

@ankitade @langong347 Sharing here what we discussed offline with @YosuaMichael and @kartikayk.

The adaptation of Swin 2d to 3d, highlighted some things we can improve on the original implementation of TorchVision. One of them is the use of single integers instead of tuples/lists for the sizes, another one is the fact that some modules can easily be reused if we make minor adaptations (candidates are PatchMerging and SwinTransformerBlock). It would be really nice, if we could make these changes on TorchVision's code now to minimize the degree of copy-pasting. The reason for the urgency is the upcoming release and the fact that Swin is a brand new class which can be easily modified without worrying about BC now. After the release, there will be all sorts of BC considerations.

The timing is tight and we definitely don't want to delay you. So we will just give it 1-2 days max to see if that's possible. If we are successful to refactor, we will be able to simplify this implementation and reduce the copy-paste. If not, we will merge this and review improvements on TorchVision on the future.

Just to update here, the update on torchvision 2d swin transformer is on pytorch/vision#6088

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

datumbox · 2022-05-23T11:01:05Z

torchmultimodal/architectures/omnivore.py

+        self.heads = heads
+        self.input_types = set(heads.keys())
+
+    def forward(self, x: torch.Tensor, input_type: str):


The idiom of passing the input_type as string and then choosing to which head to forward the data is not very common. Why can't we instantiate Omnivore modules with the same encoder but different heads and push to the training loop which module you use depending on the input? This will move the complexity from the nn.Module to the loop and it will be beneficial for use-cases where the loss is different for each head.

Speaking offline with @YosuaMichael there are some concerns on whether the whole approach will work well on a distributed setup.

Hi @datumbox, currently I plan to try on doing the training first and see if there are any particular problem with the architectures (whether the current or multiple model with shared encoder). In particular, I need to check the behaviour of having 2 models A and B that shared the same encoder wrapped individually into DDP(A) and DDP(B).

This is an interesting comment. confirming, does the ckpt include all the heads?

@ankitade yes, the original checkpoint include all the heads

@ankitade @YosuaMichael It's worth keeping in mind that this pattern wont support FX.

FX does tracing which means the flow of the execution of the model should not depend on the input. By adding the string FX won't know what to execute. Even though you might not be interested right now to make it FX traceable, on the future you might want to adopt FX quantization or other FX-based utils.

This is the reason my advice is to split this module in submodules depending on the head. It's a safer idiom that would be forward compatible with future core expansions. It might require some massaging on the original weights to fix but I think that's straightforward to do and worth it.

Up to you. :)

jdsgomes · 2022-05-23T15:49:10Z

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

+        return x
+
+
+class SwinTransformer3dEncoder(nn.Module):


I agree with @datumbox regarding the adaptation to 3d. I think it would be preferable to add it to torchvision directly. This main Class can be a good entry point for the unification effort, looking into it, I think we can have a common logic under SwinTransformer with extra parameters in order to cater for the 2d (SwinTransformer2d) and 3d (SwinTransformer3d) versions. For BC the default values should be for the 2d version. From this main class we could drill down and look into more unification efforts (not possible in all cases probably). For example, we have ConvNormActivationBlock and then 2d and 3d vertions (see https://github.com/pytorch/vision/blob/49496c4f6201f05f2351788389a0863b087e78f4/torchvision/ops/misc.py#L68). This was a simpler example, but could be used as inspiration.
Let me know your thoughts @YosuaMichael. I maybe I might have overlooked some of the differences which make this even harder than I think.

@jdsgomes I agree by add this to torchvision (planned to do later). But for having generic SwinTransformer for both 2d and 3d, I have a feeling this could be difficult, especially to make it BC. Here are some reasons I can think of (let me know if you have a good solution):

some of the params input differs in type. For instance the window_size for 2d is int (they assume it always a symmetry windows like 16x16), however in 3d we use Tuple[int, int, int] since we have windows like (8, 7, 7).

In this 3d version, I modularize the patch_embed layer because the Omnivore actually use different patch_embed method

There are some options. For the window_size we can have union as parameter. For patch_embed we can have it as optional pre-step.

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

torchmultimodal/architectures/omnivore.py

ankitade · 2022-05-23T01:45:30Z

torchmultimodal/models/omnivore.py

+        norm_layer=nn.LayerNorm,
+        patch_embed=PatchEmbedOmnivore,
+    )
+    if encoder_only:


instead of this, separate out into a function to return the encoder and call it here

Ok, I think we can create a function for the encoder only and call it here.

ankitade · 2022-05-24T02:10:37Z

torchmultimodal/architectures/omnivore.py

+        self.heads = heads
+        self.input_types = set(heads.keys())
+
+    def forward(self, x: torch.Tensor, input_type: str):


This is an interesting comment. confirming, does the ckpt include all the heads?

YosuaMichael

Hi @ankitade, thanks for your comment and sorry for a very late feedback. I miss the notif on your comment...
I will update the code to reflect some of your comment, thanks!

YosuaMichael · 2022-06-15T16:38:43Z

torchmultimodal/architectures/omnivore.py

+        self.heads = heads
+        self.input_types = set(heads.keys())
+
+    def forward(self, x: torch.Tensor, input_type: str):


@ankitade yes, the original checkpoint include all the heads

torchmultimodal/architectures/omnivore.py

YosuaMichael · 2022-06-15T16:53:30Z

torchmultimodal/models/omnivore.py

+        norm_layer=nn.LayerNorm,
+        patch_embed=PatchEmbedOmnivore,
+    )
+    if encoder_only:


Ok, I think we can create a function for the encoder only and call it here.

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py

codecov-commenter · 2022-06-22T22:10:18Z

Codecov Report

Merging #43 (e7869d7) into main (3beffd9) will increase coverage by 0.31%.
The diff coverage is 97.11%.

@@            Coverage Diff             @@
##             main      #43      +/-   ##
==========================================
+ Coverage   88.96%   89.27%   +0.31%     
==========================================
  Files          33       38       +5     
  Lines        1722     2061     +339     
==========================================
+ Hits         1532     1840     +308     
- Misses        190      221      +31

Impacted Files	Coverage Δ
...al/modules/encoders/swin_transformer_3d_encoder.py	`96.87% <96.87%> (ø)`
torchmultimodal/models/omnivore.py	`97.22% <97.22%> (ø)`
torchmultimodal/architectures/omnivore.py	`100.00% <100.00%> (ø)`
torchmultimodal/architectures/clip.py	`100.00% <0.00%> (ø)`
torchmultimodal/models/albef.py	`78.26% <0.00%> (ø)`
...ultimodal/modules/encoders/albef_vision_encoder.py	`100.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3beffd9...e7869d7. Read the comment docs.

facebook-github-bot · 2022-06-23T17:28:08Z

@YosuaMichael has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2022-06-23T20:54:04Z

@YosuaMichael has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Add swin_transformer_3d

b5edd05

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 12, 2022

YosuaMichael added 5 commits May 12, 2022 18:26

Add architecture and models for swin_t

47168af

Fix architecture by adding super().__init__() and add examples for co…

5d15888

…mparing with original model with pretrained weight

Fix typing issue

b9666f6

Update swin_transformer_3d encoder to make sure it is jit scriptable …

fa5b6a6

…and create test

Add test for omnivore and fix bug on the avgpool

c76fa7f

YosuaMichael marked this pull request as ready for review May 13, 2022 12:35

ankitade reviewed May 17, 2022

View reviewed changes

YosuaMichael added 3 commits May 17, 2022 10:14

Putting PatchEmbedOmnivore into omnivore model and resolves some comm…

146c5e9

…ents

Add license header

579552e

Update the test to follow the update on model API

092d339

langong347 reviewed May 17, 2022

View reviewed changes

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py Show resolved Hide resolved

Improve omnivore test and resolve comment from Lan

f2ad2d1

langong347 reviewed May 18, 2022

View reviewed changes

YosuaMichael added 3 commits May 20, 2022 11:15

Add option to get encoder_only for omnivore model, cleanup swin_trans…

d40b29c

…former_3d_encoder code

Update the test code for omnivore

058b729

Add test on ShiftedWindowAttention3d when there is zero shift

96b9211

datumbox reviewed May 23, 2022

View reviewed changes

jdsgomes reviewed May 23, 2022

View reviewed changes

YosuaMichael mentioned this pull request May 25, 2022

Refactor swin transfomer so later we can reuse component for 3d version pytorch/vision#6088

Merged

YosuaMichael added 2 commits May 26, 2022 00:01

Update swin_transformer_3d_encoder to be easier to upstream

cd8d199

Fix format and mypy

44786d8

YosuaMichael mentioned this pull request May 26, 2022

[Cherrypick] Refactor swin transfomer so its component can be reused on the 3d version pytorch/vision#6100

Merged

ankitade approved these changes May 31, 2022

View reviewed changes

Merge branch 'main' into omnivore/add-model

d6077ff

YosuaMichael commented Jun 15, 2022

View reviewed changes

YosuaMichael added 2 commits June 15, 2022 17:58

Create separate encoder function and simplify architecture a bit

8ab13bc

Fix mypy problem

063a403

datumbox reviewed Jun 16, 2022

View reviewed changes

torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py Outdated Show resolved Hide resolved

datumbox mentioned this pull request Jun 16, 2022

Adding _log_api_usage_once to Swin's reusable components pytorch/vision#6174

Merged

YosuaMichael added 6 commits June 16, 2022 12:58

Upstream swin_transformer components from torchvision

c8e384a

Use lowercase variable name, format with ufmt

b2f010c

Fix format and add license

13f4605

Fix formatting and expected test result

b9fccab

Fix formatting

c505b2e

Loosen test comparison

e7869d7

YosuaMichael and others added 2 commits June 23, 2022 17:17

Remove lru_cache usage to make module jit scriptable

fbcf807

Merge branch 'main' into omnivore/add-model

a8a59a3

YosuaMichael and others added 3 commits June 23, 2022 21:27

Use absolute path for test_utils

e707b96

Ufmt format

c20e1cd

Merge branch 'main' into omnivore/add-model

6cccb83

facebook-github-bot closed this in c20e5b6 Jun 27, 2022

datumbox mentioned this pull request Jul 8, 2022

Add SwinV2 pytorch/vision#6246

Merged

		from torchvision.ops.stochastic_depth import StochasticDepth


		def _compute_pad_size_3d(

Add model and architecture for Omnivore model #43

Add model and architecture for Omnivore model #43

Uh oh!

Conversation

YosuaMichael commented May 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ankitade left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

langong347 May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

YosuaMichael May 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

langong347 commented May 18, 2022

Uh oh!

YosuaMichael commented May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

langong347 May 18, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

YosuaMichael commented May 12, 2022 •

edited

Loading

langong347 May 18, 2022 •

edited

Loading

YosuaMichael May 19, 2022 •

edited

Loading

YosuaMichael commented May 18, 2022 •

edited

Loading

langong347 May 18, 2022 •

edited

Loading

YosuaMichael Jun 15, 2022 •

edited

Loading

YosuaMichael Jun 15, 2022 •

edited

Loading