-
Notifications
You must be signed in to change notification settings - Fork 332
Update VITDet to conform to KerasCV scaling standards #2086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/gcbrun |
1 similar comment
/gcbrun |
/gcbrun |
@@ -123,6 +124,11 @@ def __init__( | |||
# Use common rescaling strategy across keras_cv | |||
x = keras.layers.Rescaling(1.0 / 255.0)(x) | |||
|
|||
# VITDet scales inputs based on the standard ImageNet mean/stddev. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we want to apply both types of rescaling? I'm a bit confused.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The include_rescaling
check and associated rescaling layer make sure that the inputs are scaled from 0-1.
The subsequent bit rescales that using the mean and stddev of ImageNet, with the prior assumption that inputs are scaled 0-1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds like you already verified, but wouldn't rescaling by 255 affect the ImageNet rescaling?
Correct -- see comment inline |
# VITDet scales inputs based on the standard ImageNet mean/stddev. | ||
x = (x - ops.array([0.229, 0.224, 0.225], dtype=x.dtype)) / ( | ||
ops.array([0.485, 0.456, 0.406], dtype=x.dtype) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things:
- I think you got the mean and std mixed up :) From the SAM repo, mean is
[123.675, 116.28, 103.53]
([0.485, 0.456, 0.406]
when normalized) and std is[58.395, 57.12, 57.375]
([0.229, 0.224, 0.225]
when normalized). - We should not do this because the SAM model first normalizes, then pads. If padded inputs are passed, the preprocessing operation would not remain the same. It's giving me suboptimal outputs when I run the demos.
@ianstenbit Let's revert this and document the preprocess step, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this does look like the mean/std are backwards -- I'll fix that. Can you send your demos so that I can run them?
My test demo looked fine with this change, which makes me think it's not sensitive enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this does look like the mean/std are backwards -- I'll fix that. Can you send your demos so that I can run them?
Here: https://colab.research.google.com/drive/1wHTsYfmmZVuC71I4St1NshaAOQ6nUPFg?usp=sharing
I ran them with the patch in #2087 and the outputs are looking better. Output masks are now much closer to the demo on the original repo, with slight noise here and there (I think this is because the padded outlines having a non-zero value because of the normalization step). Nothing big though.
My test demo looked fine with this change, which makes me think it's not sensitive enough.
I agree, it's not super-sensitive, which is good! I just thought there might be cases where this could lead to a huge difference. What do you think about keeping the normalization step an opt-in rather than always having it on?
* Update VITDet to conform to KerasCV scaling standards * dtype fix
* Update VITDet to conform to KerasCV scaling standards * dtype fix
I've verified that the numerics for SAM match correctly after this change.
This removes the requirement that users do the awkward imagenet mean+stddev scaling themselves -- now the model rescales correctly as it should (this is how we did the scaling for ViT as well)