Skip to content

Migrate mnist dataset from np.frombuffer #4598

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Oct 21, 2021
Merged

Conversation

lc0
Copy link
Contributor

@lc0 lc0 commented Oct 12, 2021

Accordingly to pytorch/pytorch#59077 this PR migrates from np.frombuffer to pytorch.frombuffer and closes #4552

This PR is closing the last piece with mnist dataset.

So far I tested all mnist tests:

image

cc @pmeier @datumbox

Copy link
Contributor

@datumbox datumbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @lc0!

This looks like a good improvement to me but I'll let @pmeier to make the last call as he knows the datasets domain much better than me. Note that the linter is currently failing, could you please fix?

@lc0
Copy link
Contributor Author

lc0 commented Oct 12, 2021

Thanks for the PR @lc0!

This looks like a good improvement to me but I'll let @pmeier to make the last call as he knows the datasets domain much better than me. Note that the linter is currently failing, could you please fix?

good point, somehow on the first format run - I've got a lot of changes. Seems like after rebasing it's all good now. So sorry for the inconvenience 🙈

@datumbox
Copy link
Contributor

No worries at all. I have similar issues all the time ;)

@pmeier
Copy link
Collaborator

pmeier commented Oct 12, 2021

Could you test this against the original data? IIRC, there was something about the byte order, which is why I implemented the same in the new prototype like this:

yield np.frombuffer(chunk, dtype=in_dtype).astype(out_dtype).reshape(shape)

If this works with torch.frombuffer I'm all in favor of removing the numpy dependency. I'm on PTO right now, so it will take me some time to test this properly.

@datumbox
Copy link
Contributor

@pmeier We are happy to wait until you are back from PTO to allow for more thorough tests. Would you mind marking this with "Request changes" so that it's not accidentally merged?

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pmeier
Copy link
Collaborator

pmeier commented Oct 18, 2021

It seems to work just fine, but if we use it, we are getting an "old" warning back:

UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/autograd/python_torch_functions_manual.cpp:489.)
  parsed = torch.frombuffer(data, dtype=torch_type, offset=(4 * (nd + 1)))

This was only removed only recently in #4184. Maybe we can send a patch upstream to fix it, but the warning seems justified to me.

@NicolasHug
Copy link
Member

According to the docs of torch.frombuffer:

The returned tensor and buffer share the same memory. Modifications to
the tensor will be reflected in the buffer and vice versa. The returned
tensor is not resizable

so I'm not sure we can avoid the warning if we switch to torch.frombuffer, unless we copy the buffer beforehand.

I would suggest to either:

  • if we really want to use torch.frombuffer, just silence the warning by wrapping the call in a context manager. For this specific buffer, I really doubt the "writeable tensor on a read-only buffer" will ever cause a problem (FWIW, we used to ignore the warning before we started making a copy in Fixed MNIST Download Raises 'UserWarning: The given NumPy array is not writeable' #4184)
  • not care about all this and keep the code as-is (i.e. not merge this PR). The code-quality improvement is pretty marginal anyway.

@datumbox
Copy link
Contributor

Agreed about the options. I would go with option 1 as this aligns with our previous decision to copy instead of ignore the warning.

@pmeier
Copy link
Collaborator

pmeier commented Oct 19, 2021

Agreed about the options. I would go with option 1 as this aligns with our previous decision to copy instead of ignore the warning.

Not sure what you mean here. If we go with option 1. we need to suppress the warning, because we can only copy (.clone()) after we load the data with torch.frombuffer.

@datumbox
Copy link
Contributor

@pmeier Apologies, I misread the previous comment. TBH I would prefer to merge the PR because it removes the extra code of juggling between numpy and pytorch. Concerning fixing the warning, we could either call clone() or suppress it. Copying doesn't sound too bad given the dataset is tiny and can avoid issues with in-place modifications. It will also align the behaviour with #4578 but no strong opinions.

@pmeier
Copy link
Collaborator

pmeier commented Oct 19, 2021

Concerning fixing the warning, we could either call clone() or suppress it.

Sorry to be pedantic here, but we need to suppress and clone. torch.frombuffer emits the warning unconditionally. After we loaded the buffer we can then clone to avoid problems with inplace modifications. For example:

with suppress_read_only_warning():
    data = torch.from_buffer(buffer, dtype=dtype).clone()

This issues did not arise before, because numpy can either deal with read-only buffers or simply doesn't warn you about it.

@NicolasHug
Copy link
Member

NicolasHug commented Oct 19, 2021

Let's do that. @lc0 , would you mind updating the PR with something similar to what @pmeier suggested above #4598 (comment) @datumbox suggested below #4598 (comment)?
Thanks!

@datumbox
Copy link
Contributor

datumbox commented Oct 19, 2021

I might be missing something here but why can't we copy the buffer and make it mutable?

Something like the following should work because the bytearray is mutable:

parsed = torch.frombuffer(bytearray(data), dtype=torch_type, offset=(4 * (nd + 1)))

We do copy but we don't have to suppress warnings etc.

@NicolasHug
Copy link
Member

Yes we can either copy the buffer (and not suppress the warning) or copy the end tensor (and suppress).

They're all the same in terms of complexity / time overhead (i.e.: minimal), so we can go with whichever we want. But I'd go with the first one that we're all happy with :)

@pmeier
Copy link
Collaborator

pmeier commented Oct 19, 2021

I like torch.from_buffer(bytearray(buffer), dtype=dtype) best, because it is quite concise and we avoid to write extra functionality to suppress a warning that we can't control directly. Thanks @datumbox for the suggestion.

@lc0
Copy link
Contributor Author

lc0 commented Oct 20, 2021

I fixed accordingly with the suggestion from @datumbox and no more warnings :)

image

Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lc0 The MNIST files are stored in big endian byte order, but torch has no notion of byte order AFAIK. Thus, if your system uses little endian, we will get wrong results if we just read the data with torch.frombuffer. In numpy you can specify the byte order by prefixing > before the type to indicate big endian. If you don't specify anything, it will use the system default as torch does.

By removing numpy, we need to take care of that manually. See #4651 for an example where I did this for a prototype implementation of MNIST datasets.

@lc0
Copy link
Contributor Author

lc0 commented Oct 21, 2021

Hi @pmeier it's a good point. It feels that we are maintaining two implementations of dataset API. Does it make sense somehow abstract it away and do not implement/maintain it twice? :)

Also, if this is broken currently, I guess it's subtle enough to be useful enough to cover with a test. What do you think?

@pmeier
Copy link
Collaborator

pmeier commented Oct 21, 2021

It feels that we are maintaining two implementations of dataset API. Does it make sense somehow abstract it away and do not implement/maintain it twice? :)

Yes, currently there are two API for datasets: torchvision.datasets and torchvision.prototype.datasets. As soon as the prototype one is stable and after a deprecation period, it will replace the current one. Until this happened we need to maintain twice unfortunately.

Also, if this is broken currently, I guess it's subtle enough to be useful enough to cover with a test. What do you think?

It is not currently broken. If you look in the dtype dict, you see that all dtypes with more than one byte, have a variant that is prefixed with >. This is numpy's way to indicating big endian.

Or do you mean that this PR is currently broken, but our tests are passing? In that case I agree.

@lc0
Copy link
Contributor Author

lc0 commented Oct 21, 2021

Or do you mean that this PR is currently broken, but our tests are passing? In that case I agree.
correct this PR or sibling PR from you :) Btw, do you plan to add for this order in prototypes?

@lc0 lc0 requested a review from pmeier October 21, 2021 13:39
Copy link
Collaborator

@pmeier pmeier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @lc0!

@pmeier pmeier requested a review from datumbox October 21, 2021 15:14
@datumbox datumbox merged commit d605d7d into pytorch:main Oct 21, 2021
@github-actions
Copy link

Hey @datumbox!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

@lc0 lc0 deleted the frombuffer branch October 22, 2021 16:56
facebook-github-bot pushed a commit that referenced this pull request Oct 26, 2021
Summary:
* Migrate mnist dataset from np.frombuffer

* Add a copy with bytearray for non-writable buffers

* Add byte reversal for mnist

Reviewed By: NicolasHug

Differential Revision: D31916333

fbshipit-source-id: 9af4d8dc4fb7e11c63fcc82c87004314fbc2d225

Co-authored-by: Vasilis Vryniotis <[email protected]>
cyyever pushed a commit to cyyever/vision that referenced this pull request Nov 16, 2021
* Migrate mnist dataset from np.frombuffer

* Add a copy with bytearray for non-writable buffers

* Add byte reversal for mnist

Co-authored-by: Vasilis Vryniotis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use torch.frombuffer instead of np.frombuffer
5 participants