Migrate mnist dataset from np.frombuffer #4598

lc0 · 2021-10-12T10:20:39Z

Accordingly to pytorch/pytorch#59077 this PR migrates from np.frombuffer to pytorch.frombuffer and closes #4552

This PR is closing the last piece with mnist dataset.

So far I tested all mnist tests:

datumbox

Thanks for the PR @lc0!

This looks like a good improvement to me but I'll let @pmeier to make the last call as he knows the datasets domain much better than me. Note that the linter is currently failing, could you please fix?

lc0 · 2021-10-12T10:43:28Z

Thanks for the PR @lc0!

This looks like a good improvement to me but I'll let @pmeier to make the last call as he knows the datasets domain much better than me. Note that the linter is currently failing, could you please fix?

good point, somehow on the first format run - I've got a lot of changes. Seems like after rebasing it's all good now. So sorry for the inconvenience 🙈

datumbox · 2021-10-12T10:44:47Z

No worries at all. I have similar issues all the time ;)

pmeier · 2021-10-12T10:44:58Z

Could you test this against the original data? IIRC, there was something about the byte order, which is why I implemented the same in the new prototype like this:

vision/torchvision/prototype/datasets/_builtin/mnist.py

Line 79 in c790216

yield np.frombuffer(chunk, dtype=in_dtype).astype(out_dtype).reshape(shape)

If this works with torch.frombuffer I'm all in favor of removing the numpy dependency. I'm on PTO right now, so it will take me some time to test this properly.

datumbox · 2021-10-12T10:56:53Z

@pmeier We are happy to wait until you are back from PTO to allow for more thorough tests. Would you mind marking this with "Request changes" so that it's not accidentally merged?

pmeier

See #4598 (comment)

pmeier · 2021-10-18T10:26:10Z

It seems to work just fine, but if we use it, we are getting an "old" warning back:

UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at  ../torch/csrc/autograd/python_torch_functions_manual.cpp:489.)
  parsed = torch.frombuffer(data, dtype=torch_type, offset=(4 * (nd + 1)))

This was only removed only recently in #4184. Maybe we can send a patch upstream to fix it, but the warning seems justified to me.

NicolasHug · 2021-10-18T10:36:13Z

According to the docs of torch.frombuffer:

The returned tensor and buffer share the same memory. Modifications to
the tensor will be reflected in the buffer and vice versa. The returned
tensor is not resizable

so I'm not sure we can avoid the warning if we switch to torch.frombuffer, unless we copy the buffer beforehand.

I would suggest to either:

if we really want to use torch.frombuffer, just silence the warning by wrapping the call in a context manager. For this specific buffer, I really doubt the "writeable tensor on a read-only buffer" will ever cause a problem (FWIW, we used to ignore the warning before we started making a copy in Fixed MNIST Download Raises 'UserWarning: The given NumPy array is not writeable' #4184)
not care about all this and keep the code as-is (i.e. not merge this PR). The code-quality improvement is pretty marginal anyway.

datumbox · 2021-10-18T10:42:30Z

Agreed about the options. I would go with option 1 as this aligns with our previous decision to copy instead of ignore the warning.

pmeier · 2021-10-19T07:12:15Z

Agreed about the options. I would go with option 1 as this aligns with our previous decision to copy instead of ignore the warning.

Not sure what you mean here. If we go with option 1. we need to suppress the warning, because we can only copy (.clone()) after we load the data with torch.frombuffer.

datumbox · 2021-10-19T08:17:12Z

@pmeier Apologies, I misread the previous comment. TBH I would prefer to merge the PR because it removes the extra code of juggling between numpy and pytorch. Concerning fixing the warning, we could either call clone() or suppress it. Copying doesn't sound too bad given the dataset is tiny and can avoid issues with in-place modifications. It will also align the behaviour with #4578 but no strong opinions.

pmeier · 2021-10-19T08:28:23Z

Concerning fixing the warning, we could either call clone() or suppress it.

Sorry to be pedantic here, but we need to suppress and clone. torch.frombuffer emits the warning unconditionally. After we loaded the buffer we can then clone to avoid problems with inplace modifications. For example:

with suppress_read_only_warning():
    data = torch.from_buffer(buffer, dtype=dtype).clone()

This issues did not arise before, because numpy can either deal with read-only buffers or simply doesn't warn you about it.

NicolasHug · 2021-10-19T08:34:37Z

Let's do that. @lc0 , would you mind updating the PR with something similar to what ~~@pmeier suggested above #4598 (comment)~~ @datumbox suggested below #4598 (comment)?
Thanks!

datumbox · 2021-10-19T08:43:08Z

I might be missing something here but why can't we copy the buffer and make it mutable?

Something like the following should work because the bytearray is mutable:

parsed = torch.frombuffer(bytearray(data), dtype=torch_type, offset=(4 * (nd + 1)))

We do copy but we don't have to suppress warnings etc.

NicolasHug · 2021-10-19T08:46:05Z

Yes we can either copy the buffer (and not suppress the warning) or copy the end tensor (and suppress).

They're all the same in terms of complexity / time overhead (i.e.: minimal), so we can go with whichever we want. But I'd go with the first one that we're all happy with :)

pmeier · 2021-10-19T08:55:02Z

I like torch.from_buffer(bytearray(buffer), dtype=dtype) best, because it is quite concise and we avoid to write extra functionality to suppress a warning that we can't control directly. Thanks @datumbox for the suggestion.

lc0 · 2021-10-20T21:41:53Z

I fixed accordingly with the suggestion from @datumbox and no more warnings :)

pmeier

@lc0 The MNIST files are stored in big endian byte order, but torch has no notion of byte order AFAIK. Thus, if your system uses little endian, we will get wrong results if we just read the data with torch.frombuffer. In numpy you can specify the byte order by prefixing > before the type to indicate big endian. If you don't specify anything, it will use the system default as torch does.

By removing numpy, we need to take care of that manually. See #4651 for an example where I did this for a prototype implementation of MNIST datasets.

lc0 · 2021-10-21T08:58:27Z

Hi @pmeier it's a good point. It feels that we are maintaining two implementations of dataset API. Does it make sense somehow abstract it away and do not implement/maintain it twice? :)

Also, if this is broken currently, I guess it's subtle enough to be useful enough to cover with a test. What do you think?

pmeier · 2021-10-21T11:49:58Z

It feels that we are maintaining two implementations of dataset API. Does it make sense somehow abstract it away and do not implement/maintain it twice? :)

Yes, currently there are two API for datasets: torchvision.datasets and torchvision.prototype.datasets. As soon as the prototype one is stable and after a deprecation period, it will replace the current one. Until this happened we need to maintain twice unfortunately.

Also, if this is broken currently, I guess it's subtle enough to be useful enough to cover with a test. What do you think?

It is not currently broken. If you look in the dtype dict, you see that all dtypes with more than one byte, have a variant that is prefixed with >. This is numpy's way to indicating big endian.

Or do you mean that this PR is currently broken, but our tests are passing? In that case I agree.

lc0 · 2021-10-21T12:27:58Z

Or do you mean that this PR is currently broken, but our tests are passing? In that case I agree.
correct this PR or sibling PR from you :) Btw, do you plan to add for this order in prototypes?

pmeier

LGTM, thanks @lc0!

github-actions · 2021-10-21T15:50:35Z

Hey @datumbox!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

Summary: * Migrate mnist dataset from np.frombuffer * Add a copy with bytearray for non-writable buffers * Add byte reversal for mnist Reviewed By: NicolasHug Differential Revision: D31916333 fbshipit-source-id: 9af4d8dc4fb7e11c63fcc82c87004314fbc2d225 Co-authored-by: Vasilis Vryniotis <[email protected]>

* Migrate mnist dataset from np.frombuffer * Add a copy with bytearray for non-writable buffers * Add byte reversal for mnist Co-authored-by: Vasilis Vryniotis <[email protected]>

pytorch-probot bot added the ciflow/default label Oct 12, 2021

lc0 added module: tests module: datasets labels Oct 12, 2021

lc0 requested review from pmeier and datumbox October 12, 2021 10:21

datumbox reviewed Oct 12, 2021

View reviewed changes

lc0 force-pushed the frombuffer branch from c3b3b88 to 6ea01db Compare October 12, 2021 10:40

pmeier requested changes Oct 12, 2021

View reviewed changes

facebook-github-bot added the cla signed label Oct 13, 2021

pmeier mentioned this pull request Oct 19, 2021

replace np.frombuffer with torch.frombuffer in MNIST prototype #4651

Merged

lc0 force-pushed the frombuffer branch from 6ea01db to 194b117 Compare October 20, 2021 21:39

pmeier requested changes Oct 21, 2021

View reviewed changes

lc0 added 3 commits October 21, 2021 13:29

Migrate mnist dataset from np.frombuffer

97f3344

Add a copy with bytearray for non-writable buffers

02cb43a

Add byte reversal for mnist

38f189a

lc0 force-pushed the frombuffer branch from 194b117 to 38f189a Compare October 21, 2021 12:29

lc0 requested a review from pmeier October 21, 2021 13:39

pmeier approved these changes Oct 21, 2021

View reviewed changes

pmeier requested a review from datumbox October 21, 2021 15:14

Merge branch 'main' into frombuffer

b339391

datumbox merged commit d605d7d into pytorch:main Oct 21, 2021

datumbox added the enhancement label Oct 21, 2021

lc0 deleted the frombuffer branch October 22, 2021 16:56

pmeier mentioned this pull request Nov 8, 2021

add prototype utilities to read arbitrary numeric binary files #4882

Merged

pmeier mentioned this pull request Jan 12, 2023

The labels in the QMNIST dataset are all equal to 0. #7079

Closed

NicolasHug mentioned this pull request Jun 8, 2023

UserWarning during Datasets & DataLoaders Tutorial pytorch/tutorials#1692

Closed

Migrate mnist dataset from np.frombuffer #4598

Migrate mnist dataset from np.frombuffer #4598

Uh oh!

Conversation

lc0 commented Oct 12, 2021

Uh oh!

datumbox left a comment

Choose a reason for hiding this comment

Uh oh!

lc0 commented Oct 12, 2021

Uh oh!

datumbox commented Oct 12, 2021

Uh oh!

pmeier commented Oct 12, 2021

Uh oh!

datumbox commented Oct 12, 2021

Uh oh!

pmeier left a comment

Choose a reason for hiding this comment

Uh oh!

pmeier commented Oct 18, 2021

Uh oh!

NicolasHug commented Oct 18, 2021

Uh oh!

datumbox commented Oct 18, 2021

Uh oh!

pmeier commented Oct 19, 2021

Uh oh!

datumbox commented Oct 19, 2021

Uh oh!

pmeier commented Oct 19, 2021

Uh oh!

NicolasHug commented Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datumbox commented Oct 19, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NicolasHug commented Oct 19, 2021

Uh oh!

pmeier commented Oct 19, 2021

Uh oh!

lc0 commented Oct 20, 2021

Uh oh!

pmeier left a comment

Choose a reason for hiding this comment

Uh oh!

lc0 commented Oct 21, 2021

Uh oh!

pmeier commented Oct 21, 2021

Uh oh!

lc0 commented Oct 21, 2021

Uh oh!

pmeier left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 21, 2021

Uh oh!

Uh oh!

NicolasHug commented Oct 19, 2021 •

edited

Loading

datumbox commented Oct 19, 2021 •

edited

Loading