add download functionality to prototype datasets #5035

pmeier · 2021-12-06T16:20:00Z

This PR adds download functionality to all resources and in turn to all prototype datasets.

There are two main changes here:

The OnlineResource class now takes the decompress and extract keyword arguments that apply the action given by there name to the resource after download. This makes it really easy for a contributor to "patch" I/O performance bottlenecks. For example, while benchmarking "caltech101" we saw almost a 20x drop in performance compared to the legacy datasets, since the image archive is compressed. This is now easily fixed by setting decompress=True while defining the resource.
Instead of always returning the "raw datapipe", the resource now tries to guess if the file is an archive from its suffixes and applies the appropriate archive reader datapipe. Before in almost every dataset the first step was call (Tar|Zip)ArchiveReader to get access to the files, which is no longer needed after this PR. (This is also the reason why this PR touches almost all datasets)

cc @pmeier @bjuncek

facebook-github-bot · 2021-12-06T16:20:07Z

💊 CI failures summary and remediations

As of commit e7f61d8 (more details on the Dr. CI page):

1/1 failures introduced in this PR

1 failure not recognized by patterns:

Job	Step	Action
^{binary_linux_conda_py3.6_cu111}	^{packaging/build_conda.sh}	🔁 rerun

1 job timed out:

binary_linux_conda_py3.6_cu111

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

fmassa

Thanks for the PR!

I have some minor comments, none of which are merge blocking, so I'm approving to move forward here

fmassa · 2021-12-07T14:49:22Z

torchvision/prototype/datasets/utils/_resource.py

+    _ARCHIVE_LOADERS = {
+        ".tar": TarArchiveReader,
+        ".zip": ZipArchiveReader,
+        ".rar": RarArchiveLoader,
+    }
+
+    def _guess_archive_loader(
+        self, path: pathlib.Path
+    ) -> Optional[Callable[[IterDataPipe[Tuple[str, IO]]], IterDataPipe[Tuple[str, IO]]]]:
+        try:
+            _, archive_type, _ = _detect_file_type(path.name)
+        except RuntimeError:
+            return None
+        return self._ARCHIVE_LOADERS.get(archive_type)  # type: ignore[arg-type]


nit: we should probably have a dedicated function in torchdata to dispatch to different readers

See facebookexternal/torchdata#23. cc @ejguan

IMHO, it's different. One option as you have pointed out is to have a class to read files from all different archives.
The another option is to return an archiveReader for a specific archive type.
From my perspective, the latter solution is more suitable because the first solution needs to rely on that all file handles opened from tar, zip and rar would behave same. Otherwise, it's gonna be hard to debug which file handles cause a problem. And, for the first solution, such DataPipe class would need more dependencies like rarfile. If users only need to read from tar or zip, by using such DataPipe, they have to install rarfile.

rely on that all file handles opened from tar, zip and rar would behave same.

I'm not sure I understand. If we had an ArchiveReader datapipe, in its __iter__ method it could infer the datapipe format from the path and than use whatever functionality it needs to iterate the archive. Something along the lines of

def iterate_tar(path, file): ... def iterate_zip(path, file): ... class ArchiveReader(IterDataPipe): ... def infer_archive_type(self, path): ... def __iter__(self): for path, file in self.datapipe: archive_type = self.infer_archive_type(path) if archive_type == "tar": yield from iterate_tar(path, file) elif archive_type == "zip": yield from iterate_zip(path, file) else: raise ValueError("Unknown archive type!") class TarReader(IterDataPipe): ... def __iter__(self): for path, file in self.datapipe: yield from iterate_tar(path, file) class ZipReader(IterDataPipe): ... def __iter__(self): for path, file in self.datapipe: yield from iterate_zip(path, file)

If the user wants full control, they could still use a specific archive reader. Otherwise they can use the convenience of not needing to specify the archive type by using ArchiveReader.

And, for the first solution, such DataPipe class would need more dependencies like rarfile. If users only need to read from tar or zip, by using such DataPipe, they have to install rarfile.

The dependencies could be checked at runtime. To stick with the rarfile example, we could lazily import it in the archive_type == "rar" branch. Meaning, if the user never hits a rar archive, everything works fine.

rely on that all file handles opened from tar, zip and rar would behave same.

I am thinking if there is difference between the behavior of file handles yielded from such DataPipe. If so, all file handles from a single DataPipe can cause confusion.

The dependencies could be checked at runtime. To stick with the rarfile example, we could lazily import it in the archive_type == "rar" branch. Meaning, if the user never hits a rar archive, everything works fine.

But, then we can not raise Error at construction time. The iteration of pipeline would stop in the middle of training.

The main benefit for the first solution would be single DataPipe handles all types of archives. This would be useful only if domain has the use case that a single Dataset owns different archives.
But, the latter solution is same as what you are currently doing to return a specific archive reader per dataset.

I am thinking if there is difference between the behavior of file handles yielded from such DataPipe. If so, all file handles from a single DataPipe can cause confusion.

IMO when a user opts to use a datapipe like ArchiveReader, they enter a contract that says "I'm not going to do anything with the file handles that is specific to the archive type". They all behave the same for simply reading the data, right?

torchvision/prototype/datasets/utils/_resource.py

Summary: * add download functionality to prototype datasets * fix annotation * fix test * remove iopath * add comments Reviewed By: NicolasHug Differential Revision: D32950933 fbshipit-source-id: 042183130e1a10891d7663487900c64607a324a3

add download functionality to prototype datasets

f64ce0a

pmeier added module: datasets prototype labels Dec 6, 2021

pmeier requested a review from fmassa December 6, 2021 16:20

pytorch-probot bot added the ciflow/default label Dec 6, 2021

facebook-github-bot added the cla signed label Dec 6, 2021

pmeier added 2 commits December 6, 2021 17:20

Merge branch 'main' into datasets/download

71976b2

fix annotation

d056fb0

pmeier mentioned this pull request Dec 7, 2021

replace requests with urllib #4973

Merged

pmeier added 2 commits December 7, 2021 14:40

Merge branch 'main' into datasets/download

08ae526

fix test

ffea14e

fmassa approved these changes Dec 7, 2021

View reviewed changes

pmeier added 3 commits December 7, 2021 17:35

remove iopath

0d1e28e

add comments

5a4a2cb

Merge branch 'main' into datasets/download

e7f61d8

pmeier merged commit 4d00ae0 into pytorch:main Dec 8, 2021

pmeier deleted the datasets/download branch December 8, 2021 07:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add download functionality to prototype datasets #5035

add download functionality to prototype datasets #5035

pmeier commented Dec 6, 2021 •

edited by pytorch-probot bot

Loading

Uh oh!

facebook-github-bot commented Dec 6, 2021 •

edited

Loading

Uh oh!

fmassa left a comment

Uh oh!

fmassa Dec 7, 2021

Uh oh!

pmeier Dec 7, 2021

Uh oh!

ejguan Dec 7, 2021

Uh oh!

pmeier Dec 7, 2021

Uh oh!

ejguan Dec 7, 2021

Uh oh!

pmeier Dec 8, 2021

Uh oh!

Uh oh!

Uh oh!

Uh oh!

add download functionality to prototype datasets #5035

add download functionality to prototype datasets #5035

Conversation

pmeier commented Dec 6, 2021 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Dec 6, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

1 failure not recognized by patterns:

Uh oh!

fmassa left a comment

Choose a reason for hiding this comment

Uh oh!

fmassa Dec 7, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Dec 7, 2021

Choose a reason for hiding this comment

Uh oh!

ejguan Dec 7, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Dec 7, 2021

Choose a reason for hiding this comment

Uh oh!

ejguan Dec 7, 2021

Choose a reason for hiding this comment

Uh oh!

pmeier Dec 8, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pmeier commented Dec 6, 2021 •

edited by pytorch-probot bot

Loading

facebook-github-bot commented Dec 6, 2021 •

edited

Loading