Skip to content

add download functionality to prototype datasets #5035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Dec 8, 2021

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Dec 6, 2021

This PR adds download functionality to all resources and in turn to all prototype datasets.

There are two main changes here:

  1. The OnlineResource class now takes the decompress and extract keyword arguments that apply the action given by there name to the resource after download. This makes it really easy for a contributor to "patch" I/O performance bottlenecks. For example, while benchmarking "caltech101" we saw almost a 20x drop in performance compared to the legacy datasets, since the image archive is compressed. This is now easily fixed by setting decompress=True while defining the resource.
  2. Instead of always returning the "raw datapipe", the resource now tries to guess if the file is an archive from its suffixes and applies the appropriate archive reader datapipe. Before in almost every dataset the first step was call (Tar|Zip)ArchiveReader to get access to the files, which is no longer needed after this PR. (This is also the reason why this PR touches almost all datasets)

cc @pmeier @bjuncek

@facebook-github-bot
Copy link

facebook-github-bot commented Dec 6, 2021

💊 CI failures summary and remediations

As of commit e7f61d8 (more details on the Dr. CI page):


  • 1/1 failures introduced in this PR

1 failure not recognized by patterns:

Job Step Action
CircleCI binary_linux_conda_py3.6_cu111 packaging/build_conda.sh 🔁 rerun

1 job timed out:

  • binary_linux_conda_py3.6_cu111

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

I have some minor comments, none of which are merge blocking, so I'm approving to move forward here

Comment on lines +75 to +88
_ARCHIVE_LOADERS = {
".tar": TarArchiveReader,
".zip": ZipArchiveReader,
".rar": RarArchiveLoader,
}

def _guess_archive_loader(
self, path: pathlib.Path
) -> Optional[Callable[[IterDataPipe[Tuple[str, IO]]], IterDataPipe[Tuple[str, IO]]]]:
try:
_, archive_type, _ = _detect_file_type(path.name)
except RuntimeError:
return None
return self._ARCHIVE_LOADERS.get(archive_type) # type: ignore[arg-type]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we should probably have a dedicated function in torchdata to dispatch to different readers

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See facebookexternal/torchdata#23. cc @ejguan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, it's different. One option as you have pointed out is to have a class to read files from all different archives.
The another option is to return an archiveReader for a specific archive type.
From my perspective, the latter solution is more suitable because the first solution needs to rely on that all file handles opened from tar, zip and rar would behave same. Otherwise, it's gonna be hard to debug which file handles cause a problem. And, for the first solution, such DataPipe class would need more dependencies like rarfile. If users only need to read from tar or zip, by using such DataPipe, they have to install rarfile.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rely on that all file handles opened from tar, zip and rar would behave same.

I'm not sure I understand. If we had an ArchiveReader datapipe, in its __iter__ method it could infer the datapipe format from the path and than use whatever functionality it needs to iterate the archive. Something along the lines of

def iterate_tar(path, file):
    ...

def iterate_zip(path, file):
    ...

class ArchiveReader(IterDataPipe):
    ...

    def infer_archive_type(self, path):
        ...

    def __iter__(self):
        for path, file in self.datapipe:
            archive_type = self.infer_archive_type(path)
            if archive_type == "tar":
                yield from iterate_tar(path, file)
            elif archive_type == "zip":
                yield from iterate_zip(path, file)
            else:
                raise ValueError("Unknown archive type!")

class TarReader(IterDataPipe):
    ...

    def __iter__(self):
        for path, file in self.datapipe:
            yield from iterate_tar(path, file)
            
class ZipReader(IterDataPipe):
    ...

    def __iter__(self):
        for path, file in self.datapipe:
            yield from iterate_zip(path, file)

If the user wants full control, they could still use a specific archive reader. Otherwise they can use the convenience of not needing to specify the archive type by using ArchiveReader.

And, for the first solution, such DataPipe class would need more dependencies like rarfile. If users only need to read from tar or zip, by using such DataPipe, they have to install rarfile.

The dependencies could be checked at runtime. To stick with the rarfile example, we could lazily import it in the archive_type == "rar" branch. Meaning, if the user never hits a rar archive, everything works fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rely on that all file handles opened from tar, zip and rar would behave same.

I am thinking if there is difference between the behavior of file handles yielded from such DataPipe. If so, all file handles from a single DataPipe can cause confusion.

The dependencies could be checked at runtime. To stick with the rarfile example, we could lazily import it in the archive_type == "rar" branch. Meaning, if the user never hits a rar archive, everything works fine.

But, then we can not raise Error at construction time. The iteration of pipeline would stop in the middle of training.

The main benefit for the first solution would be single DataPipe handles all types of archives. This would be useful only if domain has the use case that a single Dataset owns different archives.
But, the latter solution is same as what you are currently doing to return a specific archive reader per dataset.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if there is difference between the behavior of file handles yielded from such DataPipe. If so, all file handles from a single DataPipe can cause confusion.

IMO when a user opts to use a datapipe like ArchiveReader, they enter a contract that says "I'm not going to do anything with the file handles that is specific to the archive type". They all behave the same for simply reading the data, right?

@pmeier pmeier merged commit 4d00ae0 into pytorch:main Dec 8, 2021
@pmeier pmeier deleted the datasets/download branch December 8, 2021 07:55
facebook-github-bot pushed a commit that referenced this pull request Dec 9, 2021
Summary:
* add download functionality to prototype datasets

* fix annotation

* fix test

* remove iopath

* add comments

Reviewed By: NicolasHug

Differential Revision: D32950933

fbshipit-source-id: 042183130e1a10891d7663487900c64607a324a3
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants