-
Notifications
You must be signed in to change notification settings - Fork 7.1k
add download functionality to prototype datasets #5035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit e7f61d8 (more details on the Dr. CI page):
1 failure not recognized by patterns:
1 job timed out:
This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
I have some minor comments, none of which are merge blocking, so I'm approving to move forward here
_ARCHIVE_LOADERS = { | ||
".tar": TarArchiveReader, | ||
".zip": ZipArchiveReader, | ||
".rar": RarArchiveLoader, | ||
} | ||
|
||
def _guess_archive_loader( | ||
self, path: pathlib.Path | ||
) -> Optional[Callable[[IterDataPipe[Tuple[str, IO]]], IterDataPipe[Tuple[str, IO]]]]: | ||
try: | ||
_, archive_type, _ = _detect_file_type(path.name) | ||
except RuntimeError: | ||
return None | ||
return self._ARCHIVE_LOADERS.get(archive_type) # type: ignore[arg-type] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should probably have a dedicated function in torchdata to dispatch to different readers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See facebookexternal/torchdata#23. cc @ejguan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, it's different. One option as you have pointed out is to have a class to read files from all different archives.
The another option is to return an archiveReader for a specific archive type.
From my perspective, the latter solution is more suitable because the first solution needs to rely on that all file handles opened from tar, zip and rar would behave same. Otherwise, it's gonna be hard to debug which file handles cause a problem. And, for the first solution, such DataPipe class would need more dependencies like rarfile. If users only need to read from tar or zip, by using such DataPipe, they have to install rarfile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rely on that all file handles opened from tar, zip and rar would behave same.
I'm not sure I understand. If we had an ArchiveReader
datapipe, in its __iter__
method it could infer the datapipe format from the path and than use whatever functionality it needs to iterate the archive. Something along the lines of
def iterate_tar(path, file):
...
def iterate_zip(path, file):
...
class ArchiveReader(IterDataPipe):
...
def infer_archive_type(self, path):
...
def __iter__(self):
for path, file in self.datapipe:
archive_type = self.infer_archive_type(path)
if archive_type == "tar":
yield from iterate_tar(path, file)
elif archive_type == "zip":
yield from iterate_zip(path, file)
else:
raise ValueError("Unknown archive type!")
class TarReader(IterDataPipe):
...
def __iter__(self):
for path, file in self.datapipe:
yield from iterate_tar(path, file)
class ZipReader(IterDataPipe):
...
def __iter__(self):
for path, file in self.datapipe:
yield from iterate_zip(path, file)
If the user wants full control, they could still use a specific archive reader. Otherwise they can use the convenience of not needing to specify the archive type by using ArchiveReader
.
And, for the first solution, such DataPipe class would need more dependencies like rarfile. If users only need to read from tar or zip, by using such DataPipe, they have to install rarfile.
The dependencies could be checked at runtime. To stick with the rarfile
example, we could lazily import it in the archive_type == "rar"
branch. Meaning, if the user never hits a rar archive, everything works fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rely on that all file handles opened from tar, zip and rar would behave same.
I am thinking if there is difference between the behavior of file handles yielded from such DataPipe. If so, all file handles from a single DataPipe can cause confusion.
The dependencies could be checked at runtime. To stick with the rarfile example, we could lazily import it in the archive_type == "rar" branch. Meaning, if the user never hits a rar archive, everything works fine.
But, then we can not raise Error at construction time. The iteration of pipeline would stop in the middle of training.
The main benefit for the first solution would be single DataPipe handles all types of archives. This would be useful only if domain has the use case that a single Dataset owns different archives.
But, the latter solution is same as what you are currently doing to return a specific archive reader per dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking if there is difference between the behavior of file handles yielded from such DataPipe. If so, all file handles from a single DataPipe can cause confusion.
IMO when a user opts to use a datapipe like ArchiveReader
, they enter a contract that says "I'm not going to do anything with the file handles that is specific to the archive type". They all behave the same for simply reading the data, right?
Summary: * add download functionality to prototype datasets * fix annotation * fix test * remove iopath * add comments Reviewed By: NicolasHug Differential Revision: D32950933 fbshipit-source-id: 042183130e1a10891d7663487900c64607a324a3
This PR adds download functionality to all resources and in turn to all prototype datasets.
There are two main changes here:
OnlineResource
class now takes thedecompress
andextract
keyword arguments that apply the action given by there name to the resource after download. This makes it really easy for a contributor to "patch" I/O performance bottlenecks. For example, while benchmarking"caltech101"
we saw almost a 20x drop in performance compared to the legacy datasets, since the image archive is compressed. This is now easily fixed by settingdecompress=True
while defining the resource.(Tar|Zip)ArchiveReader
to get access to the files, which is no longer needed after this PR. (This is also the reason why this PR touches almost all datasets)cc @pmeier @bjuncek