-
Notifications
You must be signed in to change notification settings - Fork 7.1k
add lazily filled dict for prototype datasets #5219
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💊 CI failures summary and remediationsAs of commit ff79a88 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
image_files_map = dict( | ||
(image_id, rel_posix_path.rsplit("/", maxsplit=1)[1]) for image_id, rel_posix_path in image_files_dp | ||
) | ||
image_files_dp = Mapper(image_files_dp, self._2011_image_key, input_col=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we can do here is a in_memory_cache
over image_files_dp
. Then, we could add an API to convert a IterDataPipe
to a lazy loaded MapDataPipe
to represent the LazyDict
.
If we use LazyDict
here, I have concern that image_files_map
would be missing from the DataLoader graph.
cc: @VitalyFedyunin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides, I think the DataLoader would complain this datapipe graph in the second epoch because image_files_dp
is never used after the first epoch then Demux
would also be non-serializable same as the comment I made in the other PR.
So, a fix from Demux
is not avoidable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a good thing to test in general. What should a test look like. Is something like
for _ in dataset.cycle(2):
pass
enough? If yes, my proposal passes this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean if we put the dataset (datapipes) into DataLoader, the second epoch of DataLoader would break.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So something like
data_loader = DataLoader2(dataset)
for epoch in range(2):
for sample in data_loader:
pass
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I have asked Kevin to fix such issue in demux.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My proposal still works. I've pushed the test I'm running against. There are multiple failures for other datasets, but cub200
is not one of them.
@@ -173,9 +177,8 @@ def _make_datapipe( | |||
) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A second thought. Could we simply filter image_files_dp
from archive_dp
here and create the image_files_map
dictionary?
Then, we can do demux over archive_dp
again and drop data in image_files_dp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically splitting of image_files_dp
from the graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. Then, we can materialize the data from it like a meta-datapipe.
The agreed upon idiom is that we can iterate before the full pipeline is built, but we have to exhaust it completely. See #6065 for fixes to |
This is my proposed solution to #5187 (comment). Since we only need the mapping during iteration, we can also delay its instantiation until then. Thoughts?
cc @pmeier @bjuncek