add lazily filled dict for prototype datasets #5219

pmeier · 2022-01-19T14:01:01Z

This is my proposed solution to #5187 (comment). Since we only need the mapping during iteration, we can also delay its instantiation until then. Thoughts?

cc @pmeier @bjuncek

facebook-github-bot · 2022-01-19T14:01:10Z

💊 CI failures summary and remediations

As of commit ff79a88 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

ejguan · 2022-01-19T15:01:44Z

torchvision/prototype/datasets/_builtin/cub200.py

-            image_files_map = dict(
-                (image_id, rel_posix_path.rsplit("/", maxsplit=1)[1]) for image_id, rel_posix_path in image_files_dp
-            )
+            image_files_dp = Mapper(image_files_dp, self._2011_image_key, input_col=1)


What we can do here is a in_memory_cache over image_files_dp. Then, we could add an API to convert a IterDataPipe to a lazy loaded MapDataPipe to represent the LazyDict.
If we use LazyDict here, I have concern that image_files_map would be missing from the DataLoader graph.
cc: @VitalyFedyunin

Besides, I think the DataLoader would complain this datapipe graph in the second epoch because image_files_dp is never used after the first epoch then Demux would also be non-serializable same as the comment I made in the other PR.

So, a fix from Demux is not avoidable.

This seems like a good thing to test in general. What should a test look like. Is something like

for _ in dataset.cycle(2): pass

enough? If yes, my proposal passes this test.

I mean if we put the dataset (datapipes) into DataLoader, the second epoch of DataLoader would break.

So something like

data_loader = DataLoader2(dataset) for epoch in range(2): for sample in data_loader: pass

?

Yeah. I have asked Kevin to fix such issue in demux.

My proposal still works. I've pushed the test I'm running against. There are multiple failures for other datasets, but cub200 is not one of them.

ejguan · 2022-01-19T15:07:30Z

torchvision/prototype/datasets/_builtin/cub200.py

@@ -173,9 +177,8 @@ def _make_datapipe(
            )



A second thought. Could we simply filter image_files_dp from archive_dp here and create the image_files_map dictionary?

Then, we can do demux over archive_dp again and drop data in image_files_dp.

So basically splitting of image_files_dp from the graph?

Yeah. Then, we can materialize the data from it like a meta-datapipe.

pmeier · 2022-05-24T09:03:22Z

The agreed upon idiom is that we can iterate before the full pipeline is built, but we have to exhaust it completely. See #6065 for fixes to ImageNet and CUB200.

add lazily filled dict for prototype datasets

373579d

pmeier added module: datasets prototype labels Jan 19, 2022

pmeier requested review from NivekT and ejguan January 19, 2022 14:01

pytorch-probot bot added the ciflow/default label Jan 19, 2022

facebook-github-bot added the cla signed label Jan 19, 2022

ejguan reviewed Jan 19, 2022

View reviewed changes

pmeier added 2 commits January 19, 2022 20:25

add cycle test

c91f421

adapt multi epoch test

ff79a88

pmeier closed this May 24, 2022

This was referenced May 24, 2022

expand prototype test matrix to different Python versions #6065

Open

Fully exhaust datapipes that are needed to construct a dataset #6076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add lazily filled dict for prototype datasets #5219

add lazily filled dict for prototype datasets #5219

Uh oh!

pmeier commented Jan 19, 2022 •

edited by pytorch-probot bot

Loading

Uh oh!

facebook-github-bot commented Jan 19, 2022 •

edited

Loading

Uh oh!

ejguan Jan 19, 2022

Uh oh!

ejguan Jan 19, 2022

Uh oh!

pmeier Jan 19, 2022

Uh oh!

ejguan Jan 19, 2022

Uh oh!

pmeier Jan 19, 2022

Uh oh!

ejguan Jan 19, 2022 •

edited

Loading

Uh oh!

pmeier Jan 19, 2022

Uh oh!

ejguan Jan 19, 2022

Uh oh!

pmeier Jan 19, 2022

Uh oh!

ejguan Jan 19, 2022

Uh oh!

pmeier commented May 24, 2022

Uh oh!

Uh oh!

add lazily filled dict for prototype datasets #5219

add lazily filled dict for prototype datasets #5219

Uh oh!

Conversation

pmeier commented Jan 19, 2022 • edited by pytorch-probot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

facebook-github-bot commented Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

💊 CI failures summary and remediations

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ejguan Jan 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pmeier commented May 24, 2022

Uh oh!

Uh oh!

pmeier commented Jan 19, 2022 •

edited by pytorch-probot bot

Loading

facebook-github-bot commented Jan 19, 2022 •

edited

Loading

ejguan Jan 19, 2022 •

edited

Loading