Skip to content

add lazily filled dict for prototype datasets #5219

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 12 additions & 16 deletions test/test_prototype_builtin_datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import pytest
import torch
from builtin_dataset_mocks import parametrize_dataset_mocks, DATASET_MOCKS
from torch.utils.data.dataloader_experimental import DataLoader2
from torch.utils.data.datapipes.iter.grouping import ShardingFilterIterDataPipe as ShardingFilter
from torch.utils.data.graph import traverse
from torchdata.datapipes.iter import IterDataPipe, Shuffler
Expand Down Expand Up @@ -80,27 +81,13 @@ def test_transformable(self, dataset_mock, config):

next(iter(dataset.map(transforms.Identity())))

@parametrize_dataset_mocks(
DATASET_MOCKS,
marks={
"cub200": pytest.mark.xfail(
reason="See https://github.com/pytorch/vision/pull/5187#issuecomment-1015479165"
)
},
)
@parametrize_dataset_mocks(DATASET_MOCKS)
def test_traversable(self, dataset_mock, config):
dataset, _ = dataset_mock.load(config)

traverse(dataset)

@parametrize_dataset_mocks(
DATASET_MOCKS,
marks={
"cub200": pytest.mark.xfail(
reason="See https://github.com/pytorch/vision/pull/5187#issuecomment-1015479165"
)
},
)
@parametrize_dataset_mocks(DATASET_MOCKS)
@pytest.mark.parametrize("annotation_dp_type", (Shuffler, ShardingFilter), ids=lambda type: type.__name__)
def test_has_annotations(self, dataset_mock, config, annotation_dp_type):
def scan(graph):
Expand All @@ -116,6 +103,15 @@ def scan(graph):
else:
raise AssertionError(f"The dataset doesn't comprise a {annotation_dp_type.__name__}() datapipe.")

@parametrize_dataset_mocks(DATASET_MOCKS)
def test_multi_epoch(self, dataset_mock, config):
dataset, _ = dataset_mock.load(config)
data_loader = DataLoader2(dataset)

for epoch in range(2):
for _ in data_loader:
pass


@parametrize_dataset_mocks(DATASET_MOCKS["qmnist"])
class TestQMNIST:
Expand Down
9 changes: 6 additions & 3 deletions torchvision/prototype/datasets/_builtin/cub200.py
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@
getitem,
path_comparator,
path_accessor,
LazyDict,
)
from torchvision.prototype.features import Label, BoundingBox, Feature

Expand Down Expand Up @@ -94,6 +95,9 @@ def _2011_classify_archive(self, data: Tuple[str, Any]) -> Optional[int]:
else:
return None

def _2011_image_key(self, rel_posix_path: str) -> str:
return rel_posix_path.rsplit("/", 1)[1]

def _2011_filter_split(self, row: List[str], *, split: str) -> bool:
_, split_id = row
return {
Expand Down Expand Up @@ -173,9 +177,8 @@ def _make_datapipe(
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A second thought. Could we simply filter image_files_dp from archive_dp here and create the image_files_map dictionary?

Then, we can do demux over archive_dp again and drop data in image_files_dp.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically splitting of image_files_dp from the graph?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. Then, we can materialize the data from it like a meta-datapipe.

image_files_dp = CSVParser(image_files_dp, dialect="cub200")
image_files_map = dict(
(image_id, rel_posix_path.rsplit("/", maxsplit=1)[1]) for image_id, rel_posix_path in image_files_dp
)
image_files_dp = Mapper(image_files_dp, self._2011_image_key, input_col=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we can do here is a in_memory_cache over image_files_dp. Then, we could add an API to convert a IterDataPipe to a lazy loaded MapDataPipe to represent the LazyDict.
If we use LazyDict here, I have concern that image_files_map would be missing from the DataLoader graph.
cc: @VitalyFedyunin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides, I think the DataLoader would complain this datapipe graph in the second epoch because image_files_dp is never used after the first epoch then Demux would also be non-serializable same as the comment I made in the other PR.

So, a fix from Demux is not avoidable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good thing to test in general. What should a test look like. Is something like

for _ in dataset.cycle(2):
    pass

enough? If yes, my proposal passes this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean if we put the dataset (datapipes) into DataLoader, the second epoch of DataLoader would break.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So something like

data_loader = DataLoader2(dataset)

for epoch in range(2):
    for sample in data_loader:
        pass

?

Copy link
Contributor

@ejguan ejguan Jan 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I have asked Kevin to fix such issue in demux.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My proposal still works. I've pushed the test I'm running against. There are multiple failures for other datasets, but cub200 is not one of them.

image_files_map = LazyDict(image_files_dp)

split_dp = CSVParser(split_dp, dialect="cub200")
split_dp = Filter(split_dp, functools.partial(self._2011_filter_split, split=config.split))
Expand Down
20 changes: 20 additions & 0 deletions torchvision/prototype/datasets/utils/_internal.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import pathlib
import pickle
import platform
from collections import UserDict
from typing import BinaryIO
from typing import (
Sequence,
Expand Down Expand Up @@ -49,6 +50,7 @@
"fromfile",
"read_flo",
"hint_sharding",
"LazyDict",
]

K = TypeVar("K")
Expand Down Expand Up @@ -345,3 +347,21 @@ def hint_sharding(datapipe: IterDataPipe[D]) -> IterDataPipe[D]:

def hint_shuffling(datapipe: IterDataPipe[D]) -> IterDataPipe[D]:
return Shuffler(datapipe, default=False, buffer_size=INFINITE_BUFFER_SIZE)


class LazyDict(UserDict):
def __init__(self, datapipe: IterDataPipe[Tuple[K, D]], *args: Any, **kwargs: Any) -> None:
super().__init__(*args, **kwargs)
self.datapipe = datapipe
self.loaded = False

def load(self) -> None:
for key, value in self.datapipe:
self.data[key] = value
self.loaded = True

def __getitem__(self, item: K) -> D:
if not self.loaded:
self.load()

return self.data[item]