-
Notifications
You must be signed in to change notification settings - Fork 24.6k
[DataLoader] Close open in DataPipe streams on best effort basis #78952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DataLoader] Close open in DataPipe streams on best effort basis #78952
Conversation
[ghstack-poisoned]
🔗 Helpful links
✅ No Failures (0 Pending)As of commit c9c6bf8 (more details on the Dr. CI page): Expand to see more💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
[ghstack-poisoned]
[ghstack-poisoned]
… basis" Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures [ghstack-poisoned]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if some does fork
then filter
(or zip
)? Would this close the stream held by the other ChildDataPipe?
stream_dp = ...
cdp1, cdp2 = stream_dp.fork(2)
cdp2 = cdp2.filter(...)
list(cdp1.readlines())
I think we have to increment the counter when fork
is used, and only close a stream when there is no other reference to it.
Ideally, we should |
… basis" Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures [ghstack-poisoned]
iterators = [iter(datapipe) for datapipe in self.datapipes] | ||
for data in zip(*iterators): | ||
yield data | ||
|
||
unused = [] | ||
for iterator in iterators: | ||
unused += list(iterator) | ||
|
||
# TODO(VitalyFedyunin): This should be Exception or warning when torchdata.debug is enabled | ||
for item in unused: | ||
StreamWrapper.close_streams(item) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the behavior of generator functions. When a not fully consumed iterator is deconstructed, I think the iterator would be destroyed and these StreamWrapper
s are never created.
Do we need to handle this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two options here:
- Do
try...finally
within__iter__
, where thefinally
clause does the clean up.
- This should work regardless how/why the deconstruction of the iterator happens.
I just tried this withIterableWrapper
.
def __iter__(self):
try:
source_data = self.iterable
if self.deepcopy:
try:
source_data = copy.deepcopy(self.iterable)
except TypeError:
warnings.warn("....")
for data in source_data:
yield data
finally:
print("FINALLLY")
dp = IterableWrapper(range(10))
it = iter(dp)
print(next(it))
print("About to create a new iterator")
it = iter(dp)
print("Created a new iterator")
0
About to create a new iterator
FINALLLY
Created a new iterator
- We can put the clean up logic with
hook_iterator
orreset
method ofIterDataPipe
whenever that is triggered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of finally:
it will produce readable code and can be placed as a development pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -157,13 +157,59 @@ class StreamWrapper: | |||
DataPipe operation like `FileOpener`. StreamWrapper would guarantee | |||
the wrapped file handler is closed when it's out of scope. | |||
''' | |||
def __init__(self, file_obj): | |||
session_streams: Dict[Any, int] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can use a set
in this case.
Do you think we might also register a function to atexit
to make sure all session_streams
are closed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oddly it doesn't allow to put self into Set. Switching back to Dict.
…d streams" Blocked by pytorch/pytorch#78952 Automatically cleans StreamsWrappers from various buffers Automatically closing parent unarchive streams [ghstack-poisoned]
… basis" Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures [ghstack-poisoned]
… basis" Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures [ghstack-poisoned]
… basis" Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures [ghstack-poisoned]
… basis" Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures [ghstack-poisoned]
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think we need some tests to reason through the common and complicated cases.
- Question: when a buffer is cleared (say in
Shuffler
), do we need to attempt to callclose_streams
?
self.name = name | ||
if parent_stream is not None: | ||
if not isinstance(parent_stream, StreamWrapper): | ||
raise RuntimeError('Parent steam should be StreamWrapper, {} was given'.format(type(parent_stream))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise RuntimeError('Parent steam should be StreamWrapper, {} was given'.format(type(parent_stream))) | |
raise RuntimeError('Parent stream should be StreamWrapper, {} was given'.format(type(parent_stream))) |
def autoclose(self): | ||
''' | ||
Marks Steam to close automatically as soon as all child streams are closed. | ||
''' | ||
if self.child_counter == 0: | ||
self.close() | ||
self.close_on_last_child = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will be the usage of this? Provide an option for mark streams as autoclose
?
If that is the case, maybe that can be an optional argument to __init__
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will close stream if there is no active children or mark it to be closed automatically if there is at least one.
if self.child_counter == 0: | ||
self.close() | ||
self.close_on_last_child = True | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to change the implementation of __del__
here to remove self
from StreamWrapper.session_streams
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree on this comment
if isinstance(v, dict): | ||
for kk, vv in v.items(): | ||
cls.close_streams(vv, depth=depth + 1) | ||
elif isinstance(v, list) or isinstance(v, tuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: isinstance(v, (list, tuple))
self.parent_stream.child_counter -= 1 | ||
if not self.parent_stream.child_counter and self.parent_stream.close_on_last_child: | ||
self.parent_stream.close() | ||
self.file_obj.close(*args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.file_obj.close(*args, **kwargs) | |
try: | |
self.file_obj.close(*args, **kwargs) | |
except AttributeError: | |
pass |
… basis" Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures Differential Revision: [D37489935](https://our.internmc.facebook.com/intern/diff/D37489935) [ghstack-poisoned]
… basis" Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures Differential Revision: [D37489935](https://our.internmc.facebook.com/intern/diff/D37489935) [ghstack-poisoned]
@VitalyFedyunin has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@pytorchbot merge -g |
@pytorchbot successfully started a merge job. Check the current status here |
@VitalyFedyunin your PR has been successfully merged. |
Hey @VitalyFedyunin. |
) (#78952) Summary: Adding ability to: - Track open StreamWrappers with `StreamWrapper.session_streams` - Automatically close parent StreamWrapper (ex. torchdata tar is the parent and extracted file streams are children) - Close streams in discarded by filtering structures Pull Request resolved: #78952 Approved by: https://github.com/ejguan Test Plan: contbuild & OSS CI, see https://hud.pytorch.org/commit/pytorch/pytorch/331c0c18033bf4138c9ea468a8c759865dd8ffff Reviewed By: ejguan Differential Revision: D37489935 fbshipit-source-id: 6c0fa4d03b1d957cae9ab6062a00b27b120d68e6
Adding ability to:
StreamWrapper.session_streams
Stack from ghstack (oldest at bottom):
Differential Revision: D37489935