-
-
Notifications
You must be signed in to change notification settings - Fork 356
fix(RemoteStore): avoid listing all objects in remote store in empty() method #2312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(RemoteStore): avoid listing all objects in remote store in empty() method #2312
Conversation
src/zarr/storage/remote.py
Outdated
return not await self.fs._find(self.path, withdirs=True) | ||
async for _path, _dirs, files in self.fs._walk(self.path): | ||
# stop once a file is found | ||
if files: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do included directories count as content? The original implied yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only check for files in the LocalStore
:
zarr-python/src/zarr/storage/local.py
Lines 101 to 111 in 5a134bf
async def empty(self) -> bool: | |
try: | |
with os.scandir(self.root) as it: | |
for entry in it: | |
if entry.is_file(): | |
# stop once a file is found | |
return False | |
except FileNotFoundError: | |
return True | |
else: | |
return True |
src/zarr/storage/remote.py
Outdated
@@ -94,7 +94,11 @@ async def clear(self) -> None: | |||
pass | |||
|
|||
async def empty(self) -> bool: | |||
return not await self.fs._find(self.path, withdirs=True) | |||
async for _path, _dirs, files in self.fs._walk(self.path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It cannot be guaranteed that this doesn't do a find() too... Would _ls()
be enough?
There was talk of making ls()
and friends in fsspec be generators, but not yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha... This was an improvement using gcsfs but I can see how it wouldn't be for others. I'll take your lead on the best path forward here. All we need to know is if there are any files in the store (ignoring empty directories).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the very many files that can in theory be in a zarr hierarchy, maybe we can keep with this strategy for now. Having said that, in the zarr case, we do know which subdir ought to have the fewest items in it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this. perennial question: what kind of performance monitoring should we have in place to flag a regression here (or anywhere)?
…nto fix/remote-store-empty-speedup-walk
…man/zarr-python into fix/remote-store-empty-speedup-walk
Some initial exploration with the RemoteStore found that the
empty()
check was taking a very long time. This sped it up considerably.TODO: