Skip to content

fix(RemoteStore): avoid listing all objects in remote store in empty() method #2312

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

jhamman
Copy link
Member

@jhamman jhamman commented Oct 8, 2024

Some initial exploration with the RemoteStore found that the empty() check was taking a very long time. This sped it up considerably.

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@jhamman jhamman requested a review from martindurant October 8, 2024 19:40
return not await self.fs._find(self.path, withdirs=True)
async for _path, _dirs, files in self.fs._walk(self.path):
# stop once a file is found
if files:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do included directories count as content? The original implied yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only check for files in the LocalStore:

async def empty(self) -> bool:
try:
with os.scandir(self.root) as it:
for entry in it:
if entry.is_file():
# stop once a file is found
return False
except FileNotFoundError:
return True
else:
return True

@@ -94,7 +94,11 @@ async def clear(self) -> None:
pass

async def empty(self) -> bool:
return not await self.fs._find(self.path, withdirs=True)
async for _path, _dirs, files in self.fs._walk(self.path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It cannot be guaranteed that this doesn't do a find() too... Would _ls() be enough?

There was talk of making ls() and friends in fsspec be generators, but not yet.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha... This was an improvement using gcsfs but I can see how it wouldn't be for others. I'll take your lead on the best path forward here. All we need to know is if there are any files in the store (ignoring empty directories).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the very many files that can in theory be in a zarr hierarchy, maybe we can keep with this strategy for now. Having said that, in the zarr case, we do know which subdir ought to have the fewest items in it, right?

Copy link
Contributor

@d-v-b d-v-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this. perennial question: what kind of performance monitoring should we have in place to flag a regression here (or anywhere)?

@jhamman jhamman added the V3 label Oct 10, 2024
@jhamman jhamman added this to the 3.0.0 milestone Oct 10, 2024
@jhamman jhamman merged commit cef4552 into zarr-developers:v3 Oct 10, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

3 participants