-
Notifications
You must be signed in to change notification settings - Fork 49
Parse kerchunk dict using memorystore #631
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parse kerchunk dict using memorystore #631
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
assert manifeststore._store_registry._stores == {"": memory_store} | ||
# TODO should it be this instead? | ||
# assert manifeststore._store_registry._stores == {"memory://": memory_store} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maxrjones this is the only part of this PR that I'm not sure about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that doesn't look great, I think you don't actually want the MemoryStore in the ObjectStoreRegistry since it should only be useful for parsing the Kerchunk dict rather than loading data, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory there could be actual data in the kerchunk references (if they are inlined). I also feel like if we explicitly pass an ObjectStore
that store should always end up in the registry?
It was more the prefix being ""
that I was asking about
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory there could be actual data in the kerchunk references (if they are inlined). I also feel like if we explicitly pass an ObjectStore that store should always end up in the registry?
It was more the prefix being "" that I was asking about
My intuition is that having a MemoryStore
as the fall-back is not a good idea. ObjectStoreRegistry
will search for stores in the registry that match the start of the path
indexed from ChunkManifest
. Since all strings start with ""
, the ManifestStore will use the MemoryStore
for any requests that aren't first matched by another store. I think this will lead to confusing errors.
What is the path
in the ChunkManifest
for a inlined variable? Does it start with memory://
(in which case that would work as a prefix for the ObjectStoreRegistry
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the path in the ChunkManifest for a inlined variable? Does it start with
memory://
?
No-one's ever made one, because we never fixed #489 😅 But I think we should, and I think memory://
would make sense as a prefix for that. To make that work I guess we would need the kerchunk parser to know that if it finds inlined data it should put that data into a MemoryStore
and then create a chunk reference that refers to it?
that would work as a prefix for the
ObjectStoreRegistry
That idea works in the sense that if I manually create and pass ObjectStoreRegistry({"memory://": memory_store})
then I can have the memory_store
and optionally additional stores for actually getting referenced chunks.
But it doesn't work in the sense that if I do
memory_store = obstore.store.MemoryStore()
parser = KerchunkJSONParser(store_registry=None)
parser("refs.json", memory_store)
then the get_store_prefix
call occurs before we get anywhere near creating a chunkmanifest, instead happening at a point that causes get_store_prefix
to (incorrectly) have to guess which prefix to use.
I think the better solution to that would be to move the fs_root
logic to be earlier in the parser, so that the local filepaths for chunks in the kerchunk references are disambiguated before get_store_prefix
is called. This is consistent with what I said above - if the kerchunk parser finds inlined data it should create a chunk reference with a memory://
prefix, if it finds an ambiguous path like data.nc
it should prepend it with fs_root
. Then get_store_prefix
will have enough information to work with.
But now I have a working example so I'm tempted to punt on that follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 3641679
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge this. It barely changes existing behaviour, so if it's wrong some how we can fix it later. I'll make an issue to track the fs_root
and inlined kerchunk data questions above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to merge this. It doesn't change existing behaviour, so if it's wrong some how we can fix it later. I'll make an issue to track the
fs_root
and inlined kerchunk data questions above.
sounds good, I won't have time for a review today and don't want you to be blocked.
Ensures that the new
KerchunkJSONParser
can be used on in-memory python dictionaries containing JSON, not just on-disk JSON files. This is a really nice demonstration of the power of delegating all IO code to obstore.docs/releases.rst
New functions/methods are listed inapi.rst