Skip to content

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Jun 25, 2025

Ensures that the new KerchunkJSONParser can be used on in-memory python dictionaries containing JSON, not just on-disk JSON files. This is a really nice demonstration of the power of delegating all IO code to obstore.

  • Closes #xxxx
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

@TomNicholas TomNicholas added Kerchunk Relating to the kerchunk library / specification itself obstore labels Jun 25, 2025
Comment on lines 361 to 363
assert manifeststore._store_registry._stores == {"": memory_store}
# TODO should it be this instead?
# assert manifeststore._store_registry._stores == {"memory://": memory_store}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxrjones this is the only part of this PR that I'm not sure about

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that doesn't look great, I think you don't actually want the MemoryStore in the ObjectStoreRegistry since it should only be useful for parsing the Kerchunk dict rather than loading data, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory there could be actual data in the kerchunk references (if they are inlined). I also feel like if we explicitly pass an ObjectStore that store should always end up in the registry?

It was more the prefix being "" that I was asking about

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory there could be actual data in the kerchunk references (if they are inlined). I also feel like if we explicitly pass an ObjectStore that store should always end up in the registry?

It was more the prefix being "" that I was asking about

My intuition is that having a MemoryStore as the fall-back is not a good idea. ObjectStoreRegistry will search for stores in the registry that match the start of the path indexed from ChunkManifest. Since all strings start with "", the ManifestStore will use the MemoryStore for any requests that aren't first matched by another store. I think this will lead to confusing errors.

What is the path in the ChunkManifest for a inlined variable? Does it start with memory:// (in which case that would work as a prefix for the ObjectStoreRegistry)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the path in the ChunkManifest for a inlined variable? Does it start with memory://?

No-one's ever made one, because we never fixed #489 😅 But I think we should, and I think memory:// would make sense as a prefix for that. To make that work I guess we would need the kerchunk parser to know that if it finds inlined data it should put that data into a MemoryStore and then create a chunk reference that refers to it?

that would work as a prefix for the ObjectStoreRegistry

That idea works in the sense that if I manually create and pass ObjectStoreRegistry({"memory://": memory_store}) then I can have the memory_store and optionally additional stores for actually getting referenced chunks.

But it doesn't work in the sense that if I do

memory_store = obstore.store.MemoryStore()
parser = KerchunkJSONParser(store_registry=None)
parser("refs.json", memory_store)

then the get_store_prefix call occurs before we get anywhere near creating a chunkmanifest, instead happening at a point that causes get_store_prefix to (incorrectly) have to guess which prefix to use.

I think the better solution to that would be to move the fs_root logic to be earlier in the parser, so that the local filepaths for chunks in the kerchunk references are disambiguated before get_store_prefix is called. This is consistent with what I said above - if the kerchunk parser finds inlined data it should create a chunk reference with a memory:// prefix, if it finds an ambiguous path like data.nc it should prepend it with fs_root. Then get_store_prefix will have enough information to work with.

But now I have a working example so I'm tempted to punt on that follow-up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 3641679

Copy link
Member Author

@TomNicholas TomNicholas Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge this. It barely changes existing behaviour, so if it's wrong some how we can fix it later. I'll make an issue to track the fs_root and inlined kerchunk data questions above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to merge this. It doesn't change existing behaviour, so if it's wrong some how we can fix it later. I'll make an issue to track the fs_root and inlined kerchunk data questions above.

sounds good, I won't have time for a review today and don't want you to be blocked.

@TomNicholas TomNicholas merged commit a061184 into zarr-developers:develop Jun 27, 2025
11 checks passed
@TomNicholas TomNicholas deleted the parse_kerchunk_dict_using_memorystore branch June 27, 2025 13:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Kerchunk Relating to the kerchunk library / specification itself obstore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants