-
Notifications
You must be signed in to change notification settings - Fork 46
Description
I'm working through trying to update the "fill value" (quoted to denote the massive ambiguity around this loaded term as described by @rabernat in pydata/xarray#5475 (comment)) for the HDFVirtualBackend
but it seems this will be a multi-level issue. My hope is our logic can follow the interpretation that @rabernat set out in pydata/xarray#5475 (comment) so that the reader will use the HDF dataset fillvalue
property for the the Zarr fill_value
and if the HDF dataset has a _FillValue
attr
that will be preserved as an array attr
for use by xarray's decoding logic to represent masked values. This is a transition from the the old kerchunk
logic that was discussed and updated in fsspec/kerchunk#177.
This presents a bit of a chicken and egg problem for our transition to Zarr V3 #392 and specifically our transition to ArrayV3Metadata
#411 as an internal representation. As outlined by @rabernat, Zarr v2 uses different semantics for fill_value
and the spec is less rigorous about its encoding. The Zarr V3 specification has adopted @rabernat's new semantics and has more detailed guidelines for fill_value encoding.
IIUC as we transition the HDFVirtualBackend
to use the new Zarr v3 fill_value
semantics and encoding our kerchunk
based roundtrip tests will be broken unless we add specific logic to our to_kerchunk
serialization that transform v3 -> v2 semantics. I have not searched the new kerchunk v3 compatibility PR fsspec/kerchunk#516 (comment) exhaustively but it appears that the semantic fill_value logic used by the HDF reader has not changed and is still using the v2 logic so even serializing to v3 compatible kerchunk dictionaries would fail equivalence assertions.
I'll be working through a PR to update the HDFVirtualBackend semantics in our current codebase and I'll assess the extent of test failures and how they can be addressed.
Looking through all of the Zarr v3 PRs it looks like @TomAugspurger has been tackling a lot of the fill_value
updates. Maybe he can weigh in here with some advice on the best path forward to adhere to the correct v3 behavior while still maintaining v2 semantic compatibility during our transition period.