Skip to content

Updating fill_value logic for HDFVirtualBackend #414

@sharkinsspatial

Description

@sharkinsspatial

I'm working through trying to update the "fill value" (quoted to denote the massive ambiguity around this loaded term as described by @rabernat in pydata/xarray#5475 (comment)) for the HDFVirtualBackend but it seems this will be a multi-level issue. My hope is our logic can follow the interpretation that @rabernat set out in pydata/xarray#5475 (comment) so that the reader will use the HDF dataset fillvalue property for the the Zarr fill_value and if the HDF dataset has a _FillValue attr that will be preserved as an array attr for use by xarray's decoding logic to represent masked values. This is a transition from the the old kerchunk logic that was discussed and updated in fsspec/kerchunk#177.

This presents a bit of a chicken and egg problem for our transition to Zarr V3 #392 and specifically our transition to ArrayV3Metadata #411 as an internal representation. As outlined by @rabernat, Zarr v2 uses different semantics for fill_value and the spec is less rigorous about its encoding. The Zarr V3 specification has adopted @rabernat's new semantics and has more detailed guidelines for fill_value encoding.

IIUC as we transition the HDFVirtualBackend to use the new Zarr v3 fill_value semantics and encoding our kerchunk based roundtrip tests will be broken unless we add specific logic to our to_kerchunk serialization that transform v3 -> v2 semantics. I have not searched the new kerchunk v3 compatibility PR fsspec/kerchunk#516 (comment) exhaustively but it appears that the semantic fill_value logic used by the HDF reader has not changed and is still using the v2 logic so even serializing to v3 compatible kerchunk dictionaries would fail equivalence assertions.

I'll be working through a PR to update the HDFVirtualBackend semantics in our current codebase and I'll assess the extent of test failures and how they can be addressed.

Looking through all of the Zarr v3 PRs it looks like @TomAugspurger has been tackling a lot of the fill_value updates. Maybe he can weigh in here with some advice on the best path forward to adhere to the correct v3 behavior while still maintaining v2 semantic compatibility during our transition period.

Metadata

Metadata

Assignees

No one assigned

    Labels

    HDF parserNon-kerchunk-based HDF parser

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions