Skip to content

Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328

@emfdavid

Description

@emfdavid

After upgrading from kerchunk==0.0.8 to kerchunk==0.0.9 I get an intermittent segfault reading my HRRR grib files. The problem persists in kerchunk==0.1.0.

GDB shows:

Thread 7 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff7da0e120 (LWP 20659)]
0x0000ffff820b3450 in grib_string_length () from /lib/aarch64-linux-gnu/libeccodes.so.0

It appears to be a race condition in the dask workers when I call to_dataframe on a slice of the dataset. It only happens about one time in five. I tried putting a for loop that would run till it produces the fault, but I can't seem to reset the state of the dask workers sufficiently to make that happen.

hrrr_repro.py, mzz.zarr (multizarr file from hrrr grib) and the terminal repo case output are in this gist including all the library version details.

I can try rerunning scangrib to produce the input artifacts with the new library versions, I have not done that yet but we have several years of HRRR surface output scanned and aggregated that I hope to keep using till I have time replace them with the new parquet format.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions