Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length

After upgrading from kerchunk==0.0.8 to kerchunk==0.0.9 I get an intermittent segfault reading my HRRR grib files. The problem persists in kerchunk==0.1.0.

GDB shows:
```
Thread 7 "python3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xffff7da0e120 (LWP 20659)]
0x0000ffff820b3450 in grib_string_length () from /lib/aarch64-linux-gnu/libeccodes.so.0
```

It appears to be a race condition in the dask workers when I call `to_dataframe` on a slice of the dataset. It only happens about one time in five. I tried putting a for loop that would run till it produces the fault, but I can't seem to reset the state of the dask workers sufficiently to make that happen.

hrrr_repro.py, mzz.zarr (multizarr file from hrrr grib) and the terminal repo case output are in this [gist](https://gist.github.com/emfdavid/3bea332406088b2ff36dec32e212d2a3) including all the library version details.

I can try rerunning scangrib to produce the input artifacts with the new library versions, I have not done that yet but we have several years of HRRR surface output scanned and aggregated that I hope to keep using till I have time replace them with the new parquet format.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regression in 0.0.8-0.0.9 release causes race condition & segfault in eccodes grib_string_length #328

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions