Skip to content

Help with Parquet storage #345

@norlandrhagen

Description

@norlandrhagen

Hi there Martin,

I'm trying to learn how to use the parquet storage in the hope of creating a section in the Pythia cookbook to demonstrate this. I'm going off of this test and the code block in the Kerchunk docs.

My initial attempt is below, but when I try to read the parquet I get a FileNotFoundError:

FileNotFoundError: [Errno 2] No such file or directory: '[/Users/nrhagen/Documents/carbonplan/pythia/kerchunk-cookbook/notebooks/foundations/combined.parq/TSLB/refs.0.parq](https://file+.vscode-resource.vscode-cdn.net/Users/nrhagen/Documents/carbonplan/pythia/kerchunk-cookbook/notebooks/foundations/combined.parq/TSLB/refs.0.parq)'

The only file in the combined.parq directory is ['.zmetadata']. So it seems like I'm not writing the combined reference correctly.

Also, I'm not clear on this line from the test LazyReferenceMapper.create(10, temp_dir, fs). Is the first arg of LazyReferenceMapper.create the length of input files or?

Thanks again for the help! It would be great to figure out how to use the parquet functionality.

from tempfile import TemporaryDirectory
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr
import os 
from kerchunk import hdf, combine, df
from fsspec.implementations.reference import LazyReferenceMapper, ReferenceFileSystem
import fsspec 

file_pattern = [
  's3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc',
  's3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-02.nc'
]

single_ref_sets = [hdf.SingleHdf5ToZarr(_).translate() for _ in file_pattern]

fs = fsspec.filesystem("file")
td = TemporaryDirectory()
temp_dir = td.name
temp_dir = str(temp_dir)
out = LazyReferenceMapper.create(10, temp_dir, fs)

mzz = MultiZarrToZarr(
    single_ref_sets,
    remote_protocol="memory",
    concat_dims=["Time"],identical_dims=['south_north', 'west_east', 'interp_levels', 'soil_layers_stag'],
   out=out,
).translate()

if not os.path.exists("combined.parq"):
    os.makedirs("combined.parq")
df.refs_to_dataframe(mzz, "combined.parq")

fs = ReferenceFileSystem(
    "combined.parq", lazy=True)
ds = xr.open_dataset(
    fs.get_mapper(), engine="zarr",
    backend_kwargs={"consolidated": False}
)

cc @rsignell-usgs if you've already figured this out!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions