-
Notifications
You must be signed in to change notification settings - Fork 93
Description
Hi there Martin,
I'm trying to learn how to use the parquet storage in the hope of creating a section in the Pythia cookbook to demonstrate this. I'm going off of this test and the code block in the Kerchunk docs.
My initial attempt is below, but when I try to read the parquet I get a FileNotFoundError
:
FileNotFoundError: [Errno 2] No such file or directory: '[/Users/nrhagen/Documents/carbonplan/pythia/kerchunk-cookbook/notebooks/foundations/combined.parq/TSLB/refs.0.parq](https://file+.vscode-resource.vscode-cdn.net/Users/nrhagen/Documents/carbonplan/pythia/kerchunk-cookbook/notebooks/foundations/combined.parq/TSLB/refs.0.parq)'
The only file in the combined.parq
directory is ['.zmetadata']
. So it seems like I'm not writing the combined reference correctly.
Also, I'm not clear on this line from the test LazyReferenceMapper.create(10, temp_dir, fs)
. Is the first arg of LazyReferenceMapper.create
the length of input files or?
Thanks again for the help! It would be great to figure out how to use the parquet
functionality.
from tempfile import TemporaryDirectory
import xarray as xr
from kerchunk.combine import MultiZarrToZarr
from kerchunk.hdf import SingleHdf5ToZarr
import os
from kerchunk import hdf, combine, df
from fsspec.implementations.reference import LazyReferenceMapper, ReferenceFileSystem
import fsspec
file_pattern = [
's3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-01.nc',
's3://wrf-se-ak-ar5/ccsm/rcp85/daily/2060/WRFDS_2060-01-02.nc'
]
single_ref_sets = [hdf.SingleHdf5ToZarr(_).translate() for _ in file_pattern]
fs = fsspec.filesystem("file")
td = TemporaryDirectory()
temp_dir = td.name
temp_dir = str(temp_dir)
out = LazyReferenceMapper.create(10, temp_dir, fs)
mzz = MultiZarrToZarr(
single_ref_sets,
remote_protocol="memory",
concat_dims=["Time"],identical_dims=['south_north', 'west_east', 'interp_levels', 'soil_layers_stag'],
out=out,
).translate()
if not os.path.exists("combined.parq"):
os.makedirs("combined.parq")
df.refs_to_dataframe(mzz, "combined.parq")
fs = ReferenceFileSystem(
"combined.parq", lazy=True)
ds = xr.open_dataset(
fs.get_mapper(), engine="zarr",
backend_kwargs={"consolidated": False}
)
cc @rsignell-usgs if you've already figured this out!