Skip to content

Challenge 12 -Compression of Geospatial Data with Varying Information Density #3

@EsperanzaCuartero

Description

@EsperanzaCuartero

Challenge 12 - Compression of Geospatial Data with Varying Information Density

Stream 1 - Software Developments for Earth Sciences

Goal

Development of an information-density adapting compression

  • Implement compression and bitinformation retrieval on a chunk basis
  • Analyse the information density of climate variables across time and space
  • Study the optimal chunk size depending on different features like hurricanes, gulf-stream, precipitating clouds and resolutions (large-eddy simulation vs. GCM)
  • Generally improve xbitinfo performance

Mentors and skills

  • Mentors: Miha Razinger, Juan Jose Dominguez (both ECMWF), Milan Klöwer (MIT), Hauke Schulz (University of Washington)
  • Skills required:
    • Python
    • Git
    • Familiarity with xarray
    • Zarr and Dask are beneficial

Note: Only nationals or residents from the ECMWF Member States and Co-operating States are eligible to participate (see Terms and Conditions).


Challenge description

Geospatial data can vary in its information density from one part of the world to another. A dataset containing streets will be very dense in cities but contains little information in remote places like the Alps or even the ocean. The same is also true for datasets about the ocean or the atmosphere. The variability of sea surface temperatures and currents is much larger in the vicinity of the golf stream than in the middle of the Atlantic basin. This variability might also change in time. A hurricane, for example, has a lot of variability in winds, temperature and rain rates, and travels in addition across entire ocean basins.

The challenge of this project is to improve xbitinfo to preserve the natural variability of these features but not to save random noise where the real information density is rather low. This means in particular that the number of bits needed to preserve in compression changes with location. A hurricane has a different information density than a same-sized area in the steadily blowing trade-wind regimes. Compressibility of climate data therefore can change drastically in time and space, which we want to exploit.

Currently in the bitinformation framework, to preserve all real information, the maximum information content calculated by xbitinfo needs to be used for the entire dataset. However, bitinformation can also be calculated on subsets, such that the ‘boring’ parts can therefore be more efficiently compressed.

Xbitinfo is an open-source Python package that enables lossy compression of geo-spatial data based on its information content. Embedded into the pangeo ecosystem, xbitinfo builds on top of xarray and dask and allows for fast compression and analysis of various data formats including netCDF and zarr. Xbitinfo addresses the challenge of increasingly large datasets split into chunks that are currently created due to increasingly available compute power. Climate simulations with resolutions of sub-km scale with petabytes of output are just one example where xbitinfo can help to keep the dataset manageable.

The successful applicant will refine the implementation of xbitinfo to data subsections (chunks) and improve our ability to compress spatially and temporal varying fields. Furthermore, the applicant will learn about information theory and software engineering with international mentors.

References:

Metadata

Metadata

Labels

Stream 1Software Development for Earth Sciences

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions