Supporting out-of-core computation/indexing for very large indexes

(Follow-up of discussion here https://github.com/pydata/xarray/pull/1024#issuecomment-258524115).

xarray + dask.array successfully enable out-of-core computation for very large variables that doesn't fit in memory. One current limitation is that the indexes of a `Dataset` or `DataArray`, which rely on `pandas.Index`, are still fully loaded into memory (it will be soon loaded eagerly after #1024). In many cases this is not a problem, as the sizes of 1-dimensional indexes are usually much smaller than the sizes of n-dimensional variables or coordinates.

However, this may be problematic in some specific cases where we have to deal with very large indexes. As an example, big unstructured meshes often have coordinates (x, y, z) arranged as 1-d arrays of length that equals the number of nodes, which can be very large!! (See, e.g., [ugrid conventions](http://ugrid-conventions.github.io/ugrid-conventions/)).

It would be very nice if xarray could also help for these use cases. Therefore I'm wondering if (and how) out-of-core support can be extended to indexes and indexing.

I've briefly looked at the documentation on `dask.dataframe`, and a first naive approach I have in mind would be to allow partitioning an index into multiple, contiguous indexes. For label-based indexing, we might for example map `indexing.convert_label_indexer` to each partition and combine the returned indexers.

My knowledge of dask is very limited, though. So I've no doubt that this suggestion is very simplistic and not very efficient, or that there are better approaches. I'm also certainly missing other issues not directly related to indexing.

Any thoughts?

cc @shoyer @mrocklin
 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Supporting out-of-core computation/indexing for very large indexes #1094

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Supporting out-of-core computation/indexing for very large indexes #1094

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions