shallow copies become deep copies when pickling

Whenever xarray performs a shallow copy of any object (DataArray, Dataset, Variable), it creates a view of the underlying numpy arrays.
This design fails when the object is pickled.

Whenever a numpy view is pickled, it becomes a regular array:

```
>> a = numpy.arange(2**26)
>> print(len(pickle.dumps(a)) / 2**20)
256.00015354156494
>> b = a.view()
>> print(len(pickle.dumps((a, b))) / 2**20)
512.0001964569092
>> b.base is a
True
>> a2, b2 = pickle.loads(pickle.dumps((a, b)))
>> b2.base is a2
False
```

This has devastating effects in my use case. I start from a dask-backed DataArray with a dimension of 500,000 elements and no coord, so the coord is auto-assigned by xarray as an incremental integer.
Then, I perform ~3000 transformations and dump the resulting dask-backed array with pickle. However, I have to dump all intermediate steps for audit purposes as well. This means that xarray invokes numpy.arange to create (500k \* 4 bytes) ~ 2MB worth of coord, then creates 3000 views of it, which the moment they're pickled expand to several GBs as they become 3000 independent copies.

I see a few possible solutions to this:
1. Implement pandas range indexes in xarray. This would be nice as a general thing and would solve my specific problem, but anybody who does not fall in my very specific use case won't benefit from it.
2. Do not auto-generate a coord with numpy.arange() if the user doesn't explicitly ask for it; just leave a None and maybe generate it on the fly when requested. Again, this would solve my specific problem but not other people's.
3. Force the coord to be a dask.array.arange. Actually supporting unconverted dask arrays as coordinates would take a considerable amount of work; they would get converted to numpy several times, and other issues. Again it wouldn't solve the general problem.
4. Fix the issue upstream in numpy. I didn't look into it yet and it's definitely worth investigating, but I found about it [as early as 2012](https://stackoverflow.com/questions/13746601/preserving-numpy-view-when-pickling), so I suspect there might be some pretty good reason why it works like that... 
5. Whenever xarray performs a shallow copy, take the numpy array instead of creating a view.

I implemented (5) as a workaround in my __getstate__ method.
Before:

```
%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
2.535497265867889
Wall time: 33.3 s
```

Workaround:

```
def get_base(array):
    if not isinstance(array, numpy.ndarray):
        return array      
    elif array.base is None:
        return array
    elif array.base.dtype != array.dtype:
        return array
    elif array.base.shape != array.shape:
        return array
    else:
        return array.base

for v in cache.values():
    if isinstance(v, xarray.DataArray):
        v.data = get_base(v.data)
        for coord in v.coords.values():
            coord.data = get_base(coord.data)
    elif isinstance(v, xarray.Dataset):
        for var in v.variables():
            var.data = get_base(var.data)
```

After:

```
%%time
print(len(pickle.dumps(cache, pickle.HIGHEST_PROTOCOL)) / 2**30)
0.9733252348378301
Wall time: 21.1 s
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

shallow copies become deep copies when pickling #1058

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

shallow copies become deep copies when pickling #1058

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions