-
-
Notifications
You must be signed in to change notification settings - Fork 705
Description
As suggested by @jakirkham in #1942, we can improve the registration of Dask serialization and deserialization functions, including making them available in distribute
. This could be done with NDArrayITKBase
, but we can also have optimized serializers for common itk.DataObject
's.
As @jakirkham noted:
If one communicates objects that Dask already knows how to work with (like NumPy arrays), then one will get optimized serialization for free. This may involve a little (or maybe no) code in ITK to make sure NumPy arrays are handled reasonable well and handed to Dask when functions return. The benefit here seems useful even outside the Dask context.
If one adds pickle protocol 5 support to objects (possibly extending to older Python's with pickle5
or leveraging NumPy to do this indirectly ( numpy/numpy#12091 )), then one can leverage Dask's own support for pickle protocol 5 ( dask/distributed#3784 ) ( dask/distributed#3849 ), which should yield a similar performance boost to Dask's own custom serialization (as the underlying principles are the same). This would require merely updating how pickling works in ITK. It also will work with anything else that supports pickle protocol 5.
Alternatively one can use Dask's custom serialization by following this example on how to extend to custom types. The usual strategy here is to put these registration bits in its own module. This will soften the Distributed dependency and provide a single point for registering these functions. The latter is useful as one needs to import
this module in distributed
for it to work. This is doable with a bit of work in ITK and Distributed, but not too difficult.
Any of these options seems fine. Possibly it's worth doing multiple or even all of them (this is what we did in RAPIDS rapidsai/cudf#5139 ). Just trying to show what's possible and what it would entail to do 🙂
Originally posted by @jakirkham in #1942 (comment)