-
Notifications
You must be signed in to change notification settings - Fork 81
Open
Description
Describe the feature you'd like
Support for the MIME type parquet files in the sagemaker toolkit. E.g. in the README of this repo, there is an example default_input_fn()
:
def default_input_fn(self, input_data, content_type, context=None):
"""A default input_fn that can handle JSON, CSV and NPZ formats.
Args:
input_data: the request payload serialized in the content_type format
content_type: the request content_type
context (obj): the request context (default: None).
Returns: input_data deserialized into torch.FloatTensor or torch.cuda.FloatTensor depending if cuda is available.
"""
return decoder.decode(input_data, content_type)
Looking into decoder.decode
, I see the following MIME types are supported:
_decoder_map = {
content_types.NPY: _npy_to_numpy,
content_types.CSV: _csv_to_numpy,
content_types.JSON: _json_to_numpy,
content_types.NPZ: _npz_to_sparse,
}
Should not be too hard to add parquet
here. Parquet is a dat file commonly used with large datasets and also supported in other sagemaker services, for example in Autopilot.
How would this feature be used? Please describe.
Reduce storage costs, data I/O costs, increase speed while processing.
Describe alternatives you've considered
CSV is the standard, but it's a much less efficient way to store, read and write column-oriented data.
Additional context
admivsn, lewi0332, daniel-duetto and robfdr-radial
Metadata
Metadata
Assignees
Labels
No labels