Skip to content

[Variant] Support variant_get kernel for shredded variants #7941

@alamb

Description

@alamb

I just realized something -- when dealing with shredded variants, this value method will do a lot of work to unshred and encode the whole thing (see e.g. #7915 (comment)). And that work is not memoized anywhere unless the caller is able to do so. For efficiency reasons we should strongly consider some kind of direct pathing support in VariantArray itself. Otherwise, it would be far too easy for a caller to accidentally do quadratic work when repeatedly calling value+get_path pairs for different paths.

Originally posted by @scovich in #7919 (comment)


@Samyak2 in #7919 (comment)

Just to understand this -- is this what you're suggesting?

  • We add a new get_path method in VariantArray that does this:
    • For each row, look up the type of the variant and perform the pathing without decoding. So, for example, if it's a VariantObject and there's a VariantPathElement::Field path, it would get the offset for the given field (not sure how yet) and advance the value slice by that much.
    • We would then create a new VariantArray with the metadata directly copied over and the value copied starting from the advanced slice.
    • For shredded variants, if the path ends up on a shredded value, what would be the expected behavior? I'm guessing that the shredded fields will be represented as an Array of the concrete type (an Int32Array for example) and not a VariantArray. Will we wrap these in a VariantArray and send it back? This is one case where having the path + cast in the same operation would help.
  • This variant_get would then simply re-use VariantArray::get_path and perform the appropriate cast.

@alamb in #7919 (comment)

For efficiency reasons we should strongly consider some kind of direct pathing support in VariantArray itself. Otherwise, it would be far too easy for a caller to accidentally do quadratic work when repeatedly calling value+get_path pairs for different paths.

I think @scovich is saying that the variant_get kernel (on VariantArray should have a special case that knows how to look for a shredded sub field -- and if for example it is asking for a and the the typed_value.a column exists, variant_get could simply return that a column (already as an arrow array, no actual Variant manipulation required)

Metadata

Metadata

Assignees

Labels

parquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions