-
Notifications
You must be signed in to change notification settings - Fork 1.4k
feat: Add Avro reader #15790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add Avro reader #15790
Conversation
✅ Deploy Preview for meta-velox canceled.
|
| readerOptions.setFileFormat(hiveSplit->fileFormat); | ||
| } | ||
|
|
||
| readerOptions.serDeOptions().parameters = hiveSplit->serdeParameters; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't forward this blindly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I've updated this so that only a whitelisted set of keys is forwarded, and only when format == AVRO.
|
This is a large feature that needs maintainers. Do we have candidate for maintainers for this? |
Thanks for raising this. This feature is actively used and maintained by our team internally. |
This reverts commit e5b611d.
|
The |
|
The |
@izchen Welcome to the community. Which company are you from? |
Can we avoid introducing a new dependency and instead implement the reader in Velox directly? Dependency management is hard and makes building Velox difficult. CC: @pedroerp |
Thanks! I'm from xiaohongshu.com We've been running Velox in production for about 1.5 years to accelerate Spark workloads and have seen significant compute cost savings. We appreciate the community and look forward to contributing back. |
@mbasmanova Agreed — dependency management is painful. Avro itself is fairly complex:
Using avro-cpp helps us avoid re-implementing the Avro spec and reduces the risk of diverging from upstream behavior. avro-cpp is also actively maintained. In 2025, the Apache Avro community merged 36 commits related to avro-cpp. A fully Velox-native Avro reader would be a significant development and maintenance effort. Given this, we believe avro-cpp is the better choice. We’ve already mitigated the impact by:
Open to further discussion. |
|
@izchen Thank you for clarifying. What are the limitations of use avro-cpp? How are we tracking memory usage? Are filter pushdown and lazy loading supported? Are we forced to copy data from the format produced by the reader into Velox vectors? Is it feasible to address these? This PR is quite large. Is there a way to break it up into smaller pieces? CI is red. Are you working on fixing it? |
| try { | ||
| ::avro::compileJsonSchema(schemaStream, schema); | ||
| } catch (const std::exception& e) { | ||
| VELOX_USER_FAIL( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we swallowing this exception?
| VELOX_USER_CHECK_GT( | ||
| parsedValue, | ||
| 0, | ||
| "{} must be a positive integer, got {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to include parsedValue in the message; it is included automatically
| bool lastColumnTakesRest; | ||
| uint8_t escapeChar; | ||
| bool isEscaped; | ||
| std::unordered_map<std::string, std::string> parameters; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should just use RowReaderOptions::serdeParameters
| ::avro::ValidSchema schema; | ||
| std::istringstream schemaStream(schemaIt->second); | ||
| try { | ||
| ::avro::compileJsonSchema(schemaStream, schema); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to accept external schema separate from file? Is it the evolved current table schema? If so, you need to get the requested types as a Velox type (RowReaderOptions::requestedType) and construct the Avro schema inside the reader.
|
Thanks for the detailed comments! I’ll follow up on these. It may take some time. |
Implemented a native Avro reader based on Apache avro-cpp (release-1.12.1).