Skip to content

Conversation

@izchen
Copy link
Contributor

@izchen izchen commented Dec 17, 2025

Implemented a native Avro reader based on Apache avro-cpp (release-1.12.1).

@netlify
Copy link

netlify bot commented Dec 17, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 27aae8f
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/694be023c2797b0008ba8e96

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 17, 2025
readerOptions.setFileFormat(hiveSplit->fileFormat);
}

readerOptions.serDeOptions().parameters = hiveSplit->serdeParameters;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't forward this blindly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've updated this so that only a whitelisted set of keys is forwarded, and only when format == AVRO.

@Yuhta
Copy link
Contributor

Yuhta commented Dec 17, 2025

This is a large feature that needs maintainers. Do we have candidate for maintainers for this?

@izchen
Copy link
Contributor Author

izchen commented Dec 18, 2025

This is a large feature that needs maintainers. Do we have candidate for maintainers for this?

Thanks for raising this.

This feature is actively used and maintained by our team internally.
I will be the primary point of contact and maintainer, and we have multiple engineers familiar with the code who can provide backup support.

@izchen
Copy link
Contributor Author

izchen commented Dec 19, 2025

The Ubuntu Benchmark CI failed due to a false positive from static analysis, which is the same issue as reported in #15245.
I plan to add VELOX_SUPPRESS_STRINGOP_OVERFLOW_WARNING to fix it.

@izchen
Copy link
Contributor Author

izchen commented Dec 19, 2025

The Collect Build Metrics and Run Checks / clang-tidy CI jobs failed because avro-cpp is not installed in the ghcr.io/facebookincubator/velox-dev:adapters image. (On Ubuntu, it can be installed via ./scripts/setup-ubuntu.sh install_avro)

@mbasmanova
Copy link
Contributor

This feature is actively used and maintained by our team internally.

@izchen Welcome to the community. Which company are you from?

@mbasmanova
Copy link
Contributor

based on avro-cpp (release-1.12.1).

Can we avoid introducing a new dependency and instead implement the reader in Velox directly? Dependency management is hard and makes building Velox difficult.

CC: @pedroerp

@izchen
Copy link
Contributor Author

izchen commented Dec 20, 2025

Welcome to the community. Which company are you from?

Thanks! I'm from xiaohongshu.com

We've been running Velox in production for about 1.5 years to accelerate Spark workloads and have seen significant compute cost savings. We appreciate the community and look forward to contributing back.

@izchen
Copy link
Contributor Author

izchen commented Dec 22, 2025

based on avro-cpp (release-1.12.1).

Can we avoid introducing a new dependency and instead implement the reader in Velox directly? Dependency management is hard and makes building Velox difficult.

CC: @pedroerp

@mbasmanova Agreed — dependency management is painful.

Avro itself is fairly complex:

  • avro-cpp implementation is ~19k LOC (C++ + headers, excluding blanks/comments)
  • Avro spec contains many implicit rules, especially around schema parsing and schema evolution.

Using avro-cpp helps us avoid re-implementing the Avro spec and reduces the risk of diverging from upstream behavior.

avro-cpp is also actively maintained. In 2025, the Apache Avro community merged 36 commits related to avro-cpp. A fully Velox-native Avro reader would be a significant development and maintenance effort.

Given this, we believe avro-cpp is the better choice. We’ve already mitigated the impact by:

  • keeping the dependency optional (VELOX_ENABLE_AVRO via EXTRA_CMAKE_FLAGS, default OFF)
  • providing an install_avro helper script.

Open to further discussion.

@mbasmanova
Copy link
Contributor

@izchen Thank you for clarifying. What are the limitations of use avro-cpp? How are we tracking memory usage? Are filter pushdown and lazy loading supported? Are we forced to copy data from the format produced by the reader into Velox vectors? Is it feasible to address these?

This PR is quite large. Is there a way to break it up into smaller pieces?

CI is red. Are you working on fixing it?

try {
::avro::compileJsonSchema(schemaStream, schema);
} catch (const std::exception& e) {
VELOX_USER_FAIL(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we swallowing this exception?

VELOX_USER_CHECK_GT(
parsedValue,
0,
"{} must be a positive integer, got {}",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to include parsedValue in the message; it is included automatically

bool lastColumnTakesRest;
uint8_t escapeChar;
bool isEscaped;
std::unordered_map<std::string, std::string> parameters;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should just use RowReaderOptions::serdeParameters

::avro::ValidSchema schema;
std::istringstream schemaStream(schemaIt->second);
try {
::avro::compileJsonSchema(schemaStream, schema);
Copy link
Contributor

@Yuhta Yuhta Dec 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to accept external schema separate from file? Is it the evolved current table schema? If so, you need to get the requested types as a Velox type (RowReaderOptions::requestedType) and construct the Avro schema inside the reader.

@izchen
Copy link
Contributor Author

izchen commented Dec 24, 2025

Thanks for the detailed comments! I’ll follow up on these. It may take some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants