-
Notifications
You must be signed in to change notification settings - Fork 133
feat: arrow convenience extensions #827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #827 +/- ##
==========================================
- Coverage 85.01% 84.99% -0.02%
==========================================
Files 84 86 +2
Lines 20656 20699 +43
Branches 20656 20699 +43
==========================================
+ Hits 17561 17594 +33
- Misses 2228 2229 +1
- Partials 867 876 +9 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
scovich
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice if all the new extension methods had actual use sites, to give a better sense of how useful they are? Right now only execute_arrow has a real use site.
| fn evaluate_arrow(&self, batch: RecordBatch) -> DeltaResult<RecordBatch>; | ||
| } | ||
|
|
||
| impl<T: ExpressionEvaluator + ?Sized> ExpressionEvaluatorExt for T { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why ?Sized? Are there dyn impl somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do we need that in order to invoke the associated function T::evaluate?
| let record_batch = ArrowEngineData::try_from_engine_data(data)?.into(); | ||
| mask.map(|m| Ok(filter_record_batch(&record_batch, &m.into())?)) | ||
| .unwrap_or(Ok(record_batch)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a good use for Option::map_or_else?
| let record_batch = ArrowEngineData::try_from_engine_data(data)?.into(); | |
| mask.map(|m| Ok(filter_record_batch(&record_batch, &m.into())?)) | |
| .unwrap_or(Ok(record_batch)) | |
| let record_batch = ArrowEngineData::try_from_engine_data(data)?.into(); | |
| mask.map_or_else( | |
| || Ok(record_batch), | |
| |m| Ok(filter_record_batch(&record_batch, &m.into())?), | |
| } |
Tho simple imperative code probably wins on readability:
| let record_batch = ArrowEngineData::try_from_engine_data(data)?.into(); | |
| mask.map(|m| Ok(filter_record_batch(&record_batch, &m.into())?)) | |
| .unwrap_or(Ok(record_batch)) | |
| let record_batch = ArrowEngineData::try_from_engine_data(data)?.into(); | |
| Ok(match mask { | |
| Some(m) => filter_record_batch(&record_batch, &m.into())?, | |
| None => record_batch, | |
| }) |
or even
| let record_batch = ArrowEngineData::try_from_engine_data(data)?.into(); | |
| mask.map(|m| Ok(filter_record_batch(&record_batch, &m.into())?)) | |
| .unwrap_or(Ok(record_batch)) | |
| let mut record_batch = ArrowEngineData::try_from_engine_data(data)?.into(); | |
| if let Some(m) = mask { | |
| record_batch = filter_record_batch(&record_batch, &m.into())?; | |
| } | |
| Ok(record_batch) |
| .map_ok(TryFrom::try_from) | ||
| .flatten()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, map_ok and flatten are a bad combination -- Err cases are silently dropped because they are treated as empty iterators. Does this work?
| .map_ok(TryFrom::try_from) | |
| .flatten()) | |
| .map(|result| Ok(result?.try_into()?)) | |
| .flatten_ok() |
(depending on the error types, you might be able to drop the Ok(...?) wrapper)
(again below)
|
We currently maintain these in delta-rs, where they also support transitioning to kernel. Once things stabelize we could revisit if we want to upstream those, but for now closing this here. |
What changes are proposed in this pull request?
The PR introduces some convenience APIs for engines working with arrow data. Specifically we define and implement
ScanExtandExpressionEvaluatorExtwhich define variants of the main apis forScanandExpressionEvaluatorrespectively in terms of arrowRecordBatches.PR #621 contains some similar work in defining a convenience function to handle
Scan::executeresults. In this PR aTryFromimpl is used - I was a bit unsure which approach would be better.see: #826
also includes one
cargo clippy.This PR affects the following public APIs
new public methods when traits are in scope
Scan::scan_metadata_arrow,Scan::evaluate_arrowandExpressionEvaluator::evaluate_arrow.How was this change tested?
additional unit tests for new APIs.