-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-22772][SQL] Use splitExpressionsWithCurrentInputs to split codes in elt #19964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
""".stripMargin, | ||
foldFunctions = funcs => s"UTF8String $stringVal = ${funcs.last};", | ||
makeFunctionCallback = f => prevFunc = s"$f(${ctx.INPUT_ROW}, $indexVal)", | ||
mergeSplit = false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to and can't merge split functions in inner classes.
We don't need to to do it because the split functions are not call in a sequence like this:
eltFunc_1(...)
eltFunc_2(...)
...
The calls are embedded in the default branch in each split function. So we won't call all split inner functions in outer class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't merge them because the makeSplitFunction
will create invalid merged function if used with the given foldFunctions
:
private UTF8String eltFunc(InternalRow i, int index) {
UTF8String stringVal = null;
switch (index) {
UTF8String stringVal = eltFunc_999(i, index);
default:
return nestedClassInstance.eltFunc_999(i, index);
}
return stringVal;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes but in this way we can hit the 64KB limit. Moreover I think that the current implementation is quite complex. What about making it similar to any other implementations using a while
loop instead of a switch
?
In this way we can ensure the 64KB limit won't be a problem and the code would be easier to understand IMHO.
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why we can hit the 64kb limit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about it. Other implementation needs to introduce at least one global variable such as case when case. If we can tolerate it, it is ok for me. Let's see what other reviewers think about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's not complicated the already-complex splitExpressions
, I'm ok to use some global variables to simplify the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I think we can hit it, with an outstanding number of parameters to the function. I am not saying that it is likely to happen, but IMHO it is feasible to make it happening
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Let me replace it with simpler codes. Thanks.
Test build #84848 has finished for PR 19964 at commit
|
Test build #84847 has finished for PR 19964 at commit
|
val NOT_MATCHED = -1 | ||
// 0 means the given index matches one of indices of strings in split function. | ||
val MATCHED = 0 | ||
val resultState = ctx.freshName("eltResultState") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a boolean
instead of a byte
IMHO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, right. :-)
|
||
// -1 means the given index doesn't match indices of strings in split function. | ||
val NOT_MATCHED = -1 | ||
// 0 means the given index matches one of indices of strings in split function. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only 2 possible values, we can use boolean
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, missing it.
@@ -289,53 +289,56 @@ case class Elt(children: Seq[Expression]) | |||
val index = indexExpr.genCode(ctx) | |||
val strings = stringExprs.map(_.genCode(ctx)) | |||
val indexVal = ctx.freshName("index") | |||
val resultState = ctx.freshName("eltResultState") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe this can have a better name now that it is a boolean.... I am not very good at naming, but something like indexFound
or anything you feel appropriate...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
|${index.code} | ||
|final int $indexVal = ${index.value}; | ||
|${ctx.JAVA_BOOLEAN} $indexMatched = false; | ||
|$stringVal = ${ctx.defaultValue(dataType)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I would prefer $stringVal = null
to enforce this, Because later we rely on stringVal
to be init to null
. Anyway the current implementation is right. If we have a UT which checks that it returns null
when it should, we should be safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have required tests in StringExpressionsSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, since at the end we do final boolean ${ev.isNull} = ${ev.value} == null;
.
} | ||
|
||
val cases = ctx.buildCodeBlocks(assignStringValue) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now ctx.buildCodeBlock
doesn't need to be a separate method, can we revert that change and inline buildCodeBlock
to splitExpressions
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
splitExpressions
is quite complicated. I think it is still good to have buildCodeBlock
as a separate method. Maybe makes it as private
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
LGTM, thanks |
LGTM |
Test build #84857 has finished for PR 19964 at commit
|
Test build #84859 has finished for PR 19964 at commit
|
Test build #84860 has finished for PR 19964 at commit
|
Test build #84863 has finished for PR 19964 at commit
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
In SPARK-22550 which fixes 64KB JVM bytecode limit problem with elt,
buildCodeBlocks
is used to split codes. However, we should usesplitExpressionsWithCurrentInputs
because it considers both normal and wholestage codgen (it is not supported yet, so it simply doesn't split the codes).How was this patch tested?
Existing tests.