Skip to content

[SPARK-22772][SQL] Use splitExpressionsWithCurrentInputs to split codes in elt #19964

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Dec 13, 2017

What changes were proposed in this pull request?

In SPARK-22550 which fixes 64KB JVM bytecode limit problem with elt, buildCodeBlocks is used to split codes. However, we should use splitExpressionsWithCurrentInputs because it considers both normal and wholestage codgen (it is not supported yet, so it simply doesn't split the codes).

How was this patch tested?

Existing tests.

""".stripMargin,
foldFunctions = funcs => s"UTF8String $stringVal = ${funcs.last};",
makeFunctionCallback = f => prevFunc = s"$f(${ctx.INPUT_ROW}, $indexVal)",
mergeSplit = false)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to and can't merge split functions in inner classes.

We don't need to to do it because the split functions are not call in a sequence like this:

eltFunc_1(...)
eltFunc_2(...)
...

The calls are embedded in the default branch in each split function. So we won't call all split inner functions in outer class.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't merge them because the makeSplitFunction will create invalid merged function if used with the given foldFunctions:

private UTF8String eltFunc(InternalRow i, int index) { 

  UTF8String stringVal = null;
  switch (index) {
    UTF8String stringVal = eltFunc_999(i, index);
    default:
      return nestedClassInstance.eltFunc_999(i, index);
  }
  return stringVal;
}

Copy link
Contributor

@mgaido91 mgaido91 Dec 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but in this way we can hit the 64KB limit. Moreover I think that the current implementation is quite complex. What about making it similar to any other implementations using a while loop instead of a switch?
In this way we can ensure the 64KB limit won't be a problem and the code would be easier to understand IMHO.
WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why we can hit the 64kb limit?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have thought about it. Other implementation needs to introduce at least one global variable such as case when case. If we can tolerate it, it is ok for me. Let's see what other reviewers think about it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's not complicated the already-complex splitExpressions, I'm ok to use some global variables to simplify the code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I think we can hit it, with an outstanding number of parameters to the function. I am not saying that it is likely to happen, but IMHO it is feasible to make it happening

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Let me replace it with simpler codes. Thanks.

@viirya
Copy link
Member Author

viirya commented Dec 13, 2017

cc @cloud-fan @kiszk @mgaido91

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84848 has finished for PR 19964 at commit c677aed.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84847 has finished for PR 19964 at commit c40488e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val NOT_MATCHED = -1
// 0 means the given index matches one of indices of strings in split function.
val MATCHED = 0
val resultState = ctx.freshName("eltResultState")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a boolean instead of a byte IMHO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, right. :-)


// -1 means the given index doesn't match indices of strings in split function.
val NOT_MATCHED = -1
// 0 means the given index matches one of indices of strings in split function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only 2 possible values, we can use boolean

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, missing it.

@@ -289,53 +289,56 @@ case class Elt(children: Seq[Expression])
val index = indexExpr.genCode(ctx)
val strings = stringExprs.map(_.genCode(ctx))
val indexVal = ctx.freshName("index")
val resultState = ctx.freshName("eltResultState")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this can have a better name now that it is a boolean.... I am not very good at naming, but something like indexFound or anything you feel appropriate...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

|${index.code}
|final int $indexVal = ${index.value};
|${ctx.JAVA_BOOLEAN} $indexMatched = false;
|$stringVal = ${ctx.defaultValue(dataType)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would prefer $stringVal = null to enforce this, Because later we rely on stringVal to be init to null. Anyway the current implementation is right. If we have a UT which checks that it returns null when it should, we should be safe.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have required tests in StringExpressionsSuite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, since at the end we do final boolean ${ev.isNull} = ${ev.value} == null;.

}

val cases = ctx.buildCodeBlocks(assignStringValue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now ctx.buildCodeBlock doesn't need to be a separate method, can we revert that change and inline buildCodeBlock to splitExpressions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitExpressions is quite complicated. I think it is still good to have buildCodeBlock as a separate method. Maybe makes it as private?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good

@mgaido91
Copy link
Contributor

LGTM, thanks

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84857 has finished for PR 19964 at commit ac41620.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84859 has finished for PR 19964 at commit f6a4a54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84860 has finished for PR 19964 at commit 1563a46.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84863 has finished for PR 19964 at commit 69aab61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in ba0e79f Dec 13, 2017
@viirya viirya deleted the SPARK-22772 branch December 27, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants