Skip to content

[SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) #26003

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

[SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) #26003

wants to merge 6 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Oct 2, 2019

What changes were proposed in this pull request?

This PR regenerates the sql/core benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison.

A. EXPECTED CASES(JDK11 is faster in general)

  • BloomFilterBenchmark (JDK11 is faster except one case)
  • BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC)
  • CSVBenchmark (JDK11 is faster except five cases)
  • ColumnarBatchBenchmark (JDK11 is faster at boolean/string and some cases in int/array)
  • DatasetBenchmark (JDK11 is faster with string, but is slower for long type)
  • ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases)
  • ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS)
  • HashedRelationMetricsBenchmark (JDK11 is faster)
  • JSONBenchmark (JDK11 is much faster except eight cases)
  • JoinBenchmark (JDK11 is faster except five cases)
  • OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases)
  • PrimitiveArrayBenchmark (JDK11 is faster)
  • SortBenchmark (JDK11 is faster except Arrays.sort case)
  • UDFBenchmark (N/A, values are too small)
  • UnsafeArrayDataBenchmark (JDK11 is faster except one case)
  • WideTableBenchmark (JDK11 is faster except two cases)

B. CASES WE NEED TO INVESTIGATE MORE LATER

  • AggregateBenchmark (JDK11 is slower in general)
  • CompressionSchemeBenchmark (JDK11 is slower in general except string)
  • DataSourceReadBenchmark (JDK11 is slower in general)
  • DateTimeBenchmark (JDK11 is slightly slower in general except parsing)
  • MakeDateTimeBenchmark (JDK11 is slower except two cases)
  • MiscBenchmark (JDK11 is slower except ten cases)
  • OrcV2NestedSchemaPruningBenchmark (JDK11 is slower)
  • ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases)
  • RangeBenchmark (JDK11 is slower except one case)

FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark will be compared later because it took long timer.

Why are the changes needed?

According to the result, there are some difference between JDK8/JDK11.
This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used.

  • Instance: r3.xlarge
  • OS: CentOS Linux release 7.5.1804 (Core)
  • JDK:
    • OpenJDK Runtime Environment (build 1.8.0_222-b10)
    • OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This is a test-only PR. We need to run benchmark.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) [WIP][SPARK-29320][TESTS] Compare sql/core module in JDK8/11 Oct 2, 2019
@MaxGekk
Copy link
Member

MaxGekk commented Oct 2, 2019

I am just wondering why the PR removes 2 times more lines than adds: +1,675 −3,064?

@SparkQA
Copy link

SparkQA commented Oct 2, 2019

Test build #111697 has finished for PR 26003 at commit b4bbff4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 2, 2019

Test build #111699 has finished for PR 26003 at commit d7d665c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Oct 2, 2019

For a better commit log, how about putting benchmark env info below in the PR description?

OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

@SparkQA
Copy link

SparkQA commented Oct 3, 2019

Test build #111703 has finished for PR 26003 at commit b399f0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 3, 2019

Test build #111706 has finished for PR 26003 at commit 9eb1e83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 3, 2019

Test build #111718 has finished for PR 26003 at commit d2e561b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun changed the title [WIP][SPARK-29320][TESTS] Compare sql/core module in JDK8/11 [SPARK-29320][TESTS] Compare sql/core module in JDK8/11 (Part 1) Oct 3, 2019
@dongjoon-hyun
Copy link
Member Author

Hi, @wangyum , @srowen , @maropu , @MaxGekk , @HyukjinKwon . Could you review this one? I'll make another PR for FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark.

Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
agg w/o group: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
agg w/o group wholestage off 55644 59484 NaN 37.7 26.5 1.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NaN Stdev? Is it constantly NaN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm not sure, but it's the generated result of the current benchmark suite.
It exists in the master branch from your commit 93a264d , too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would suggest only 1 trial was run on this one. Not wrong per se but ideally these benchmarks all run a few iterations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right.

aggregate without grouping
================================================================================================

OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just in case, 3.10 kernel (if it is version of Linux kernel) is 2 years old. Are we sure we are interested in results on this old version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk . This is Java comparison, not OS comparison. We didn't care Mac OS benchmark result either, do we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is that the benchmark ratios could be different on another OS, but, I think this is fine. Eventually we can try benchmarking consistently on a later distro / 4.x kernel

Baseline 20 22 1 4,9 204,3 1,0X
With identity UDF 24 26 2 4,1 241,3 0,8X
Baseline 53 58 4 1.9 525.6 1.0X
With identity UDF 38 38 0 2.7 376.3 1.4X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was faster but now it is slower than baseline.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For JDK8 degradation, we may need to file another JIRA.

------------------------------------------------------------------------------------------------------------------------
reference TimSort key prefix array 15736 15778 59 1.6 629.4 1.0X
reference Arrays.sort 3051 3057 10 8.2 122.0 5.2X
radix sort one byte 442 453 10 56.6 17.7 35.6X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Radix sort became 3 times slower.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, wait. Are you comparing between Mac and Linux? You should not do that. It's meaningless.

Java HotSpot(TM) 64-Bit Server VM 1.8.0_162-b12 on Mac OS X 10.13.6
 Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
...
 OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
 Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I compared the ratio

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The ratio comparison is also meaningful when you on the similar Mac.
Mac and Linux Server has totally different CPU/Cache/RAM/DISK and its ratio.

@dongjoon-hyun
Copy link
Member Author

Thank you for review, @MaxGekk ~

@SparkQA
Copy link

SparkQA commented Oct 3, 2019

Test build #111728 has finished for PR 26003 at commit f06a25f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks OK if it's reestablishing new baselines.

aggregate without grouping
================================================================================================

OpenJDK 64-Bit Server VM 11.0.4+11-LTS on Linux 3.10.0-862.3.2.el7.x86_64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the point is that the benchmark ratios could be different on another OS, but, I think this is fine. Eventually we can try benchmarking consistently on a later distro / 4.x kernel

Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
agg w/o group: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
agg w/o group wholestage off 55644 59484 NaN 37.7 26.5 1.0X
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would suggest only 1 trial was run on this one. Not wrong per se but ideally these benchmarks all run a few iterations.

@dongjoon-hyun
Copy link
Member Author

Thank you, @srowen and @MaxGekk . There are many factors are considered.
Usually, we had better change one factor only. I'll move forward for now with your advices (the number of iterations and kernels.

Merged to master.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-29320 branch October 3, 2019 15:59
@dongjoon-hyun
Copy link
Member Author

Until now, I used the same official CentOS AMI and didn't change to compare the code-change only. As a follow-up of this PR, I took a look at the recent one according to the review comments.

The latest CentOS AMI is ami-01ed306a12b7d1c96 (CentOS Linux 7.6.1810 x86_64 HVM EBS ENA 1901_01-b7). Unfortunately, it's also using Linux version 3.10.0-957.1.3.el7.x86_64.

To use Kernel 4.x, It seems that we need to choose Ubuntu or Amazon Linux from now. I'll try to use one of them for the remaining benchmarks.

dongjoon-hyun added a commit that referenced this pull request Jan 12, 2020
…dk11-results.txt

### What changes were proposed in this pull request?

This PR removes a dangling test result, `JSONBenchmark-jdk11-results.txt`.
This causes a case-sensitive issue on Mac.

```
$ git clone https://gitbox.apache.org/repos/asf/spark.git spark-gitbox
Cloning into 'spark-gitbox'...
remote: Counting objects: 671717, done.
remote: Compressing objects: 100% (258021/258021), done.
remote: Total 671717 (delta 329181), reused 560390 (delta 228097)
Receiving objects: 100% (671717/671717), 149.69 MiB | 950.00 KiB/s, done.
Resolving deltas: 100% (329181/329181), done.
Updating files: 100% (16090/16090), done.
warning: the following paths have collided (e.g. case-sensitive paths
on a case-insensitive filesystem) and only one from the same
colliding group is in the working tree:

  'sql/core/benchmarks/JSONBenchmark-jdk11-results.txt'
  'sql/core/benchmarks/JsonBenchmark-jdk11-results.txt'
```

### Why are the changes needed?

Previously, since the file name didn't match with `object JSONBenchmark`, it made a confusion when we ran the benchmark. So, 4e0e4e5 renamed `JSONBenchmark` to `JsonBenchmark`. However, at the same time frame, #26003 regenerated this file.

Recently, #27078 regenerates the results with the correct file name, `JsonBenchmark-jdk11-results.txt`. So, we can remove the old one.

### Does this PR introduce any user-facing change?

No. This is a test result.

### How was this patch tested?

Manually check the following correctly generated files in the master. And, check this PR removes the dangling one.
- https://github.com/apache/spark/blob/master/sql/core/benchmarks/JsonBenchmark-results.txt
- https://github.com/apache/spark/blob/master/sql/core/benchmarks/JsonBenchmark-jdk11-results.txt

Closes #27180 from dongjoon-hyun/SPARK-REMOVE.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
cloud-fan pushed a commit that referenced this pull request Mar 9, 2021
### What changes were proposed in this pull request?

`BroadcastNestedLoopJoinExec` does not have code-gen, and we can potentially boost the CPU performance for this operator if we add code-gen for it. https://databricks.com/blog/2017/02/16/processing-trillion-rows-per-second-single-machine-can-nested-loop-joins-fast.html also showed the evidence in one fork.

The codegen for `BroadcastNestedLoopJoinExec` shared some code with `HashJoin`, and the interface `JoinCodegenSupport` is created to hold those common logic. This PR is only supporting inner and cross join. Other join types will be added later in followup PRs.

Example query and generated code:

```
val df1 = spark.range(4).select($"id".as("k1"))
val df2 = spark.range(3).select($"id".as("k2"))
df1.join(df2, $"k1" + 1 =!= $"k2").explain("codegen")
```

```
== Subtree 2 / 2 (maxMethodCodeSize:282; maxConstantPoolSize:203(0.31% used); numInnerClasses:0) ==
*(2) BroadcastNestedLoopJoin BuildRight, Inner, NOT ((k1#2L + 1) = k2#6L)
:- *(2) Project [id#0L AS k1#2L]
:  +- *(2) Range (0, 4, step=1, splits=2)
+- BroadcastExchange IdentityBroadcastMode, [id=#22]
   +- *(1) Project [id#4L AS k2#6L]
      +- *(1) Range (0, 3, step=1, splits=2)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage2(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=2
/* 006 */ final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private boolean range_initRange_0;
/* 010 */   private long range_nextIndex_0;
/* 011 */   private TaskContext range_taskContext_0;
/* 012 */   private InputMetrics range_inputMetrics_0;
/* 013 */   private long range_batchEnd_0;
/* 014 */   private long range_numElementsTodo_0;
/* 015 */   private InternalRow[] bnlj_buildRowArray_0;
/* 016 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[4];
/* 017 */
/* 018 */   public GeneratedIteratorForCodegenStage2(Object[] references) {
/* 019 */     this.references = references;
/* 020 */   }
/* 021 */
/* 022 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 023 */     partitionIndex = index;
/* 024 */     this.inputs = inputs;
/* 025 */
/* 026 */     range_taskContext_0 = TaskContext.get();
/* 027 */     range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics();
/* 028 */     range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 029 */     range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 030 */     range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 031 */     bnlj_buildRowArray_0 = (InternalRow[]) ((org.apache.spark.broadcast.TorrentBroadcast) references[1] /* broadcastTerm */).value();
/* 032 */     range_mutableStateArray_0[3] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(2, 0);
/* 033 */
/* 034 */   }
/* 035 */
/* 036 */   private void bnlj_doConsume_0(long bnlj_expr_0_0) throws java.io.IOException {
/* 037 */     for (int bnlj_arrayIndex_0 = 0; bnlj_arrayIndex_0 < bnlj_buildRowArray_0.length; bnlj_arrayIndex_0++) {
/* 038 */       UnsafeRow bnlj_buildRow_0 = (UnsafeRow) bnlj_buildRowArray_0[bnlj_arrayIndex_0];
/* 039 */
/* 040 */       long bnlj_value_1 = bnlj_buildRow_0.getLong(0);
/* 041 */
/* 042 */       long bnlj_value_4 = -1L;
/* 043 */
/* 044 */       bnlj_value_4 = bnlj_expr_0_0 + 1L;
/* 045 */
/* 046 */       boolean bnlj_value_3 = false;
/* 047 */       bnlj_value_3 = bnlj_value_4 == bnlj_value_1;
/* 048 */       boolean bnlj_value_2 = false;
/* 049 */       bnlj_value_2 = !(bnlj_value_3);
/* 050 */       if (!(false || !bnlj_value_2))
/* 051 */       {
/* 052 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] /* numOutputRows */).add(1);
/* 053 */
/* 054 */         range_mutableStateArray_0[3].reset();
/* 055 */
/* 056 */         range_mutableStateArray_0[3].write(0, bnlj_expr_0_0);
/* 057 */
/* 058 */         range_mutableStateArray_0[3].write(1, bnlj_value_1);
/* 059 */         append((range_mutableStateArray_0[3].getRow()).copy());
/* 060 */
/* 061 */       }
/* 062 */     }
/* 063 */
/* 064 */   }
/* 065 */
/* 066 */   private void initRange(int idx) {
/* 067 */     java.math.BigInteger index = java.math.BigInteger.valueOf(idx);
/* 068 */     java.math.BigInteger numSlice = java.math.BigInteger.valueOf(2L);
/* 069 */     java.math.BigInteger numElement = java.math.BigInteger.valueOf(4L);
/* 070 */     java.math.BigInteger step = java.math.BigInteger.valueOf(1L);
/* 071 */     java.math.BigInteger start = java.math.BigInteger.valueOf(0L);
/* 072 */     long partitionEnd;
/* 073 */
/* 074 */     java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start);
/* 075 */     if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
/* 076 */       range_nextIndex_0 = Long.MAX_VALUE;
/* 077 */     } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
/* 078 */       range_nextIndex_0 = Long.MIN_VALUE;
/* 079 */     } else {
/* 080 */       range_nextIndex_0 = st.longValue();
/* 081 */     }
/* 082 */     range_batchEnd_0 = range_nextIndex_0;
/* 083 */
/* 084 */     java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice)
/* 085 */     .multiply(step).add(start);
/* 086 */     if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
/* 087 */       partitionEnd = Long.MAX_VALUE;
/* 088 */     } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
/* 089 */       partitionEnd = Long.MIN_VALUE;
/* 090 */     } else {
/* 091 */       partitionEnd = end.longValue();
/* 092 */     }
/* 093 */
/* 094 */     java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract(
/* 095 */       java.math.BigInteger.valueOf(range_nextIndex_0));
/* 096 */     range_numElementsTodo_0  = startToEnd.divide(step).longValue();
/* 097 */     if (range_numElementsTodo_0 < 0) {
/* 098 */       range_numElementsTodo_0 = 0;
/* 099 */     } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) {
/* 100 */       range_numElementsTodo_0++;
/* 101 */     }
/* 102 */   }
/* 103 */
/* 104 */   protected void processNext() throws java.io.IOException {
/* 105 */     // initialize Range
/* 106 */     if (!range_initRange_0) {
/* 107 */       range_initRange_0 = true;
/* 108 */       initRange(partitionIndex);
/* 109 */     }
/* 110 */
/* 111 */     while (true) {
/* 112 */       if (range_nextIndex_0 == range_batchEnd_0) {
/* 113 */         long range_nextBatchTodo_0;
/* 114 */         if (range_numElementsTodo_0 > 1000L) {
/* 115 */           range_nextBatchTodo_0 = 1000L;
/* 116 */           range_numElementsTodo_0 -= 1000L;
/* 117 */         } else {
/* 118 */           range_nextBatchTodo_0 = range_numElementsTodo_0;
/* 119 */           range_numElementsTodo_0 = 0;
/* 120 */           if (range_nextBatchTodo_0 == 0) break;
/* 121 */         }
/* 122 */         range_batchEnd_0 += range_nextBatchTodo_0 * 1L;
/* 123 */       }
/* 124 */
/* 125 */       int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L);
/* 126 */       for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) {
/* 127 */         long range_value_0 = ((long)range_localIdx_0 * 1L) + range_nextIndex_0;
/* 128 */
/* 129 */         // common sub-expressions
/* 130 */
/* 131 */         bnlj_doConsume_0(range_value_0);
/* 132 */
/* 133 */         if (shouldStop()) {
/* 134 */           range_nextIndex_0 = range_value_0 + 1L;
/* 135 */           ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(range_localIdx_0 + 1);
/* 136 */           range_inputMetrics_0.incRecordsRead(range_localIdx_0 + 1);
/* 137 */           return;
/* 138 */         }
/* 139 */
/* 140 */       }
/* 141 */       range_nextIndex_0 = range_batchEnd_0;
/* 142 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(range_localEnd_0);
/* 143 */       range_inputMetrics_0.incRecordsRead(range_localEnd_0);
/* 144 */       range_taskContext_0.killTaskIfInterrupted();
/* 145 */     }
/* 146 */   }
/* 147 */
/* 148 */ }
```

### Why are the changes needed?

Improve query CPU performance. Added a micro benchmark query in `JoinBenchmark.scala`.
Saw 1x of run time improvement:

```
OpenJDK 64-Bit Server VM 11.0.9+11-LTS on Linux 4.14.219-161.340.amzn2.x86_64
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
broadcast nested loop join:                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
broadcast nested loop join wholestage off          62922          63052         184          0.3        3000.3       1.0X
broadcast nested loop join wholestage on           30946          30972          26          0.7        1475.6       2.0X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

* Added unit test in `WholeStageCodegenSuite.scala`, and existing unit tests for `BroadcastNestedLoopJoinExec`.
* Updated golden files for several TCPDS query plans, as whole stage code-gen for `BroadcastNestedLoopJoinExec` is triggered.
* Updated `JoinBenchmark-jdk11-results.txt ` and `JoinBenchmark-results.txt` with new benchmark result. Followed previous benchmark PRs - #27078 and #26003 to use same type of machine:

```
Amazon AWS EC2
type: r3.xlarge
region: us-west-2 (Oregon)
OS: Linux
```

Closes #31736 from c21/nested-join-exec.

Authored-by: Cheng Su <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants