-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-4092] [CORE] Fix InputMetrics for coalesce'd Rdds #3120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -109,18 +109,19 @@ class NewHadoopRDD[K, V]( | |
logInfo("Input split: " + split.serializableHadoopSplit) | ||
val conf = confBroadcast.value.value | ||
|
||
val inputMetrics = new InputMetrics(DataReadMethod.Hadoop) | ||
val inputMetrics = context.taskMetrics | ||
.getInputMetricsForReadMethod(DataReadMethod.Hadoop) | ||
|
||
// Find a function that will return the FileSystem bytes read by this thread. Do this before | ||
// creating RecordReader, because RecordReader's constructor might read some bytes | ||
val bytesReadCallback = if (split.serializableHadoopSplit.value.isInstanceOf[FileSplit]) { | ||
SparkHadoopUtil.get.getFSBytesReadOnThreadCallback( | ||
split.serializableHadoopSplit.value.asInstanceOf[FileSplit].getPath, conf) | ||
} else { | ||
None | ||
} | ||
if (bytesReadCallback.isDefined) { | ||
context.taskMetrics.inputMetrics = Some(inputMetrics) | ||
} | ||
val bytesReadCallback = inputMetrics.bytesReadCallback.orElse( | ||
split.serializableHadoopSplit.value match { | ||
case split: FileSplit => | ||
SparkHadoopUtil.get.getFSBytesReadOnThreadCallback(split.getPath, conf) | ||
case _ => None | ||
} | ||
) | ||
inputMetrics.setBytesReadCallback(bytesReadCallback) | ||
|
||
val attemptId = newTaskAttemptID(jobTrackerId, id, isMap = true, split.index, 0) | ||
val hadoopAttemptContext = newTaskAttemptContext(conf, attemptId) | ||
|
@@ -153,34 +154,19 @@ class NewHadoopRDD[K, V]( | |
throw new java.util.NoSuchElementException("End of stream") | ||
} | ||
havePair = false | ||
|
||
// Update bytes read metric every few records | ||
if (recordsSinceMetricsUpdate == HadoopRDD.RECORDS_BETWEEN_BYTES_READ_METRIC_UPDATES | ||
&& bytesReadCallback.isDefined) { | ||
recordsSinceMetricsUpdate = 0 | ||
val bytesReadFn = bytesReadCallback.get | ||
inputMetrics.bytesRead = bytesReadFn() | ||
} else { | ||
recordsSinceMetricsUpdate += 1 | ||
} | ||
|
||
(reader.getCurrentKey, reader.getCurrentValue) | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This was done intentionally to help keep the callback updates out of the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pwendell There is a long thread in this pr between @sryza and @kayousterhout about why we need to add the call back to the input metrics. The reason is to prevent clobbering between different HadoopRdds. For example CartesianRdd - this is why there is a specific unit test for that case. I don't think we can do anything correctly if we don't have the callbacks in the inputMetrics. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, that's fine then. I looked and it's all private[spark] so actually there is no change to visibility. |
||
|
||
private def close() { | ||
try { | ||
reader.close() | ||
|
||
// Update metrics with final amount | ||
if (bytesReadCallback.isDefined) { | ||
val bytesReadFn = bytesReadCallback.get | ||
inputMetrics.bytesRead = bytesReadFn() | ||
inputMetrics.updateBytesRead() | ||
} else if (split.serializableHadoopSplit.value.isInstanceOf[FileSplit]) { | ||
// If we can't get the bytes read from the FS stats, fall back to the split size, | ||
// which may be inaccurate. | ||
try { | ||
inputMetrics.bytesRead = split.serializableHadoopSplit.value.getLength | ||
context.taskMetrics.inputMetrics = Some(inputMetrics) | ||
inputMetrics.addBytesRead(split.serializableHadoopSplit.value.getLength) | ||
} catch { | ||
case e: java.io.IOException => | ||
logWarning("Unable to get input size to set InputMetrics for task", e) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in your next pr, can u fix this by adding a return type explicitly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this follows the method above: updateShuffleReadMetrics that doesn't have a return type. Should I change both then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be great to do that!