[SPARK-2261] Make event logger use a single file. #1222

vanzin · 2014-06-25T23:51:15Z

Currently the event logger uses a directory and several files to
describe an app's event log, all but one of which are empty. This
is not very HDFS-friendly, since creating lots of nodes in HDFS
(especially when they don't contain any data) is frowned upon due
to the node metadata being kept in the NameNode's memory.

Instead, add a header section to the event log file that contains metadata
needed to read the events. This metadata includes things like the Spark
version (for future code that may need it for backwards compatibility) and
the compression codec used for the event data.

With the new approach, aside from reducing the load on the NN, there's
also a lot less remote calls needed when reading the log directory.

vanzin · 2014-06-25T23:52:14Z

Note this change makes the HS URLs a little ugly. #1218 fixes that (when both are merged).

AmplabJenkins · 2014-06-25T23:55:20Z

Can one of the admins verify this patch?

andrewor14 · 2014-07-28T17:23:44Z

add to whitelist

SparkQA · 2014-07-28T17:28:51Z

QA tests have started for PR 1222. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17286/consoleFull

andrewor14 · 2014-07-28T17:38:17Z

core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala

-    logCheckingThread.start()
+
+    // Treat 0 as "disable the background thread", mostly for testing.
+    if (UPDATE_INTERVAL_MS > 0) {


Doesn't an interval of 0 technically mean it's constantly checking? It might make sense to make the test value MAX_INTEGER or something huge

andrewor14 · 2014-07-28T18:16:21Z

@vanzin It seems that with the changes in this PR the new filename will look something like

app-spark-shell-201407281106.1.0.0.LZFCompression.inprogress

This is pretty hard to read and the file name regex becomes somewhat complicated. If the app name is very long and we decide to add more fields in the future, we may hit a different limitation of the file system, i.e. the length of the file name. I think what we want to do instead is have two files, (1) the event log and (2) the metadata, instead of trying to encode all the information in the file names (as we have done even before your PR).

The other thing is that I notice you pass a lot of nulls around. It would be good if you could avoid these when initializing variables or returning from methods, either through Options or other more concise ways of rewriting the logic.

Also, if you could up-merge this once you have a chance that would be great. Thanks.

SparkQA · 2014-07-28T19:06:40Z

QA results for PR 1222:
- This patch FAILED unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17286/consoleFull

vanzin · 2014-07-29T23:10:15Z

Yeah, encoding attributes in the file name is not optimal. But I was trying to avoid having to write a separate resource containing the metadata, to avoid too many round trips to HDFS. I was thinking that extended attributes (HDFS 2.5) could be a solution, but they still require a separate call to HDFS to retrieve them, and we still need to support older versions. I need to think more about this to see what a good solution would be here.

Regarding FileLogger, I removed it because it only had one user, and the only functionality it really provided on top of the raw streams (the "new file with incremented index" method) is not needed anymore. So I didn't see the need to keep it around - it's not really providing anything useful anymore.

Currently the event logger uses a directory and several files to describe an app's event log, all but one of which are empty. This is not very HDFS-friendly, since creating lots of nodes in HDFS (especially when they don't contain any data) is frowned upon due to the node metadata being kept in the NameNode's memory. Instead, all the metadata needed for the app log file can be encoded in the file name itself. (HDFS is adding extended attributes which could be used for this, but we need to support older versions.) This change implements that approach, and also gets rid of FileLogger, which was only used by EventLoggingListener and the little functionality it provided can be much more concisely implemented inside the listener itself. With the new approach, aside from reducing the load on the NN, there's also a lot less remote calls needed when reading the log directory.

Spark 1.0 will generate log directories instead of single log files for applications; so it's nice to have the history server understand both styles.

Work around a SparkUI issue where the name to show has to be provided in the constructor. Also remove explicit flushes from logging code, since they're not really useful now that the HS only reads data from finished apps (and the API used does not exist in Hadoop trunk).

Checking that events have been written to the log file while the logger is running is brittle; instead, check that expected events show up in the file after the job is done, since that's really the functionality we care about. Also add another name parsing test, just for completeness.

Actual fixes don't matter much since I will be changing a lot of this code anyway.

That makes file names too complicated and makes it harder to add more metadata later. Instead, change the log format so that it has a header containing the metadata. The header is always uncompressed, while the data after the header may be compressed. EventLoggingListener provides two methods that help in making sense of the new log files. It also avoids exposing too many details about what goes on under the hood, so overall it's a better interface than before. The header is also binary, so just "cat"ing the log file will probably end up in some garbage in the output. But that was the case with compressed logs anyway. The log format is not supposed to be public in any case, as far as I can tell. Note that while possible, the code does not add extra metadata that is currently missing, such as compression block sizes. That's pretty trivial to add later, though.

SparkQA · 2014-09-17T00:09:30Z

QA tests have started for PR 1222 at commit 16661a3.

This patch merges cleanly.

SparkQA · 2014-09-17T00:10:32Z

QA tests have finished for PR 1222 at commit 16661a3.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-12-19T03:03:29Z

Test build #24617 has finished for PR 1222 at commit c7e6123.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-19T03:03:32Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24617/
Test PASSed.

andrewor14 · 2014-12-19T22:01:33Z

Hey here's a thought, why don't we pass in an iterator of string to the ReplayListenerBus instead of a stream? The replay bus is always going to read JSON line by line right? If we do that then we can just do a simple readline() with BufferedReader without worrying about the length of the header and buffering and other low-level concerns.

vanzin · 2014-12-19T22:03:16Z

I don't follow. How would that work? How would you apply the compression codec to the BufferedReader instance?

andrewor14 · 2014-12-19T22:58:09Z

Ah yes, that won't work straight out of the box when there's compression. However, I still think it makes sense to pass it an iterator rather than an input stream. What we could do is read lines manually without buffering, e.g. something like the following

def readLine(in: java.io.InputStream): String = {
  var x = in.read()
  var line = "" // use string builder here later
  while (x != 10 && x != -1) {
    line += Character.toString(x.toChar)
    x = in.read()
  }
  return line
}

Then when the return value equals === HEADER END MARKER ===, we wrap the stream in a buffered stream and then a compressed stream (as we already do in ReplayListenerBus) and extract the lines from this into an iterator, e.g.

val fstream = // file stream after reading in the header end marker
val bstream = new BufferedInputStream(fstream)
val cstream = codec.compressedInputStream(bstream)
val jsonEvents: Iterator[String] = scala.io.Source.fromInputStream(cstream).getLines
val replayBus = new ReplayListenerBus(jsonEvents)

I tried this locally and it does what I expect. This doesn't seem super complicated to me.

vanzin · 2014-12-19T23:03:01Z

Ok, I'll do the manual line parsing. I don't really see the benefits in readability that you see here, and would much rather prefer the simpler code, but this review has already gone on for too long.

As for the ReplayListenerBus interface, it should probably take an Iterable[SparkListenerEvent], not any kind of input stream...

SparkQA · 2014-12-19T23:47:33Z

Test build #24659 has started for PR 1222 at commit cc8f5de.

This patch merges cleanly.

andrewor14 · 2014-12-19T23:49:09Z

The event logs are actually user facing to a certain extent so I think it's important to present a nice format to the user, in case they want to consume the logs independently of the history server or the standalone master. I think it's worth the extra code to keep this exterior user-friendly, and IMO the extra logic is fairly simple to reason about.

vanzin · 2014-12-19T23:55:30Z

in case they want to consume the logs independently of the history server

But that's actually part of my point. A format that is simpler to parse (length + bytes) is easier to consume externally. The line-based approach is only better for humans, which I don't think are the main target of these files.

In any case, moot point, since I implemented what you suggest anyway.

SparkQA · 2014-12-20T01:08:55Z

Test build #24659 has finished for PR 1222 at commit cc8f5de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-12-20T01:08:59Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24659/
Test PASSed.

andrewor14 · 2014-12-20T02:06:31Z

core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala

+    }
+
+    val in = new BufferedInputStream(fs.open(log))
+    def readLine() = {


When I merge this I'll add a big comment here on why this is necessary (no action needed on your part)

andrewor14 · 2014-12-20T02:16:12Z

My point is that if we leave random bytes in the header then it's impossible for the user to write code to parse it themselves without digging into our code. I actually think we should expose the parsing code as a developer API once we have a stable format, in which case either of our designs is fine.

Anyway, I'm merging this into master. I'll add a few comments myself, but thanks for keeping this one updated for a long time.

JoshRosen · 2014-12-20T18:37:51Z

I think that this PR broke the Hadoop 1 Maven build (link):

[INFO] --- scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) @ spark-core_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile
[INFO] Using incremental compilation
[INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
[INFO] Compiling 124 Scala sources and 4 Java sources to /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/1.0.4/label/centos/core/target/scala-2.10/test-classes...
[ERROR] /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/1.0.4/label/centos/core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala:68: value isFile is not a member of org.apache.hadoop.fs.FileStatus
[ERROR]     assert(logStatus.isFile)
[ERROR]                      ^
[ERROR] /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/1.0.4/label/centos/core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala:72: value isFile is not a member of org.apache.hadoop.fs.FileStatus
[ERROR]     assert(fileSystem.getFileStatus(new Path(eventLogger.logPath)).isFile())
[ERROR]                                                                    ^
[ERROR] /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/1.0.4/label/centos/core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala:115: value isFile is not a member of org.apache.hadoop.fs.FileStatus
[ERROR]     assert(eventLog.isFile)
[ERROR]                     ^
[ERROR] three errors found

vanzin · 2014-12-21T01:10:00Z

Hmmm. I'll look at this Monday morning. It's not breaking anything for jenkins, right?

JoshRosen · 2014-12-21T21:18:07Z

@vanzin It didn't break the pull request builder, but it did cause the Hadoop 1.x master builds to fail. There's a fix in #3754 which I've merged, so this should be fixed now.

vanzin · 2014-12-22T17:58:44Z

Thanks Josh!

Earne · 2014-12-23T02:34:29Z

core/src/main/scala/org/apache/spark/deploy/master/Master.scala

    try {
-      val replayBus = new ReplayListenerBus(eventLogPaths, fileSystem, compressionCodec)
+      val fs = Utils.getHadoopFileSystem(eventLogFile, hadoopConf)
+      val (logInput, sparkVersion) = EventLoggingListener.openEventLog(new Path(eventLogFile), fs)


@vanzin @andrewor14 can not see "Completed Applications" 's UI now. The eventLogFile is still a base dir path like "hdfs://****/1.3.0/", may be we still need EventLoggingListener.getLogPath(eventLogFile, app.id)?
*I think we should rename current eventLogFIle back to eventLogDir, and make a new var eventLogFIle by using getLogPath

vanzin · 2014-12-23T02:57:31Z

@Earne If that's the master UI, there's a PR to handle that bug (#3755).

If it's the history server, I'll take a look tomorrow morning. The code was working fine last time I tested it, but recent changes may have affected it.

Earne · 2014-12-23T03:09:10Z

@vanzin This bug happend in the master UI. Sorry I didn't see #3755 , I did something in my local repo similar to #3755 . Thanks.

andrewor14 reviewed Jul 28, 2014
View reviewed changes

vanzin changed the title ~~[SPARK-2261] Make event logger use a single file.~~ [wip][SPARK-2261] Make event logger use a single file. Aug 14, 2014

Marcelo Vanzin added 9 commits September 15, 2014 15:41

Make history server parse old-style log directories.

8f42274

Spark 1.0 will generate log directories instead of single log files for applications; so it's nice to have the history server understand both styles.

Fix botched rebase.

f677930

Actual fixes don't matter much since I will be changing a lot of this code anyway.

Restore log flushing.

3700586

Some review feedback.

cc6bce4

Simplify some internal code.

16661a3

vanzin force-pushed the hist-server-single-log branch from 508f028 to 16661a3 Compare September 17, 2014 00:06

vanzin changed the title ~~[wip][SPARK-2261] Make event logger use a single file.~~ [SPARK-2261] Make event logger use a single file. Sep 17, 2014

Store header in plain text.

cc8f5de

andrewor14 reviewed Dec 20, 2014
View reviewed changes

asfgit closed this in 4564519 Dec 20, 2014

vanzin mentioned this pull request Dec 20, 2014

[SPARK-3697] Ignore event directories that cannot be read. #3391

Closed

JoshRosen mentioned this pull request Dec 21, 2014

[SPARK-4913] Fix incorrect event log path #3755

Closed

Earne reviewed Dec 23, 2014
View reviewed changes

vanzin deleted the hist-server-single-log branch December 23, 2014 18:53

mapr-devops pushed a commit to mapr/spark that referenced this pull request May 8, 2025

MapR [SPARK-1328] Remove dependency on mapr-core package (apache#1222)

c3d9375

[SPARK-2261] Make event logger use a single file. #1222

[SPARK-2261] Make event logger use a single file. #1222

Uh oh!

Conversation

vanzin commented Jun 25, 2014

Uh oh!

vanzin commented Jun 25, 2014

Uh oh!

AmplabJenkins commented Jun 25, 2014

Uh oh!

andrewor14 commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

andrewor14 Jul 28, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Jul 28, 2014

Uh oh!

SparkQA commented Jul 28, 2014

Uh oh!

vanzin commented Jul 29, 2014

Uh oh!

SparkQA commented Sep 17, 2014

Uh oh!

SparkQA commented Sep 17, 2014

Uh oh!

SparkQA commented Dec 19, 2014

Uh oh!

AmplabJenkins commented Dec 19, 2014

Uh oh!

andrewor14 commented Dec 19, 2014

Uh oh!

vanzin commented Dec 19, 2014

Uh oh!

andrewor14 commented Dec 19, 2014

Uh oh!

vanzin commented Dec 19, 2014

Uh oh!

SparkQA commented Dec 19, 2014

Uh oh!

andrewor14 commented Dec 19, 2014

Uh oh!

vanzin commented Dec 19, 2014

Uh oh!

SparkQA commented Dec 20, 2014

Uh oh!

AmplabJenkins commented Dec 20, 2014

Uh oh!

andrewor14 Dec 20, 2014

Choose a reason for hiding this comment

Uh oh!

andrewor14 commented Dec 20, 2014

Uh oh!

JoshRosen commented Dec 20, 2014

Uh oh!

vanzin commented Dec 21, 2014

Uh oh!

JoshRosen commented Dec 21, 2014

Uh oh!

vanzin commented Dec 22, 2014

Uh oh!

Earne Dec 23, 2014

Choose a reason for hiding this comment

Uh oh!

vanzin commented Dec 23, 2014

Uh oh!

Earne commented Dec 23, 2014

Uh oh!

Uh oh!