Skip to content

[SPARK-2261] Make event logger use a single file. #1222

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 29 commits into from

Conversation

vanzin
Copy link
Contributor

@vanzin vanzin commented Jun 25, 2014

Currently the event logger uses a directory and several files to
describe an app's event log, all but one of which are empty. This
is not very HDFS-friendly, since creating lots of nodes in HDFS
(especially when they don't contain any data) is frowned upon due
to the node metadata being kept in the NameNode's memory.

Instead, add a header section to the event log file that contains metadata
needed to read the events. This metadata includes things like the Spark
version (for future code that may need it for backwards compatibility) and
the compression codec used for the event data.

With the new approach, aside from reducing the load on the NN, there's
also a lot less remote calls needed when reading the log directory.

@vanzin
Copy link
Contributor Author

vanzin commented Jun 25, 2014

Note this change makes the HS URLs a little ugly. #1218 fixes that (when both are merged).

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@andrewor14
Copy link
Contributor

add to whitelist

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA tests have started for PR 1222. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17286/consoleFull

logCheckingThread.start()

// Treat 0 as "disable the background thread", mostly for testing.
if (UPDATE_INTERVAL_MS > 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't an interval of 0 technically mean it's constantly checking? It might make sense to make the test value MAX_INTEGER or something huge

@andrewor14
Copy link
Contributor

@vanzin It seems that with the changes in this PR the new filename will look something like

app-spark-shell-201407281106.1.0.0.LZFCompression.inprogress

This is pretty hard to read and the file name regex becomes somewhat complicated. If the app name is very long and we decide to add more fields in the future, we may hit a different limitation of the file system, i.e. the length of the file name. I think what we want to do instead is have two files, (1) the event log and (2) the metadata, instead of trying to encode all the information in the file names (as we have done even before your PR).

The other thing is that I notice you pass a lot of nulls around. It would be good if you could avoid these when initializing variables or returning from methods, either through Options or other more concise ways of rewriting the logic.

Also, if you could up-merge this once you have a chance that would be great. Thanks.

@SparkQA
Copy link

SparkQA commented Jul 28, 2014

QA results for PR 1222:
- This patch FAILED unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17286/consoleFull

@vanzin
Copy link
Contributor Author

vanzin commented Jul 29, 2014

Yeah, encoding attributes in the file name is not optimal. But I was trying to avoid having to write a separate resource containing the metadata, to avoid too many round trips to HDFS. I was thinking that extended attributes (HDFS 2.5) could be a solution, but they still require a separate call to HDFS to retrieve them, and we still need to support older versions. I need to think more about this to see what a good solution would be here.

Regarding FileLogger, I removed it because it only had one user, and the only functionality it really provided on top of the raw streams (the "new file with incremented index" method) is not needed anymore. So I didn't see the need to keep it around - it's not really providing anything useful anymore.

@vanzin vanzin changed the title [SPARK-2261] Make event logger use a single file. [wip][SPARK-2261] Make event logger use a single file. Aug 14, 2014
Marcelo Vanzin added 9 commits September 15, 2014 15:41
Currently the event logger uses a directory and several files to
describe an app's event log, all but one of which are empty. This
is not very HDFS-friendly, since creating lots of nodes in HDFS
(especially when they don't contain any data) is frowned upon due
to the node metadata being kept in the NameNode's memory.

Instead, all the metadata needed for the app log file can be
encoded in the file name itself. (HDFS is adding extended attributes
which could be used for this, but we need to support older versions.)

This change implements that approach, and also gets rid of FileLogger,
which was only used by EventLoggingListener and the little functionality
it provided can be much more concisely implemented inside the listener
itself.

With the new approach, aside from reducing the load on the NN, there's
also a lot less remote calls needed when reading the log directory.
Spark 1.0 will generate log directories instead of single log
files for applications; so it's nice to have the history server
understand both styles.
Work around a SparkUI issue where the name to show has to be
provided in the constructor.

Also remove explicit flushes from logging code, since they're
not really useful now that the HS only reads data from finished
apps (and the API used does not exist in Hadoop trunk).
Checking that events have been written to the log file while the
logger is running is brittle; instead, check that expected events
show up in the file after the job is done, since that's really
the functionality we care about.

Also add another name parsing test, just for completeness.
Actual fixes don't matter much since I will be changing a lot of this
code anyway.
That makes file names too complicated and makes it harder to add more
metadata later. Instead, change the log format so that it has a header
containing the metadata.

The header is always uncompressed, while the data after the header may
be compressed. EventLoggingListener provides two methods that help in
making sense of the new log files. It also avoids exposing too many
details about what goes on under the hood, so overall it's a better
interface than before.

The header is also binary, so just "cat"ing the log file will probably
end up in some garbage in the output. But that was the case with
compressed logs anyway. The log format is not supposed to be public
in any case, as far as I can tell.

Note that while possible, the code does not add extra metadata that is
currently missing, such as compression block sizes. That's pretty trivial
to add later, though.
@vanzin vanzin force-pushed the hist-server-single-log branch from 508f028 to 16661a3 Compare September 17, 2014 00:06
@vanzin vanzin changed the title [wip][SPARK-2261] Make event logger use a single file. [SPARK-2261] Make event logger use a single file. Sep 17, 2014
@SparkQA
Copy link

SparkQA commented Sep 17, 2014

QA tests have started for PR 1222 at commit 16661a3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 17, 2014

QA tests have finished for PR 1222 at commit 16661a3.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24617 has finished for PR 1222 at commit c7e6123.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24617/
Test PASSed.

@andrewor14
Copy link
Contributor

Hey here's a thought, why don't we pass in an iterator of string to the ReplayListenerBus instead of a stream? The replay bus is always going to read JSON line by line right? If we do that then we can just do a simple readline() with BufferedReader without worrying about the length of the header and buffering and other low-level concerns.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 19, 2014

I don't follow. How would that work? How would you apply the compression codec to the BufferedReader instance?

@andrewor14
Copy link
Contributor

Ah yes, that won't work straight out of the box when there's compression. However, I still think it makes sense to pass it an iterator rather than an input stream. What we could do is read lines manually without buffering, e.g. something like the following

def readLine(in: java.io.InputStream): String = {
  var x = in.read()
  var line = "" // use string builder here later
  while (x != 10 && x != -1) {
    line += Character.toString(x.toChar)
    x = in.read()
  }
  return line
}

Then when the return value equals === HEADER END MARKER ===, we wrap the stream in a buffered stream and then a compressed stream (as we already do in ReplayListenerBus) and extract the lines from this into an iterator, e.g.

val fstream = // file stream after reading in the header end marker
val bstream = new BufferedInputStream(fstream)
val cstream = codec.compressedInputStream(bstream)
val jsonEvents: Iterator[String] = scala.io.Source.fromInputStream(cstream).getLines
val replayBus = new ReplayListenerBus(jsonEvents)

I tried this locally and it does what I expect. This doesn't seem super complicated to me.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 19, 2014

Ok, I'll do the manual line parsing. I don't really see the benefits in readability that you see here, and would much rather prefer the simpler code, but this review has already gone on for too long.

As for the ReplayListenerBus interface, it should probably take an Iterable[SparkListenerEvent], not any kind of input stream...

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24659 has started for PR 1222 at commit cc8f5de.

  • This patch merges cleanly.

@andrewor14
Copy link
Contributor

The event logs are actually user facing to a certain extent so I think it's important to present a nice format to the user, in case they want to consume the logs independently of the history server or the standalone master. I think it's worth the extra code to keep this exterior user-friendly, and IMO the extra logic is fairly simple to reason about.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 19, 2014

in case they want to consume the logs independently of the history server

But that's actually part of my point. A format that is simpler to parse (length + bytes) is easier to consume externally. The line-based approach is only better for humans, which I don't think are the main target of these files.

In any case, moot point, since I implemented what you suggest anyway.

@SparkQA
Copy link

SparkQA commented Dec 20, 2014

Test build #24659 has finished for PR 1222 at commit cc8f5de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24659/
Test PASSed.

}

val in = new BufferedInputStream(fs.open(log))
def readLine() = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I merge this I'll add a big comment here on why this is necessary (no action needed on your part)

@andrewor14
Copy link
Contributor

My point is that if we leave random bytes in the header then it's impossible for the user to write code to parse it themselves without digging into our code. I actually think we should expose the parsing code as a developer API once we have a stable format, in which case either of our designs is fine.

Anyway, I'm merging this into master. I'll add a few comments myself, but thanks for keeping this one updated for a long time.

@JoshRosen
Copy link
Contributor

I think that this PR broke the Hadoop 1 Maven build (link):

[INFO] --- scala-maven-plugin:3.2.0:testCompile (scala-test-compile-first) @ spark-core_2.10 ---
[WARNING] Zinc server is not available at port 3030 - reverting to normal incremental compile
[INFO] Using incremental compilation
[INFO] compiler plugin: BasicArtifact(org.scalamacros,paradise_2.10.4,2.0.1,null)
[INFO] Compiling 124 Scala sources and 4 Java sources to /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/1.0.4/label/centos/core/target/scala-2.10/test-classes...
[ERROR] /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/1.0.4/label/centos/core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala:68: value isFile is not a member of org.apache.hadoop.fs.FileStatus
[ERROR]     assert(logStatus.isFile)
[ERROR]                      ^
[ERROR] /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/1.0.4/label/centos/core/src/test/scala/org/apache/spark/scheduler/EventLoggingListenerSuite.scala:72: value isFile is not a member of org.apache.hadoop.fs.FileStatus
[ERROR]     assert(fileSystem.getFileStatus(new Path(eventLogger.logPath)).isFile())
[ERROR]                                                                    ^
[ERROR] /home/jenkins/workspace/Spark-Master-Maven-pre-YARN/hadoop.version/1.0.4/label/centos/core/src/test/scala/org/apache/spark/scheduler/ReplayListenerSuite.scala:115: value isFile is not a member of org.apache.hadoop.fs.FileStatus
[ERROR]     assert(eventLog.isFile)
[ERROR]                     ^
[ERROR] three errors found

@vanzin
Copy link
Contributor Author

vanzin commented Dec 21, 2014

Hmmm. I'll look at this Monday morning. It's not breaking anything for jenkins, right?

@JoshRosen
Copy link
Contributor

@vanzin It didn't break the pull request builder, but it did cause the Hadoop 1.x master builds to fail. There's a fix in #3754 which I've merged, so this should be fixed now.

@vanzin
Copy link
Contributor Author

vanzin commented Dec 22, 2014

Thanks Josh!

try {
val replayBus = new ReplayListenerBus(eventLogPaths, fileSystem, compressionCodec)
val fs = Utils.getHadoopFileSystem(eventLogFile, hadoopConf)
val (logInput, sparkVersion) = EventLoggingListener.openEventLog(new Path(eventLogFile), fs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin @andrewor14 can not see "Completed Applications" 's UI now. The eventLogFile is still a base dir path like "hdfs://****/1.3.0/", may be we still need EventLoggingListener.getLogPath(eventLogFile, app.id)?
*I think we should rename current eventLogFIle back to eventLogDir, and make a new var eventLogFIle by using getLogPath

no-ui-for-application

@vanzin
Copy link
Contributor Author

vanzin commented Dec 23, 2014

@Earne If that's the master UI, there's a PR to handle that bug (#3755).

If it's the history server, I'll take a look tomorrow morning. The code was working fine last time I tested it, but recent changes may have affected it.

@Earne
Copy link
Contributor

Earne commented Dec 23, 2014

@vanzin This bug happend in the master UI. Sorry I didn't see #3755 , I did something in my local repo similar to #3755 . Thanks.

@vanzin vanzin deleted the hist-server-single-log branch December 23, 2014 18:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants