[SPARK-3377] [Metrics] Metrics can be accidentally aggregated #2250

sarutak · 2014-09-03T15:31:50Z

I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems.

(1) When applications which have same spark.app.name run on cluster at the same time, some metrics names are mixed. For instance, if 2+ application is running on the cluster at the same time, each application emits the same named metric like "SparkPi.DAGScheduler.stage.failedStages" and Graphite cannot distinguish the metrics is for which application.

(2) When 2+ executors run on the same machine, JVM metrics of each executors are mixed. For instance, 2+ executors running on the same node can emit the same named metric "jvm.memory" and Graphite cannot distinguish the metrics is from which application.

I think the main issue tried to resolve in #1067 is subsumed by this PR.

Closes #1067
Closes #2432

…rkConf

…fiers

…nagerSource

…d and driver/executor-id

… BlockManagerSource" This reverts commit 71609f5.

…cture-improvement

…nagerSource

SparkQA · 2014-09-03T15:34:20Z

QA tests have started for PR 2250 at commit 6fc5560.

This patch merges cleanly.

…ause the instance of SparkContext is no longer used

SparkQA · 2014-09-03T16:04:17Z

QA tests have started for PR 2250 at commit 6f7dcd4.

This patch merges cleanly.

SparkQA · 2014-09-03T16:38:27Z

QA tests have finished for PR 2250 at commit 6fc5560.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected trait YarnAllocateResponse

SparkQA · 2014-09-03T16:49:08Z

QA tests have started for PR 2250 at commit 6f7dcd4.

This patch merges cleanly.

SparkQA · 2014-09-03T17:08:20Z

QA tests have finished for PR 2250 at commit 6f7dcd4.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected trait YarnAllocateResponse

SparkQA · 2014-09-03T17:53:00Z

QA tests have finished for PR 2250 at commit 6f7dcd4.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

…turn null when correspondin entry is absent

SparkQA · 2014-09-03T18:24:11Z

QA tests have started for PR 2250 at commit 15f88a3.

This patch merges cleanly.

SparkQA · 2014-09-03T19:23:01Z

QA tests have finished for PR 2250 at commit 15f88a3.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-03T19:34:39Z

QA tests have started for PR 2250 at commit 15f88a3.

This patch merges cleanly.

SparkQA · 2014-09-03T20:30:34Z

QA tests have finished for PR 2250 at commit 15f88a3.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected trait YarnAllocateResponse

…cture-improvement

sarutak · 2014-09-04T06:21:21Z

retest this please.

SparkQA · 2014-09-04T06:24:16Z

QA tests have started for PR 2250 at commit 15f88a3.

This patch merges cleanly.

SparkQA · 2014-09-04T07:16:54Z

QA tests have finished for PR 2250 at commit 15f88a3.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class BlockManagerMaster(
- class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

sarutak · 2014-09-04T07:28:31Z

retest this please.

SparkQA · 2014-09-04T07:34:19Z

QA tests have started for PR 2250 at commit 15f88a3.

This patch merges cleanly.

SparkQA · 2014-09-04T08:37:38Z

QA tests have finished for PR 2250 at commit 15f88a3.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

…cture-improvement

SparkQA · 2014-09-12T03:29:20Z

QA tests have started for PR 2250 at commit 45bd33d.

This patch merges cleanly.

SparkQA · 2014-09-12T04:25:39Z

QA tests have finished for PR 2250 at commit 45bd33d.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

…cture-improvement

SparkQA · 2014-09-13T07:44:23Z

QA tests have started for PR 2250 at commit 7b67f5a.

This patch merges cleanly.

SparkQA · 2014-09-13T08:51:47Z

QA tests have finished for PR 2250 at commit 7b67f5a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JavaSparkContext(val sc: SparkContext)
- class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

davies · 2014-09-15T04:37:53Z

Could you cleanup the changes? It's confusing to see a bunch of debugging changes were left.

This reverts commit e4a4593.

…cture-improvement

sarutak · 2014-09-15T05:08:47Z

Sorry, now I've just cleaned up.

SparkQA · 2014-09-15T05:09:30Z

QA tests have started for PR 2250 at commit ead8966.

This patch merges cleanly.

SparkQA · 2014-09-15T06:15:05Z

QA tests have finished for PR 2250 at commit ead8966.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

sarutak · 2014-09-16T00:13:18Z

Can anyone review this if you have time?

…cture-improvement

JoshRosen · 2014-09-16T19:16:45Z

This seems like a good idea; I can see how the current behavior is confusing, especially since I think it might be common for multiple apps to be running with the same name (e.g. two copies of spark-shell). Do you know if there are any other places where we incorrectly treat application names as unique?

I'm not sure that calling System.currentTimeMillis() is the most intuitive way to give applications unique ids, though. The Spark master already gives application unique ids, such as app-20140916115848-0000, and it would be nice if we used these same IDs in the metrics system.

The Master-assigned application ID is exposed through TaskScheduler.applicationId() and comes from SchedulerBackend.applicationId(); technically this method could return None but it looks like none of the current implementations do. (This id was exposed pretty recently; see #1218). This id is only available after the application is registered with the master, but that shouldn't be a problem since initDriverMetrics() is called after the task scheduler has been initialized.

…cture-improvement

…s when using YARN cluster mode

SparkQA · 2014-09-17T18:04:22Z

QA tests have started for PR 2250 at commit cfe8027.

This patch merges cleanly.

sarutak · 2014-09-17T18:39:46Z

@JoshRosen Thanks for your advise. I tried to use application id for metrics name and I found there were something difficulty.

Problem 1. We need application id before creating SparkEnv
For driver, we need application id before creating SparkEnv because some metrics sources are loaded and registered within SparkEnv.create. To be exact, in SparkEnv.create, an instance of MetricsSystem is created and the constructor of MetricsSystem invokes registerSource method, which loads sources from metrics.properties.
Unfortunately, SparkEnv cannot create after before getting application id. Application id is gotten from SchedulerBackend (or its sub classes), but instances of SchedulerBackend cannot create before creating SparkEnv, for instance, TaskSchedulerImpl needs SparkEnv and TaskSchedulerImpl and SchedulerBackend are created at the same time.

Problem 2. Difficult to pass application id to Executors via SparkConf
Considering all of implementations of SchedulerBackends, we can get application id after invoking "taskScheduler.start()" in SparkContext.
But, before finishing "taskScheduler.start()", Executors should be launched and extract SparkConf from DriverActor. In other words, Executors extract SparkConf before setting application id to SparkConf.

So I have 2 solutions.
1st is this PR. This is a compromised solution. When we use YARN Cluster mode, we can get application id by SparkConf.get("spark.yarn.app.id") before SparkEnv is created and if we use other modes, we use System.currentTimeMillis instead.

2nd is #2432 .
To register metrics sources after getting application id, SparkEnv doesn't register metrics sources and doesn't start MetricsSystem within SparkEnv#create when SparkEnv-creator is a driver so after getting application id, register metrics and start MetricsSystem instead. This is for problem 1.

And for problem 2, when launching ExecutorBackends, launcher pass application id to ExecutorBackends. It doesn't consider Mesos because MesosSchedulerBackend doesn't return application id so if we use Mesos, System.currentTimeMillis is used instead of application id.

SparkQA · 2014-09-17T19:11:43Z

QA tests have finished for PR 2250 at commit cfe8027.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2014-09-20T21:57:17Z

I feel strongly that we should use the same application ID to refer to the application in every context, since creating a different id based off of System.currentTimeMillis could be very confusing for users. As a user, I'd like to be able to grep logs / metrics / web UIs for my application data using one application id; displaying some other unique but random value is confusing because I have to compare timestamps, etc. to correlate the ids.

This is tricky, though, since we have a "chicken and egg" initialization problem, as you've described. I like the approach that you've suggested in #2432, so I'm going to continue review over there. Feel free to leave this PR open, though, so that it shows up in our PR dashboard and invites discussion; it will be automatically closed if I merge your other PR.

sarutak added 7 commits September 3, 2014 17:23

Modified SparkContext to retain spark.unique.app.name property in Spa…

4180993

…rkConf

Modified SparkContext and Executor to set spark.executor.id to identi…

55debab

…fiers

Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockMa…

71609f5

…nagerSource

Modified MetricsSystem to set registry name with unique application-i…

868e326

…d and driver/executor-id

Revert "Modified sourceName of ExecutorSource, DAGSchedulerSource and…

85ffc02

… BlockManagerSource" This reverts commit 71609f5.

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

4e057c9

…cture-improvement

Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockMa…

6fc5560

…nagerSource

Modified constructor of DAGSchedulerSource and BlockManagerSource bec…

6f7dcd4

…ause the instance of SparkContext is no longer used

Modified MetricsSystem#buildRegistryName because conf.get does not re…

15f88a3

…turn null when correspondin entry is absent

sarutak added 4 commits September 4, 2014 10:05

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

fa7175b

…cture-improvement

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

4603a39

…cture-improvement

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

3e098d8

…cture-improvement

tmp

e4a4593

sarutak changed the title ~~[SPARK-3377] [Metrics] codahale base Metrics data between applications can jumble up together~~ [SPARK-3377] [Metrics] Don't mix metrics from different applications Sep 4, 2014

sarutak added 2 commits September 12, 2014 10:16

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

93e263a

…cture-improvement

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

45bd33d

…cture-improvement

sarutak changed the title ~~[SPARK-3377] [Metrics] Don't mix metrics from different applications otherwise we cannot distinguish~~ [SPARK-3377] [Metrics] Metrics can be accidentally aggregated Sep 12, 2014

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

7b67f5a

…cture-improvement

sarutak added 2 commits September 15, 2014 14:04

Revert "tmp"

08e627e

This reverts commit e4a4593.

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

ead8966

…cture-improvement

sarutak force-pushed the metrics-structure-improvement branch from b5c907d to ead8966 Compare September 15, 2014 05:05

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

3ea7896

…cture-improvement

sarutak added 2 commits September 18, 2014 02:00

Merge branch 'master' of git://git.apache.org/spark into metrics-stru…

4a871c3

…cture-improvement

Use applicaton id for metrics name instead of System.currentTimeMilli…

cfe8027

…s when using YARN cluster mode

sarutak mentioned this pull request Sep 17, 2014

[SPARK-3377] [SPARK-3610] Metrics can be accidentally aggregated / History server log name should not be based on user input #2432

Closed

asfgit closed this in 79e45c9 Oct 3, 2014

sarutak deleted the metrics-structure-improvement branch April 11, 2015 05:22

[SPARK-3377] [Metrics] Metrics can be accidentally aggregated #2250

[SPARK-3377] [Metrics] Metrics can be accidentally aggregated #2250

Uh oh!

Conversation

sarutak commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

SparkQA commented Sep 3, 2014

Uh oh!

sarutak commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

sarutak commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 4, 2014

Uh oh!

SparkQA commented Sep 12, 2014

Uh oh!

SparkQA commented Sep 12, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

SparkQA commented Sep 13, 2014

Uh oh!

davies commented Sep 15, 2014

Uh oh!

sarutak commented Sep 15, 2014

Uh oh!

SparkQA commented Sep 15, 2014

Uh oh!

SparkQA commented Sep 15, 2014

Uh oh!

sarutak commented Sep 16, 2014

Uh oh!

JoshRosen commented Sep 16, 2014

Uh oh!

SparkQA commented Sep 17, 2014

Uh oh!

sarutak commented Sep 17, 2014

Uh oh!

SparkQA commented Sep 17, 2014

Uh oh!

JoshRosen commented Sep 20, 2014

Uh oh!

Uh oh!