Skip to content

[SPARK-3377] [Metrics] Metrics can be accidentally aggregated #2250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 23 commits into from

Conversation

sarutak
Copy link
Member

@sarutak sarutak commented Sep 3, 2014

I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems.

(1) When applications which have same spark.app.name run on cluster at the same time, some metrics names are mixed. For instance, if 2+ application is running on the cluster at the same time, each application emits the same named metric like "SparkPi.DAGScheduler.stage.failedStages" and Graphite cannot distinguish the metrics is for which application.

(2) When 2+ executors run on the same machine, JVM metrics of each executors are mixed. For instance, 2+ executors running on the same node can emit the same named metric "jvm.memory" and Graphite cannot distinguish the metrics is from which application.

I think the main issue tried to resolve in #1067 is subsumed by this PR.

Closes #1067
Closes #2432

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2250 at commit 6fc5560.

  • This patch merges cleanly.

…ause the instance of SparkContext is no longer used
@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2250 at commit 6f7dcd4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2250 at commit 6fc5560.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected trait YarnAllocateResponse

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2250 at commit 6f7dcd4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2250 at commit 6f7dcd4.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected trait YarnAllocateResponse

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2250 at commit 6f7dcd4.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2250 at commit 15f88a3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2250 at commit 15f88a3.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have started for PR 2250 at commit 15f88a3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 3, 2014

QA tests have finished for PR 2250 at commit 15f88a3.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • protected trait YarnAllocateResponse

@sarutak
Copy link
Member Author

sarutak commented Sep 4, 2014

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have started for PR 2250 at commit 15f88a3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have finished for PR 2250 at commit 15f88a3.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class BlockManagerMaster(
    • class AttributeMap[A](baseMap: Map[ExprId, (Attribute, A)])

@sarutak
Copy link
Member Author

sarutak commented Sep 4, 2014

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have started for PR 2250 at commit 15f88a3.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 4, 2014

QA tests have finished for PR 2250 at commit 15f88a3.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak sarutak changed the title [SPARK-3377] [Metrics] codahale base Metrics data between applications can jumble up together [SPARK-3377] [Metrics] Don't mix metrics from different applications Sep 4, 2014
@SparkQA
Copy link

SparkQA commented Sep 12, 2014

QA tests have started for PR 2250 at commit 45bd33d.

  • This patch merges cleanly.

@sarutak sarutak changed the title [SPARK-3377] [Metrics] Don't mix metrics from different applications otherwise we cannot distinguish [SPARK-3377] [Metrics] Metrics can be accidentally aggregated Sep 12, 2014
@SparkQA
Copy link

SparkQA commented Sep 12, 2014

QA tests have finished for PR 2250 at commit 45bd33d.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2250 at commit 7b67f5a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2250 at commit 7b67f5a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class JavaSparkContext(val sc: SparkContext)
    • class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

@davies
Copy link
Contributor

davies commented Sep 15, 2014

Could you cleanup the changes? It's confusing to see a bunch of debugging changes were left.

@sarutak sarutak force-pushed the metrics-structure-improvement branch from b5c907d to ead8966 Compare September 15, 2014 05:05
@sarutak
Copy link
Member Author

sarutak commented Sep 15, 2014

Sorry, now I've just cleaned up.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have started for PR 2250 at commit ead8966.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 15, 2014

QA tests have finished for PR 2250 at commit ead8966.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sarutak
Copy link
Member Author

sarutak commented Sep 16, 2014

Can anyone review this if you have time?

@JoshRosen
Copy link
Contributor

This seems like a good idea; I can see how the current behavior is confusing, especially since I think it might be common for multiple apps to be running with the same name (e.g. two copies of spark-shell). Do you know if there are any other places where we incorrectly treat application names as unique?

I'm not sure that calling System.currentTimeMillis() is the most intuitive way to give applications unique ids, though. The Spark master already gives application unique ids, such as app-20140916115848-0000, and it would be nice if we used these same IDs in the metrics system.

The Master-assigned application ID is exposed through TaskScheduler.applicationId() and comes from SchedulerBackend.applicationId(); technically this method could return None but it looks like none of the current implementations do. (This id was exposed pretty recently; see #1218). This id is only available after the application is registered with the master, but that shouldn't be a problem since initDriverMetrics() is called after the task scheduler has been initialized.

@SparkQA
Copy link

SparkQA commented Sep 17, 2014

QA tests have started for PR 2250 at commit cfe8027.

  • This patch merges cleanly.

@sarutak
Copy link
Member Author

sarutak commented Sep 17, 2014

@JoshRosen Thanks for your advise. I tried to use application id for metrics name and I found there were something difficulty.

Problem 1. We need application id before creating SparkEnv
For driver, we need application id before creating SparkEnv because some metrics sources are loaded and registered within SparkEnv.create. To be exact, in SparkEnv.create, an instance of MetricsSystem is created and the constructor of MetricsSystem invokes registerSource method, which loads sources from metrics.properties.
Unfortunately, SparkEnv cannot create after before getting application id. Application id is gotten from SchedulerBackend (or its sub classes), but instances of SchedulerBackend cannot create before creating SparkEnv, for instance, TaskSchedulerImpl needs SparkEnv and TaskSchedulerImpl and SchedulerBackend are created at the same time.

Problem 2. Difficult to pass application id to Executors via SparkConf
Considering all of implementations of SchedulerBackends, we can get application id after invoking "taskScheduler.start()" in SparkContext.
But, before finishing "taskScheduler.start()", Executors should be launched and extract SparkConf from DriverActor. In other words, Executors extract SparkConf before setting application id to SparkConf.

So I have 2 solutions.
1st is this PR. This is a compromised solution. When we use YARN Cluster mode, we can get application id by SparkConf.get("spark.yarn.app.id") before SparkEnv is created and if we use other modes, we use System.currentTimeMillis instead.

2nd is #2432 .
To register metrics sources after getting application id, SparkEnv doesn't register metrics sources and doesn't start MetricsSystem within SparkEnv#create when SparkEnv-creator is a driver so after getting application id, register metrics and start MetricsSystem instead. This is for problem 1.

And for problem 2, when launching ExecutorBackends, launcher pass application id to ExecutorBackends. It doesn't consider Mesos because MesosSchedulerBackend doesn't return application id so if we use Mesos, System.currentTimeMillis is used instead of application id.

@SparkQA
Copy link

SparkQA commented Sep 17, 2014

QA tests have finished for PR 2250 at commit cfe8027.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

I feel strongly that we should use the same application ID to refer to the application in every context, since creating a different id based off of System.currentTimeMillis could be very confusing for users. As a user, I'd like to be able to grep logs / metrics / web UIs for my application data using one application id; displaying some other unique but random value is confusing because I have to compare timestamps, etc. to correlate the ids.

This is tricky, though, since we have a "chicken and egg" initialization problem, as you've described. I like the approach that you've suggested in #2432, so I'm going to continue review over there. Feel free to leave this PR open, though, so that it shows up in our PR dashboard and invites discussion; it will be automatically closed if I merge your other PR.

@asfgit asfgit closed this in 79e45c9 Oct 3, 2014
@sarutak sarutak deleted the metrics-structure-improvement branch April 11, 2015 05:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants