-
Notifications
You must be signed in to change notification settings - Fork 323
Description
Is your feature request related to a problem?
Sometimes we see spans that should take milliseconds take multiple seconds and it's difficult to know if it was just a slow operation, or the JVM was not responding. This is often caused by GC (which we could correlate, but it's not particularly easy), but it's not the only reason for pauses. It'd be great to track the pause time with would catch the StopTheWorld GC time and other hiccups.
Describe the solution you'd like
I'd like to have a metric that shows when the JVM is not responsive, for how long (was it 100ms, or 10 seconds) and how often.
There's some some existing implementations that resolve around the same idea of sleeping and measuring how much longer than you're supposed to sleep you actually slept (e.g. sleep for 10ms, but wake up 3 seconds later you know something's up)
https://github.com/giltene/jHiccup/blob/master/src/main/java/org/jhiccup/HiccupMeter.java
https://github.com/apache/zookeeper/blob/c74658d398cdc1d207aa296cb6e20de00faec03e/zookeeper-server/src/main/java/org/apache/zookeeper/server/util/JvmPauseMonitor.java
https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/util/JvmPauseMonitor.java
Describe alternatives you've considered
I considered using -XX:+PrintGCApplicationStoppedTime
, but would need to parse the log, and it's quite verbose, while I'd like just a few metrics that's easy to graph.