Cache Workers getting stuck?

I saw this problem in version 2.3.1, via VisualVm, when after a few hours that a bunch of ForkJoinWorkers would be stuck on the WriteBuffer.poll() function in RUNNABLE state. Even hit an OOM when the cpu utilization became too high and my events started to get backed up. I recently upgraded to 2.3.3 and thought the problem was fixed until the below screenshots.

When I started up the application (10/19, 7:34 am) there is one FJW in the writebuffer.poll, which is fine.
<img width="998" alt="screen shot 2016-10-19 at 7 34 41 am" src="https://cloud.githubusercontent.com/assets/14983089/19569558/0a6ad1bc-96ab-11e6-9706-3505dd75cae3.png">

This morning (10/20, 9:34am) and this is in constant RUNNABLE state. The number of workers that will be in this stuck runnable state will slowly grow over time too because I have seen this behavior happen with 2.3.1.
<img width="993" alt="screen shot 2016-10-20 at 9 34 02 am" src="https://cloud.githubusercontent.com/assets/14983089/19569552/01d2bcfe-96ab-11e6-8d04-0b0936cf43a9.png">

The cpu utilization and cpu load metrics in Grafana support an overall increase work as if the workers are busy waiting/spinlooping or something.

There are maybe 1k or so instances of the cache data structure in the application. This server has the most instances of the cache and the issue appears to manifest faster than the other servers. All of the servers have a 64GB heap.

I will have to switch back to Guava considering this is a production application.

Thanks for your effort though!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache Workers getting stuck? #127

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Cache Workers getting stuck? #127

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions