Skip to content

[BUG] Monitors only execute on 2 nodes #1992

@BlaiseSaunders

Description

@BlaiseSaunders

Describe the bug

OpenSearch will execute 99% of monitors on only 2 nodes, the nodes that contain the primary and replica shards for the .opendistro-alerting-config index

On 2.18 this behaviour doesn't seem to cause too many issues, on 3.0+ it leads to regular crashes of these 2 nodes with the shards, which often cascades into larger cluster issues

Related component

Search:Performance

To Reproduce

  1. Deploy OpenSearch cluster with significant number of nodes (16+)
  2. Run a high number of regularly scheduled monitors (100+, ideally 1k+) with ~10 min frequency
  3. Parse OpenSearch logs and observe which nodes are running these monitors (can query for Executing scheduled monitor)
  4. Observe that over 100k+ monitor runs, 99% of them will only execute on 2 nodes in the cluster, the 2 nodes that have the .opendistro-alerting-config primary and replica shards

This activity has been observed on 2.18, 3.1, 3.3

Expected behavior

Monitors execute on all nodes in the cluster OR it is easy to increase the shard count on .opendistro-alerting-config

Additional Details

Plugins
Standard RPM install of OpenSearch exhibits this issue

Screenshots
Cannot provide

Host/Environment (please complete the following information):

  • OS: RHEL
  • Version: 2.18, 3.1, 3.3, untested on others

Additional context
N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions