Skip to content

fix: cost autoscaler flaky test#19594

Open
Shekharrajak wants to merge 2 commits into
apache:masterfrom
Shekharrajak:fix-cost-autoscaler-flaky-test
Open

fix: cost autoscaler flaky test#19594
Shekharrajak wants to merge 2 commits into
apache:masterfrom
Shekharrajak:fix-cost-autoscaler-flaky-test

Conversation

@Shekharrajak

Copy link
Copy Markdown
Contributor

Ref. #19517

Description

Fixes flaky CostBasedAutoScalerIntegrationTest scale-up wait.

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@Shekharrajak Shekharrajak changed the title Fix cost autoscaler flaky test fix: cost autoscaler flaky test Jun 17, 2026

cluster.callApi().postSupervisor(supervisor.createSuspendedSpec());
cluster.callApi().waitForAllSegmentsToBeAvailable(dataSource, coordinator, broker);
Assertions.assertEquals("10000", cluster.runSql("SELECT COUNT(*) FROM %s", dataSource));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assert final row count from the actual number of published records.

);
}
finally {
keepPublishing.set(false);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean up publisher and suspend supervisor

overlord.latchableEmitter().waitForEvent(
event -> event.hasMetricName(OPTIMAL_TASK_COUNT_METRIC)
.hasDimension(DruidMetrics.SUPERVISOR_ID, supervisor.getId())
.hasValueMatching(Matchers.greaterThan(1L))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait until the cost-based autoscaler computes that the optimal task count is greater than 1. This is only the autoscaler recommendation.

overlord.latchableEmitter().waitForEvent(
event -> event.hasMetricName("task/autoScaler/updatedCount")
.hasDimension(DruidMetrics.SUPERVISOR_ID, supervisor.getId())
.hasValueMatching(Matchers.greaterThan(1L))

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the applied scale-up event.

final AtomicInteger totalRecords = new AtomicInteger();
final ExecutorService publisher = Executors.newSingleThreadExecutor();
final Future<?> publisherFuture = publisher.submit(() -> {
for (int i = 0; i < MAX_SCALE_UP_RECORD_BATCHES && keepPublishing.get(); ++i) {

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of publishing 10k records upfront and then waiting, the test keeps publishing records in the background while the autoscaler is running. This gives the autoscaler a
stable lag signal to observe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant