feat: Avoid small batches in Exchange #12010

arhimondr · 2025-01-03T03:36:40Z

Summary:
Prevent exchange client from unblocking to early. Unblocking to early impedes
effectiveness of page merging. When the cost of creating a vector is high (for
example for data sets with high number of columns) creating small pages can
make queries significantly less efficient.

For example it was observed that when network is congested and Exchange buffers
are not filled up as fast query may experience CPU efficiency drop up to 4x: T211034421

Differential Revision: D67615570

facebook-github-bot · 2025-01-03T03:36:52Z

This pull request was exported from Phabricator. Differential Revision: D67615570

netlify · 2025-01-03T03:36:59Z

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`d263477`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/67926852cb2a6f0008267cdf

Summary: X-link: facebookincubator/velox#12010 Prevent exchange client from unblocking to early. Unblocking to early impedes effectiveness of page merging. When the cost of creating a vector is high (for example for data sets with high number of columns) creating small pages can make queries significantly less efficient. For example it was observed that when network is congested and Exchange buffers are not filled up as fast query may experience CPU efficiency drop up to 4x: T211034421 Differential Revision: D67615570

facebook-github-bot · 2025-01-10T17:03:59Z

This pull request was exported from Phabricator. Differential Revision: D67615570

Summary: Prevent exchange client from unblocking to early. Unblocking to early impedes effectiveness of page merging. When the cost of creating a vector is high (for example for data sets with high number of columns) creating small pages can make queries significantly less efficient. For example it was observed that when network is congested and Exchange buffers are not filled up as fast query may experience CPU efficiency drop up to 4x: T211034421 Differential Revision: D67615570

facebook-github-bot · 2025-01-10T17:05:14Z

This pull request was exported from Phabricator. Differential Revision: D67615570

xiaoxmeng

@arhimondr thank for the change % comments.

velox/core/QueryConfig.h

velox/exec/ExchangeClient.h

velox/exec/ExchangeQueue.h

velox/exec/MergeSource.cpp

velox/exec/ExchangeQueue.cpp

velox/exec/ExchangeQueue.h

velox/exec/tests/ExchangeClientTest.cpp

velox/core/QueryConfig.h

facebook-github-bot · 2025-01-14T15:44:49Z

This pull request was exported from Phabricator. Differential Revision: D67615570

facebook-github-bot · 2025-01-14T16:07:13Z

This pull request was exported from Phabricator. Differential Revision: D67615570

facebook-github-bot · 2025-01-14T16:13:04Z

This pull request was exported from Phabricator. Differential Revision: D67615570

xiaoxmeng

@arhimondr LGTM. Thanks for the update!

velox/exec/ExchangeQueue.cpp

velox/exec/MergeSource.cpp

facebook-github-bot · 2025-01-21T20:51:09Z

This pull request was exported from Phabricator. Differential Revision: D67615570

facebook-github-bot · 2025-01-22T17:07:52Z

This pull request was exported from Phabricator. Differential Revision: D67615570

facebook-github-bot · 2025-01-22T21:04:32Z

This pull request was exported from Phabricator. Differential Revision: D67615570

facebook-github-bot · 2025-01-22T22:58:53Z

This pull request was exported from Phabricator. Differential Revision: D67615570

xiaoxmeng

@arhimondr thanks for the iterations % nits.

xiaoxmeng · 2025-01-23T05:26:17Z

velox/exec/Exchange.cpp

          operatorCtx_->driverCtx()->queryConfig(),
          serdeKind_)},
      processSplits_{operatorCtx_->driverCtx()->driverId == 0},
+      driverId_{driverCtx->driverId},


nit: we can just fetch from operatorCtx_->driverCtx()->driverId and not necessary save a copy of it.

Just wanted to safe a couple of extra dereferences :-)

velox/exec/ExchangeQueue.cpp

velox/exec/ExchangeClient.cpp

xiaoxmeng · 2025-01-23T05:30:54Z

velox/exec/ExchangeQueue.h

+      uint32_t maxBytes,
+      bool* atEnd,
+      ContinueFuture* future,
+      std::vector<ContinuePromise>& promises);


s/promises/staledPromises/

Maybe put a comment? And at the caller, we expect it at most has one promise?

Actually yeah, let me use a pointer instead of a list. It should never be more than one

velox/exec/ExchangeQueue.h

Summary: Prevent exchange client from unblocking to early. Unblocking to early impedes effectiveness of page merging. When the cost of creating a vector is high (for example for data sets with high number of columns) creating small pages can make queries significantly less efficient. For example it was observed that when network is congested and Exchange buffers are not filled up as fast query may experience CPU efficiency drop up to 4x: T211034421 Reviewed By: xiaoxmeng Differential Revision: D67615570

facebook-github-bot · 2025-01-23T16:03:56Z

This pull request was exported from Phabricator. Differential Revision: D67615570

facebook-github-bot · 2025-01-23T20:10:45Z

This pull request has been merged in 121b230.

conbench-facebook · 2025-01-23T21:00:46Z

Conbench analyzed the 0 benchmark runs that triggered this notification.

None of the specified runs were found on the Conbench server.

The full Conbench report has more details.

lingbin · 2025-02-13T14:15:07Z

velox/exec/ExchangeQueue.cpp

  queue_.push_back(std::move(page));
-  if (!promises_.empty()) {
+  const auto minBatchSize = minOutputBatchBytesLocked();
+  while (!promises_.empty()) {


@arhimondr I have a question to confirm: there is actually no need for a loop here, because after all, only one page is added here, so at most only one consumer should be awakened, right?

Looking forward to your guidance.

cc @xiaoxmeng

@lingbin I think you are right. I don't think a single page can be consumed by more than a single consumer today (even if it is large). The loop does not seem to be necessary.

@arhimondr Thanks for your quick reply.

Meanwhile, I think it can be simplified to the following code, because the resumed driver will try to read as much data as possible (as long as it does not exceed 'maxBytes').

What do you think? If you think it's OK, I can create a PR to make this change.

Before:

while (!promises_.empty()) { VELOX_CHECK_LE(promises_.size(), numberOfConsumers_); const int32_t unblockedConsumers = numberOfConsumers_ - promises_.size(); const int64_t unasignedBytes = totalBytes_ - unblockedConsumers * minBatchSize; if (unasignedBytes < minBatchSize) { break; } // Resume one of the waiting drivers. auto it = promises_.begin(); promises.push_back(std::move(it->second)); promises_.erase(it); }

After:

if (!promises_.empty() && totalBytes_ >= minBatchSize) { // Resume one of the waiting drivers. auto it = promises_.begin(); promises.push_back(std::move(it->second)); promises_.erase(it); }

@lingbin you need to take into account how many consumers are inflight to avoid unblocking too many.

Consider receiving minBatchSize and unblocking one consumer. Then receiving minBatchSize / 2 worth of data would increase the totalBytes_ to 1.5 * minBatchSize and unblocking one more consumer (what is not desired).

@arhimondr Thank you for your explanation. Now I understand the purpose of unasignedBytes.

However, I found that according to the current strategy, unblocking will only be performed when totalBytes_ >= (unblockedConsumers +1) * minBatchSize. Will this lead to too few unblocks?

Consider const numberOfConsumers_ = 10, promises_.size() = 2, then according to the current strategy, it is necessary to receive 9 pages (totalBytes_ == 9 * minBatchSize) before unblocking one.

Consider const numberOfConsumers_ = 10, promises_.size() = 2, then according to the current strategy, it is necessary to receive 9 pages (totalBytes_ == 9 * minBatchSize) before unblocking one.

Yes, this is correct.

When there are 8 consumers unblocked to unblock a next one you need to receive (8 * minBatchSize) + minBatchSize so every consumer has minBatchSize to process (on average)

When there are 8 consumers unblocked to unblock a next one you need to receive (8 * minBatchSize) + minBatchSize so every consumer has minBatchSize to process (on average)

@arhimondr Do you mean that you originally wanted to unblock 8 consumers (so that each consumes minBatchSize size of data)? But in fact, only one will be unblocked here, and then it will consume 9 minBatchSize size of data at a time. (The default "preferred_output_batch_bytes" size is 10MB).

@arhimondr There seem to be two points here: "the number of unblocks" and "whether to generate a small vector".
These two issues do not seem to be causally related, as long as each of our consumers guarantees that they will consume at least minBatchSize size of data each time, then we can guarantee that small vector will not be generated.

This is already guaranteed in dequeueLocked(), right? I see:

velox/velox/exec/ExchangeQueue.cpp

Lines 146 to 151 in d8cac2f

// If we don't have enough bytes to return, we wait for more data to be

// available

if (totalBytes_ < minOutputBatchBytesLocked()) {

addPromiseLocked(consumerId, future, stalePromise);

return {};

}

For your example:

t1: add one page: totalBytes_ = minBatchSize, unblock 'consumer-1' t2: add one page: totalBytes_ = 1.5 * minBatchSize, unblock 'consumer-2' t3: 'consumer-1' will consume data of size `1.5 * minBatchSize` at one time and generate a RowVector t4: 'consumer-2' will find that there is no data to consume, and will be blocked again. No new RowVector will be generated.

That is, although it is unblocked twice, only one RowVector will be generated in the end.

This is already guaranteed in dequeueLocked(), right? I see:

Yes, there is a second check in dequeueLocked. The idea is not to unblock too many. For example when you have totalBytes_ = 10 * minBatchSize unblocking more than 10 consumers does not make sense, as at least on of them will get blocked again in dequeueLocked

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 3, 2025

facebook-github-bot added the fb-exported label Jan 3, 2025

arhimondr mentioned this pull request Jan 3, 2025

Avoid small batches in Exchange prestodb/presto#24318

Closed

arhimondr force-pushed the export-D67615570 branch from d634e88 to ebd2e0b Compare January 10, 2025 17:03

arhimondr changed the title ~~Avoid small batches in Exchange~~ feat: Avoid small batches in Exchange Jan 10, 2025

arhimondr force-pushed the export-D67615570 branch from ebd2e0b to 54f4247 Compare January 10, 2025 17:04

xiaoxmeng reviewed Jan 10, 2025

View reviewed changes

arhimondr force-pushed the export-D67615570 branch from 54f4247 to 0c92f8b Compare January 14, 2025 15:44

arhimondr force-pushed the export-D67615570 branch from 0c92f8b to 8197447 Compare January 14, 2025 16:06

arhimondr force-pushed the export-D67615570 branch from 8197447 to 6421f7d Compare January 14, 2025 16:12

xiaoxmeng approved these changes Jan 16, 2025

View reviewed changes

velox/exec/ExchangeQueue.cpp Outdated Show resolved Hide resolved

velox/exec/ExchangeQueue.cpp Outdated Show resolved Hide resolved

velox/exec/MergeSource.cpp Show resolved Hide resolved

arhimondr force-pushed the export-D67615570 branch from 6421f7d to 2631e22 Compare January 21, 2025 20:50

arhimondr force-pushed the export-D67615570 branch from 2631e22 to 34fbab7 Compare January 22, 2025 17:07

arhimondr force-pushed the export-D67615570 branch from 34fbab7 to d82f1c8 Compare January 22, 2025 21:04

arhimondr force-pushed the export-D67615570 branch from d82f1c8 to 401586f Compare January 22, 2025 22:58

xiaoxmeng approved these changes Jan 23, 2025

View reviewed changes

arhimondr force-pushed the export-D67615570 branch from 401586f to d263477 Compare January 23, 2025 16:03

facebook-github-bot closed this in 121b230 Jan 23, 2025

facebook-github-bot added the Merged label Jan 23, 2025

lingbin reviewed Feb 13, 2025

View reviewed changes

lingbin mentioned this pull request Dec 19, 2025

perf(exchange): Reduce awakened consumer count #15824

Open

	// If we don't have enough bytes to return, we wait for more data to be
	// available
	if (totalBytes_ < minOutputBatchBytesLocked()) {
	addPromiseLocked(consumerId, future, stalePromise);
	return {};
	}

feat: Avoid small batches in Exchange #12010

feat: Avoid small batches in Exchange #12010

Uh oh!

Conversation

arhimondr commented Jan 3, 2025

Uh oh!

facebook-github-bot commented Jan 3, 2025

Uh oh!

netlify bot commented Jan 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for meta-velox canceled.

Uh oh!

facebook-github-bot commented Jan 10, 2025

Uh oh!

facebook-github-bot commented Jan 10, 2025

Uh oh!

xiaoxmeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Jan 14, 2025

Uh oh!

facebook-github-bot commented Jan 14, 2025

Uh oh!

facebook-github-bot commented Jan 14, 2025

Uh oh!

xiaoxmeng left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Jan 21, 2025

Uh oh!

facebook-github-bot commented Jan 22, 2025

Uh oh!

facebook-github-bot commented Jan 22, 2025

Uh oh!

facebook-github-bot commented Jan 22, 2025

Uh oh!

xiaoxmeng left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

facebook-github-bot commented Jan 23, 2025

Uh oh!

facebook-github-bot commented Jan 23, 2025

Uh oh!

conbench-facebook bot commented Jan 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

arhimondr Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netlify bot commented Jan 3, 2025 •

edited

Loading

arhimondr Feb 13, 2025 •

edited

Loading