Optimize PagedInputStream::Skip #6699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

Yuhta wants to merge 1 commit into facebookincubator:main from Yuhta:export-D49501856

Contributor

Yuhta commented Sep 22, 2023 •

edited

Loading

Differential Revision: D49501856

Currently when we skip bytes in PagedInputStream, we do the decompression unconditionally and it is expensive. Some optimizations are added to address this:

Skip decompression of the whole block (frame in case of ZSTD) if
1. We can get the precise decompressed size, and
2. The decompressed size is no larger than the bytes need to skip
Accumulate contiguous skip calls to create larger skip region (delayed skipping)
Fix ByteRleDecoder::skipBytes to avoid reading data and breaking contiguous skips

netlify bot commented Sep 22, 2023 •

edited

Loading

✅ Deploy Preview for meta-velox canceled.

Name	Link
🔨 Latest commit	`f0ea3ac`
🔍 Latest deploy log	https://app.netlify.com/sites/meta-velox/deploys/6512021885007900083f5fd7

facebook-github-bot added the CLA Signed label

Contributor

facebook-github-bot commented Sep 22, 2023

This pull request was exported from Phabricator. Differential Revision: D49501856

facebook-github-bot added the fb-exported label

Yuhta added a commit to Yuhta/velox that referenced this pull request


          Optimize PagedInputStream::Skip (facebookincubator#6699)

d183dbc

Summary:

Currently when we skip bytes in `PagedInputStream`, we do the decompression unconditionally and it is expensive.  Some optimizations are added to address this:
1. Skip decompression of the whole block (frame in case of ZSTD) if
   1. We can get the precise decompressed size, and
   2. The decompressed size is no larger than the bytes need to skip
2. Accumulate contiguous skip calls to create larger skip region (delayed skipping)
3. Fix `ByteRleDecoder::skipBytes` to avoid reading data and breaking contiguous skips

Differential Revision: D49501856

Yuhta force-pushed the export-D49501856 branch from 092254c to d183dbc Compare

September 25, 2023 17:53

Contributor

facebook-github-bot commented Sep 25, 2023

This pull request was exported from Phabricator. Differential Revision: D49501856

Yuhta added a commit to Yuhta/velox that referenced this pull request


          Optimize PagedInputStream::Skip (facebookincubator#6699)

dfe16d4

Summary:

Currently when we skip bytes in `PagedInputStream`, we do the decompression unconditionally and it is expensive.  Some optimizations are added to address this:
1. Skip decompression of the whole block (frame in case of ZSTD) if
   1. We can get the precise decompressed size, and
   2. The decompressed size is no larger than the bytes need to skip
2. Accumulate contiguous skip calls to create larger skip region (delayed skipping)
3. Fix `ByteRleDecoder::skipBytes` to avoid reading data and breaking contiguous skips

Differential Revision: D49501856

Yuhta force-pushed the export-D49501856 branch from d183dbc to dfe16d4 Compare

September 25, 2023 17:53

Contributor

facebook-github-bot commented Sep 25, 2023

This pull request was exported from Phabricator. Differential Revision: D49501856

Yuhta added a commit to Yuhta/velox that referenced this pull request


          Optimize PagedInputStream::Skip (facebookincubator#6699)

3f67f5b

Summary:

Currently when we skip bytes in `PagedInputStream`, we do the decompression unconditionally and it is expensive.  Some optimizations are added to address this:
1. Skip decompression of the whole block (frame in case of ZSTD) if
   1. We can get the precise decompressed size, and
   2. The decompressed size is no larger than the bytes need to skip
2. Accumulate contiguous skip calls to create larger skip region (delayed skipping)
3. Fix `ByteRleDecoder::skipBytes` to avoid reading data and breaking contiguous skips

Differential Revision: D49501856

Yuhta force-pushed the export-D49501856 branch from dfe16d4 to 3f67f5b Compare

September 25, 2023 18:52

Contributor

facebook-github-bot commented Sep 25, 2023

This pull request was exported from Phabricator. Differential Revision: D49501856

Yuhta added a commit to Yuhta/velox that referenced this pull request


          Optimize PagedInputStream::Skip (facebookincubator#6699)

408cab1

Summary:

Currently when we skip bytes in `PagedInputStream`, we do the decompression unconditionally and it is expensive.  Some optimizations are added to address this:
1. Skip decompression of the whole block (frame in case of ZSTD) if
   1. We can get the precise decompressed size, and
   2. The decompressed size is no larger than the bytes need to skip
2. Accumulate contiguous skip calls to create larger skip region (delayed skipping)
3. Fix `ByteRleDecoder::skipBytes` to avoid reading data and breaking contiguous skips

Differential Revision: D49501856

Yuhta force-pushed the export-D49501856 branch from 3f67f5b to 408cab1 Compare

September 25, 2023 18:54

Contributor

facebook-github-bot commented Sep 25, 2023

This pull request was exported from Phabricator. Differential Revision: D49501856


          Optimize PagedInputStream::Skip (facebookincubator#6699)

f0ea3ac

Summary:

Currently when we skip bytes in `PagedInputStream`, we do the decompression unconditionally and it is expensive.  Some optimizations are added to address this:
1. Skip decompression of the whole block (frame in case of ZSTD) if
   1. We can get the precise decompressed size, and
   2. The decompressed size is no larger than the bytes need to skip
2. Accumulate contiguous skip calls to create larger skip region (delayed skipping)

Differential Revision: D49501856

Yuhta added a commit to Yuhta/velox that referenced this pull request


          Optimize PagedInputStream::Skip (facebookincubator#6699)

834c5c7

Summary:

Currently when we skip bytes in `PagedInputStream`, we do the decompression unconditionally and it is expensive.  Some optimizations are added to address this:
1. Skip decompression of the whole block (frame in case of ZSTD) if
   1. We can get the precise decompressed size, and
   2. The decompressed size is no larger than the bytes need to skip
2. Accumulate contiguous skip calls to create larger skip region (delayed skipping)

Differential Revision: D49501856

Yuhta force-pushed the export-D49501856 branch from 408cab1 to 834c5c7 Compare

September 25, 2023 21:56

Contributor

facebook-github-bot commented Sep 25, 2023

This pull request was exported from Phabricator. Differential Revision: D49501856

Yuhta force-pushed the export-D49501856 branch from 834c5c7 to f0ea3ac Compare

September 25, 2023 21:56

Contributor

facebook-github-bot commented Sep 25, 2023

This pull request was exported from Phabricator. Differential Revision: D49501856

facebook-github-bot closed this in

f6e9b76

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Sep 26, 2023

This pull request has been merged in f6e9b76.

facebook-github-bot added the Reverted label

Contributor

facebook-github-bot commented Oct 3, 2023

This pull request has been reverted by d08ab02.

ericyuliu pushed a commit to ericyuliu/velox that referenced this pull request


          Optimize PagedInputStream::Skip (facebookincubator#6699)

4db8d47

Summary:
Pull Request resolved: facebookincubator#6699

Currently when we skip bytes in `PagedInputStream`, we do the decompression unconditionally and it is expensive.  Some optimizations are added to address this:
1. Skip decompression of the whole block (frame in case of ZSTD) if
   1. We can get the precise decompressed size, and
   2. The decompressed size is no larger than the bytes need to skip
2. Accumulate contiguous skip calls to create larger skip region (delayed skipping)

Reviewed By: oerling

Differential Revision: D49501856

fbshipit-source-id: 07241aaf71e83f0f491050a9be6075dd5500dd52

kletkavrubashku mentioned this pull request

perf(dwio): Reuse context in ZSTD_decompress #15854

Open

kletkavrubashku added a commit to kletkavrubashku/velox that referenced this pull request


          fix: [velox] Reuse context in ZSTD_decompress

Summary:
## History
Some time ago similar PR was landed (facebookincubator#6699, D49501856) and caused SEV S369242 in Meta. That time `ZSTD_DCtx` context was created and reused on decompressor level. The optimization was reverted due to OOMs.

## Getting back to it again
The optimization still makes sense. For example:
- In Presto Adhoc we spend 0.07% of CPU cycles in `ZSTD_createDCtx_internal`: https://fburl.com/strobelight/2lt1wn4v
- In Presto batch we spend  0.1% of CPU cycles in `ZSTD_createDCtx_internal`: https://fburl.com/strobelight/js4kn8za

## The fix
Instead of creating `ZSTD_DCtx` per decompressor, we should create it per thread. Then we will be able to reuse the allocation and don't consume so much memory in FlatMaps.

Differential Revision: D89716393

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed fb-exported Merged Reverted