Skip to content

SPY-1394: CSD caching policy at block level #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 25, 2017

Conversation

ianlcsd
Copy link

@ianlcsd ianlcsd commented Jul 21, 2017

The problem originated from the fact that spark can not read a cache block from disk store with size over 2G. When data is distributed highly skewed over partitions, we see problems recorded in SPY-1394, with any RDD cached on disk.

The ultimate solution of being to adapt the partitions to skew is a long shot,
but imposing a CSD policy on cache block size is feasible.

This PR is proposing the following policy:

  1. We cache block in memory if it is less than 2G and there is available storage memory.
  2. We don't cache any block over 2G size
  3. We drop the block to disk store when it is smaller than 2G and there is no sufficient storage memory.
  4. Introduced a spark configuration, spark.storage.MemoryStore.csdCacheBlockSizeLimit.
    A negative value should disable the block policy.
    The default is set to Integer.MAX_VALUE, but we could choose to make it smaller.

By applying the above policy, we ensure all the cache block is less 2G in size either in memory or disk. This braces us against skewed data and other type of abnormal caching pattern.

@davidnavas @markhamstra

@ianlcsd ianlcsd changed the title SPY-1394: Csd policy on caching block size SPY-1394: CSD caching policy at block level Jul 21, 2017
@ianlcsd ianlcsd closed this Jul 21, 2017
@ianlcsd ianlcsd reopened this Jul 21, 2017
* sufficient memory in spark's user memory space to be set apart:
* (2G + overhead) per thread and per operator/RDD compute.
*/
def fetchUntilCsdBlockSizeLimit[T](
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This api is challenging as we need to keep all the seen values until reaching 2G size limit. it is a potential OOM thread and hard to test.

@ianlcsd
Copy link
Author

ianlcsd commented Jul 24, 2017

jttp

@alteryx alteryx deleted a comment from ianlcsd Jul 25, 2017
if (!shouldCache) {
logBlockSizeLimitMessage(blockId, vector.estimateSize())
} else {
// We ran out of space while unrolling the values for this block

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure I understand this comment for this case. Wouldn't this be the case where it's still cacheable?
oic, this is the original comment. Maybe need to put a line before this to separate cache from unroll.

@markhamstra markhamstra merged commit 2b5e422 into alteryx:csd-1.6 Jul 25, 2017
@ianlcsd ianlcsd deleted the csd-1.6 branch April 22, 2018 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants