Skip to content

[SPARK-22033][CORE] BufferHolder, other size checks should account for the specific VM array size limitations #19266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

srowen
Copy link
Member

@srowen srowen commented Sep 18, 2017

What changes were proposed in this pull request?

Try to avoid allocating an array bigger than Integer.MAX_VALUE - 8, which is the actual max size on some JVMs, in several places

How was this patch tested?

Existing tests

…hich is the actual max size on some JVMs, in several places
@srowen
Copy link
Member Author

srowen commented Sep 18, 2017

CC @maropu

@maropu
Copy link
Member

maropu commented Sep 18, 2017

I though, if this limit highly depends on JVM implementations, better to put the limit as a global variable somewhere (e.g., ARRAY_INT_MAX in spark.util.Utils or other places)? As another option, how about adding internal option in SparkConf for that?

Also, how about making a parent jira tcket to track the similar issue cuz it seems difficult to cover all the possible place in this pr.

@srowen
Copy link
Member Author

srowen commented Sep 18, 2017

Yeah, agree, it could be some global constant. I don't think it should be configurable. Ideally it's determined from the JVM, but don't know a way to do that.

In many cases, assuming Int.MaxValue is the max array size when it's Int.MaxValue-8 doesn't matter much. For example, arguably I should leave the ML changes alone here, because, in the very rare case that a matrix size is somewhere between Int.MaxValue-8 and Int.MaxValue, it will fail anyway, and it's not avoidable given the user input. It's also, maybe, more conservative to not always assume anything beyond Int.MaxValue-8 is going to fail, and not "proactively" fail at this cutoff.

However I think there are a smallish number of identifiable cases where Spark can very much avoid the failure (like BufferHolder), and they're the instances where an array size keeps doubling. Maybe we can stick to those clear cases? especially any one that seems to have triggered the original error?

Those cases are few enough and related enough that I'm sure they're just one issue, not several.

@@ -39,7 +39,7 @@
private final long length;

public LongArray(MemoryBlock memory) {
assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 4 billion elements";
assert memory.size() < (long) Integer.MAX_VALUE * 8: "Array size > 2.1 billion elements";
Copy link

@ascii766164696D ascii766164696D Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assert memory.size() <= (long) (Integer.MAX_VALUE - 8) * 8

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the same assert below?

public static MemoryBlock fromLongArray(final long[] array) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add the exact number in the error message instead of 2.1 billion

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure whether the JVM array size limit applied here, because this represents a non-native array. Still wanted to fix the comment as I saw it. If an array exists, its length is valid, so didn't think that part of MemoryBlock represented an issue.

@@ -30,11 +30,15 @@
HashMapGrowthStrategy DOUBLING = new Doubling();

class Doubling implements HashMapGrowthStrategy {

private static final int ARRAY_MAX = Integer.MAX_VALUE - 8;
Copy link

@ascii766164696D ascii766164696D Sep 18, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth adding a comment why this value is chosen as the max
Like here
http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l223

}
val capacity = if (otherElements != null) otherElements.length + 2 else 2
if (newSize > capacity) {
var newArrayLen = 8
var newArrayLen = 8L
while (newSize - 2 > newArrayLen) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove - 2 now since newArrayLen is a Long

Copy link

@ascii766164696D ascii766164696D Sep 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I see that it's reserved, wasn't clear to me why - 2, seems like a magic number to me, I see it in other places too.

@SparkQA
Copy link

SparkQA commented Sep 18, 2017

Test build #81882 has finished for PR 19266 at commit 8dbfa30.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -96,5 +96,5 @@ private[spark] class PartitionedPairBuffer[K, V](initialCapacity: Int = 64)
}

private object PartitionedPairBuffer {
val MAXIMUM_CAPACITY = Int.MaxValue / 2 // 2 ^ 30 - 1
val MAXIMUM_CAPACITY = (Int.MaxValue - 8) / 2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe worth adding the Int type even though it's already an Int.

Also the comment at the line 28 should be changed to 1073741819 i.e. - 8/2

@@ -344,7 +344,7 @@ class Word2Vec extends Serializable with Logging {
val newSentences = sentences.repartition(numPartitions).cache()
val initRandom = new XORShiftRandom(seed)

if (vocabSize.toLong * vectorSize >= Int.MaxValue) {
if (vocabSize.toLong * vectorSize >= Int.MaxValue - 8) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be just >

@@ -304,8 +304,8 @@ class BlockMatrix @Since("1.3.0") (
s"Int.MaxValue. Currently numRows: ${numRows()}")
require(numCols() < Int.MaxValue, "The number of columns of this matrix should be less than " +
s"Int.MaxValue. Currently numCols: ${numCols()}")
require(numRows() * numCols() < Int.MaxValue, "The length of the values array must be " +
s"less than Int.MaxValue. Currently numRows * numCols: ${numRows() * numCols()}")
require(numRows() * numCols() < Int.MaxValue - 8, "The length of the values array must be " +

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<=

…mediately; stick to 'avoidable' allocations that are too large
@SparkQA
Copy link

SparkQA commented Sep 19, 2017

Test build #81921 has finished for PR 19266 at commit 4594b55.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 20, 2017

Test build #81984 has finished for PR 19266 at commit 9bb9e77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@srowen
Copy link
Member Author

srowen commented Sep 23, 2017

Merged to master

@asfgit asfgit closed this in 50ada2a Sep 23, 2017
@srowen srowen deleted the SPARK-22033 branch September 24, 2017 16:35

// Some JVMs can't allocate arrays of length Integer.MAX_VALUE; actual max is somewhat
// smaller. Be conservative and lower the cap a little.
private static final int ARRAY_MAX = Integer.MAX_VALUE - 8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious how to get this value -8?

Also cc @liufengdb

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not also fixing line 54?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen I think we can use Integer.MAX_VALUE - 7 instead of Integer.MAX_VALUE - 8 to make the size align with words, otherwise, this check will fail: https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/UnsafeRow.java#L170.

This is the reason why all the size inputs to the methods are rounded, for example, https://github.com/apache/spark/blob/master/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java#L216.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile have a look at the JIRA for some detail; you can see a similar limit in the JDK at for example http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/util/ArrayList.java#l229

You are right, I think around line 54 needs to be something straightforward like:

long totalSize = initialSize + bitsetWidthInBytes + 8L * row.numFields();
if (totalSize > ARRAY_MAX) { ...error... }
this.buffer = new byte[(int) totalSize];

Yes I agree with your new JIRA @liufengdb though think we'll need to go the other way to Integer.MAX_VALUE - 15 where the value must be divisible by 8.

@gatorsmile
Copy link
Member

@srowen Thanks!

@liufengdb Could you submit a separate PR to fix the issues and also please include the test cases?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants