[SPARK-21766][SQL] Convert nullable int columns to float columns in toPandas #18945

logannc · 2017-08-15T01:48:08Z

…as to prevent needless Exceptions during routine use.

Add the strict=True kwarg to DataFrame.toPandas to allow for a non-strict interpretation of the schema of a dataframe. This is currently limited to allowing a nullable int column to being interpreted as a float column (because that is the only way Pandas supports nullable int columns and actually crashes without this).

I consider this small change to be a massive quality of life improvement for DataFrames with lots of nullable int columns, which would otherwise need a litany of df.withColumn(name, F.col(name).cast(DoubleType())), etc, just to view them easily or interact with them in-memory.

Possible Objections

I foresee concerns with the name of the kwarg, of which I am open to suggestions.
I also foresee possible objections due to the potential for needless conversion of nullable int columns to floats when there are actually no null values. I would counter those objections by noting that it only occurs when strict=False, which is not the default, and can be avoided on a per-column basis by setting the nullable property of the schema to False.

Alternatives

Rename the kwarg to be specific to the current change. i.e., nullable_int_to_float instead of strict or some other, similar name.
Fix Pandas to allow nullable int columns. (Very difficult, per Wes McKinney, due to lack of NumPy support. https://stackoverflow.com/questions/11548005/numpy-or-pandas-keeping-array-type-as-integer-while-having-a-nan-value)

…as to prevent needless crashes.

BryanCutler · 2017-08-17T18:55:23Z

@logannc , thanks for this. You bring up a big issue here that I think was overlooked when this code was added in Spark. I filed a JIRA for this SPARK-21766, which generally comes before the PR. Please see the contributing guide here

BryanCutler · 2017-08-17T19:00:38Z

There should be no way an error like this is raised during the call toPandas() so I am thinking that if there is a nullable int column, the type should not try to be changed in _to_corrected_pandas_type cc @HyukjinKwon @ueshin

logannc · 2017-08-17T20:24:43Z

I read the contributing guide. It said that simple changes didn't need a JIRA. Certainly this code change is quite simple, I just wasn't sure if there would be enough discussion to warrant a Jira. Now I know.

So, rather than return np.float32, return None? That would probably also work, though the bug might get reintroduced by someone unfamiliar with the problem. That is why I prefered the explicitness of returning a type.

HyukjinKwon · 2017-08-18T09:25:55Z

I agree with @BryanCutler in general. None makes more sense to me to infer the type in this case.

Another rough thought for a feasible way I could think to keep the current behaviour (to be more specific, to match the types with / without Arrow optimization, IIUC) is, to make a generator wrapper to check if None exists in the results for columns where the type is int and nullable.

HyukjinKwon · 2017-08-18T09:33:25Z

@logannc, mind adding the JIRA number in this PR title as described in the guide line?
I definitely also think this needs a JIRA as before and after are not virtually same and it looks a non-trivial change. I believe tests are required here too.

HyukjinKwon · 2017-08-18T09:51:17Z

python/pyspark/sql/dataframe.py

@@ -1731,7 +1731,7 @@ def toDF(self, *cols):
        return DataFrame(jdf, self.sql_ctx)

    @since(1.3)
-    def toPandas(self):
+    def toPandas(self, strict=True):


BTW, this change looks uesless when the optimization is enabled if I understood correctly. I wouldn't add an option and make it complicated.

You referring to the Arrow optimization right? I also agree that we should not add this option, rather just handle all this automatically

BryanCutler · 2017-08-18T18:54:16Z

python/pyspark/sql/dataframe.py

@@ -1731,7 +1731,7 @@ def toDF(self, *cols):
        return DataFrame(jdf, self.sql_ctx)

    @since(1.3)
-    def toPandas(self):
+    def toPandas(self, strict=True):


You referring to the Arrow optimization right? I also agree that we should not add this option, rather just handle all this automatically

BryanCutler · 2017-08-18T19:06:57Z

python/pyspark/sql/dataframe.py

@@ -1762,7 +1762,7 @@ def toPandas(self):
        else:


If we wanted to check that a nullable int field actually has null values, we could do it here and then not change type it there are null values. We would have to construct the pandas DataFrame first though.

pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) dtype = {} for field in self.schema: if not(field.dataType == IntegerType and field.nullable and pdf[field.name].isnull().any()): pandas_type = _to_corrected_pandas_type(field.dataType) if pandas_type is not None: dtype[field.name] = pandas_type for f, t in dtype.items(): pdf[f] = pdf[f].astype(t, copy=False) return pdf

This does make a pass over the data to check though, but is not much overhead

If we use this approach, how about the following to check if the type corrections are needed:

dtype = {} for field in self.schema: pandas_type = _to_corrected_pandas_type(field.dataType) if pandas_type is not None and not(field.nullable and pdf[field.name].isnull().any()): dtype[field.name] = pandas_type

BryanCutler · 2017-08-18T19:15:07Z

Another rough thought for a feasible way I could think to keep the current behaviour (to be more specific, to match the types with / without Arrow optimization, IIUC) is, to make a generator wrapper to check if None exists in the results for columns where the type is int and nullable.

I'm not sure I follow @HyukjinKwon , we can just look at the pdf when it is first created but before any type conversions and just not do anything if it the column is a nullable int with null values. Is this similar to what you're suggesting?

HyukjinKwon · 2017-08-19T08:57:16Z

Ah, I had to be clear. I thought something like ...

dtype = {}
for field in self.schema:
    pandas_type = _to_corrected_pandas_type(field.dataType)
    if pandas_type is not None:
        dtype[field.name] = pandas_type

# Columns with int + nullable from schemaa
int_null_cols = [...]

# Columns with int + nullable but with actual None. This will be set in `check_nulls`
int_null_cols_with_none = []

# This functions checks if the value is None.
def check_nulls():
    for row in rows:
        # Check with int_null_cols and set int_null_cols_with_none if there is None.
        ...
        yield row

# Don't check anything if no int + nullable columns.
if len(int_null_cols) > 0:
    check_func = check_nulls
else:
    check_func = lambda r: r

pdf = pd.DataFrame.from_records(check_func(self.collect()), columns=self.columns)

# Replace int32 -> float one by checking int_null_cols_with_none.
dtype = ...

for f, t in dtype.items():
    pdf[f] = pdf[f].astype(t, copy=False)
    return pdf

So, I was thinking of checking the actual value in the data might be a way if we can't resolve this only with the schema.

HyukjinKwon · 2017-08-19T09:09:09Z

Yea, I think it is basically similar idea with #18945 (comment).

BryanCutler · 2017-08-22T04:40:26Z

Thanks for clarifying @HyukjinKwon , I see what you mean now. Since pandas will iterate over self.collect() anyway I don't think your solution would impact performance at all right? So your way might be better, but it is slightly more complicated..

Just to sum things up - @logannc does this still meet your requirements?
Instead of having the strict = True option we do the following:

for each nullable int32 column:
    if there are null values:
        change column type to float32
    else:
        change column type to int32

I'm also guessing we will have the same problem with nullable ShortType - maybe others?

HyukjinKwon · 2017-08-23T05:09:36Z

I think it'd be nicer if we can go with the approach above ^ (checking the null in data and setting the correct type). I am okay with any form for the approach above for now if it makes sense as we have a decent Arrow optimization now for the performance aspect.

logannc · 2017-08-27T23:51:17Z

Sorry for the delay. Things got busy and now there is the storm in Houston. Will update this per these suggestions soon.

HyukjinKwon · 2017-09-11T23:37:23Z

Hey @logannc, have you had some time to work on this? I want to fix this issue asap. Ortherwise, would anyone here be interested in submitimg another PR for the another approach?

a10y · 2017-09-18T15:08:03Z

python/pyspark/sql/dataframe.py

    if type(dt) == ByteType:
        return np.int8
    elif type(dt) == ShortType:
        return np.int16
    elif type(dt) == IntegerType:
+        if not strict and field.nullable:
+            return np.float32


Is loss of precision a concern here? Some integers from the original dataset will now be rounded to the nearest representable float32 if I'm not mistaken.

HyukjinKwon · 2017-09-19T01:40:10Z

gentle ping @logannc.

HyukjinKwon · 2017-09-20T23:26:51Z

@BryanCutler, @a10y and @viirya, would you guys be interested in this and have some time to take over this with the different approach we discussed above - #18945 (comment) and #18945 (comment)? I could take over this too if you guys are currently busy.

…n toPandas to prevent needless crashes." This reverts commit bceeefc.

logannc · 2017-09-21T02:45:43Z

Sorry I fell off the face of the earth. I finally had some time to sit down and do this. I took your suggestions but implemented it a little differently. Unless I've made a dumb mistake, I think I improved on it a bit.

a10y · 2017-09-21T03:24:11Z

python/pyspark/sql/dataframe.py

+                        if val is not None:
+                            if abs(val) > 16777216: # Max value before np.float32 loses precision.
+                                val = np.float64(val)
+                                if np.float64 != dt:


Is this if totally necessary or can we just move the two assignments up?

(Also thanks for solving the precision issue)!

No, not strictly necessary, but also hardly harmful and it may future proof a bit...? Anyway, it can be removed if you think it should.

viirya · 2017-09-21T03:54:16Z

python/pyspark/sql/dataframe.py

            for field in self.schema:
                pandas_type = _to_corrected_pandas_type(field.dataType)
+                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
+                    columns_with_null_int.add(field.name)


>>> columns_with_null_int = {} >>> columns_with_null_int.add("test") Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'dict' object has no attribute 'add'

Am I missing something?

Ack, I noticed I did that then forgot to change it. On it...

viirya · 2017-09-21T03:59:32Z

python/pyspark/sql/dataframe.py

+                    for column in columns_with_null_int:
+                        val = row[column]
+                        dt = dtype[column]
+                        if val is not None:


Don't we want to change data type for None values? I don't see you do it?

They are handled by Pandas already, so I am just letting them pass through.

Don't we want to fix the issue when pandas type in (np.int8, np.int16, np.int32) and the field is nullable, the dtype we get will cause exception later when converting a None to int type such as np.int16?

You can follow the https://github.com/apache/spark/pull/18945/files#r134033952 suggested.

If pandas_type in (np.int8, np.int16, np.int32) and field.nullable and there are ANY non-null values, the dtype of the column is changed to np.float32 or np.float64, both of which properly handle None values.

That said, if the entire column was None, it would fail. Therefore I have preemptively changed the type on line 1787 to np.float32. Per null_handler, it may still change to np.float64 if needed.

viirya · 2017-09-21T04:00:47Z

We also need a proper test for this.

viirya · 2017-09-21T04:09:21Z

@HyukjinKwon I can take over this if @logannc can't find time to continue it.

logannc · 2017-09-21T05:49:38Z

Hm. Where would I add tests?

viirya · 2017-09-21T07:31:44Z

@logannc There are pandas related tests in python/pyspark/sql/tests.py, e.g., test_to_pandas.

HyukjinKwon · 2017-09-21T07:56:48Z

@logannc, mind adding the JIRA number in this PR title as described in the guide line? Please take a look - http://spark.apache.org/contributing.html.

I'd read carefully the comments above, e.g., adding a test, #18945 (comment), fixing the PR title, #18945 (comment), following the suggestion , #18945 (comment).

HyukjinKwon · 2017-09-21T07:56:52Z

ok to test

SparkQA · 2017-09-21T07:58:28Z

Test build #82024 has finished for PR 18945 at commit 18d8d72.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-21T08:03:35Z

Test build #82025 has finished for PR 18945 at commit 18d8d72.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

…None.

SparkQA · 2017-09-22T05:03:29Z

Test build #82060 has finished for PR 18945 at commit b313a3b.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-22T05:04:50Z

python/pyspark/sql/dataframe.py

+                if pandas_type in (np.int8, np.int16, np.int32) and field.nullable:
+                    columns_with_null_int.add(field.name)
+                    row_handler = null_handler
+                    pandas_type = np.float32


I don't think this is a correct fix.

Can you elaborate? I believe it is, per my reply to your comment in the null_handler.

Have you read carefully the comments in #18945 (comment), #18945 (comment)? They are good suggestions for this issue. I don't know why you don't want to follow them to check null values with Pandas...

A simple wrong for this line is, even this condition is met, don't necessarily meaning there are null values in the column. But you forcibly set the type to np.float32.

Ah, I see where I got confused. I had started with @ueshin 's suggestion but abandoned it because I didn't want to create the DataFrame before the type correction because I was also looking at @HyukjinKwon 's suggestion. I somehow ended up combining them incorrectly.

I will take my suggestion back. I think thier suggestions are better than mine.

SparkQA · 2017-09-22T05:08:26Z

Test build #82061 has finished for PR 18945 at commit 14f36c3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-22T05:45:36Z

Test build #82062 has finished for PR 18945 at commit 6e248dd.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ater

logannc · 2017-09-22T05:52:25Z

I've continued to use @HyukjinKwon 's suggestion because it should be more performant and is capable of handling it without loss of precision. I believe I've addressed your concerns by only changing the type when we encounter a null (duh).

SparkQA · 2017-09-22T06:11:00Z

Test build #82063 has finished for PR 18945 at commit bd25923.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-09-22T06:22:01Z

python/pyspark/sql/dataframe.py

+                            dt = np.float64 if column in requires_double_precision else np.float32
+                            dtype[column] = dt
+                        elif val is not None:
+                            if abs(val) > 16777216:  # Max value before np.float32 loses precision.


Why do we need this?

Values above this cannot be represented losslessly as a np.float32.

I think they are represented as np.float64. I added a test in #19319 which follows previous suggestion with a little tweak.

SparkQA · 2017-09-22T06:23:14Z

Test build #82066 has finished for PR 18945 at commit d93a203.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-09-22T06:25:48Z

Hey @logannc, let's don't make it complicated for now and go with their ways first - #18945 (comment) and #18945 (comment).

Maybe we can make a followup later with some small benchmark results for the performance one and precision concern (I guess the precision concern is not a regression BTW?). I think we should first match it with when spark.sql.execution.arrow.enable is enabled.

SparkQA · 2017-09-22T06:57:22Z

Test build #82067 has finished for PR 18945 at commit 6e16cd8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Add option to convert nullable int columns to float columns in toPand…

bceeefc

…as to prevent needless crashes.

HyukjinKwon reviewed Aug 18, 2017

View reviewed changes

BryanCutler reviewed Aug 18, 2017

View reviewed changes

a10y reviewed Sep 18, 2017

View reviewed changes

logannc added 2 commits September 20, 2017 20:53

Revert "Add option to convert nullable int columns to float columns i…

cba225f

…n toPandas to prevent needless crashes." This reverts commit bceeefc.

Implement casting nullable int columns to floats per suggestions.

0ffb9ef

a10y reviewed Sep 21, 2017

View reviewed changes

viirya reviewed Sep 21, 2017

View reviewed changes

dict to set, remove branch

18d8d72

logannc changed the title ~~Add option to convert nullable int columns to float columns in toPand…~~ [SPARK-21766][SQL] Convert nullable int columns to float columns in toPandas Sep 22, 2017

Preemptively change pandas_type to np.float32 in case all values are …

b313a3b

…None.

viirya reviewed Sep 22, 2017

View reviewed changes

Realized null_handler does not need to set to float32...

14f36c3

PEP8 fixes, added test cases.

6e248dd

Only change types if found nulls, record precision requirements for l…

bd25923

…ater

logannc added 2 commits September 22, 2017 00:55

numpy wasn't imported.

caf219f

should have tested first.

d93a203

viirya reviewed Sep 22, 2017

View reviewed changes

should have tested first.

6e16cd8

viirya mentioned this pull request Sep 22, 2017

[SPARK-21766][PySpark][SQL] DataFrame toPandas() raises ValueError with nullable int columns #19319

Closed

asfgit closed this in 3e6a714 Sep 22, 2017

[SPARK-21766][SQL] Convert nullable int columns to float columns in toPandas #18945

[SPARK-21766][SQL] Convert nullable int columns to float columns in toPandas #18945

Uh oh!

Conversation

logannc commented Aug 15, 2017

Uh oh!

BryanCutler commented Aug 17, 2017

Uh oh!

BryanCutler commented Aug 17, 2017

Uh oh!

logannc commented Aug 17, 2017

Uh oh!

HyukjinKwon commented Aug 18, 2017

Uh oh!

HyukjinKwon commented Aug 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon Aug 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler Aug 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Aug 18, 2017

Uh oh!

HyukjinKwon commented Aug 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Aug 19, 2017

Uh oh!

BryanCutler commented Aug 22, 2017

Uh oh!

HyukjinKwon commented Aug 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

logannc commented Aug 27, 2017

Uh oh!

HyukjinKwon commented Sep 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 19, 2017

Uh oh!

HyukjinKwon commented Sep 20, 2017

Uh oh!

logannc commented Sep 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Sep 21, 2017

Uh oh!

HyukjinKwon commented Aug 18, 2017 •

edited

Loading

HyukjinKwon Aug 18, 2017 •

edited

Loading

BryanCutler Aug 18, 2017 •

edited

Loading

HyukjinKwon commented Aug 19, 2017 •

edited

Loading

HyukjinKwon commented Aug 23, 2017 •

edited

Loading

logannc commented Sep 21, 2017 •

edited

Loading

viirya commented Sep 21, 2017 •

edited

Loading