-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-21766][SQL] Convert nullable int columns to float columns in toPandas #18945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…as to prevent needless crashes.
@logannc , thanks for this. You bring up a big issue here that I think was overlooked when this code was added in Spark. I filed a JIRA for this SPARK-21766, which generally comes before the PR. Please see the contributing guide here |
There should be no way an error like this is raised during the call |
I read the contributing guide. It said that simple changes didn't need a JIRA. Certainly this code change is quite simple, I just wasn't sure if there would be enough discussion to warrant a Jira. Now I know. So, rather than return np.float32, return None? That would probably also work, though the bug might get reintroduced by someone unfamiliar with the problem. That is why I prefered the explicitness of returning a type. |
I agree with @BryanCutler in general. Another rough thought for a feasible way I could think to keep the current behaviour (to be more specific, to match the types with / without Arrow optimization, IIUC) is, to make a generator wrapper to check if |
@logannc, mind adding the JIRA number in this PR title as described in the guide line? |
python/pyspark/sql/dataframe.py
Outdated
@@ -1731,7 +1731,7 @@ def toDF(self, *cols): | |||
return DataFrame(jdf, self.sql_ctx) | |||
|
|||
@since(1.3) | |||
def toPandas(self): | |||
def toPandas(self, strict=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, this change looks uesless when the optimization is enabled if I understood correctly. I wouldn't add an option and make it complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You referring to the Arrow optimization right? I also agree that we should not add this option, rather just handle all this automatically
python/pyspark/sql/dataframe.py
Outdated
@@ -1731,7 +1731,7 @@ def toDF(self, *cols): | |||
return DataFrame(jdf, self.sql_ctx) | |||
|
|||
@since(1.3) | |||
def toPandas(self): | |||
def toPandas(self, strict=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You referring to the Arrow optimization right? I also agree that we should not add this option, rather just handle all this automatically
@@ -1762,7 +1762,7 @@ def toPandas(self): | |||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wanted to check that a nullable int field actually has null values, we could do it here and then not change type it there are null values. We would have to construct the pandas DataFrame first though.
pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
dtype = {}
for field in self.schema:
if not(field.dataType == IntegerType and field.nullable and pdf[field.name].isnull().any()):
pandas_type = _to_corrected_pandas_type(field.dataType)
if pandas_type is not None:
dtype[field.name] = pandas_type
for f, t in dtype.items():
pdf[f] = pdf[f].astype(t, copy=False)
return pdf
This does make a pass over the data to check though, but is not much overhead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use this approach, how about the following to check if the type corrections are needed:
dtype = {}
for field in self.schema:
pandas_type = _to_corrected_pandas_type(field.dataType)
if pandas_type is not None and not(field.nullable and pdf[field.name].isnull().any()):
dtype[field.name] = pandas_type
I'm not sure I follow @HyukjinKwon , we can just look at the |
Ah, I had to be clear. I thought something like ... dtype = {}
for field in self.schema:
pandas_type = _to_corrected_pandas_type(field.dataType)
if pandas_type is not None:
dtype[field.name] = pandas_type
# Columns with int + nullable from schemaa
int_null_cols = [...]
# Columns with int + nullable but with actual None. This will be set in `check_nulls`
int_null_cols_with_none = []
# This functions checks if the value is None.
def check_nulls():
for row in rows:
# Check with int_null_cols and set int_null_cols_with_none if there is None.
...
yield row
# Don't check anything if no int + nullable columns.
if len(int_null_cols) > 0:
check_func = check_nulls
else:
check_func = lambda r: r
pdf = pd.DataFrame.from_records(check_func(self.collect()), columns=self.columns)
# Replace int32 -> float one by checking int_null_cols_with_none.
dtype = ...
for f, t in dtype.items():
pdf[f] = pdf[f].astype(t, copy=False)
return pdf So, I was thinking of checking the actual value in the data might be a way if we can't resolve this only with the schema. |
Yea, I think it is basically similar idea with #18945 (comment). |
Thanks for clarifying @HyukjinKwon , I see what you mean now. Since pandas will iterate over Just to sum things up - @logannc does this still meet your requirements?
I'm also guessing we will have the same problem with nullable ShortType - maybe others? |
I think it'd be nicer if we can go with the approach above ^ (checking the null in data and setting the correct type). I am okay with any form for the approach above for now if it makes sense as we have a decent Arrow optimization now for the performance aspect. |
Sorry for the delay. Things got busy and now there is the storm in Houston. Will update this per these suggestions soon. |
Hey @logannc, have you had some time to work on this? I want to fix this issue asap. Ortherwise, would anyone here be interested in submitimg another PR for the another approach? |
python/pyspark/sql/dataframe.py
Outdated
if type(dt) == ByteType: | ||
return np.int8 | ||
elif type(dt) == ShortType: | ||
return np.int16 | ||
elif type(dt) == IntegerType: | ||
if not strict and field.nullable: | ||
return np.float32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is loss of precision a concern here? Some integers from the original dataset will now be rounded to the nearest representable float32 if I'm not mistaken.
gentle ping @logannc. |
@BryanCutler, @a10y and @viirya, would you guys be interested in this and have some time to take over this with the different approach we discussed above - #18945 (comment) and #18945 (comment)? I could take over this too if you guys are currently busy. |
…n toPandas to prevent needless crashes." This reverts commit bceeefc.
Sorry I fell off the face of the earth. I finally had some time to sit down and do this. I took your suggestions but implemented it a little differently. Unless I've made a dumb mistake, I think I improved on it a bit. |
python/pyspark/sql/dataframe.py
Outdated
if val is not None: | ||
if abs(val) > 16777216: # Max value before np.float32 loses precision. | ||
val = np.float64(val) | ||
if np.float64 != dt: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this if
totally necessary or can we just move the two assignments up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also thanks for solving the precision issue)!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, not strictly necessary, but also hardly harmful and it may future proof a bit...? Anyway, it can be removed if you think it should.
python/pyspark/sql/dataframe.py
Outdated
for field in self.schema: | ||
pandas_type = _to_corrected_pandas_type(field.dataType) | ||
if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: | ||
columns_with_null_int.add(field.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> columns_with_null_int = {}
>>> columns_with_null_int.add("test")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'dict' object has no attribute 'add'
Am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack, I noticed I did that then forgot to change it. On it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed...
python/pyspark/sql/dataframe.py
Outdated
for column in columns_with_null_int: | ||
val = row[column] | ||
dt = dtype[column] | ||
if val is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to change data type for None
values? I don't see you do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are handled by Pandas already, so I am just letting them pass through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to fix the issue when pandas type in (np.int8, np.int16, np.int32) and the field is nullable, the dtype
we get will cause exception later when converting a None
to int type such as np.int16?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can follow the https://github.com/apache/spark/pull/18945/files#r134033952 suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If pandas_type in (np.int8, np.int16, np.int32) and field.nullable
and there are ANY non-null values, the dtype of the column is changed to np.float32
or np.float64
, both of which properly handle None
values.
That said, if the entire column was None
, it would fail. Therefore I have preemptively changed the type on line 1787 to np.float32
. Per null_handler
, it may still change to np.float64
if needed.
We also need a proper test for this. |
@HyukjinKwon I can take over this if @logannc can't find time to continue it. |
Hm. Where would I add tests? |
@logannc There are pandas related tests in |
@logannc, mind adding the JIRA number in this PR title as described in the guide line? Please take a look - http://spark.apache.org/contributing.html. I'd read carefully the comments above, e.g., adding a test, #18945 (comment), fixing the PR title, #18945 (comment), following the suggestion , #18945 (comment). |
ok to test |
Test build #82024 has finished for PR 18945 at commit
|
Test build #82025 has finished for PR 18945 at commit
|
Test build #82060 has finished for PR 18945 at commit
|
python/pyspark/sql/dataframe.py
Outdated
if pandas_type in (np.int8, np.int16, np.int32) and field.nullable: | ||
columns_with_null_int.add(field.name) | ||
row_handler = null_handler | ||
pandas_type = np.float32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a correct fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate? I believe it is, per my reply to your comment in the null_handler
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you read carefully the comments in #18945 (comment), #18945 (comment)? They are good suggestions for this issue. I don't know why you don't want to follow them to check null values with Pandas...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A simple wrong for this line is, even this condition is met, don't necessarily meaning there are null values in the column. But you forcibly set the type to np.float32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see where I got confused. I had started with @ueshin 's suggestion but abandoned it because I didn't want to create the DataFrame before the type correction because I was also looking at @HyukjinKwon 's suggestion. I somehow ended up combining them incorrectly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take my suggestion back. I think thier suggestions are better than mine.
Test build #82061 has finished for PR 18945 at commit
|
Test build #82062 has finished for PR 18945 at commit
|
I've continued to use @HyukjinKwon 's suggestion because it should be more performant and is capable of handling it without loss of precision. I believe I've addressed your concerns by only changing the type when we encounter a null (duh). |
Test build #82063 has finished for PR 18945 at commit
|
dt = np.float64 if column in requires_double_precision else np.float32 | ||
dtype[column] = dt | ||
elif val is not None: | ||
if abs(val) > 16777216: # Max value before np.float32 loses precision. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Values above this cannot be represented losslessly as a np.float32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are represented as np.float64. I added a test in #19319 which follows previous suggestion with a little tweak.
Test build #82066 has finished for PR 18945 at commit
|
Hey @logannc, let's don't make it complicated for now and go with their ways first - #18945 (comment) and #18945 (comment). Maybe we can make a followup later with some small benchmark results for the performance one and precision concern (I guess the precision concern is not a regression BTW?). I think we should first match it with when |
Test build #82067 has finished for PR 18945 at commit
|
…as to prevent needless Exceptions during routine use.
Add the
strict=True
kwarg to DataFrame.toPandas to allow for a non-strict interpretation of the schema of a dataframe. This is currently limited to allowing a nullable int column to being interpreted as a float column (because that is the only way Pandas supports nullable int columns and actually crashes without this).I consider this small change to be a massive quality of life improvement for DataFrames with lots of nullable int columns, which would otherwise need a litany of
df.withColumn(name, F.col(name).cast(DoubleType()))
, etc, just to view them easily or interact with them in-memory.Possible Objections
nullable
property of the schema to False.Alternatives
nullable_int_to_float
instead ofstrict
or some other, similar name.