Skip to content

[SPARK-9014][SQL] Allow Python spark API to use built-in exponential operator #8658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

0x0FFF
Copy link
Contributor

@0x0FFF 0x0FFF commented Sep 8, 2015

This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014]
Added functionality: Column object in Python now supports exponential operator **
Example:

from pyspark.sql import *
df = sqlContext.createDataFrame([Row(a=2)])
df.select(3**df.a,df.a**3,df.a**df.a).collect()

Outputs:

[Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)]

@@ -151,6 +162,8 @@ def __init__(self, jc):
__rdiv__ = _reverse_op("divide")
__rtruediv__ = _reverse_op("divide")
__rmod__ = _reverse_op("mod")
__pow__ = _bin_func_op("pow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this doesn't quite match the scala API (which I suppose isn't the end of the world), but would it possible make sense to have a similar functions.py file to match the scala API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have pow in pyspark.sql.functions.

For here, it's easy to do like this:

from pyspark.sql.function import pow
__pow__ = pow
__rpow__ = lambda c, other: pow(other, c)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies, pow function in pyspark.sql.functions is created with function _create_binary_mathfunction which uses Column internally, thus it cannot be simply imported from pyspark.sql.fuction
So as I outlined below, there are two options: do it like I did or add pow and ** implementation to Column in Scala. In Scala you can reuse the same Pow class as it does not depend on Column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'd like to go with current approach.

We could change to _bin_func_op to _pow, use it for __pow__ and __rpow__, it would be more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename, but what if later we would implement something like __lshift__ or __rshift__ to allow the syntax df.a << 2, then we would have to either rename _pow back to _bin_func_op and utilize it, or add one more functions. Now its clear that _bin_func_op allows you to utilize binary function. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense, but _bin_func_op expect that other should be Column or float, it's not in a shape that we could easily reused for others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. What if I replace

jc = other._jc if isinstance(other, Column) else float(other)

with

jc = other._jc if isinstance(other, Column) else _create_column_from_literal(other)

Would it still worth renaming to _pow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds better, go for it, thanks!

@0x0FFF
Copy link
Contributor Author

0x0FFF commented Sep 9, 2015

@holdenk, there are two ways of implementing this:

  1. The one I've done in this PR: adding functionality to utilize binary function operation from org.apache.spark.sql.functions to Column.py and utilize pow() function from there for __pow__ and __rpow__. Pros: with this approach you can add other functions from org.apache.spark.sql.functions to Column.py. Cons: non-consistent Column APIs between Python and Scala (to be fair, Scala does not have ** out of the box, so it is a question whether it is really an inconsistency)
  2. Add implementation of ** and pow to org.apache.spark.sql.Column and utilize it in Column.py with existing _bin_op and _reverse_op. Pros: ** operation is added to Scala org.apache.spark.sql.Column as well, plus it turns into slightly less new code. Cons: this would cover only pow() implementation, if in the future you would need to utilize other function from org.apache.spark.sql.functions in Python you would have to change org.apache.spark.sql.Column first once again. Plus in Scala the case with 3**col("a") will not work as it would look for ** implementation in Int

In my opinion both options are possible, and second one might even be a bit better. I can easily switch to the second option, I've already implemented it locally. What is your view?

@0x0FFF
Copy link
Contributor Author

0x0FFF commented Sep 10, 2015

@davies, could you take a look, please?

@davies
Copy link
Contributor

davies commented Sep 10, 2015

Jenkins, OK to test.

@0x0FFF
Copy link
Contributor Author

0x0FFF commented Sep 10, 2015

Please, also check out the implementation from last commit. In my opinion it is much more consistent. I just cannot implement _pow in column.py, it looks much like a duck tape. General _bin_func_op function with converting second parameter to Column type with lit call is better, but still in my opinion the implementation from commit 40cbcb4 is much better - code is easier to read, more consistent, and it is adding less code (only 4 lines, excluding comment and tests)

@davies
Copy link
Contributor

davies commented Sep 10, 2015

@0x0FFF I think ** only make sense in Python, so we should not introduce ** into Scala (also pow).

@davies
Copy link
Contributor

davies commented Sep 10, 2015

cc @rxin

@rxin
Copy link
Contributor

rxin commented Sep 10, 2015

+1 on not having this for Scala. There is already a pow function that do pow(x, y).

We should just do this for Python.

@0x0FFF
Copy link
Contributor Author

0x0FFF commented Sep 11, 2015

Agree, with commit aecc0c2 I reverted to the first option and replaced float with _create_column_from_literal as proposed

@davies
Copy link
Contributor

davies commented Sep 11, 2015

LGTM, waiting for tests.

@SparkQA
Copy link

SparkQA commented Sep 11, 2015

Test build #1745 has finished for PR 8658 at commit aecc0c2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@asfgit asfgit closed this in c34fc19 Sep 11, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants