Skip to content

[SPARK-9014][SQL] Allow Python spark API to use built-in exponential operator #8658

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions python/pyspark/sql/column.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,17 @@ def _(self):
return _


def _bin_func_op(name, reverse=False, doc="binary function"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are just looking at adding pow then it might make more sense to add pow to column, but if we are looking at making it easy to access the rest of the functions in functions.scala this sounds good (although maybe add a tiny comment explaining the purpose).

def _(self, other):
sc = SparkContext._active_spark_context
fn = getattr(sc._jvm.functions, name)
jc = other._jc if isinstance(other, Column) else _create_column_from_literal(other)
njc = fn(self._jc, jc) if not reverse else fn(jc, self._jc)
return Column(njc)
_.__doc__ = doc
return _


def _bin_op(name, doc="binary operator"):
""" Create a method for given binary operator
"""
Expand Down Expand Up @@ -151,6 +162,8 @@ def __init__(self, jc):
__rdiv__ = _reverse_op("divide")
__rtruediv__ = _reverse_op("divide")
__rmod__ = _reverse_op("mod")
__pow__ = _bin_func_op("pow")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this doesn't quite match the scala API (which I suppose isn't the end of the world), but would it possible make sense to have a similar functions.py file to match the scala API?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have pow in pyspark.sql.functions.

For here, it's easy to do like this:

from pyspark.sql.function import pow
__pow__ = pow
__rpow__ = lambda c, other: pow(other, c)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davies, pow function in pyspark.sql.functions is created with function _create_binary_mathfunction which uses Column internally, thus it cannot be simply imported from pyspark.sql.fuction
So as I outlined below, there are two options: do it like I did or add pow and ** implementation to Column in Scala. In Scala you can reuse the same Pow class as it does not depend on Column

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, I'd like to go with current approach.

We could change to _bin_func_op to _pow, use it for __pow__ and __rpow__, it would be more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can rename, but what if later we would implement something like __lshift__ or __rshift__ to allow the syntax df.a << 2, then we would have to either rename _pow back to _bin_func_op and utilize it, or add one more functions. Now its clear that _bin_func_op allows you to utilize binary function. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense, but _bin_func_op expect that other should be Column or float, it's not in a shape that we could easily reused for others.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. What if I replace

jc = other._jc if isinstance(other, Column) else float(other)

with

jc = other._jc if isinstance(other, Column) else _create_column_from_literal(other)

Would it still worth renaming to _pow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds better, go for it, thanks!

__rpow__ = _bin_func_op("pow", reverse=True)

# logistic operators
__eq__ = _bin_op("equalTo")
Expand Down
2 changes: 1 addition & 1 deletion python/pyspark/sql/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -568,7 +568,7 @@ def test_column_operators(self):
cs = self.df.value
c = ci == cs
self.assertTrue(isinstance((- ci - 1 - 2) % 3 * 2.5 / 3.5, Column))
rcc = (1 + ci), (1 - ci), (1 * ci), (1 / ci), (1 % ci)
rcc = (1 + ci), (1 - ci), (1 * ci), (1 / ci), (1 % ci), (1 ** ci), (ci ** 1)
self.assertTrue(all(isinstance(c, Column) for c in rcc))
cb = [ci == 5, ci != 0, ci > 3, ci < 4, ci >= 0, ci <= 7]
self.assertTrue(all(isinstance(c, Column) for c in cb))
Expand Down