-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. #1624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…regression method.
Can one of the admins verify this patch? |
Jenkins, test this please. |
@@ -120,6 +120,23 @@ def train(cls, data, iterations=100, step=1.0, | |||
d._jrdd, iterations, step, miniBatchFraction, i) | |||
return _regression_train_wrapper(sc, train_f, LinearRegressionModel, data, initialWeights) | |||
|
|||
@classmethod | |||
def trainL2Opt(cls, data, iterations=100, step=1.0, regParam=1.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of adding new methods, we can add optional parameters to the original train method. For example, regType
and regParam
. User can set regType
to l1
, l2
, or none
(default).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! I am working in this issue!
QA tests have started for PR 1624. This patch merges cleanly. |
QA results for PR 1624: |
…in only one function.
val L2 : Int = 0 | ||
val L1 : Int = 1 | ||
val NONE : Int = 2 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used a type of Enumeration in order to separate between the different types of Update Methods [Regularizers] with which the user wants to create the model from training data. I tried to extend this object from Enumeration but from what I have seen it uses reflection heavily and it does not work well with serialized objects and with py4j...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using strings with a clear doc should be sufficient. Then you can map the string to L1Updater
or SquaredUpdater
inside PythonMLLibAPI
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok! I will do it with strings both in python and in scala.
Jenkins, test this please. |
…ues of 'regType' parameter.
d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i) | ||
else: | ||
raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " + | ||
"using the following string values [L1Updater, SquaredUpdater, NONE].") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using enumerations for regType
parameter anymore. Switched to string values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@miccagiann It may be easier if you send the string directly to PythonMLLibAPI().trainLinearRegressionModelWithSGD
and implement the logic there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current version, all branches in the if-else block are essentially the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! I fixed it in the regression.py
file where I was calling the same function again and again. As far as PythonMLLibAPI().trainLinearRegressionModelWithSGD
I implement there the logic as well... I am building right now and I will commit instantly.
@miccagiann For |
miniBatchFraction: Double, | ||
initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = { | ||
val lrAlg = new LinearRegressionWithSGD() | ||
lrAlg.setIntercept(intercept) | ||
lrAlg.optimizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We usually put .
at the beginning of the line:
lrAlg.optimizer
.setNumIterations(numIterations)
.setRegParam(regParam)
.setStepSize(stepSize)
Not at all! I am going to change them! Thanks! |
def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None, | ||
intercept=False, miniBatchFraction=1.0, initialWeights=None): | ||
"""Train a linear regression model on the given data. The 'regType' parameter can take | ||
one from the following string values: "L1Updater" for invoking the lasso regularizer, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Python, the line width for docs should be less than 80 (or 78 to be safe).
Btw, you can use |
I have applied the suggested changes! Please notify me if any more modifications should be performed. Thanks for all your help Xiangrui. |
if (regType == "l2") | ||
lrAlg.optimizer.setUpdater(new SquaredL2Updater) | ||
else if (regType == "l1") | ||
lrAlg.optimizer.setUpdater(new L1Updater) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is safer to add
else if (regType != "none")
throw IllegalArgumentException("...")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By adding the exception to the scala code, I am going to remove the ValueError
exception used in the python code.
Jenkins, add to whitelist. |
Jenkins, test this please. |
QA tests have started for PR 1624. This patch DID NOT merge cleanly! |
Xiangrui, After the tests are finished, should I merge my local branch with the upstream/master so as to make this patch merging smoothly? |
Yes, you need to merge the latest master and resolve conflicts first. |
I have done it. Thanks for all your help! Now, I suppose that I need to call Jenkins again, right? |
QA tests have started for PR 1624. This patch merges cleanly. |
LGTM. Waiting for Jenkins .... |
I added you to the whitelist. Jenkins should be triggered automatically for changes from you. |
Nice! Thanks for everything! Tomorrow I am going to search for new issues On Fri, Aug 1, 2014 at 10:55 PM, Xiangrui Meng [email protected]
|
Great! Do you mind adding regularization type and intercept to other linear methods? For example, |
Yes! I can do this. Is there an issue created in JIRA or it would be part of the same PR? |
It should be part of the same JIRA. But let's do that in a separate PR. |
OK! |
QA results for PR 1624: |
QA results for PR 1624: |
Merged into master. Thanks! |
Alright, I was fixing my branches so as my new commits to be included correctly in the new PR I am going to create. |
Xiangrui, I see that the JIRA issue is closed. Should we create a new one for the |
I re-opened the JIRA. Please use the same JIRA number for your new PR. Thanks! |
…t in pyspark's linear methods. Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC). Author: Michael Giannakopoulos <[email protected]> Closes apache#1624 from miccagiann/new-branch and squashes the following commits: c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master. 8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets. fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException. 44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD. 8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one. 638be47 [Michael Giannakopoulos] Modified code to comply with code standards. ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter. 78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function. 3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.
Upgrade callhomeservice to 0.2.20 Co-authored-by: Ling Yuan <[email protected]>
Related to issue: SPARK-2550.