Skip to content

[SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. #1624

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from

Conversation

miccagiann
Copy link
Contributor

Related to issue: SPARK-2550.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@miccagiann miccagiann changed the title Added support for regularizer and intercection parameters for linear reg... [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. Jul 28, 2014
@mengxr
Copy link
Contributor

mengxr commented Jul 29, 2014

Jenkins, test this please.

@@ -120,6 +120,23 @@ def train(cls, data, iterations=100, step=1.0,
d._jrdd, iterations, step, miniBatchFraction, i)
return _regression_train_wrapper(sc, train_f, LinearRegressionModel, data, initialWeights)

@classmethod
def trainL2Opt(cls, data, iterations=100, step=1.0, regParam=1.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of adding new methods, we can add optional parameters to the original train method. For example, regType and regParam. User can set regType to l1, l2, or none (default).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I am working in this issue!

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA tests have started for PR 1624. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17336/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 29, 2014

QA results for PR 1624:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17336/consoleFull

val L2 : Int = 0
val L1 : Int = 1
val NONE : Int = 2
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used a type of Enumeration in order to separate between the different types of Update Methods [Regularizers] with which the user wants to create the model from training data. I tried to extend this object from Enumeration but from what I have seen it uses reflection heavily and it does not work well with serialized objects and with py4j...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using strings with a clear doc should be sufficient. Then you can map the string to L1Updater or SquaredUpdater inside PythonMLLibAPI.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I will do it with strings both in python and in scala.

@miccagiann
Copy link
Contributor Author

Jenkins, test this please.

d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
else:
raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
"using the following string values [L1Updater, SquaredUpdater, NONE].")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using enumerations for regType parameter anymore. Switched to string values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@miccagiann It may be easier if you send the string directly to PythonMLLibAPI().trainLinearRegressionModelWithSGD and implement the logic there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current version, all branches in the if-else block are essentially the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! I fixed it in the regression.py file where I was calling the same function again and again. As far as PythonMLLibAPI().trainLinearRegressionModelWithSGD I implement there the logic as well... I am building right now and I will commit instantly.

@mengxr
Copy link
Contributor

mengxr commented Jul 31, 2014

@miccagiann For regType, I think people are more familiar with l1 and l2 than L1Updater and SquaredUpdater. Do you mind changing them? Lowercase names are preferred because we use lowercase method names in other places.

miniBatchFraction: Double,
initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
val lrAlg = new LinearRegressionWithSGD()
lrAlg.setIntercept(intercept)
lrAlg.optimizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We usually put . at the beginning of the line:

lrAlg.optimizer
  .setNumIterations(numIterations)
  .setRegParam(regParam)
  .setStepSize(stepSize)

@miccagiann
Copy link
Contributor Author

Not at all! I am going to change them! Thanks!

def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
intercept=False, miniBatchFraction=1.0, initialWeights=None):
"""Train a linear regression model on the given data. The 'regType' parameter can take
one from the following string values: "L1Updater" for invoking the lasso regularizer,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Python, the line width for docs should be less than 80 (or 78 to be safe).

@mengxr
Copy link
Contributor

mengxr commented Jul 31, 2014

Btw, you can use @param for the doc for an argument. An example can be found at https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg.py#L41

@miccagiann
Copy link
Contributor Author

I have applied the suggested changes! Please notify me if any more modifications should be performed. Thanks for all your help Xiangrui.

if (regType == "l2")
lrAlg.optimizer.setUpdater(new SquaredL2Updater)
else if (regType == "l1")
lrAlg.optimizer.setUpdater(new L1Updater)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is safer to add

    else if (regType != "none")
      throw IllegalArgumentException("...")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By adding the exception to the scala code, I am going to remove the ValueError exception used in the python code.

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2014

Jenkins, add to whitelist.

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1624. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17729/consoleFull

@miccagiann
Copy link
Contributor Author

Xiangrui,

After the tests are finished, should I merge my local branch with the upstream/master so as to make this patch merging smoothly?

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2014

Yes, you need to merge the latest master and resolve conflicts first.

@miccagiann
Copy link
Contributor Author

I have done it. Thanks for all your help! Now, I suppose that I need to call Jenkins again, right?

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA tests have started for PR 1624. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17740/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2014

LGTM. Waiting for Jenkins ....

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2014

I added you to the whitelist. Jenkins should be triggered automatically for changes from you.

@miccagiann
Copy link
Contributor Author

Nice! Thanks for everything! Tomorrow I am going to search for new issues
under your supervision. You have helped me a lot!

On Fri, Aug 1, 2014 at 10:55 PM, Xiangrui Meng [email protected]
wrote:

I added you to the whitelist. Jenkins should be triggered automatically
for changes from you.


Reply to this email directly or view it on GitHub
#1624 (comment).

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2014

Great! Do you mind adding regularization type and intercept to other linear methods? For example, LogisticRegressionWithSGD and SVMWithSGD.

@miccagiann
Copy link
Contributor Author

Yes! I can do this. Is there an issue created in JIRA or it would be part of the same PR?

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2014

It should be part of the same JIRA. But let's do that in a separate PR.

@miccagiann
Copy link
Contributor Author

OK!

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1624:
- This patch PASSES unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17729/consoleFull

@SparkQA
Copy link

SparkQA commented Aug 2, 2014

QA results for PR 1624:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17740/consoleFull

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2014

Merged into master. Thanks!

@asfgit asfgit closed this in c281189 Aug 2, 2014
@miccagiann
Copy link
Contributor Author

Alright, I was fixing my branches so as my new commits to be included correctly in the new PR I am going to create.

@miccagiann
Copy link
Contributor Author

Xiangrui,

I see that the JIRA issue is closed. Should we create a new one for the LogisticRegressionWithSGD and for SVMWithSGD?

@mengxr
Copy link
Contributor

mengxr commented Aug 2, 2014

I re-opened the JIRA. Please use the same JIRA number for your new PR. Thanks!

@miccagiann miccagiann deleted the new-branch branch August 25, 2014 16:27
xiliu82 pushed a commit to xiliu82/spark that referenced this pull request Sep 4, 2014
…t in pyspark's linear methods.

Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC).

Author: Michael Giannakopoulos <[email protected]>

Closes apache#1624 from miccagiann/new-branch and squashes the following commits:

c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master.
8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets.
fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException.
44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD.
8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one.
638be47 [Michael Giannakopoulos] Modified code to comply with code standards.
ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file
b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter.
78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function.
3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
Upgrade callhomeservice to 0.2.20

Co-authored-by: Ling Yuan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants