[SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. #1624

miccagiann · 2014-07-28T23:49:31Z

Related to issue: SPARK-2550.

…regression method.

AmplabJenkins · 2014-07-28T23:52:20Z

Can one of the admins verify this patch?

mengxr · 2014-07-29T05:19:36Z

Jenkins, test this please.

mengxr · 2014-07-29T05:22:29Z

python/pyspark/mllib/regression.py

@@ -120,6 +120,23 @@ def train(cls, data, iterations=100, step=1.0,
            d._jrdd, iterations, step, miniBatchFraction, i)
        return _regression_train_wrapper(sc, train_f, LinearRegressionModel, data, initialWeights)

+    @classmethod
+    def trainL2Opt(cls, data, iterations=100, step=1.0, regParam=1.0,


Instead of adding new methods, we can add optional parameters to the original train method. For example, regType and regParam. User can set regType to l1, l2, or none (default).

Ok! I am working in this issue!

SparkQA · 2014-07-29T05:23:55Z

QA tests have started for PR 1624. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17336/consoleFull

SparkQA · 2014-07-29T06:11:49Z

QA results for PR 1624:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17336/consoleFull

…in only one function.

miccagiann · 2014-07-30T05:01:38Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

+    val L2 : Int = 0
+    val L1 : Int = 1
+    val NONE : Int = 2
+  }


I used a type of Enumeration in order to separate between the different types of Update Methods [Regularizers] with which the user wants to create the model from training data. I tried to extend this object from Enumeration but from what I have seen it uses reflection heavily and it does not work well with serialized objects and with py4j...

Using strings with a clear doc should be sufficient. Then you can map the string to L1Updater or SquaredUpdater inside PythonMLLibAPI.

Ok! I will do it with strings both in python and in scala.

miccagiann · 2014-07-30T05:04:43Z

Jenkins, test this please.

…ues of 'regType' parameter.

miccagiann · 2014-07-31T00:27:26Z

python/pyspark/mllib/regression.py

+                d._jrdd, iterations, step, regParam, regType, intercept, miniBatchFraction, i)
+        else:
+            raise ValueError("Invalid value for 'regType' parameter. Can only be initialized " +
+                             "using the following string values [L1Updater, SquaredUpdater, NONE].")


Not using enumerations for regType parameter anymore. Switched to string values.

@miccagiann It may be easier if you send the string directly to PythonMLLibAPI().trainLinearRegressionModelWithSGD and implement the logic there.

In the current version, all branches in the if-else block are essentially the same.

Yes! I fixed it in the regression.py file where I was calling the same function again and again. As far as PythonMLLibAPI().trainLinearRegressionModelWithSGD I implement there the logic as well... I am building right now and I will commit instantly.

mengxr · 2014-07-31T01:51:41Z

@miccagiann For regType, I think people are more familiar with l1 and l2 than L1Updater and SquaredUpdater. Do you mind changing them? Lowercase names are preferred because we use lowercase method names in other places.

mengxr · 2014-07-31T01:54:23Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

      miniBatchFraction: Double,
      initialWeightsBA: Array[Byte]): java.util.List[java.lang.Object] = {
+    val lrAlg = new LinearRegressionWithSGD()
+    lrAlg.setIntercept(intercept)
+    lrAlg.optimizer.


We usually put . at the beginning of the line:

lrAlg.optimizer .setNumIterations(numIterations) .setRegParam(regParam) .setStepSize(stepSize)

miccagiann · 2014-07-31T01:57:26Z

Not at all! I am going to change them! Thanks!

mengxr · 2014-07-31T02:01:21Z

python/pyspark/mllib/regression.py

+    def train(cls, data, iterations=100, step=1.0, regParam=1.0, regType=None,
+              intercept=False, miniBatchFraction=1.0, initialWeights=None):
+        """Train a linear regression model on the given data. The 'regType' parameter can take
+           one from the following string values: "L1Updater" for invoking the lasso regularizer,


In Python, the line width for docs should be less than 80 (or 78 to be safe).

mengxr · 2014-07-31T02:19:36Z

Btw, you can use @param for the doc for an argument. An example can be found at https://github.com/apache/spark/blob/master/python/pyspark/mllib/linalg.py#L41

miccagiann · 2014-07-31T03:19:56Z

I have applied the suggested changes! Please notify me if any more modifications should be performed. Thanks for all your help Xiangrui.

mengxr · 2014-07-31T06:12:30Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

+    if (regType == "l2")
+      lrAlg.optimizer.setUpdater(new SquaredL2Updater)
+    else if (regType == "l1")
+      lrAlg.optimizer.setUpdater(new L1Updater)


It is safer to add

else if (regType != "none") throw IllegalArgumentException("...")

By adding the exception to the scala code, I am going to remove the ValueError exception used in the python code.

mengxr · 2014-08-02T01:18:32Z

Jenkins, add to whitelist.

mengxr · 2014-08-02T01:18:39Z

Jenkins, test this please.

SparkQA · 2014-08-02T01:24:13Z

QA tests have started for PR 1624. This patch DID NOT merge cleanly!
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17729/consoleFull

miccagiann · 2014-08-02T02:10:31Z

Xiangrui,

After the tests are finished, should I merge my local branch with the upstream/master so as to make this patch merging smoothly?

mengxr · 2014-08-02T02:30:01Z

Yes, you need to merge the latest master and resolve conflicts first.

miccagiann · 2014-08-02T02:51:18Z

I have done it. Thanks for all your help! Now, I suppose that I need to call Jenkins again, right?

SparkQA · 2014-08-02T02:54:13Z

QA tests have started for PR 1624. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17740/consoleFull

mengxr · 2014-08-02T02:54:22Z

LGTM. Waiting for Jenkins ....

mengxr · 2014-08-02T02:55:43Z

I added you to the whitelist. Jenkins should be triggered automatically for changes from you.

miccagiann · 2014-08-02T02:57:05Z

Nice! Thanks for everything! Tomorrow I am going to search for new issues
under your supervision. You have helped me a lot!

On Fri, Aug 1, 2014 at 10:55 PM, Xiangrui Meng [email protected]
wrote:

I added you to the whitelist. Jenkins should be triggered automatically
for changes from you.

—
Reply to this email directly or view it on GitHub
#1624 (comment).

mengxr · 2014-08-02T03:06:14Z

Great! Do you mind adding regularization type and intercept to other linear methods? For example, LogisticRegressionWithSGD and SVMWithSGD.

miccagiann · 2014-08-02T03:11:06Z

Yes! I can do this. Is there an issue created in JIRA or it would be part of the same PR?

mengxr · 2014-08-02T03:14:25Z

It should be part of the same JIRA. But let's do that in a separate PR.

miccagiann · 2014-08-02T03:14:56Z

OK!

SparkQA · 2014-08-02T03:17:35Z

QA results for PR 1624:
- This patch PASSES unit tests.

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17729/consoleFull

SparkQA · 2014-08-02T03:58:51Z

QA results for PR 1624:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds no public classes

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17740/consoleFull

mengxr · 2014-08-02T04:01:27Z

Merged into master. Thanks!

miccagiann · 2014-08-02T04:30:25Z

Alright, I was fixing my branches so as my new commits to be included correctly in the new PR I am going to create.

miccagiann · 2014-08-02T04:32:32Z

Xiangrui,

I see that the JIRA issue is closed. Should we create a new one for the LogisticRegressionWithSGD and for SVMWithSGD?

mengxr · 2014-08-02T04:36:40Z

I re-opened the JIRA. Please use the same JIRA number for your new PR. Thanks!

…t in pyspark's linear methods. Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC). Author: Michael Giannakopoulos <[email protected]> Closes apache#1624 from miccagiann/new-branch and squashes the following commits: c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master. 8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets. fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException. 44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD. 8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one. 638be47 [Michael Giannakopoulos] Modified code to comply with code standards. ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter. 78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function. 3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.

Upgrade callhomeservice to 0.2.20 Co-authored-by: Ling Yuan <[email protected]>

Added support for regularizer and intercection parameters for linear …

3ac8874

…regression method.

miccagiann changed the title ~~Added support for regularizer and intercection parameters for linear reg...~~ [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. Jul 28, 2014

mengxr reviewed Jul 29, 2014
View reviewed changes

Providing intercept and regualizer functionallity for linear methods …

78853ec

…in only one function.

miccagiann reviewed Jul 30, 2014
View reviewed changes

Replaced the enum classes, with strings-keywords for defining the val…

b962744

…ues of 'regType' parameter.

miccagiann reviewed Jul 31, 2014
View reviewed changes

Shorten the if-elif-else statement in regression.py file

ec50ee9

mengxr reviewed Jul 31, 2014
View reviewed changes

Modified code to comply with code standards.

638be47

mengxr reviewed Jul 31, 2014
View reviewed changes

Merge cleanly with upstream/master.

c02e5f5

asfgit closed this in c281189 Aug 2, 2014

miccagiann deleted the new-branch branch August 25, 2014 16:27

sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023

rdar://102058124 Add PeakMem usage (apache#1624) (apache#1629)

e7b19a1

Upgrade callhomeservice to 0.2.20 Co-authored-by: Ling Yuan <[email protected]>

[SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. #1624

[SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. #1624

Uh oh!

Conversation

miccagiann commented Jul 28, 2014

Uh oh!

AmplabJenkins commented Jul 28, 2014

Uh oh!

mengxr commented Jul 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

SparkQA commented Jul 29, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miccagiann commented Jul 30, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jul 31, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

miccagiann commented Jul 31, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Jul 31, 2014

Uh oh!

miccagiann commented Jul 31, 2014

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mengxr commented Aug 2, 2014

Uh oh!

mengxr commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

miccagiann commented Aug 2, 2014

Uh oh!

mengxr commented Aug 2, 2014

Uh oh!

miccagiann commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!

mengxr commented Aug 2, 2014

Uh oh!

mengxr commented Aug 2, 2014

Uh oh!

miccagiann commented Aug 2, 2014

Uh oh!

mengxr commented Aug 2, 2014

Uh oh!

miccagiann commented Aug 2, 2014

Uh oh!

mengxr commented Aug 2, 2014

Uh oh!

miccagiann commented Aug 2, 2014

Uh oh!

SparkQA commented Aug 2, 2014

Uh oh!