Skip to content

[SPARK-4907][MLlib] Inconsistent loss and gradient in LeastSquaresGradient compared with R #3746

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

dbtsai
Copy link
Member

@dbtsai dbtsai commented Dec 19, 2014

In most of the academic paper and algorithm implementations,
people use L = 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2
for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf

Since MLlib uses different convention, this will result different residuals and
all the stats properties will be different from GLMNET package in R.

The model coefficients will be still the same under this change.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24656 has started for PR 3746 at commit 0b2c29c.

  • This patch merges cleanly.

@srowen
Copy link
Member

srowen commented Dec 19, 2014

Seems reasonable to me.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24656 has finished for PR 3746 at commit 0b2c29c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24656/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24657 has started for PR 3746 at commit 19c2e85.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Dec 19, 2014

Test build #24657 has finished for PR 3746 at commit 19c2e85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24657/
Test PASSed.

@bryanyang0528
Copy link

On my opinion, I don't think the parameter of the cost function is 1/m or 1/2m is the critical deference.
Across the cost function L = alpha * 1/2n ||A weights-y||^2 (alpha is the learning rate), we can control the learning rate to acquire the same result no mater 1/m or 1/2m.

@srowen
Copy link
Member

srowen commented Dec 22, 2014

@bryanyang0528 I don't think anyone's suggesting that the extra factor of 1/2 is more or less correct or desirable per se. The solution doesn't depend on the absolute value of the loss function, but its minimum only. But I think the question here is being consistent with the loss function as implemented by other software packages, so that the absolute value can be compared, for the same setting of learning rate, overfitting param, etc.

@bryanyang0528
Copy link

@srowen I agree on that need a absolute value can be compared with others software. Maybe it would add a parameter to control the extra factor?

@dbtsai
Copy link
Member Author

dbtsai commented Dec 22, 2014

@bryanyang0528 The learning rate issue here is different story. With modern optimization algorithms like LBFGS and OWLQN, the learning rate is not required. The algorithms will find the step size in line search step. As @srowen pointed out, the statistical property of model will be different without the 1/2 factor compared with other package. At Alpine Data Labs, I implemented generalized linear model with elastic net (mixing L1 and L2) using OWLQN in Spark, I can train and get exactly the same coefficients and the same statistical property for model including std error, p-value, t-value, residual plot, and QQ plot, etc. For lots of our customers in financial industry, those stats are very important, and it's required to get the same solution compared with well-known R's reference implementation with scalability. Without the PR, the coefficients can be the same, but the stats will be different.

Although I only have limited time on contributing to open source project, I'll try to have most of my work available in Spark 1.3.

@bryanyang0528
Copy link

@dbtsai Thank you for your clear explanation which helps me alot!

@mengxr
Copy link
Contributor

mengxr commented Dec 23, 2014

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in a96b727 Dec 23, 2014
@dbtsai dbtsai deleted the lir branch January 14, 2015 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants