-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-4907][MLlib] Inconsistent loss and gradient in LeastSquaresGradient compared with R #3746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #24656 has started for PR 3746 at commit
|
Seems reasonable to me. |
Test build #24656 has finished for PR 3746 at commit
|
Test FAILed. |
Test build #24657 has started for PR 3746 at commit
|
Test build #24657 has finished for PR 3746 at commit
|
Test PASSed. |
On my opinion, I don't think the parameter of the cost function is 1/m or 1/2m is the critical deference. |
@bryanyang0528 I don't think anyone's suggesting that the extra factor of 1/2 is more or less correct or desirable per se. The solution doesn't depend on the absolute value of the loss function, but its minimum only. But I think the question here is being consistent with the loss function as implemented by other software packages, so that the absolute value can be compared, for the same setting of learning rate, overfitting param, etc. |
@srowen I agree on that need a absolute value can be compared with others software. Maybe it would add a parameter to control the extra factor? |
@bryanyang0528 The learning rate issue here is different story. With modern optimization algorithms like LBFGS and OWLQN, the learning rate is not required. The algorithms will find the step size in line search step. As @srowen pointed out, the statistical property of model will be different without the 1/2 factor compared with other package. At Alpine Data Labs, I implemented generalized linear model with elastic net (mixing L1 and L2) using OWLQN in Spark, I can train and get exactly the same coefficients and the same statistical property for model including std error, p-value, t-value, residual plot, and QQ plot, etc. For lots of our customers in financial industry, those stats are very important, and it's required to get the same solution compared with well-known R's reference implementation with scalability. Without the PR, the coefficients can be the same, but the stats will be different. Although I only have limited time on contributing to open source project, I'll try to have most of my work available in Spark 1.3. |
@dbtsai Thank you for your clear explanation which helps me alot! |
LGTM. Merged into master. Thanks! |
In most of the academic paper and algorithm implementations,
people use L = 1/2n ||A weights-y||^2 instead of L = 1/n ||A weights-y||^2
for least-squared loss. See Eq. (1) in http://web.stanford.edu/~hastie/Papers/glmnet.pdf
Since MLlib uses different convention, this will result different residuals and
all the stats properties will be different from GLMNET package in R.
The model coefficients will be still the same under this change.