-
Notifications
You must be signed in to change notification settings - Fork 18
Regularized interval regression
Interval regression is a class of machine learning models which is useful when predicted values should be real numbers, but outputs in the training data set may be partially observed. A common example is survival analysis, in which data are patient survival times.
For example, say that Alice and Bob came into the hospital and were treated for cancer on the same day in 2000. Now we are in 2016 and we would like to study the treatment efficacy. Say Alice died in 2010, and Bob is still alive. The survival time for Alice is 10 years, and although we do not know Bob’s survival time, we know it is in the interval (16, Infinity).
Say that we also measured some covariates (input variables) for Alice and Bob (age, sex, gene expression). We can fit an Accelerated Failure Time (AFT) model which takes those input variables and outputs a predicted survival time (Simple 1-page explanation). L1 regularized AFT models are of interest when there are many input variables and we would like the model to automatically ignore those which are un-informative (do not help predicting survival time). Several papers describe L1 regularized AFT models:
- Tech report on L1 regularization for AFT models, Huang et al 2005.
- PubMed article on L1 regularization for AFT models, Cai et al 2011.
Interval regression (or interval censoring) is a generalization in which any kind of interval is an acceptable output in the training data:
- exactly 10
- (10, 10)
- at least 16
- (16, Infinity)
- at most 3
- (-Infinity, 3)
- between -4 and 5
- (-4, 5)
As far as I know there is not yet any way to fit an L1 regularized model for this more general interval output data.
- AdapEnetClass::WEnetCC.aft (arXiv paper) fits a model with AFT loss and elastic net regularization.
- glmnet fits models for elastic net regularization with several loss functions, but neither AFT nor interval regression losses are supported.
- interval::icfit and survival::survreg provide solvers for non-regularized interval regression models.
Implement the first R package to support both
- general interval regression loss functions (not just right-censored survival data), and
- elastic net regularization.
loss | no regularization | elastic net regularization |
---|---|---|
Cox | coxph | glmnet |
AFT | icfit, survreg | WEnetCC.aft |
interval regression | icfit, survreg | THIS PROJECT |
There are two possible coding strategies
- Fork the solver from the glmnet source code and adapt it to work with the interval regression loss. Should be possible if you understand their FORTRAN code.
- Read Simon et al (JSS) and implement a coordinate descent solver from scratch in C code. (This coding strategy is preferred)
- Toby Dylan Hocking <[email protected]> proposed this project, would be a user of this package, and could mentor.
- Noah Simon <[email protected]> implemented the elastic net for the Cox model, and said he could help out informally, but he can NOT commit to formal co-mentoring.
- Need a co-mentor with experience implementing convex optimization algorithms! Students, if you are interested in this project, then you need to find another mentor! Maybe email the authors of the articles referenced above?
- Easy: write a knitr document in which you perform cross-validation to compare predictions from WEnetCC.aft/icfit/survreg models. Divide the AdapEnetClass::MCLcleaned data into train/test, then fit models to the train set, and compute error of each model with respect to the test set. Which model is most accurate?
- Medium: show that you know how to include FORTRAN/C code in an R package.
- Hard: write down the mathematical optimization problem for elastic net regularized interval regression using the loss function which corresponds to a log-logistic AFT model. Derive the sub-differential, coordinate descent updates, and a stopping criterion.
Students, please post a link to your test results here.