taking performance issue more seriously

I know MLJ will definitely have some overhead since it's wraps other code. But i believe this overhead can be reduced below the current level with careful design considerations.  Avoidable overhead which are neglected may come to hunt us when the code is scaled. There are a few things i discovered which are important.
1. As pointed out in issue #151 `selectrows`. This causes overhead when used in `evaluate` method which calls this method a lot of times depending on the `repeats` parameter.
2. In the `MLJBase.fit` method, the `matrix` method (which to my understanding copies data)  is called on a given table `X` of course this isn't bad if i call this method only once using the same data changing a couple of model parameters. This becomes important in the `evaluate` method which calls `fit` a couple of times depending on `repeats` .(If X is larger this isn't nice). (I don't think `update` method does enough justice). Also other methods call these methods repeatedly copying X in each case
3. The return type to `MLJBase.predict` for probabilistic Classifiers.( I can't find the link to the issue)

These are just some of the point. I believe there are other things which affects scalability. It is better we start treating  this issues more serious. Imagine a case where One wants to embark on a kaggle competition on a large dataset only to find out that the overhead is just unbearable in this case). You may correct me perhaps i'm missing something. 
@ablaom , @tlienart  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

taking performance issue more seriously #309

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

taking performance issue more seriously #309

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions