-
Notifications
You must be signed in to change notification settings - Fork 43
Open
Labels
Description
I know MLJ will definitely have some overhead since it's wraps other code. But i believe this overhead can be reduced below the current level with careful design considerations. Avoidable overhead which are neglected may come to hunt us when the code is scaled. There are a few things i discovered which are important.
- As pointed out in issue reduce overhead of fit! over update #151
selectrows
. This causes overhead when used inevaluate
method which calls this method a lot of times depending on therepeats
parameter. - In the
MLJBase.fit
method, thematrix
method (which to my understanding copies data) is called on a given tableX
of course this isn't bad if i call this method only once using the same data changing a couple of model parameters. This becomes important in theevaluate
method which callsfit
a couple of times depending onrepeats
.(If X is larger this isn't nice). (I don't thinkupdate
method does enough justice). Also other methods call these methods repeatedly copying X in each case - The return type to
MLJBase.predict
for probabilistic Classifiers.( I can't find the link to the issue)
These are just some of the point. I believe there are other things which affects scalability. It is better we start treating this issues more serious. Imagine a case where One wants to embark on a kaggle competition on a large dataset only to find out that the overhead is just unbearable in this case). You may correct me perhaps i'm missing something.
@ablaom , @tlienart
ablaom, azev77 and CameronBieganektlienart and elisno