Skip to content

taking performance issue more seriously #309

@OkonSamuel

Description

@OkonSamuel

I know MLJ will definitely have some overhead since it's wraps other code. But i believe this overhead can be reduced below the current level with careful design considerations. Avoidable overhead which are neglected may come to hunt us when the code is scaled. There are a few things i discovered which are important.

  1. As pointed out in issue reduce overhead of fit! over update #151 selectrows. This causes overhead when used in evaluate method which calls this method a lot of times depending on the repeats parameter.
  2. In the MLJBase.fit method, the matrix method (which to my understanding copies data) is called on a given table X of course this isn't bad if i call this method only once using the same data changing a couple of model parameters. This becomes important in the evaluate method which calls fit a couple of times depending on repeats .(If X is larger this isn't nice). (I don't think update method does enough justice). Also other methods call these methods repeatedly copying X in each case
  3. The return type to MLJBase.predict for probabilistic Classifiers.( I can't find the link to the issue)

These are just some of the point. I believe there are other things which affects scalability. It is better we start treating this issues more serious. Imagine a case where One wants to embark on a kaggle competition on a large dataset only to find out that the overhead is just unbearable in this case). You may correct me perhaps i'm missing something.
@ablaom , @tlienart

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions