Skip to content

Control randomness of Kfold,StratifiedKfold,RandomSub #57

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Apr 10, 2023
Merged

Control randomness of Kfold,StratifiedKfold,RandomSub #57

merged 5 commits into from
Apr 10, 2023

Conversation

ngiann
Copy link
Contributor

@ngiann ngiann commented Jan 21, 2023

I modified the code so that the randomness of Kfold, StratifiedKfold and RandomSub can be controlled.

This is done by passing an optional object of type AbstractRNG via the new argument rng.

The default value of the new argument rng is MersenneTwister().

Example: Kfold(10,2, MersenneTwister(1))

I did not modify StratifiedRandomSub as I don't understand where random indices are samples.

I also did not modify LOOCV as there seems to be naturally no need for it.

Fixes #56

@codecov-commenter
Copy link

codecov-commenter commented Jan 21, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: -0.03 ⚠️

Comparison is base (25f578b) 78.77% compared to head (ff41fb8) 78.75%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
- Coverage   78.77%   78.75%   -0.03%     
==========================================
  Files           6        7       +1     
  Lines         655      659       +4     
==========================================
+ Hits          516      519       +3     
- Misses        139      140       +1     
Impacted Files Coverage Δ
src/crossval.jl 98.93% <100.00%> (+0.03%) ⬆️

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@AsafManela
Copy link
Contributor

Thanks. Looks great. Maybe add some tests to test/crossval.jl that provide an rng?

@ngiann
Copy link
Contributor Author

ngiann commented Jan 22, 2023

Good idea, I will also add a test then.

Copy link
Contributor

@AsafManela AsafManela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Thanks taking this on!

@ngiann
Copy link
Contributor Author

ngiann commented Jan 23, 2023

Looks great. Thanks taking this on!

Thank you for your input. I will update the docs soon.

@ngiann ngiann closed this Jan 23, 2023
@ngiann ngiann deleted the ControlRandomnessCrossvalidation branch January 23, 2023 08:04
@AsafManela
Copy link
Contributor

@ararslan Can you please merge and release this PR?

@ngiann ngiann restored the ControlRandomnessCrossvalidation branch January 24, 2023 12:54
@ngiann
Copy link
Contributor Author

ngiann commented Jan 24, 2023

closed this by accident

@ngiann ngiann reopened this Jan 24, 2023
@ngiann ngiann mentioned this pull request Jan 24, 2023
src/crossval.jl Outdated
@@ -11,9 +11,9 @@ struct Kfold <: CrossValGenerator
k::Int
coeff::Float64

function Kfold(n::Int, k::Int)
function Kfold(n::Int, k::Int, rng::AbstractRNG = MersenneTwister())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The general convention seems to be to put the RNG argument first for methods that accept one. I guess it's a bit less clear whether it applies in this case given that it's for a constructor for a type that doesn't hold onto the RNG but I think it'd still probably be good to follow the general pattern. It is admittedly more annoying though because there isn't convenient syntax for it. But to that end, you could do

function Kfold(rng::AbstractRNG, n::Int, k::Int)
    # existing contents
end
Kfold(n::Int, k::Int) = Kfold(MersenneTwister(), n, k)

and likewise for StratifiedKfold.

RandomSub is interesting though because it does hold onto the RNG. I guess it should stay in the corresponding position then, which means that it's different from (Stratified)Kfold. Maybe that's okay though? Curious to hear your thoughts on that.

I also wonder whether there could be any benefits to having (Stratified)Kfold also hold onto the RNG.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a very important detail that never occurred to me. If this is the general convention, I think it would be good to adhere to it.

One question: what do you mean with "holding onto" the RNG? I suppose you mean storing it in a field, right?

My (non-expert) opinion would be then to indeed put the RNG as the first argument in Kfold, StratifiedKfold and in RandomSub.

I am very happy to change the code accordingly. Should I push additional changes to this pull request?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean with "holding onto" the RNG? I suppose you mean storing it in a field, right?

Ah sorry about that, my choice of words was not particularly clear. 😅 Your interpretation is correct, I meant storing the RNG object as one of the fields.

My (non-expert) opinion would be then to indeed put the RNG as the first argument in Kfold, StratifiedKfold and in RandomSub.

Just to clarify, you would have them all take the RNG as the first argument, even though for RandomSub that doesn't correspond to position in the fields?

Another thing that occurred to me is that typically the default value for RNG arguments is Random.GLOBAL_RNG but currently the PR uses MersenneTwister(). I think it would probably be good to use the global RNG as the default, which will also match the behavior of these constructors as they exist on master currently. Or was the choice of MersenneTwister deliberate?

Copy link
Contributor Author

@ngiann ngiann Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, many thanks for your time and valuable input. This is my first ever contribution to a Julia package. I thought I should start with tackling something relatively easy and did not anticipate all these important details to crop up. I hope in the future I can make more careful and efficient contributions ;)

Just to clarify, you would have them all take the RNG as the first argument, even though for RandomSub that doesn't correspond to position in the fields?

Indeed, I would have all of them taking RNG as the first argument including RandomSub. For RandomSub, I would move the corresponding newly introduced field rng::AbstractRNG to the first position. I hope this is clearer now.

Another thing that occurred to me is that typically the default value for RNG arguments is Random.GLOBAL_RNG but currently the PR uses MersenneTwister(). I think it would probably be good to use the global RNG as the default, which will also match the behavior of these constructors as they exist on master currently. Or was the choice of MersenneTwister deliberate?

There is absolutely no deliberation behind the choice of MersenneTwister. Again, I think we should use Random.GLOBAL_RNG as the default option as you suggest.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, many thanks for your time and valuable input.

I'm more than happy to help, I just hope my comments have been more useful than annoying. 😅 This is looking great so far and I appreciate the contribution!

I hope this is clearer now.

👍 Yes indeed, thanks!

That all sounds good to me. Would you be willing to make those changes in this PR then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am very happy to do so. I will get on with it early next week.

@ngiann
Copy link
Contributor Author

ngiann commented Mar 28, 2023

Sorry that this took far too long. I incorporated the above discussed changes and updated the tests accordingly. In hindsight, I see now that I should have committed the updated code and the corresponding tests in a single commit. I did it separately, and I think that this is why the commit above failed: the updated code was automatically tested using the old test code and I provided the new test code in a separate, subsequent commit. I hope what I wrote make sense. Sorry for the confusion!

Also: I updated the documentation to reflect the changes in these functions. Please see #58.

src/crossval.jl Outdated
@@ -98,13 +102,21 @@ struct RandomSub <: CrossValGenerator
n::Int # total length
sn::Int # length of each subset
k::Int # number of subsets
rng::AbstractRNG # Random number generator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance reasons, it's generally not recommended to use abstract types on fields. Whether it would really make a difference here in practice I'm not sure. An alternative is to make the type itself parametric on the type of this field, i.e.

struct RandomSub{T<:AbstractRNG}
    n::Int
    sn::Int
    k::Int
    rng::T

    RandomSub(rng::T, n, sn, k) where {T<:AbstractRNG} = new{T}(n, sn, k, rng)
end

RandomSub(n, sn, k) = RandomSub(Random.GLOBAL_RNG, n, sn, k)

That too has its own tradeoffs though, e.g. RandomSub on its own is then a UnionAll rather than a DataType. Do you have any thoughts on what the most ergonomic approach would be for users?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My opinion is that this is not a performance critical code, as you also point out, and hence we could use the code in the proposed commit. However: if this is not good practice, then we should avoid it. On the other hand, I see that e.g. in the package Distributions.jl the function rand does accept as its first argument an AbstractRNG. This make me more inclined to use indeed rng::AbstractRNG as currently specified in the proposed commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think having the field abstractly typed is fine for this case.

Btw, something you could do to make things a little easier would be to put rng as the first field, which then only requires a single additional method definition:

struct RandomSub <: CrossValGenerator
    rng::AbstractRNG
    n::Int
    sn::Int
    k::Int
end

RandomSub(n, sn, k) = RandomSub(Random.GLOBAL_RNG, n, sn, k)

The approach you've already implemented works as well though!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good suggestion. I will make the corresponding changes very soon.

Copy link
Member

@ararslan ararslan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here, thanks for your patience and persistence!

@ararslan ararslan merged commit ccc8d57 into JuliaStats:master Apr 10, 2023
@ngiann ngiann deleted the ControlRandomnessCrossvalidation branch April 10, 2023 18:44
@ngiann
Copy link
Contributor Author

ngiann commented Apr 10, 2023

Great work here, thanks for your patience and persistence!

Thanks for your guidance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Control randomness of Kfold
4 participants