-
Notifications
You must be signed in to change notification settings - Fork 62
Control randomness of Kfold,StratifiedKfold,RandomSub #57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control randomness of Kfold,StratifiedKfold,RandomSub #57
Conversation
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #57 +/- ##
==========================================
- Coverage 78.77% 78.75% -0.03%
==========================================
Files 6 7 +1
Lines 655 659 +4
==========================================
+ Hits 516 519 +3
- Misses 139 140 +1
... and 2 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Thanks. Looks great. Maybe add some tests to |
Good idea, I will also add a test then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Thanks taking this on!
Thank you for your input. I will update the docs soon. |
@ararslan Can you please merge and release this PR? |
closed this by accident |
src/crossval.jl
Outdated
@@ -11,9 +11,9 @@ struct Kfold <: CrossValGenerator | |||
k::Int | |||
coeff::Float64 | |||
|
|||
function Kfold(n::Int, k::Int) | |||
function Kfold(n::Int, k::Int, rng::AbstractRNG = MersenneTwister()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general convention seems to be to put the RNG argument first for methods that accept one. I guess it's a bit less clear whether it applies in this case given that it's for a constructor for a type that doesn't hold onto the RNG but I think it'd still probably be good to follow the general pattern. It is admittedly more annoying though because there isn't convenient syntax for it. But to that end, you could do
function Kfold(rng::AbstractRNG, n::Int, k::Int)
# existing contents
end
Kfold(n::Int, k::Int) = Kfold(MersenneTwister(), n, k)
and likewise for StratifiedKfold
.
RandomSub
is interesting though because it does hold onto the RNG. I guess it should stay in the corresponding position then, which means that it's different from (Stratified)Kfold
. Maybe that's okay though? Curious to hear your thoughts on that.
I also wonder whether there could be any benefits to having (Stratified)Kfold
also hold onto the RNG.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a very important detail that never occurred to me. If this is the general convention, I think it would be good to adhere to it.
One question: what do you mean with "holding onto" the RNG? I suppose you mean storing it in a field, right?
My (non-expert) opinion would be then to indeed put the RNG as the first argument in Kfold
, StratifiedKfold
and in RandomSub
.
I am very happy to change the code accordingly. Should I push additional changes to this pull request?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what do you mean with "holding onto" the RNG? I suppose you mean storing it in a field, right?
Ah sorry about that, my choice of words was not particularly clear. 😅 Your interpretation is correct, I meant storing the RNG object as one of the fields.
My (non-expert) opinion would be then to indeed put the RNG as the first argument in
Kfold
,StratifiedKfold
and inRandomSub
.
Just to clarify, you would have them all take the RNG as the first argument, even though for RandomSub
that doesn't correspond to position in the fields?
Another thing that occurred to me is that typically the default value for RNG arguments is Random.GLOBAL_RNG
but currently the PR uses MersenneTwister()
. I think it would probably be good to use the global RNG as the default, which will also match the behavior of these constructors as they exist on master
currently. Or was the choice of MersenneTwister
deliberate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, many thanks for your time and valuable input. This is my first ever contribution to a Julia package. I thought I should start with tackling something relatively easy and did not anticipate all these important details to crop up. I hope in the future I can make more careful and efficient contributions ;)
Just to clarify, you would have them all take the RNG as the first argument, even though for
RandomSub
that doesn't correspond to position in the fields?
Indeed, I would have all of them taking RNG as the first argument including RandomSub
. For RandomSub
, I would move the corresponding newly introduced field rng::AbstractRNG
to the first position. I hope this is clearer now.
Another thing that occurred to me is that typically the default value for RNG arguments is
Random.GLOBAL_RNG
but currently the PR usesMersenneTwister()
. I think it would probably be good to use the global RNG as the default, which will also match the behavior of these constructors as they exist onmaster
currently. Or was the choice ofMersenneTwister
deliberate?
There is absolutely no deliberation behind the choice of MersenneTwister
. Again, I think we should use Random.GLOBAL_RNG
as the default option as you suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, many thanks for your time and valuable input.
I'm more than happy to help, I just hope my comments have been more useful than annoying. 😅 This is looking great so far and I appreciate the contribution!
I hope this is clearer now.
👍 Yes indeed, thanks!
That all sounds good to me. Would you be willing to make those changes in this PR then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I am very happy to do so. I will get on with it early next week.
and RandomSub work when specifying as the first argument a RNG.
Sorry that this took far too long. I incorporated the above discussed changes and updated the tests accordingly. In hindsight, I see now that I should have committed the updated code and the corresponding tests in a single commit. I did it separately, and I think that this is why the commit above failed: the updated code was automatically tested using the old test code and I provided the new test code in a separate, subsequent commit. I hope what I wrote make sense. Sorry for the confusion! Also: I updated the documentation to reflect the changes in these functions. Please see #58. |
src/crossval.jl
Outdated
@@ -98,13 +102,21 @@ struct RandomSub <: CrossValGenerator | |||
n::Int # total length | |||
sn::Int # length of each subset | |||
k::Int # number of subsets | |||
rng::AbstractRNG # Random number generator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For performance reasons, it's generally not recommended to use abstract types on fields. Whether it would really make a difference here in practice I'm not sure. An alternative is to make the type itself parametric on the type of this field, i.e.
struct RandomSub{T<:AbstractRNG}
n::Int
sn::Int
k::Int
rng::T
RandomSub(rng::T, n, sn, k) where {T<:AbstractRNG} = new{T}(n, sn, k, rng)
end
RandomSub(n, sn, k) = RandomSub(Random.GLOBAL_RNG, n, sn, k)
That too has its own tradeoffs though, e.g. RandomSub
on its own is then a UnionAll
rather than a DataType
. Do you have any thoughts on what the most ergonomic approach would be for users?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My opinion is that this is not a performance critical code, as you also point out, and hence we could use the code in the proposed commit. However: if this is not good practice, then we should avoid it. On the other hand, I see that e.g. in the package Distributions.jl the function rand
does accept as its first argument an AbstractRNG
. This make me more inclined to use indeed rng::AbstractRNG
as currently specified in the proposed commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think having the field abstractly typed is fine for this case.
Btw, something you could do to make things a little easier would be to put rng
as the first field, which then only requires a single additional method definition:
struct RandomSub <: CrossValGenerator
rng::AbstractRNG
n::Int
sn::Int
k::Int
end
RandomSub(n, sn, k) = RandomSub(Random.GLOBAL_RNG, n, sn, k)
The approach you've already implemented works as well though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good suggestion. I will make the corresponding changes very soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work here, thanks for your patience and persistence!
Thanks for your guidance! |
I modified the code so that the randomness of
Kfold
,StratifiedKfold
andRandomSub
can be controlled.This is done by passing an optional object of type
AbstractRNG
via the new argumentrng
.The default value of the new argument
rng
isMersenneTwister()
.Example:
Kfold(10,2, MersenneTwister(1))
I did not modify
StratifiedRandomSub
as I don't understand where random indices are samples.I also did not modify
LOOCV
as there seems to be naturally no need for it.Fixes #56