-
Notifications
You must be signed in to change notification settings - Fork 336
Consensus Motifs: Ostinato algorithm; most central motif; reproduce Fig 1, 2, 9 from paper Matrix Profile XV #279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This uses the mtDNA example. Ostinato transcribed from table 2 in paper. Performance is really slow. Related to stumpy-dev#278
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
Hi @seanlaw, here's a first draft that reproduces figure 9 of the paper. It's of course still very rough around the edges, not deserving the name 'tutorial' yet. I simply transcribed the pseudocode in table 2. Somehow it's really slow. I think this is because of the repeated calls to the MASS algorithm. |
OK, in the Matlab implementation they actually use something very close to the MASS2 algorithm which is an optimised version of MASS using frequency domain multiplication (so ffts). Instead of repeatedly calculating the fft for the time series T^i they precompute it once. |
@DanBenHa Firstly, thank you for taking the time to put this together! You've done a wonderful job! I have tendency to be rather direct in my feedback and just know that my comments are meant to be constructive and open for discussion/push back. I really appreciate everything that you've done! Regarding the slowness, I don't see anything particularly obvious. In fact, when I look at the length of each time series:
It looks like it is actually faster to concatenate all four time series (length of Okay, I just noticed that you are calling I find it really strange that, in their paper, So, I remember reading in one of the papers that when a subsequence of length
Or, in our case, I think we can/should initialize |
Below are some additional comments for you to consider:
What do you think? |
No offense taken. I'm here to learn.
How did you do it? Compute the matrix profile for the concatenate sequence and four times
For me
Yes, that's what they did in the Matlab implementation, too. Doesn't seem to speed up the calculation, though.
Sure. It was just for me because the naive implementation took so long to compute ;)
I wasn't aware of that. Thanks.
Done :) So, I'm not sure if this is actually worth pursuing any further if your brute-force approach is so fast. I'll test it with more/longer time-series tomorrow. Maybe it scales better? |
Sorry, I should've been a bit more explicit. My "15 seconds" was simply calling
Honestly, from what I can tell, you aren't doing anything that isn't already implemented in
Unfortunately, this means that all of the means/stddevs and array copies are performed over and over again for each call to I think the secret may be to use the lower level
I think that as the time series lengths get longer (100K+), the brute-force approach will perform significantly worse and, again, I apologize for making the false claim about the speed of the approximate "brute-force" approach above. That was a distraction. |
I love your attitude! I am learning a lot from this conversation |
Codecov Report
@@ Coverage Diff @@
## master #279 +/- ##
===========================================
- Coverage 100.00% 99.82% -0.18%
===========================================
Files 18 19 +1
Lines 1644 1688 +44
===========================================
+ Hits 1644 1685 +41
- Misses 0 3 +3
Continue to review full report at Codecov.
|
@seanlaw Thanks, I appreciate your feedback very much.
The only difference is that I'm precomputing |
A quick test of convolving two vectors of length 2**18 at least shows that
|
Can you tell me if you are using the latest version of NumPy? I recently (10 days ago) replaced Now I regret doing it as we may need to revert back to |
Exactly! First, you'd take each time series and do something like:
The reason we use I am not so well versed with FFT and haven't looked at that part of your new code closely but, since the sliding dot product is always between two different time series (i.e., Q from one time series and T from another time series), is it actually possible to precompute FFTs for each of the time series independently and reuse them once you know which Q it should be paired with? If so, assuming we replace it with |
My previous |
I can reproduce your findings:
The discrepancy is because I was using vectors of the same orders of magnitude:
I was about to say that it's alright to use numpy, since the query will typically be shorter than than
|
Wow, that's wild! Thanks for digging into this. If I understand your numbers correctly, it seems as though
It's interesting that the length described is very much aligned with your findings where the break even happens around
Perhaps we just update
|
That first conditional should be
I don't know if SciPy's direct method is the same as NumPy's |
Oops, you're right! Thanks for catching that
Honestly, I've gone back and forth between using How does that sound? If we don't want to handle the |
Okay, I just ran a quick test with
and the results were:
For short And then, when
we get
Based on this, I think I'm pretty convinced that
Given that the outputs are the same (based on the assert statements above) I don't think this matters. Unless I'm missing anything obvious? |
So, after switching
This code executed in around 6-7 seconds for the DNA inputs using I think 6-7 seconds is "good enough". @DanBenHa Do you mind testing this out? |
Yes, switching to |
Awesome! I'm super interested in hearing about how this goes with your larger data set and how much time it ends up taking. I'm guessing that since the If you have the time, would you mind:
Additionally:
Would it be possible to add an additional argument to Also, I think the DNA example is good as a more complex example (for later in the notebook) but I think it would be really helpful to start by reproducing Figure 1 and Figure 2 from the paper. I feel like these two basic figures really keeps the point simple in that you are trying to find the subsequence that exists in ALL of the 9 time series and then you show how to do it with simply calling the I understand that you are probably super busy so please let me know and I can probably handle it afterward. Absolutely no pressure! You've already contributed a lot of your time and I really appreciate it. |
I was busy with other stuff, so only had time to start the job around 5.5 hours ago. If I reuse FFTs like in my second version it takes around 80 minutes. The other version (convolution changed) is still computing and is taking more than three times the time. Seems like the repeated FFT computation is taking a big toll if the number of time-series is increased that much.
Do you mean just for this notebook or should I make a PR for the change in
After seeing the performance drop for bigger datasets we should consider recycling the time-series' FFTs. What do you think? Doesn't need to be in that object-oriented style that I went with in version 2.
In the paper's companion code it's a separate function. There's probably a way to combine them into one function but it doesn't look so simple. I think it's faster if we first implement the special case and later add the generalisation. BTW, in their readme they state
Agreed. After all, it's supposed to be a tutorial ;)
I appreciate your empathy! Yes, I'm busy actually applying stumpy to my data :) But there are going to be downtimes during which I can work on the notebook. It's just gonna take a bit more time. |
It just completed taking 212 minutes. So 2.65 times more than with the FFT recycling. |
I don't mind if you changed it in this PR.
It sounds like recycling FFTs may be worth it. Is it possible (or enough) to modify Having said this, I think we should be pragmatic and get this implementation merged as it is definitely "good enough" and we don't have to strive for the fastest implementation right out of the gate. I'd rather have smaller focused PRs than one giant PR with many little related things. Let's celebrate this win! And then, we can start a new separate issue that is focused on recycling FFTs and work on it when we are able to find more time. How does that sound to you? |
Thank you for following up on this. It's not horrible. Let's create a fresh issue that focuses on improving the speed/performance. |
Yes, thanks for bringing in the pragmatic perspective. Being in academia I tend to over-analyse/-optimise things 😆
And the optimisation in a fresh issue/PR. |
I've taken the liberty to add a little more detail
Please let me know if you'd like me to help with any/all of these tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of minor suggestions but looks good otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some additional thoughts on one of the points
tests/test_ostinato.py
Outdated
ts_ind = j | ||
ss_ind = min_rad_index | ||
|
||
return _get_central_motif(tss, rad, ts_ind, ss_ind, m) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using _get_central_motif
is it possible to create a naive_get_central_motif
instead? And then, we should test _get_central_motif
against naive_get_central_motif
. I think it is important to have this test since it is so core to ostinato
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this. Can you elaborate on this idea? How would a naive_get_central_motif
work? I feel like my implementation of _get_central_motif
is already quite naive :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the initial vagueness and I agree that ostinato._get_central_motif
is already quite simple but here is my concern. What happens when somebody incorrectly modifies or changes ostinato._get_central_motif
? Since this test calls ostinato._get_central_motif
then we have no way of verifying the "correctness" of ostinato._get_central_motif
. So, if somebody decides to alter ostinato._get_central_motif
to not compute anything and to always return the set of values 1, 2, 3
(for rad, ts_ind, ss_ind
) then this test will always pass since the test is calling ostinato._get_central_motif
. In other words, to be properly covered, we should have a separate test to verify the outputs of ostinato._get_central_motif
. Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes, when the functions are already simple, I create a "naive" version by replacing all of the major NumPy
calls with a more verbose/explicit for-loop
(i.e., replace np.argmin
, np.isclose
, np.max
, etc). I know that I am being rather stubborn/persistent here and so I can handle this afterward in #285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that makes sense. I'll give it a shot.
tests/test_ostinato.py
Outdated
radii = np.zeros(len(tss[j]) - m + 1) | ||
for i in range(k): | ||
if i != j: | ||
mp = stumpy.stump(tss[j], m, tss[i], ignore_trivial=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In tests/naive.py
you can import this into this test with import naive
and then replace stumpy.stump
with naive.stump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment refers to the idea of naive_get_central_motif
, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. Instead of using:
mp = stumpy.stump(tss[j], m, tss[i], ignore_trivial=False)
Please replace this with:
import naive
mp = naive.stump(tss[j], m, tss[i], ignore_trivial=False)
This naive.stump
implementation is slower but helps us ensure that all of our fast implementations produce the same results as all of slow (naive) implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added some more comments around the tests
Apologies for the delay - I've been quite busy the last few days. I've implemented your suggestions to the notebook. The data was uploaded to Zenodo but, as you saw, not in the proper way. I'll leave it like that for now and see if I have time to do it properly later. |
No worries! What a coincidence, I found a little bit of time to upload the data to Zenodo in the "correct" format and I've provided some notebook updates that reflect those changes so, hopefully, it should be a quick fix.
No rush! Thank you for following up on this |
Great, cheers for that! Looks much cleaner now 😃 |
@DanBenHa Do you think this is ready to be merged or are there any remaining tasks to be done? |
Almost. I'll resolve your remaining comments around the tests first thing tomorrow morning :) |
OK, barring the |
Thanks for this idea! Seems like snippets are conceptually similar to a PCA (thinking very naively) in that you're trying to find components/snippets that best explain the variance in the data. I don't think it would help with the central motifs but it's definitely something that I will look into for my data. |
stumpy/ostinato.py
Outdated
ts_ind_alt = ts_ind_alt[alt_same] | ||
ss_ind_alt = ss_ind_alt[alt_same] | ||
i_alt_first = np.argmin(ts_ind_alt) | ||
if ts_ind_alt[i_alt_first] < ts_ind: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This amount of nested branching logic concerns me (from a testing standpoint). I can try to spend some time in the future to see if it can be simplified but just leaving a note for my future self here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #285
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, in case it matters, I would not put naive_get_central_motif
in naive.py
because naive_get_central_motif
is only used by test_ostinato.py
and no other test file. Functions in naive.py
are strictly for naive implementations that are used across multiple different test files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, when I wrote this I also got that code smell. I'll see if i can resolve it ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor comment
Cool, I'll try to implement/change what you suggested. |
I have implemented the naive version of Also removed the nested conditional. |
Ohh no, that wasn't my intention. :( I apologize as I was likely not clear enough on what I was asking. I think you've done a tonne so I can go over it after it gets merged.
Sounds good! I will take some time to go through it. Thank you so much for working with me on all of this! I really appreciate your help and I certainly learned a lot. Once this gets released, I'll be sure to use your @DanielHaehnke Twitter handle to acknowledge your contribution! |
Cool, thanks for accepting the pull request into master! Don't worry about the testing confusion on my side. My knowledge of testing isn't that big, so I'm really curious about what you meant. Will subscribe to #285 to see your solution :) |
This uses the mtDNA example. Ostinato transcribed from table 2 in paper.
Performance is really slow.
resolves #277, resolves #278
Pull Request Checklist
Below is a simple checklist but please do not hesitate to ask for assistance!
black
(i.e.,python -m pip install black
orconda install black
)flake8
(i.e.,python -m pip install flake8
orconda install flake8
)pytest-cov
(i.e.,python -m pip install pytest-cov
orconda install pytest-cov
)black .
in the root stumpy directoryflake8 .
in the root stumpy directory./setup.sh && ./test.sh
in the root stumpy directory