Skip to content

[ADD] Added first draft notebook for the multidimensional motif and m… #557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 38 commits into from
Oct 31, 2022

Conversation

SaVoAMP
Copy link
Contributor

@SaVoAMP SaVoAMP commented Mar 2, 2022

…atches discovery tutorial

Pull Request Checklist

Below is a simple checklist but please do not hesitate to ask for assistance!

  • Fork, clone, and checkout the newest version of the code
  • Create a new branch
  • Make necessary code changes
  • Install black (i.e., python -m pip install black or conda install -c conda-forge black)
  • Install flake8 (i.e., python -m pip install flake8 or conda install -c conda-forge flake8)
  • Install pytest-cov (i.e., python -m pip install pytest-cov or conda install -c conda-forge pytest-cov)
  • Run black . in the root stumpy directory
  • Run flake8 . in the root stumpy directory
  • Run ./setup.sh && ./test.sh in the root stumpy directory
  • Reference a Github issue (and create one if one doesn't already exist)
    Add Top-K Multi-dimensional Motif Discovery Tutorial #518

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Mar 2, 2022

I was not sure if I am supposed to run black and flake8 on jupyter notebooks too, so I skipped these steps.

@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.89%. Comparing base (c01c04c) to head (d2d8876).
Report is 249 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #557   +/-   ##
=======================================
  Coverage   99.89%   99.89%           
=======================================
  Files          80       80           
  Lines       11399    11453   +54     
=======================================
+ Hits        11387    11441   +54     
  Misses         12       12           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@seanlaw
Copy link
Contributor

seanlaw commented Mar 3, 2022

I was not sure if I am supposed to run black and flake8 on jupyter notebooks too, so I skipped these steps.

Nope. black/flake8 is not needed for notebooks. Thanks for checking

@seanlaw
Copy link
Contributor

seanlaw commented Mar 6, 2022

@SaVoAMP Said:

Alright!
My problem is that setting values with zero standard deviation to np.nan doesn't work and I can't explain why. I wrote two if-conditions and when I debug, I also get into the conditions. But when setting the values to np.nan it doesn't work, although the constant_idx vector is occupied. I probably need to take a closer look at the data frame operations here.
Could this be because the timestamps are now here in the left column and no longer the data points 0, 1, 2, ..., n?
But I think constant_idx is giving me the data point positions since constant_idx=[239, 240, 241, ..., 4650, 4651, 4652]. Is there a simple (data science) trick? I would insert another column that counts the data_points and work with them instead of the timestamps. But that can certainly be done more elegantly.

I will take a look at the notebook (click on the big dark purple button at the top of the page that says "ReviewNB"). Your best bet is to make your changes and push the notebook so that I can directly see what you are saying and we can comment (roughly) inline.

@@ -0,0 +1,241 @@
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, it looks like you are manipulating the values within a column and then you are trying to stuff the results back into the same column of your dataframe. Unfortunately, you cannot do that via df.values[:, i] = X as df[:, i].values only allows you to access a copy of the values as a numpy array but, since it's a copy, it does not allow you to write back into the original Pandas dataframe. Would something like this work for what you are trying to achieve?:

# m was set to 4 hours
m = 4 * 60

padded_stddev = np.zeros(len(df))  # reusuable array filled with zeros
for colname in df.columns:
    stddev = core.rolling_nanstd(df.loc[:, colname].values, m)  # Compute rolling stddev. Constant regions will have a 0.0 stddev
    padded_stddev[-len(constant_mask):] = stddev  # Copy stddev into array that is pre-padded with (m-1) zeros
    modified_values = df.loc[:, colname].where(padded_stddev != 0.0, np.nan)  # Replace original value with np.nan where condition is False
    df.assign(colname=modified_values)

Stepping through each line of the for-loop:

  1. This should look familiar as we are computing the stddev for a particular column. In Pandas, the preferred way to access the column (or Pandas series) by name is via the loc function where the first element corresponds to a list of named indices (in this case, our index is a timestamp) and then followed by a list of column/series names. In our case, we want to grab all rows (i.e., all timestamps) but only a single column at a time, so, df.loc[:, colname] specifies that all rows with : and only the column with colname should be analyzed. The .values makes sure to convert that column/series into a copy of a numpy array for that column/series. Then, we pass that copy of values into core.rolling_nanstd (a STUMPY function) and compute the rolling standard deviation. Of course, it is shortened by m-1 values and so we'll need to eventually pad it with some additional values
  2. Here, we use our pre-generated array that has been padded with m-1 zeros at the start and then we copy over our standard deviation values to fill in the rest of the array. So, the first m-1 values are all zeros and then followed by all of the stddev values
  3. Then, we create a new array of modified_values of the df.loc[:, colname] series based on whether the each row of padded_stddev is equal/not equal to 0.0. Note that the Pandas series where function assesses whether the condition provided as the first argument is True or False. When True, it keeps the value from the Pandas series. However, if the condition is found to be False then it will assign the value in the second argument, which is np.nan in this case. So, for each row of padded_stddev , if that row has a value that is not equal to 0.0 (i.e., True), then keep the value from row in the the Pandas series. However, If the row in padded_stddev is equal to 0.0 (i.e., False), then replace the value in row of the Pandas series with np.nan. Note that all of this is written out to a new Pandas series called modified_values
  4. Finally, we assign the values of the series in df with column name colname with the our new Pandas series modified_values

We could have combined steps 3 and 4 into one:

df.assign(colname=df.loc[:, colname].where(padded_stddev != 0.0, np.nan))

but this is a bit harder to digest/consume.

Please ask for any clarification and I would be happy to discuss this further.



Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much! This was pretty useful for my understanding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

@seanlaw
Copy link
Contributor

seanlaw commented Mar 7, 2022

@SaVoAMP I came across this paper by Jessica Lin (who has worked with Eamonn Keogh in the past). If you look at Section D, I think that the PAMAP2 dataset (Overview found here - Download zip file here) may be good for our tutorial. In the paper, they focused on the "ironing" activity and used a 12.5 second (12500 data point) window size, which they claimed allowed them to identify relevant hand motions (from the Y and Z hand coordinates) that are consistent with ironing while, presumably, other coordinates (from the chest and ankles) may have been irrelevant for ironing.

The dataset is suppose to contain 1.3 million datapoints in length which may be too much for a tutorial but I wonder if we could downsample the data by 10x (i.e., only look at every 10th data point and thereby analyzing only 130K data points for each dimension) and still get be able to convey our point. I wanted to bring it to your attention in case it may be useful.

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Mar 7, 2022

The dataset is suppose to contain 1.3 million datapoints in length which may be too much for a tutorial but I wonder if we could downsample the data by 10x (i.e., only look at every 10th data point and thereby analyzing only 130K data points for each dimension) and still get be able to convey our point. I wanted to bring it to your attention in case it may be useful.

I'm not sure what data set would be better 🤔 Just added your suggestions to eliminate constant regions. Now we have much better results. But I'm not sure why there is not eliminated more of the 'zero'-regions in the Tumble Dryer and Washing Machine data. Maybe we get good results if I also eliminate regions where the power demand is zero, since this regions are not important? Or to you think that would do to much?
Do you think we can still get something out of the dataset, or would you rather switch to the other dataset?

@seanlaw
Copy link
Contributor

seanlaw commented Mar 8, 2022

The dataset is suppose to contain 1.3 million datapoints in length which may be too much for a tutorial but I wonder if we could downsample the data by 10x (i.e., only look at every 10th data point and thereby analyzing only 130K data points for each dimension) and still get be able to convey our point. I wanted to bring it to your attention in case it may be useful.

I'm not sure what data set would be better 🤔 Just added your suggestions to eliminate constant regions. Now we have much better results. But I'm not sure why there is not eliminated more of the 'zero'-regions in the Tumble Dryer and Washing Machine data. Maybe we get good results if I also eliminate regions where the power demand is zero, since this regions are not important? Or to you think that would do to much? Do you think we can still get something out of the dataset, or would you rather switch to the other dataset?

I think we should exhaust our options fully before moving on. The key thing is that there is no way for anybody (including your or I) to know what the best option is until we try. The good thing is that there is no rush here. We should assess as we go. If you are willing, let's "try something and see what happens". Let me know if you are getting frustrated by anything or if there are any other ways that I can help. However, I see that you are a good problem solver too. Know that we are in this together!

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Mar 9, 2022

I also tried to eliminate those regions that are zero since they aren't interesting. Unfortunately this is eliminating too much and therefore the program doesn't work anymore. I think it might be generally difficult to find more than one motif with the dataset. Preprocessing alone is a major problem here. Either you eliminate too much or too little 😞

@seanlaw
Copy link
Contributor

seanlaw commented Mar 9, 2022

Let me take a look

@@ -0,0 +1,250 @@
{
Copy link
Contributor

@seanlaw seanlaw Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of setting things to np.nan, set things to random noise (adjust the mean/stddev for np.random.uniform accordingly) ?

m = 4 * 60
padded_stddev = np.zeros(len(df))

for colname in df.columns:
    noise = np.random.uniform(0.0, 0.1, size=len(df))
    stddev = core.rolling_nanstd(df.loc[:, colname].values, m)
    padded_stddev[-len(stddev):] = stddev
    modified_values = df.loc[:, colname].where(padded_stddev != 0.0, noise)
    df = df.assign({colname: modified_values})
    zero_idx = df.loc[:, colname].where(df.loc[:, colname] != 0, noise)
    df = df.assign({colname: zero_idx})

Reply via ReviewNB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this, I did:

mps, indices = stumpy.mstump(df.astype(float), m)
motifs_distances, motifs_indices, motifs_subspaces, motifs_mdls = mmotifs(
    df.astype(float), mps, indices, max_distance=np.inf, max_matches=2
)

I think we need to allow max_motifs=5 if we wanted to find more motifs as the default is max_motifs=1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if instead of setting things to np.nan, set things to random noise

That's a pretty good idea! Thank you!

I think we need to allow max_motifs=5 if we wanted to find more motifs as the default is max_motifs=1

Yes, I agree. But since we want to show as much of the mmotifs function as we can, I would do everything step by step. Or what would you say?

My suggestion would be to proceed as follows:
I would try to set different parameter combinations in mmotifs and try to find different things. For example we could try to find the first motif unconstrained (as it is now), explain how to choose the parameters in mmotifs to do so, plot the found motif and explain what we have done, in the first step. Afterwards we could do exactly the same steps but try to find a 2-dimensional motif (instead of the unconstrained one) for example. If we also have the 2-dimensional motif, we could search for its nearest neighbors (matches) and so on. And after that I would try to find some kind of story that explains what we are doing in each step and what we are finding / trying to find.

That means, that I would try to call mmotifs in different ways in order to explain our input parameters. Then I would plot what we have found and explain our findings. We also could show the MDL results after the unconstrained search, to show the user what mmotifs is doing if no k is set and explain that. And so on ...
I think it might be easiest to do the programming steps first, look what parameter combinations lead to meaningful results and write the context/explanation for the jupyter notebook at the end since it depends on the parameter settings.

What do you think?

num_motifs = motifs_distances.shape[0]

for motif_num in range(num_motifs):
    motif_indices = motifs_indices[motif_num]
    mdl = motifs_mdls[motif_num]
    S = motifs_subspaces[motif_num]
    subspace = [df.columns.values[s] for s in S]

    k = len(S)
    f, axs = plt.subplots(k, 1, figsize=(20, 20), sharex=True)
    plt.suptitle(f"The {k}-dimensional motif pair")

    for j, s in enumerate(S):
        time_series = df[df.columns[s]]
        if k > 1:
            ax = axs[j]
        else:
            ax = axs
        ax.plot(time_series, c="0.75")
        ax.set_title(df.columns.values[s])

        motif_idx = motif_indices[0]
        nn_idx = motif_indices[1]
        ax.plot(time_series[motif_idx : motif_idx + m], c="r", linewidth=4)
        ax.plot(time_series[nn_idx : nn_idx + m], c="r", linewidth=4)
        # ax.axvline(x=motif_idx, linestyle='dashed', c='k', linewidth=2)
        # ax.axvline(x=nn_idx, linestyle='dashed', c='k', linewidth=2)

        # Only relevant if you want to find further matches
        # Set motif_indices[2:] to avoid double counting the motif pair
        [ax.plot(time_series[match : match + m]) for match in motif_indices[2:]]

I think it might be useful to wrap that code in a separate function for this tutorial since it will be repeated many times. I'm asking because I haven't seen any function definitions in your tutorials yet. I would define the function at the beginning of the tutorial and explain that it will be used many times to visualize our results/findings.

Copy link
Contributor

@seanlaw seanlaw Mar 9, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the first demonstration should always be calling the function with default parameters as it is the easiest and most straightforward:

mps, indices = stumpy.mstump(df.astype(float), m)
motifs_distances, motifs_indices, motifs_subspaces, motifs_mdls = mmotifs(df.astype(float), mps, indices)

So, you get something reasonable back with very little effort. Then, I would motivate the next step with "but this only gave us one motif and, at most, 10 matches", so how would we find more motifs beyond this first "top" one? And, by asking logical questions, we can motivate the introduction/need for the other parameters. I trust your judgement and don't want to overthink this. If you are open to feedback then I propose that we write the tutorial and then edit it as we go. It will take time but, ultimately, our users will benefit!

I'm asking because I haven't seen any function definitions in your tutorials yet. I would define the function at the beginning of the tutorial and explain that it will be used many times to visualize our results/findings.

You are very observant and thank you for asking! In general, for STUMPY code, I completely agree that we should write reusable functions rather than repeating ourselves everywhere. However, for tutorials, I personally hate needing to scroll all the way up to the top of the notebook in order to remember what the function does. I just can't keep it all in my head and I'd rather having all of the relevant code all in one place for a tutorial. Let me be absolutely clear that I'm not claiming to be "right" here but I think that by not defining functions in tutorials makes the barrier to entry a little lower for those who are new to Python. So the repeated code in our tutorials was not a lazy choice but one that was aiming to provide a more "inclusive"/"locally self-contained" programming approach. Considering that you've been a consumer of the tutorials, I'm open to your feedback and criticism as well! How do you feel about it? What are your thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are very observant and thank you for asking! In general, for STUMPY code, I completely agree that we should write reusable functions rather than repeating ourselves everywhere. However, for tutorials, I personally hate needing to scroll all the way up to the top of the notebook in order to remember what the function does. I just can't keep it all in my head and I'd rather having all of the relevant code all in one place for a tutorial. Let me be absolutely clear that I'm not claiming to be "right" here but I think that by not defining functions in tutorials makes the barrier to entry a little lower for those who are new to Python. So the repeated code in our tutorials was not a lazy choice but one that was aiming to provide a more "inclusive"/"locally self-contained" programming approach. Considering that you've been a consumer of the tutorials, I'm open to your feedback and criticism as well! How do you feel about it? What are your thoughts?

Since different visualizations were always chosen in the previous tutorials, I thought this made sense. However, I think that in a tutorial where you want to explain the functionality of a function by setting different parameters it may be clearer to define one visualization function since we are using exactly the same code many times (which was not the case in the previous tutorials) in order to explain how everything works and what the differences between the function calls are.

However, for tutorials, I personally hate needing to scroll all the way up to the top of the notebook in order to remember what the function does. I just can't keep it all in my head and I'd rather having all of the relevant code all in one place for a tutorial.

Generally I agree with you here. But in this specific case I think it might be easier/clearer to define a visualize() or show_matches() function. Such a function is so simple that I think the name says enough about it that everyone knows what is happening without having to scroll up for it 🤔

I think the first demonstration should always be calling the function with default parameters as it is the easiest and most straightforward:

Yes, that sounds reasonable! I wanted to make sure at the beginning that I actually get exactly one motif pair. I'll take a look if we get a solution with the default parameters (in my testing data I had to at least set the max_distance parameter in order to find a motif).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I think that in a tutorial where you want to explain the functionality of a function by setting different parameters it may be clearer to define one visualization function since we are using exactly the same code many times (which was not the case in the previous tutorials) in order to explain how everything works and what the differences between the function calls are.
Such a function is so simple that I think the name says enough about it that everyone knows what is happening without having to scroll up for it

@SaVoAMP I trust your judgement. Go for it and let's see how it turns out!

Yes, that sounds reasonable! I wanted to make sure at the beginning that I actually get exactly one motif pair. I'll take a look if we get a solution with the default parameters (in my testing data I had to at least set the max_distance parameter in order to find a motif).

Sounds good!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why but every time I rerun the notebook, I get different outputs, for example:
image
image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does sound strange. It implies that you may be overwriting your original data structures. Are your otifs_distances, motifs_indices, motifs_subspaces, motifs_mdls still the same when you re-run the notebook or are they changing after each re-run?

@@ -0,0 +1,242 @@
{
Copy link
Contributor

@seanlaw seanlaw Mar 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that it is wise to have a function with no inputs as this can cause unintended behavior. For example, in motif_indices = motifs_indices[motif_num] you are overwriting the global motif_indices that is outside of your function. You should really make a copy first and you should always pass in the latest motif_distances, motif_indices , df, etc.

def show_motifs_matches(df, motif_distances, motif_indices, ...)

Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thank you, I'll change it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this doesn't change anything and I still get different motifs. I have to take a closer look at this the next days. I don't like Jupyter notebooks since I am often debugging my way through all the code which is inconvenient with them in my opinion. So I resort to copy-pasting it to Pycharm and back as needed 😆

@seanlaw
Copy link
Contributor

seanlaw commented Mar 11, 2022

[MOD] Rap visualization in a fuction

I can't lie, @SaVoAMP, I was really looking forward to seeing how you combine "Rap" and "visualization" 😸

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Mar 11, 2022

I can't lie, @SaVoAMP, I was really looking forward to seeing how you combine "Rap" and "visualization" smile_cat

Oh god, that is very embarrassing 😆 It's supposed to mean 'wrap' 🤣

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Apr 20, 2022

Yes, that makes sense!
So we could first finish the tutorial with an easier concept and I could dive into the other problem afterwards.

@seanlaw
Copy link
Contributor

seanlaw commented Jun 22, 2022

@SaVoAMP Just checking in here. How are things going?

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Jun 22, 2022

Hey, sorry, I was ill for a while 😞

I have already integrated your suggested Annotation Vector code and updated all of the text according to the motifs that are found with this new approach. I think that the new motifs make more sense than the old ones. However, the three-dimensional motif that is found in the "Constrained Search" section doesn't look really meaningful since many no-action regions are found for the matches.

At this time I didn't find out why the "Comparing the motif with its nearest neighbors" plot only works sometimes. That's pretty strange 🤔 I think I have to check this with another IDE since I don't like to debug directly on jupyter notebooks 😆

By the way, do you have a programming task I could contribute to? Maybe something like developing the mmotifs function? I would love to contribute some code and improve my programming skills, but jupyter notebooks drive me crazy 😆

@seanlaw
Copy link
Contributor

seanlaw commented Jun 22, 2022

Hey, sorry, I was ill for a while 😞

Oh no! Sorry to hear that. I hope you are getting better

I have already integrated your suggested Annotation Vector code and updated all of the text according to the motifs that are found with this new approach. I think that the new motifs make more sense than the old ones. However, the three-dimensional motif that is found in the "Constrained Search" section doesn't look really meaningful since many no-action regions are found for the matches.

At some point, if the process and results aren't clear then maybe we merge the tutorial and simply accept it in a "work-in-progress" state and allow people to discover it (or we point people to it that ask in the discussions) but, due to the lack of clarity, we don't add it to the Tutorial section. Maybe you'd agree that while the tools are reasonable, the example use case will likely cause more confusion. I think you've done a tonne of work and I also want to give you an opportunity to break from it as well. We've done this many, many times with other tutorials where we just didn't publish it (not because of lack of effort) but because it wasn't "ready" yet and, sometimes, it may never be. What do you think?

At this time I didn't find out why the "Comparing the motif with its nearest neighbors" plot only works sometimes. That's pretty strange 🤔 I think I have to check this with another IDE since I don't like to debug directly on jupyter notebooks 😆

Cool.

By the way, do you have a programming task I could contribute to? Maybe something like developing the mmotifs function? I would love to contribute some code and improve my programming skills, but jupyter notebooks drive me crazy 😆

I totally understand. I just looked at the open issues (and I recommend you take a look as well) but I don't see anythings as interesting/similar as developing mmotifs. I will try to keep it in mind. The closest thing that we have right now is to try and reproduce the applications found in #107 or also from this paper. It would be really great to demonstrate how to recreate some if not all of this work using STUMPY. Just throwing the idea out there. I'd love to add a new section to our documentation that is more short and to the point along the lines of "10 Useful Things You Can Do with STUMPY in a Few Lines of Code"

@seanlaw
Copy link
Contributor

seanlaw commented Sep 28, 2022

Hi @SaVoAMP! How are you? I just wanted to check-in here and see how things were going?

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Sep 30, 2022

Hey 😃

At some point, if the process and results aren't clear then maybe we merge the tutorial and simply accept it in a "work-in-progress" state and allow people to discover it (or we point people to it that ask in the discussions) but, due to the lack of clarity, we don't add it to the Tutorial section. Maybe you'd agree that while the tools are reasonable, the example use case will likely cause more confusion. I think you've done a tonne of work and I also want to give you an opportunity to break from it as well. We've done this many, many times with other tutorials where we just didn't publish it (not because of lack of effort) but because it wasn't "ready" yet and, sometimes, it may never be. What do you think?

Sorry, I interpreted the meaning of this message so that further work on this tutorial section may not be worth the effort.

The only idea that comes to my mind regarding this tutorial would be to change the data set and thus redo the text of the tutorial section as well.
I might have a suggestion for an appropriate data set (I think this one should be suitable for the multidimensional pattern matching tutorial) here. It is a data set about human activity recognition and consists of 30-dimensional time series data. Different participants perform various exercises (such as Squats, Mountain Climbers, Push-Ups and so on) in three sets. I already used this data set for my Bachelors Thesis. Unfortunately the data is stored in .zip-archives and sorted as separate files for different sensors (where one sensor file includes the time series data of the three spatial dimensions) and would require some preprocessing to bring it into a suitable .csv format. What do you suggest?

@seanlaw
Copy link
Contributor

seanlaw commented Oct 2, 2022

@SaVoAMP I think the pragmatic thing to do is:

  1. Declare this notebook as a "work-in-progress"
  2. Clean up this notebook so that we emphasize the parts that worked well (i.e., the tools/functions made it easy for the user to find what they need in the data) and what parts didn't work well and needs better data/examples
  3. In other words, we clearly state what is "done" and what "isn't done" (also reminding the reader that we might need ot change the data and try again)
  4. Merge the notebook as-is

This way, your work can still be captured and anybody (including you) can take another stab at improving it in a future PR to this file (whether it's using the same dataset or something else). If you want to try it with another dataset then I would keep the PR separate. How does that sound?

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Oct 4, 2022

All right, that sounds good!

Unfortunately I'll probably only get to do it on the weekend since I'm busy with my masters degree and work on weekdays.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 4, 2022

No problem! Thank you

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Oct 11, 2022

Hey, I'm sorry but I was very busy on the weekend with studying. But I promise that I'll get to do it as soon as I have some free time. 😃

@seanlaw
Copy link
Contributor

seanlaw commented Oct 11, 2022

Hey, I'm sorry but I was very busy on the weekend with studying. But I promise that I'll get to do it as soon as I have some free time. 😃

No worries! Thank you for letting me know @SaVoAMP

@seanlaw
Copy link
Contributor

seanlaw commented Oct 18, 2022

@SaVoAMP This might be relevant

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Oct 18, 2022

Uhh, this looks very interesting, especially since I particularly like to work with sports data.

I'm very sorry that I have not found the time to go through the tutorial again yet. In addition to the currently quite stressful everyday life, I was also sick last week. I really hope to get to it this coming weekend!

Probably it would make the most sense to work though your mentioned steps (17 days ago) first and experiment with the new data set afterwards as you already said?

Declare this notebook as a "work-in-progress"
Clean up this notebook so that we emphasize the parts that worked well (i.e., the tools/functions made it easy for the user to find what they need in the data) and what parts didn't work well and needs better data/examples
In other words, we clearly state what is "done" and what "isn't done" (also reminding the reader that we might need ot change the data and try again)
Merge the notebook as-is

These steps I think (and hope) I should be able to do soon.

If you want to try it with another data set then I would keep the PR separate. How does that sound?

This will probably take a while. I find the data set really exciting, but unfortunately I can't guarantee how fast I will make progress besides studying and working. If it continues to be this stressful, I may not be able to work on it decently until my exam period in late January / early February is over. Unless you just happen to have an idea how to combine some machine learning stuff with matrix profiles so that it has some use for stumpy (I'm currently looking for a project idea for a machine learning subject I have chosen) 😆.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 18, 2022

I'm very sorry that I have not found the time to go through the tutorial again yet. In addition to the currently quite stressful everyday life, I was also sick last week. I really hope to get to it this coming weekend!

No worries! This is all volunteer and you are allowed to say, "I don't have the time" too.

Probably it would make the most sense to work though your mentioned steps (17 days ago) first and experiment with the new data set afterwards as you already said?

Yes, I think that is the right plan. Finish this up and, separately (whether it is you or somebody else), we can explore the other dataset.

This will probably take a while. I find the data set really exciting, but unfortunately I can't guarantee how fast I will make progress besides studying and working. If it continues to be this stressful, I may not be able to work on it decently until my exam period in late January / early February is over.

This is why I created a new issue. Sometimes, issues sit there and nobody has time/interest to work on them and that is okay. It's just a place to track thoughts/ideas and anybody can give it a try if they have the availability. No pressure.

Unless you just happen to have an idea how to combine some machine learning stuff with matrix profiles so that it has some use for stumpy (I'm currently looking for a project idea for a machine learning subject I have chosen)

I recommend that you post something in our Discussion section as others in the community may have some thoughts/suggestions!

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Oct 23, 2022

Today I took a closer look at the "comparing the motif with its nearest neighbors" plot. It seems to be correct after all. I enabled %matplotlib notebook and zoomed into the plot. It turned out that there are 10 matches plotted but they look so similar that it's not possible to see them without zooming very close into the plot. I think everything should fit now if one disregards the fact that the data set is not suitable in order to find meaningful motifs.

@seanlaw
Copy link
Contributor

seanlaw commented Oct 24, 2022

Great! If you are satisfied with it then I will merge this notebook after the tests complete? I will await your final confirmation @SaVoAMP

@SaVoAMP
Copy link
Contributor Author

SaVoAMP commented Oct 24, 2022

Yes, I guess it's only worth to put more effort in it with a new data set (maybe a data set that doesn't have that many constant regions without a signal and therefore doesn't need so much preprocessing).

@seanlaw seanlaw merged commit ed84d3f into stumpy-dev:main Oct 31, 2022
@seanlaw
Copy link
Contributor

seanlaw commented Oct 31, 2022

Thank you for this work @SaVoAMP! We really appreciate it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants