Understanding Z-normalization and the Effects of Using a Short Subsequence Length #215

Jacks349 · 2020-07-14T21:08:16Z

Jacks349
Jul 14, 2020

I am the same guy from this issue! I have another (noob) question:

In my dataset, i have a set of local minima where each element of the array is a percentage that stands for the variation from the previous element, so it means that price (not asking for financial advice) for that asset is going up when there is a series of positive percentages or down when there is a series of negative percentages.

Here is a sample:

[1.8867924528301907, 2.5462962962962914, 0.45146726862302056, 0.6741573033707992, 2.2321428571428545, 0.8733624454148388, 7.575757575757577, 1.810865191146881, 2.173913043478257, 2.514506769825928, 5.094339622641508, -7.719928186714545, -2.3346303501945465, -0.5976095617529992, -3.40681362725451, 0.0, -1.2448132780082868, -1.6806722689075646, -2.7777777777777874, -0.4395604395604354, 6.401766004415006]

In order to avoid noise and get better results with my distance profile, i normalized this data by converting every percentage to a value between -3 and 3, where 0 is a variation of 0, from 0 to 0,5 is a variation of <2, from 0.5 to 1 is a variation <4 and so on.

So the previous dataset, becomes this:
Target = [0.5, 1, 0.5, 0.5, 1, 0.5, 2, 0.5, 1, 1, 1.5, -2, -1, -0.5, -1, 0.0, -0.5, -0.5, -1, -0.5, 1.5]

Now i specified a small pattern, which is the following: Pattern = [-0.5, -1, -1]. This pattern is made of only negative values, which means that price is going down (gradually, since the values are not high).

So i made the distance profile:
distance_profile = mass(Pattern, Target)

Which gave me the following output:

[3.00000000e+00 1.46000966e-07 3.00000000e+00 3.00000000e+00
 2.67093493e+00 3.00000000e+00 5.74919912e-01 3.46410162e+00
 3.00000000e+00 1.92679045e+00 4.85166428e-01 3.41606017e+00
 3.00000000e+00 2.44948974e+00 3.34606521e+00 1.61180702e-07
 1.73205081e+00 1.73205081e+00 3.15086052e+00]

What i don't understand is the second element of the output 1.46000966e-07 , which is a very low value, so it suggests strong correlation for that part of the dataset, what i don't get is why is the value so low when the data on that part of the dataset are all positive values ([..0.5, 1, 0.5, 0.5, 1, 0.5, 2...]), whereas on my Pattern array i only have negative values.

Here are three hypothesis i made:

My attempt to "normalize" that data is what made this
Maybe my Pattern array is too small
I still haven't understood how does MASS works completely

Any advice on this is appreciated! I apologize if anything should sound confusing.

seanlaw · 2020-07-15T00:09:13Z

seanlaw
Jul 15, 2020
Maintainer

What i don't understand is the second element of the output 1.46000966e-07 , which is a very low value, so it suggests strong correlation for that part of the dataset, what i don't get is why is the value so low when the data on that part of the dataset are all positive values ([..0.5, 1, 0.5, 0.5, 1, 0.5, 2...]), whereas on my Pattern array i only have negative values.

In fact, I would consider 1.46000966e-07 to be essentially zero, which means that the z-normalized shape of the target subsequence is identical to your pattern. When you actually compute the z-normalized values for the subsequence, you'll see:

x = Target[1:4]
print((x - x.mean())/x.std())
# [ 1.41421356, -0.70710678, -0.70710678]

And when you do the same for the Pattern:

print((Pattern - Pattern.mean())/Pattern.std())
# [ 1.41421356, -0.70710678, -0.70710678]

You see that both of their z-normalized Euclidean distances are the same. So, it is important to keep in mind that z-normalization basically removes the vertical shifts in your data and only compares the shape of your data. In this case, they are indeed identical in shape! However, they are certainly different in magnitude. Of course, finding matching shape is the harder part and, in your case, you can simply account for this magnitude different in a post-processing step.

Here are three hypothesis i made:

My attempt to "normalize" that data is what made this
Maybe my Pattern array is too small
I still haven't understood how does MASS works completely
Any advice on this is appreciated! I apologize if anything should sound confusing.

Yes, having a pattern of length 3 is pretty short and likely not a good idea. Given that you have daily data, you'd probably want to look for 14-30 motifs. Also, matrix profile computations kind of assumes that your data is continuous. By "binning" the way that you've done, you've removed the continuous nature of the data and this can cause different problems with your analysis. Just something to consider.

P.S. This is a judgement free zone so please ask away! I'm learning as much as you are.

0 replies

Jacks349 · 2020-07-15T08:09:28Z

Jacks349
Jul 15, 2020
Author

By "binning" the way that you've done, you've removed the continuous nature of the data and this can cause different problems with your analysis. Just something to consider.

Thank you a lot! For binning you mean my attempt to normalize the data by assigning it to values between -3 and 3, right?
If that's the case, i did that because i'm comparing datasets where prices are very different, so i can have two patterns that are identical but on the dataset the price ranges from 30 to 36, while my pattern (even though the shape can be similar) will range from 700 to 830, for example. But what i'm maybe understanding is that, as you said, that MASS should not care about the magnitude but only about the shape, so i was attempting to solve a problem that doesn't exist?

0 replies

seanlaw · 2020-07-15T12:13:01Z

seanlaw
Jul 15, 2020
Maintainer

Thank you a lot! For binning you mean my attempt to normalize the data by assigning it to values between -3 and 3, right?
If that's the case, i did that because i'm comparing datasets where prices are very different, so i can have two patterns that are identical but on the dataset the price ranges from 30 to 36, while my pattern (even though the shape can be similar) will range from 700 to 830, for example. But what i'm maybe understanding is that, as you said, that MASS should not care about the magnitude but only about the shape, so i was attempting to solve a problem that doesn't exist?

Yes, basically, by assigning values between -3 and 3, you are applying a manual normalization step (which could be biased depending on how you choose the bin edges). Unlike other machine learning methods that typically require your data to be standardized or normalized, this is not necessary with matrix profiles since it will already perform a local z-normalization of the subsequence before computing the Euclidean distance. By assigning the values to a "bin", you've essentially removed information that would help you differentiate between the matching patterns. I would generally recommend just running STUMPY on your raw data whenever possible (i.e., make little to no assumptions) and then only make changes to your process in a data driven way (i.e., only "fix" something if you see that it is causing a problem rather than fixing a non-existent problem and end up potentially creating more issues).

Again, to be clear, a window size of 3 is not likely going to be meaningful so I strongly recommend looking at larger window sizes (or you may explore hourly data??)

0 replies

Jacks349 · 2020-07-15T13:03:04Z

Jacks349
Jul 15, 2020
Author

Thank you a lot! In this case i will do it without that "normalization". I think that another thing interfeering here might be the fact that i'm using, instead of the actual prices, the variation between the prices: [1.8867924528301907, 2.5462962962962914, 0.45146726862302056, 0.6741573033707992, 2.2321428571428545, 0.8733624454148388, 7.575757575757577, 1.810865191146881, 2.173913043478257, 2.514506769825928, 5.094339622641508, -7.719928186714545, -2.3346303501945465, -0.5976095617529992, -3.40681362725451, 0.0, -1.2448132780082868, -1.6806722689075646, -2.7777777777777874, -0.4395604395604354, 6.401766004415006]. I think it would be better to use the normal prices instead, but i've tried that way and when my pattern data has very smaller/higher number compared to the target data, the distance profile will give me only high numbers.

0 replies

seanlaw · 2020-07-15T13:09:12Z

seanlaw
Jul 15, 2020
Maintainer

Another thing to potentially experiment with is looking at percentage change between neighboring values. Good luck!

0 replies

Jacks349 · 2020-07-15T13:52:15Z

Jacks349
Jul 15, 2020
Author

I would generally recommend just running STUMPY on your raw data whenever possible (i.e., make little to no assumptions) and then only make changes to your process in a data driven way (i.e., only "fix" something if you see that it is causing a problem rather than fixing a non-existent problem and end up potentially creating more issues).

This is absolutely true. What i'm trying to do now is how much does the scale of the number interferes with my MASS output; i've noticed that, for example, when the data in my target dataset is made of small values, compared to the pattern data, the output tends to have higher values (all around 3).

For example, the following code:

A = [4.97e-06, 5.06e-06, 5.17e-06, 5.3e-06, 5.57e-06, 5.14e-06, 5.02e-06]
B = [4.45e-06, 4.48e-06, 4.58e-06, 4.62e-06, 4.97e-06, 5.06e-06, 5.17e-06, 5.3e-06, 5.57e-06, 5.14e-06, 5.02e-06, 4.99e-06, 4.82e-06, 4.82e-06, 4.76e-06, 4.68e-06, 4.55e-06, 4.53e-06, 4.82e-06, 4.95e-06, 5.06e-06]
distance_profile = mass(A, B)

Gives the following output:

3.740009893
3.739320259
3.739682072
3.738315396
3.736936713
3.739914170
3.742017905
3.743443068
3.744686636
3.742891930
3.742357546
3.742458550
3.743773765
3.744656994
3.742554742

Even though from index 4 the two arrays will have the exact same values

I'm trying to work on this, for now. Thank you a lot!

0 replies

seanlaw · 2020-07-15T14:35:48Z

seanlaw
Jul 15, 2020
Maintainer

I'll have to look into that. I would've expected the identical match to produce a distance of zero. We may be bumping up against machine precision.

0 replies

Jacks349 · 2020-07-15T18:07:41Z

Jacks349
Jul 15, 2020
Author

Honored i was able to help you finding a bug!

0 replies

seanlaw · 2020-07-17T01:43:21Z

seanlaw
Jul 17, 2020
Maintainer

i've noticed that, for example, when the data in my target dataset is made of small values, compared to the pattern data, the output tends to have higher values (all around 3).

@Jacks349 This is fixed in the master branch now. Are you able to clone the repository and re-install from source? Otherwise, you'll have to wait for the next release

0 replies

Jacks349 · 2020-07-17T08:06:21Z

Jacks349
Jul 17, 2020
Author

Yes i can do that! Thank you for fixing the issue so quickly!

0 replies

Understanding Z-normalization and the Effects of Using a Short Subsequence Length #215

Uh oh!

Jacks349 Jul 14, 2020

Replies: 10 comments

Uh oh!

Uh oh!

seanlaw Jul 15, 2020 Maintainer

Uh oh!

Uh oh!

Jacks349 Jul 15, 2020 Author

Uh oh!

seanlaw Jul 15, 2020 Maintainer

Uh oh!

Jacks349 Jul 15, 2020 Author

Uh oh!

seanlaw Jul 15, 2020 Maintainer

Uh oh!

Uh oh!

Jacks349 Jul 15, 2020 Author

Uh oh!

Uh oh!

seanlaw Jul 15, 2020 Maintainer

Uh oh!

Jacks349 Jul 15, 2020 Author

Uh oh!

seanlaw Jul 17, 2020 Maintainer

Uh oh!

Jacks349 Jul 17, 2020 Author

Jacks349
Jul 14, 2020

seanlaw
Jul 15, 2020
Maintainer

Jacks349
Jul 15, 2020
Author

seanlaw
Jul 15, 2020
Maintainer

Jacks349
Jul 15, 2020
Author

seanlaw
Jul 15, 2020
Maintainer

Jacks349
Jul 15, 2020
Author

seanlaw
Jul 15, 2020
Maintainer

Jacks349
Jul 15, 2020
Author

seanlaw
Jul 17, 2020
Maintainer

Jacks349
Jul 17, 2020
Author