-
Notifications
You must be signed in to change notification settings - Fork 17
Changing the anomaly score formula #568
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| sub = pd.DataFrame({"magpsf": magpsf, "sigmapsf": sigmapsf, "jd": jd, "cfid": cfid}) | ||
|
|
||
| sub = sub.sort_values("jd", ascending=True) | ||
| sub = sub.drop_duplicates(subset=['jd']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the documentation, this will drop all lines with the same jd keeping only first in each group.
At the same time we have cfid column and it is valid to have both filters at the same jd . I am not sure that it is technically possible, but still. Don't you think that subset=['jd', 'cfid'] would be better approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is valid to have both filters at the same jd
For a given object, there must be only one filter for a given jd? And I do not understand why there could be duplicated jd for an object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is valid to have both filters at the same jd
For a given object, there must be only one filter for a given
jd? And I do not understand why there could be duplicatedjdfor an object?
Me too :-)
But there are duplicated jd lines with the same filter in the wild. We found it because lc_features code breaks under this circumstances.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah that's true some artifacts were found. I'm not sure this is majority, but ok I understand the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am little confused with how get_key works:
def get_key(x: dict, band: int):
if (
len(x) != 2
or x is None
or any(
map( # noqa: W503, C417
lambda fs: (fs is None or len(fs) == 0), x.values()
)
)
):
return pd.Series({k: np.nan for k in MODEL_COLUMNS}, dtype=np.float64)
elif band in x:
return pd.Series(x[band])
else:
raise IndexError("band {} not found in {}".format(band, x))Imagine two cases. First x have only key g. Second x have g and i. Then we ask for r key in both scenarios.
Then, you will have table of nans in first case, while exception in the second. It seems to be inconsistent behavior. Isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, it does look strange. This is not my code, so I can't say exactly why it was done this way, but it seems that in the current implementation it is no longer necessary. Fixed it
|
|
||
| # Case 4 (both are invalid) is already handled by the zero initialization. | ||
|
|
||
| return final_scores |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that employing numpy masked array will help to reduce code size here.
scores_g = ma.array(np.transpose(scores_g_raw[-1])[0], mask=mask_g.to_numpy())
scores_r = ma.array(np.transpose(scores_r_raw[-1])[0], mask=mask_r.to_numpy())
final_scores = ma.column_stack([scores_g, scores_r]).min(axis=1).filled(0)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| # Initialize the final score array with zeros. | ||
| # This handles the case where data in both filters is NaN by default. | ||
| final_scores = np.zeros_like(scores_g, dtype=np.float64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure that zero is a good default anomaly score. We should be able to somehow distinguish between true zero score and unknown score. Note, that it is not guaranteed that anomalies will even have negative scores. The anomalies only have scores just lower than nominals.
We with @pruzhinskaya decided to return nan instead to mark unknown score.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JulienPeloton currently, we know that we are unable to calculate the score for some objects. From the broker point of view, does it make any sense to filter such objects before they enter to this module? Are you happy with nan as indicator of unknown anomaly score? Or is there another option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it makes sense to filter such object before they enter the module. I'm ok with returning a default value that indicate that the module was unable to calculate the score. nan is a good indicator.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
Thanks @Knispel2 for the code, and @matwey for the review! I have one question to confirm. In the current release, models are about 50MB in size: ls -lth fink_science/data/models/anomaly_detection/
total 258M
-rw-rw-r-- 1 peloton 51M Jul 22 10:58 anomaly_detection_forest_AAD_julien.zip
-rw-rw-r-- 1 peloton 51M Jul 22 10:58 anomaly_detection_forest_AAD_emille_30days.zip
-rw-rw-r-- 1 peloton 52M Jul 22 10:58 anomaly_detection_forest_AAD_emille.zip
-rw-rw-r-- 1 peloton 51M Jul 22 10:58 anomaly_detection_forest_AAD_anais.zip
-rw-rw-r-- 1 peloton 52M Jul 22 10:58 anomaly_detection_forest_AAD.zip
-rw-rw-r-- 1 peloton 2.4M Jul 4 07:47 anomaly_detection_forest_AAD_beta.zip
-rw-rw-r-- 1 peloton 660 Feb 9 2025 g_means.csv
-rw-rw-r-- 1 peloton 665 Feb 9 2025 r_means.csv
-rw-rw-r-- 1 peloton 103K Dec 20 2024 anomaly_detection_forest_AAD_maria.zipwhile in this PR, models size are shrink to a couple of MB: ls -lth fink_science/data/models/anomaly_detection/
total 11M
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_julien.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_emille_30days.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_emille.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_anais.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD.zip
-rw-rw-r-- 1 peloton 2.4M Jul 4 07:47 anomaly_detection_forest_AAD_beta.zip
-rw-rw-r-- 1 peloton 660 Feb 9 2025 g_means.csv
-rw-rw-r-- 1 peloton 665 Feb 9 2025 r_means.csv
-rw-rw-r-- 1 peloton 103K Dec 20 2024 anomaly_detection_forest_AAD_maria.zipExpected? |
In the previous version, we greatly increased the depth of the trees in the hope that this would have a positive effect. However, we didn't see anything good, so we returned to the standard depth |
|
Perfect @Knispel2 -- this makes sense! |
No description provided.