Skip to content

Conversation

@Knispel2
Copy link
Contributor

No description provided.

sub = pd.DataFrame({"magpsf": magpsf, "sigmapsf": sigmapsf, "jd": jd, "cfid": cfid})

sub = sub.sort_values("jd", ascending=True)
sub = sub.drop_duplicates(subset=['jd'])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the documentation, this will drop all lines with the same jd keeping only first in each group.
At the same time we have cfid column and it is valid to have both filters at the same jd . I am not sure that it is technically possible, but still. Don't you think that subset=['jd', 'cfid'] would be better approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is valid to have both filters at the same jd

For a given object, there must be only one filter for a given jd? And I do not understand why there could be duplicated jd for an object?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is valid to have both filters at the same jd

For a given object, there must be only one filter for a given jd? And I do not understand why there could be duplicated jd for an object?

Me too :-)
But there are duplicated jd lines with the same filter in the wild. We found it because lc_features code breaks under this circumstances.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah that's true some artifacts were found. I'm not sure this is majority, but ok I understand the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am little confused with how get_key works:

    def get_key(x: dict, band: int):
        if (
            len(x) != 2
            or x is None
            or any(
                map(  # noqa: W503, C417
                    lambda fs: (fs is None or len(fs) == 0), x.values()
                )
            )
        ):
            return pd.Series({k: np.nan for k in MODEL_COLUMNS}, dtype=np.float64)
        elif band in x:
            return pd.Series(x[band])
        else:
            raise IndexError("band {} not found in {}".format(band, x))

Imagine two cases. First x have only key g. Second x have g and i. Then we ask for r key in both scenarios.
Then, you will have table of nans in first case, while exception in the second. It seems to be inconsistent behavior. Isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, it does look strange. This is not my code, so I can't say exactly why it was done this way, but it seems that in the current implementation it is no longer necessary. Fixed it


# Case 4 (both are invalid) is already handled by the zero initialization.

return final_scores
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that employing numpy masked array will help to reduce code size here.

scores_g = ma.array(np.transpose(scores_g_raw[-1])[0], mask=mask_g.to_numpy())
scores_r = ma.array(np.transpose(scores_r_raw[-1])[0], mask=mask_r.to_numpy())

final_scores = ma.column_stack([scores_g, scores_r]).min(axis=1).filled(0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# Initialize the final score array with zeros.
# This handles the case where data in both filters is NaN by default.
final_scores = np.zeros_like(scores_g, dtype=np.float64)
Copy link
Contributor

@matwey matwey Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that zero is a good default anomaly score. We should be able to somehow distinguish between true zero score and unknown score. Note, that it is not guaranteed that anomalies will even have negative scores. The anomalies only have scores just lower than nominals.

We with @pruzhinskaya decided to return nan instead to mark unknown score.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JulienPeloton currently, we know that we are unable to calculate the score for some objects. From the broker point of view, does it make any sense to filter such objects before they enter to this module? Are you happy with nan as indicator of unknown anomaly score? Or is there another option?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it makes sense to filter such object before they enter the module. I'm ok with returning a default value that indicate that the module was unable to calculate the score. nan is a good indicator.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@JulienPeloton
Copy link
Member

Thanks @Knispel2 for the code, and @matwey for the review!

I have one question to confirm. In the current release, models are about 50MB in size:

ls -lth fink_science/data/models/anomaly_detection/
total 258M
-rw-rw-r-- 1 peloton  51M Jul 22 10:58 anomaly_detection_forest_AAD_julien.zip
-rw-rw-r-- 1 peloton  51M Jul 22 10:58 anomaly_detection_forest_AAD_emille_30days.zip
-rw-rw-r-- 1 peloton  52M Jul 22 10:58 anomaly_detection_forest_AAD_emille.zip
-rw-rw-r-- 1 peloton  51M Jul 22 10:58 anomaly_detection_forest_AAD_anais.zip
-rw-rw-r-- 1 peloton  52M Jul 22 10:58 anomaly_detection_forest_AAD.zip
-rw-rw-r-- 1 peloton 2.4M Jul  4 07:47 anomaly_detection_forest_AAD_beta.zip
-rw-rw-r-- 1 peloton  660 Feb  9  2025 g_means.csv
-rw-rw-r-- 1 peloton  665 Feb  9  2025 r_means.csv
-rw-rw-r-- 1 peloton 103K Dec 20  2024 anomaly_detection_forest_AAD_maria.zip

while in this PR, models size are shrink to a couple of MB:

ls -lth fink_science/data/models/anomaly_detection/
total 11M
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_julien.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_emille_30days.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_emille.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_anais.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD.zip
-rw-rw-r-- 1 peloton 2.4M Jul  4 07:47 anomaly_detection_forest_AAD_beta.zip
-rw-rw-r-- 1 peloton  660 Feb  9  2025 g_means.csv
-rw-rw-r-- 1 peloton  665 Feb  9  2025 r_means.csv
-rw-rw-r-- 1 peloton 103K Dec 20  2024 anomaly_detection_forest_AAD_maria.zip

Expected?

@Knispel2
Copy link
Contributor Author

Thanks @Knispel2 for the code, and @matwey for the review!

I have one question to confirm. In the current release, models are about 50MB in size:

ls -lth fink_science/data/models/anomaly_detection/
total 258M
-rw-rw-r-- 1 peloton  51M Jul 22 10:58 anomaly_detection_forest_AAD_julien.zip
-rw-rw-r-- 1 peloton  51M Jul 22 10:58 anomaly_detection_forest_AAD_emille_30days.zip
-rw-rw-r-- 1 peloton  52M Jul 22 10:58 anomaly_detection_forest_AAD_emille.zip
-rw-rw-r-- 1 peloton  51M Jul 22 10:58 anomaly_detection_forest_AAD_anais.zip
-rw-rw-r-- 1 peloton  52M Jul 22 10:58 anomaly_detection_forest_AAD.zip
-rw-rw-r-- 1 peloton 2.4M Jul  4 07:47 anomaly_detection_forest_AAD_beta.zip
-rw-rw-r-- 1 peloton  660 Feb  9  2025 g_means.csv
-rw-rw-r-- 1 peloton  665 Feb  9  2025 r_means.csv
-rw-rw-r-- 1 peloton 103K Dec 20  2024 anomaly_detection_forest_AAD_maria.zip

while in this PR, models size are shrink to a couple of MB:

ls -lth fink_science/data/models/anomaly_detection/
total 11M
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_julien.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_emille_30days.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_emille.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD_anais.zip
-rw-rw-r-- 1 peloton 1.7M Oct 17 14:49 anomaly_detection_forest_AAD.zip
-rw-rw-r-- 1 peloton 2.4M Jul  4 07:47 anomaly_detection_forest_AAD_beta.zip
-rw-rw-r-- 1 peloton  660 Feb  9  2025 g_means.csv
-rw-rw-r-- 1 peloton  665 Feb  9  2025 r_means.csv
-rw-rw-r-- 1 peloton 103K Dec 20  2024 anomaly_detection_forest_AAD_maria.zip

Expected?

In the previous version, we greatly increased the depth of the trees in the hope that this would have a positive effect. However, we didn't see anything good, so we returned to the standard depth

@JulienPeloton
Copy link
Member

Perfect @Knispel2 -- this makes sense!

@JulienPeloton JulienPeloton merged commit 75b36e0 into astrolabsoftware:master Oct 17, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants