-
Notifications
You must be signed in to change notification settings - Fork 43
Prevent duplicates in the result bdf of simplify_index
#1385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
simplify_index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR explicitly raises an error in case the result contains duplicate indices, which is good, because that should always be an unexpected result.
You also added functionality for whoever is calling this function to clean up duplicates, which is nice, but I feel it's too much responsibility for this function. I suggest to simply raise instead:
if bdf.index.duplicated().any():
logging.debug(f"bdf with duplicates: {bdf}")
raise ValueError("Duplicates found in index after processing.")
and have the code that calls this function deal with filtering by e.g. a specific source.
Preferably, though, the error message in this PR should contain information about the reason for the duplicates, which will help with debugging what went wrong. It could happen because of three things: multiple cumulative probability values per event, multiple belief times/horizons per event, or multiple sources per event. When we still have the BeliefsDataFrame
, we could use check for these cases, using, respectively:
if bdf.lineage.number_of_events == len(bdf): # we won't end up with duplicate indices -> reindex
elif bdf.lineage.number_of_beliefs < len(bdf): # points to probabilistic beliefs -> raise informatively
elif bdf.lineage.number_of_events < bdf.lineage.number_of_beliefs and bdf.lineage.number_of_sources == 1: # points to multiple belief times/horizons per event -> raise informatively
elif bdf.lineage.number_of_events < bdf.lineage.number_of_beliefs and bdf.lineage.number_of_sources > 1: # points (most likely) to multiple sources per event (but theoretically could still be a case of multiple belief times/horizons per event, in combination with a switch from one source to another) -> raise informatively
flexmeasures/data/queries/utils.py
Outdated
bdf: tb.BeliefsDataFrame, | ||
index_levels_to_columns: list[str] | None = None, | ||
keep_duplicate_value: str | None = None, | ||
keep_duplicate_column: GenericAsset | GenericAssetType | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a str | None
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed it as you suggested here
flexmeasures/data/queries/utils.py
Outdated
bdf: tb.BeliefsDataFrame, index_levels_to_columns: list[str] | None = None | ||
bdf: tb.BeliefsDataFrame, | ||
index_levels_to_columns: list[str] | None = None, | ||
keep_duplicate_value: str | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are using this new functionality solely to filter by source. It's presented here as if it could be useful for any available column (i.e. "event_value"
and the ones passed in index_levels_to_columns
), but I don't think that is the case.
In any case, none of the values in any of the columns is a str
type, so Any
would be more fitting.
…ns. To handle duplicate where simplify_index method is used
@@ -241,6 +251,10 @@ def simplify_index( | |||
else: | |||
raise KeyError(f"Level {col} not found") | |||
bdf.index = bdf.index.get_level_values("event_start") | |||
if bdf.index.duplicated().any(): | |||
logging.debug(f"bdf with duplicates: {bdf}") | |||
# raise ValueError("Duplicates found in index after processing.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't raise a ValueError
here because duplicate removal is handled by the caller. Logging a message should be sufficient right?.
Description
The function
simplify_index
does not prevent duplicates .Changes:
Added two optional parameters to
simplify_index
to extend its functionality without breaking existing code that calls the function:keep_duplicate_column
:Specifies the column to filter by.keep_duplicate_value
: The value in the specified column to keep.Ensured duplicates are either removed (if possible) or an exception is raised.
How to Test
Steps:
0. Make sure you have a
pandas version >= 2.2.1
fix/keep-beliefs-from-Simulation-source-in-evse-power-sensor
branch, as it includes adaptations for this change.Related Items
This PR closes #750 and enables the merging of future PR in smart-buildings.