Skip to content

Conversation

hanachaari
Copy link

@hanachaari hanachaari commented Mar 22, 2025

Description

The function simplify_index does not prevent duplicates .
Changes:

  • Added two optional parameters to simplify_index to extend its functionality without breaking existing code that calls the function:

  • keep_duplicate_column:Specifies the column to filter by.

  • keep_duplicate_value: The value in the specified column to keep.

  • Ensured duplicates are either removed (if possible) or an exception is raised.

How to Test

Steps:
0. Make sure you have a pandas version >= 2.2.1

  1. Use the fix/keep-beliefs-from-Simulation-source-in-evse-power-sensor branch, as it includes adaptations for this change.
  2. Select a project, complete the setup phase, and run the smart charging scenario.
  3. Ensure that no warning related to fill_null_values appears, unlike the previous issue:

image

Related Items

This PR closes #750 and enables the merging of future PR in smart-buildings.

@hanachaari hanachaari changed the title cherry-pick commit 1bbc7513 ee2f1a0a from fix/prevent-duplicates-afte… Prevent duplicates in the result bdf of simplify_index Mar 22, 2025
@hanachaari hanachaari requested a review from Flix6x March 22, 2025 07:05
Copy link
Contributor

@Flix6x Flix6x left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR explicitly raises an error in case the result contains duplicate indices, which is good, because that should always be an unexpected result.

You also added functionality for whoever is calling this function to clean up duplicates, which is nice, but I feel it's too much responsibility for this function. I suggest to simply raise instead:

if bdf.index.duplicated().any():
    logging.debug(f"bdf with duplicates: {bdf}")
    raise ValueError("Duplicates found in index after processing.")

and have the code that calls this function deal with filtering by e.g. a specific source.

Preferably, though, the error message in this PR should contain information about the reason for the duplicates, which will help with debugging what went wrong. It could happen because of three things: multiple cumulative probability values per event, multiple belief times/horizons per event, or multiple sources per event. When we still have the BeliefsDataFrame, we could use check for these cases, using, respectively:

  1. if bdf.lineage.number_of_events == len(bdf): # we won't end up with duplicate indices -> reindex
  2. elif bdf.lineage.number_of_beliefs < len(bdf): # points to probabilistic beliefs -> raise informatively
  3. elif bdf.lineage.number_of_events < bdf.lineage.number_of_beliefs and bdf.lineage.number_of_sources == 1: # points to multiple belief times/horizons per event -> raise informatively
  4. elif bdf.lineage.number_of_events < bdf.lineage.number_of_beliefs and bdf.lineage.number_of_sources > 1: # points (most likely) to multiple sources per event (but theoretically could still be a case of multiple belief times/horizons per event, in combination with a switch from one source to another) -> raise informatively

bdf: tb.BeliefsDataFrame,
index_levels_to_columns: list[str] | None = None,
keep_duplicate_value: str | None = None,
keep_duplicate_column: GenericAsset | GenericAssetType | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a str | None?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it as you suggested here

bdf: tb.BeliefsDataFrame, index_levels_to_columns: list[str] | None = None
bdf: tb.BeliefsDataFrame,
index_levels_to_columns: list[str] | None = None,
keep_duplicate_value: str | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are using this new functionality solely to filter by source. It's presented here as if it could be useful for any available column (i.e. "event_value" and the ones passed in index_levels_to_columns), but I don't think that is the case.

In any case, none of the values in any of the columns is a str type, so Any would be more fitting.

…ns. To handle duplicate where simplify_index method is used
@@ -241,6 +251,10 @@ def simplify_index(
else:
raise KeyError(f"Level {col} not found")
bdf.index = bdf.index.get_level_values("event_start")
if bdf.index.duplicated().any():
logging.debug(f"bdf with duplicates: {bdf}")
# raise ValueError("Duplicates found in index after processing.")
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't raise a ValueError here because duplicate removal is handled by the caller. Logging a message should be sufficient right?.

@hanachaari hanachaari requested a review from Flix6x May 14, 2025 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants