Skip to content

Trailing Slash in create_dataset output_path causes issues with joining results #4569

@mkkatica

Description

@mkkatica

Describe the bug
When using a pandas dataframe as the base for a Feature Store dataset, having a trailing slash in the output_path parameter causes objects to be uploaded to s3 with a double slash in their path. This conflicts with the definition of the temp table, which uses the correct location without the double-slash.

To reproduce
Try to create a dataset with code similar to:

df = feature_store.create_dataset(
    base=base_df,
    output_path="s3://sagemaker-us-east-1-<aact num>/fs-output/", #<--------
    event_time_identifier_feature_name="index_dt",
    record_identifier_feature_name="id"
).with_feature_group(
    feature_group, "id", ["my_feature_1"]
    ).to_dataframe()

The returned dataset is empty, regardless of whether there is a valid inner join target for rows of the base dataframe.

Expected behavior
The returned dataset includes all the values that matched the inner join. Ideally, the s3 upload will not include the trailing slash if specified.

Screenshots or logs
Screenshot 2024-04-09 at 5 07 09 PM
Screenshot 2024-04-09 at 5 07 26 PM
Screenshot 2024-04-09 at 5 07 51 PM

System information
A description of your system. Please provide:

  • SageMaker Python SDK version: 2.214.3
  • Framework name (eg. PyTorch) or algorithm (eg. KMeans): N/A
  • Framework version: N/A
  • Python version: 3.11.6
  • CPU or GPU: Apple M2
  • Custom Docker image (Y/N): N/A

Additional context
I'll submit a fix PR, just opening an issue to track.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions