Skip to content

[WIP][SQL] Clarify schema mismatch types in insertInto error #51446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

a1noh
Copy link

@a1noh a1noh commented Jul 10, 2025

What changes were proposed in this pull request?

This PR improves the error reporting behavior in INSERT INTO operations when there is a schema mismatch between a DataFrame and the target table. Specifically, it makes the error message more accurate when Spark attempts to insert data and incorrectly reports the schema of the input DataFrame due to type mismatches.

Previously, the exception message did not clearly reflect the actual types involved, sometimes implying that the DataFrame column had a different type than it truly did. This patch ensures the reported types are correct and provides a clearer message, including:

  • The name of the column
  • The type in the DataFrame
  • The type in the target table
  • Whether the columns were matched by position

Why are the changes needed?

This change addresses a confusing behavior during schema mismatches in insert operations. It improves the developer experience by giving precise and helpful diagnostics. This is especially important for debugging complex ETL pipelines or schema evolution issues.

Without this fix, developers may misinterpret the root cause of an error due to incorrect or vague type information in the exception message.

Does this PR introduce any user-facing change?

Yes.

This PR changes the error message users see when they attempt to insert a DataFrame into a table with mismatched schemas. While the functionality remains the same, the error message is more descriptive and accurate.

Before:
val df = Seq((2025, "Monaco GP")).toDF("race_year", "race_name") // race_year: INT
df.write.insertInto("target_table") // target_table expects race_year as STRING

Cannot safely cast 'race_year': string to int

After:
InsertInto schema mismatch at column 'race_year':

  • DataFrame column has type int
  • Target table column 'race_year' has type string
  • Columns matched by position

How was this patch tested?

  • Manually verified with a DataFrame insert that causes a schema mismatch between IntegerType and StringType, which previously misreported the input type.
  • Ensured existing test suites (sql/catalyst, sql/core) still pass.
    [SELF-TEST] InsertInto error message fix a1noh/spark#1

@github-actions github-actions bot added the SQL label Jul 10, 2025
@a1noh a1noh changed the title [SQL] Clarify schema mismatch types in insertInto error [WIP][SQL] Clarify schema mismatch types in insertInto error Jul 10, 2025
@github-actions github-actions bot added the DOCS label Jul 10, 2025
@a1noh a1noh force-pushed the fix-insert-schema-error branch from 899d0cf to ea74e90 Compare July 10, 2025 19:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant