Auto-detection of nullable columns in local.order_by incorrectly applies allow_nullable_key to Distributed table instead of local tables

                                                                                                                                                                                                                                                                                                                       
  When creating a Distributed table with local.order_by containing nullable columns, the connector's auto-detection logic adds settings.allow_nullable_key=1 to the Distributed table instead of the local MergeTree tables, causing a failure because Distributed tables don't support this setting.                  
                                                                                                                                                                                                                                                                                                                       
  Environment:                                                                                                                                                                                                                                                                                                         
  - ClickHouse version: 24.8.14.1                                                                                                                                                                                                                                                                                      
  - Spark connector version: 0.10.0                                                                                                                                                                                                                                                                                    
  - Spark version: 3.5.6                                                                                                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                       
  Steps to Reproduce:                                                                                                                                                                                                                                                                                                  
   ```                                                                                                                                                                                                                                                                                                                    
  from pyspark.sql import SparkSession                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                                       
  spark = SparkSession.builder                                                                                                                                                                                                                                                                                       
      .config("spark.jars.packages", "com.clickhouse.spark:clickhouse-spark-runtime-3.5_2.12:0.10.0,com.clickhouse:clickhouse-jdbc:0.9.4::all")                                                                                                                                                                                                                
      .getOrCreate()                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                       
  # Create DataFrame with nullable string columns                                                                                                                                                                                                                                                                      
  data = [                                                                                                                                                                                                                                                                                                             
      ("v1", "v2"),                                                                                                                                                                                                                                                                 
      ("v3", "v4"),                                                                                                                                                                                                                                                                 
  ]                                                                                                                                                                                                                                                                                                                    
  df = spark.createDataFrame(data, ["col1", "col2"])                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                                       
  # Write with Distributed engine and local.order_by containing nullable columns                                                                                                                                                                                                                                       
  df.write.format("clickhouse")                                                                                                                                                                                                                                                                                      
      .option("host", "clickhouse-host")                                                                                                                                                                                                                                                                            
      .option("database", "test_db")                                                                                                                                                                                                                                                                                
      .option("table", "test_table")                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
      .option("cluster", "cluster1shards")                                                                                                                                                                                                                                                                            
      .option("engine", "Distributed")                                                                                                                                                                                                                                                                               
      .option("local.order_by", "col1")                                                                                                                                                                                                                                                        
      .option("local.settings.allow_nullable_key", "1")                                                                                                                                                                                                                                                              
      .save()   
```                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                       
  Expected Behavior:                                                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                                                                                       
  The connector should:                                                                                                                                                                                                                                                                                                
  1. Detect nullable columns in local.order_by                                                                                                                                                                                                                                                                         
  2. Apply settings.allow_nullable_key=1 to the local MergeTree tables being created on cluster nodes                                                                                                                                                                                                                  
  3. Successfully create both local and distributed tables                                                                                                                                                                                                                                                             
                                                                                                                                                                                                                                                                                                                       
  Actual Behavior:                                                                                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                                                       
  The connector logs show:                                                                                                                                                                                                                                                                                             
  INFO ClickHouseTableProvider: ORDER BY contains nullable columns, adding settings.allow_nullable_key=1                                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                                       
  But the setting is applied to the Distributed table instead of the local tables, causing:                                                                                                                                                                                                                            
 ```                                                                                                                                                                                                                                                                                                                      
  com.clickhouse.spark.exception.CHServerException: Code: 115. DB::Exception:                                                                                                                                                                                                                                          
  Unknown setting 'allow_nullable_key': for storage Distributed. (UNKNOWN_SETTING)                                                                                                                                                                                                                                     
  ```                                                                                                                                                                                                                                                                                                                 
  Root Cause:                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                       
  The ClickHouseTableProvider detects nullable columns from the local.order_by parameter but adds the setting globally (without the local. prefix), which gets applied to the Distributed table creation instead of only to the local table creation.  
Issue seems to be in this function: https://github.com/ClickHouse/spark-clickhouse-connector/blob/a1d8b7b32cae27e2fe38133c10ce10613b824013/spark-3.5/clickhouse-spark/src/main/scala/com/clickhouse/spark/ClickHouseTableProvider.scala#L160                                                                
                                                                                                                                                                                                                                                                                                                       
  Suggested Fix:                                                                                                                                                                                                                                                                                                       
                                                                                                                                                                                                                                                                                                                       
  When engine="Distributed" is detected and nullable columns are found in local.order_by, the auto-added setting should be prefixed with local. (i.e., local.settings.allow_nullable_key=1) so it only applies to the local MergeTree tables.                                                                          
                                                                                                                                                                                                                                                                                                                       
  Workaround:                                                                                                                                                                                                                                                                                                          
                                                                                                                                                                                                                                                                                                                       
  Currently, the only workaround is to ensure all columns in local.order_by are non-nullable by casting or coalescing nulls before writing.     

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-detection of nullable columns in local.order_by incorrectly applies allow_nullable_key to Distributed table instead of local tables #520

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Auto-detection of nullable columns in local.order_by incorrectly applies allow_nullable_key to Distributed table instead of local tables #520

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions