Skip to content

Support text types in sklearn predictors #39

@popescu-v

Description

@popescu-v

Description

Khiops 11 supports Text columns which have a specialized AutoML treatment as oppossed to normal strings(Categorical). Sklearn predictors should also support this type.

Questions/Ideas

  • Does pandas and/or numpy have a specialized Text type?
  • Option 1: Implement it as a Dataset property
    • Add to the table specification tuple should have an optional field text_columns with the names of the text fields or
    • Add another field to the spec table_text_columns indexed by the table name and whose values are the names of the text columns (I prefer this one)
    • When creating the dictionary the Dataset object will have all the necessary info to add the specified columns as Text The Dataset API is not exposed.
  • Option 2: Implement it as a fit parameter
    • As above add table_text_columns but as a fit optional parameter
      • I works but the fact that a column is a Text is part of the description of the dataset
      • This parameter should be fed to the dictionary creation routine The Dataset API is not exposed.
  • [later edit: 2025/08/01] Option 3: Add 2 extra parameters n_text_features and text_feature_type to:
    • Option 3.1: the KhiopsPredictor estimator initializer (__init__ method)
    • Option 3.2: the KhiopsPredictor's fit method
    • Note: The text_columns needs to be passed as well:
      • either to the KhiopsPredictor initializer
      • or to the estimator's fit method.
    • Note 2:
  • Expose Dataset API it will have two init patters:
    • Big Constructor
    • Builder pattern
  • The big constructor uses the builder
  • And the dict interface should be maintained The Dataset API is not exposed.

Metadata

Metadata

Assignees

Labels

Priority/0-HighTo do nowStatus/ReadyForDevThe issue is ready to be developed or to be investigated deeply

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions