-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
Priority/0-HighTo do nowTo do nowStatus/ReadyForDevThe issue is ready to be developed or to be investigated deeplyThe issue is ready to be developed or to be investigated deeply
Milestone
Description
Description
Khiops 11 supports Text
columns which have a specialized AutoML treatment as oppossed to normal strings(Categorical
). Sklearn predictors should also support this type.
Questions/Ideas
- Does pandas and/or numpy have a specialized
Text
type? Option 1: Implement it as aDataset
property- Add to the table specification tuple should have an optional field
text_columns
with the names of the text fields or - Add another field to the spec
table_text_columns
indexed by the table name and whose values are the names of the text columns (I prefer this one) When creating the dictionary theThe Dataset API is not exposed.Dataset
object will have all the necessary info to add the specified columns asText
- Add to the table specification tuple should have an optional field
Option 2: Implement it as afit
parameterAs above addtable_text_columns
but as afit
optional parameter- I works but the fact that a column is a
Text
is part of the description of the dataset This parameter should be fed to the dictionary creation routineThe Dataset API is not exposed.
- I works but the fact that a column is a
- [later edit: 2025/08/01] Option 3: Add 2 extra parameters
n_text_features
andtext_feature_type
to:- Option 3.1: the
KhiopsPredictor
estimator initializer (__init__
method) - Option 3.2: the
KhiopsPredictor
'sfit
method - Note: The
text_columns
needs to be passed as well:- either to the
KhiopsPredictor
initializer - or to the estimator's
fit
method.
- either to the
- Note 2:
- The Pandas
StringDType
should be used for columns whose Khiops type isText
; see https://pandas.pydata.org/docs/user_guide/text.html#working-with-text-data. - Lists of Pandas
StringDType
s should be used forTextList
(to clarify).
- The Pandas
- Option 3.1: the
Expose Dataset API it will have two init patters:- Big Constructor
- Builder pattern
- The big constructor uses the builder
And theThe Dataset API is not exposed.dict
interface should be maintained
Metadata
Metadata
Assignees
Labels
Priority/0-HighTo do nowTo do nowStatus/ReadyForDevThe issue is ready to be developed or to be investigated deeplyThe issue is ready to be developed or to be investigated deeply