Closed
Description
Description
Separate model config into estimator
/ model
, and improve inputs
Design
_default
/_optional
:_default
defaults toNone
in any case (scalar/list/map)_default
can never be set toNone
explicitly- If user sets
_default
,_optional
is implicitly set toTrue
- If
*_COLUMN
type is anywhere in or under the type, default is not an option - For list and map types,
_min_count
is an option (and defaults to0
)
- User provided map types cannot have keys that start with
_
- Denote cortex values with
@
throughout (i.e. allinput
, also e.g.model: @dnn
inapi
) - Add
training_input
to models - Remove
type: classification
in model - Rename "inputs" to "input"
input: INT
is supported- Can't mix value / column types in a single type (e.g.
FLOAT_COLUMN|INT
is not allowed) {arg: INT}
,{STRING: INT}
,{STRING: INT_COLUMN}
,{STRING_COLUMN: INT}
,{STRING_COLUMN: INT_COLUMN}
,{STRING_COLUMN: [INT_COLUMN]}
, etc... are all supported.{[STRING_COLUMN]: INT_COLUMN}
is not supported.- Cast
INT
->FLOAT
for inputs,INT
->FLOAT
andINT_COLUMN
->FLOAT_COLUMN
for outputs
Structure
input: <TYPE> # (short form)
input: # (long form)
_type: <TYPE>
_default: <>
...
Examples
Aggregators
# mean
- kind: aggregator
name: mean
output_type: FLOAT
input: FLOAT_COLUMN|INT_COLUMN
- kind: aggregate
name: sepal_length_mean
aggregator: cortex.mean
input: sepal_length
# bucket_boundaries
- kind: aggregator
name: bucket_boundaries
output_type: [FLOAT]
input:
col: FLOAT_COLUMN|INT_COLUMN
num_buckets:
_type: INT
_default: 10
- kind: aggregate
name: sepal_length_bucket
aggregator: cortex.bucket_boundaries
input:
col: sepal_length
num_buckets: 5
Transformers
# normalize
- kind: transformer
name: normalize
output_type: FLOAT_COLUMN
input:
col: FLOAT_COLUMN|INT_COLUMN
mean: INT|FLOAT
stddev: INT|FLOAT
- kind: transformed_column
name: sepal_length_normalized
transformer: cortex.normalize
input:
col: sepal_length
mean: sepal_length_mean
stddev: sepal_length_stddev
# weight
- kind: transformer
name: weight
output_type: FLOAT_COLUMN
input:
col: INT_COLUMN
class_distribution: {INT: FLOAT}
- kind: transformed_column
name: weight_column
transformer: weight
input:
col: class
class_distribution: class_distribution
Models
# iris
- kind: estimator
name: dnn
target_column: INT_COLUMN
input:
normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
num_classes: INT
hparams:
hidden_layers: [INT]
learning_rate:
_type: FLOAT
_default: 0.01
- kind: model
name: iris-dnn
estimator: dnn
target_column: class_indexed
input:
normalized_columns:
- sepal_length_normalized
- sepal_width_normalized
- petal_length_normalized
- petal_width_normalized
num_classes: 2
hparams:
hidden_layers: [4, 4]
data_partition_ratio: ...
training: ...
# Fraud
- kind: estimator
name: dnn
target_column: INT_COLUMN
input:
normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
training_input:
weight_column: FLOAT_COLUMN
- kind: model
name: fraud-dnn
target_column: class
input:
normalized_columns: [time_normalized, v1_normalized, ...]
training_input:
weight_column: weights
# Insurance
- kind: estimator
name: dnn
target_column: FLOAT_COLUMN
input:
categorical:
_type:
- feature: INT_COLUMN
categories: [STRING]
_min_count: 1 # This wouldn't actually be necessary, it's just to demonstrate
bucketized:
- feature: INT_COLUMN|FLOAT_COLUMN
buckets: [INT]
- kind: model
name: insurance-dnn
estimator: insurance-dnn
target_column: charges_normalized
input:
categorical:
- feature: gender
categories: ["female", "male"]
- feature: smoker
categories: ["yes", "no"]
- feature: region
categories: ["northwest", "northeast", "southwest", "southeast"]
- feature: children
categories: children_set
bucketized:
- feature: age
buckets: [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
- feature: bmi
buckets: [15, 20, 25, 35, 40, 45, 50, 55]
# Poker
- kind: estimator
name: dnn
target_column: INT
input:
suit_columns: [INT_COLUMN]
rank_columns: [INT_COLUMN]
- kind: model
name: poker-dnn
type: classification
target_column: class
input:
suit_columns: [card_1_suit, card_2_suit, card_3_suit, card_4_suit, card_5_suit]
rank_columns: [card_1_rank, card_2_rank, card_3_rank, card_4_rank, card_5_rank]
# Mnist
- kind: model
name: t2t
target_column: INT
input: FLOAT_LIST_COLUMN
prediction_key: outputs
- kind: trainer
name: mnist-t2t
target_column: label
input: image_pixels
TensorFlow feature_column examples
review tf.feature_column
tf.feature_column.indicator_column(
tf.feature_column.categorical_column_with_vocabulary_list("smoker", ["yes", "no"])
),
tf.feature_column.bucketized_column(
tf.feature_column.numeric_column("age"), [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
),
tf.feature_column.numeric_column("sepal_length_normalized"),
tf.feature_column.numeric_column("image_pixels", shape=model_config["hparams"]["input_shape"])
weight_column = "class_weight"
# cloudml-template
categorical_columns_with_identity = {
item[0]: tf.feature_column.categorical_column_with_identity(item[0], item[1])
for item in categorical_feature_names_with_identity.items()
}
categorical_columns_with_vocabulary = {
item[0]: tf.feature_column.categorical_column_with_vocabulary_list(item[0], item[1])
for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_VOCABULARY.items()
}
categorical_columns_with_hash_bucket = {
item[0]: tf.feature_column.categorical_column_with_hash_bucket(item[0], item[1])
for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_HASH_BUCKET.items()
}
age_buckets = tf.feature_column.bucketized_column(
feature_columns["age"], boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]
)
education_X_occupation = tf.feature_column.crossed_column(
["education", "occupation"], hash_bucket_size=int(1e4)
)
native_country_embedded = tf.feature_column.embedding_column(
feature_columns["native_country"], dimension=task.HYPER_PARAMS.embedding_size
)
Design goals
- As concise as possible
- Be able to copy paste schema to manifestation when possible, have clear rules when not possible
- Don't need separate keys for values vs column; typing is for that
- Use the same input config for models, transformers, and aggregates
Motivation
- Consistency with
transformer
/transformed_feature
andaggregator
/aggregate
- Enable built-in trainers, so users don't need any TensorFlow code for canned estimators
- Enable re-use of model implementations