Skip to content

Create estimator and improve inputs #72

Closed
@deliahu

Description

@deliahu

Description

Separate model config into estimator / model, and improve inputs

Design

  • _default / _optional:
    • _default defaults to None in any case (scalar/list/map)
    • _default can never be set to None explicitly
    • If user sets _default, _optional is implicitly set to True
    • If *_COLUMN type is anywhere in or under the type, default is not an option
    • For list and map types, _min_count is an option (and defaults to 0)
  • User provided map types cannot have keys that start with _
  • Denote cortex values with @ throughout (i.e. all input, also e.g. model: @dnn in api)
  • Add training_input to models
  • Remove type: classification in model
  • Rename "inputs" to "input"
  • input: INT is supported
  • Can't mix value / column types in a single type (e.g. FLOAT_COLUMN|INT is not allowed)
  • {arg: INT}, {STRING: INT}, {STRING: INT_COLUMN}, {STRING_COLUMN: INT}, {STRING_COLUMN: INT_COLUMN}, {STRING_COLUMN: [INT_COLUMN]}, etc... are all supported. {[STRING_COLUMN]: INT_COLUMN} is not supported.
  • Cast INT -> FLOAT for inputs, INT -> FLOAT and INT_COLUMN -> FLOAT_COLUMN for outputs

Structure

input: <TYPE>     # (short form)

input:            # (long form)
  _type: <TYPE>
  _default: <>
  ...

Examples

Aggregators

# mean

- kind: aggregator
  name: mean
  output_type: FLOAT
  input: FLOAT_COLUMN|INT_COLUMN

- kind: aggregate
  name: sepal_length_mean
  aggregator: cortex.mean
  input: sepal_length

# bucket_boundaries

- kind: aggregator
  name: bucket_boundaries
  output_type: [FLOAT]
  input:
    col: FLOAT_COLUMN|INT_COLUMN
    num_buckets:
      _type: INT
      _default: 10

- kind: aggregate
  name: sepal_length_bucket
  aggregator: cortex.bucket_boundaries
  input:
    col: sepal_length
    num_buckets: 5

Transformers

# normalize

- kind: transformer
  name: normalize
  output_type: FLOAT_COLUMN
  input:
    col: FLOAT_COLUMN|INT_COLUMN
    mean: INT|FLOAT
    stddev: INT|FLOAT

- kind: transformed_column
  name: sepal_length_normalized
  transformer: cortex.normalize
  input:
    col: sepal_length
    mean: sepal_length_mean
    stddev: sepal_length_stddev

# weight

- kind: transformer
  name: weight
  output_type: FLOAT_COLUMN
  input:
    col: INT_COLUMN
    class_distribution: {INT: FLOAT}

- kind: transformed_column
  name: weight_column
  transformer: weight
  input:
    col: class
    class_distribution: class_distribution

Models

# iris

- kind: estimator
  name: dnn
  target_column: INT_COLUMN
  input:
    normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
    num_classes: INT
  hparams:
    hidden_layers: [INT]
    learning_rate:
      _type: FLOAT
      _default: 0.01

- kind: model
  name: iris-dnn
  estimator: dnn
  target_column: class_indexed
  input:
    normalized_columns:
      - sepal_length_normalized
      - sepal_width_normalized
      - petal_length_normalized
      - petal_width_normalized
    num_classes: 2
  hparams:
    hidden_layers: [4, 4]
  data_partition_ratio: ...
  training: ...

# Fraud

- kind: estimator
  name: dnn
  target_column: INT_COLUMN
  input:
    normalized_columns: [INT_COLUMN|FLOAT_COLUMN]
  training_input:
    weight_column: FLOAT_COLUMN

- kind: model
  name: fraud-dnn
  target_column: class
  input:
    normalized_columns: [time_normalized, v1_normalized, ...]
  training_input:
    weight_column: weights

# Insurance

- kind: estimator
  name: dnn
  target_column: FLOAT_COLUMN
  input:
    categorical:
      _type:
        - feature: INT_COLUMN
          categories: [STRING]
      _min_count: 1  # This wouldn't actually be necessary, it's just to demonstrate
    bucketized:
      - feature: INT_COLUMN|FLOAT_COLUMN
        buckets: [INT]

- kind: model
  name: insurance-dnn
  estimator: insurance-dnn
  target_column: charges_normalized
  input:
    categorical:
      - feature: gender
        categories: ["female", "male"]
      - feature: smoker
        categories: ["yes", "no"]
      - feature: region
        categories: ["northwest", "northeast", "southwest", "southeast"]
      - feature: children
        categories: children_set
    bucketized:
      - feature: age
        buckets: [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
      - feature: bmi
        buckets: [15, 20, 25, 35, 40, 45, 50, 55]

# Poker

- kind: estimator
  name: dnn
  target_column: INT
  input:
    suit_columns: [INT_COLUMN]
    rank_columns: [INT_COLUMN]

- kind: model
  name: poker-dnn
  type: classification
  target_column: class
  input:
    suit_columns: [card_1_suit, card_2_suit, card_3_suit, card_4_suit, card_5_suit]
    rank_columns: [card_1_rank, card_2_rank, card_3_rank, card_4_rank, card_5_rank]

# Mnist

- kind: model
  name: t2t
  target_column: INT
  input: FLOAT_LIST_COLUMN
  prediction_key: outputs

- kind: trainer
  name: mnist-t2t
  target_column: label
  input: image_pixels

TensorFlow feature_column examples

review tf.feature_column

tf.feature_column.indicator_column(
    tf.feature_column.categorical_column_with_vocabulary_list("smoker", ["yes", "no"])
),

tf.feature_column.bucketized_column(
    tf.feature_column.numeric_column("age"), [15, 20, 25, 35, 40, 45, 50, 55, 60, 65]
),

tf.feature_column.numeric_column("sepal_length_normalized"),

tf.feature_column.numeric_column("image_pixels", shape=model_config["hparams"]["input_shape"])

weight_column = "class_weight"


# cloudml-template

categorical_columns_with_identity = {
    item[0]: tf.feature_column.categorical_column_with_identity(item[0], item[1])
    for item in categorical_feature_names_with_identity.items()
}

categorical_columns_with_vocabulary = {
    item[0]: tf.feature_column.categorical_column_with_vocabulary_list(item[0], item[1])
    for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_VOCABULARY.items()
}

categorical_columns_with_hash_bucket = {
    item[0]: tf.feature_column.categorical_column_with_hash_bucket(item[0], item[1])
    for item in metadata.INPUT_CATEGORICAL_FEATURE_NAMES_WITH_HASH_BUCKET.items()
}

age_buckets = tf.feature_column.bucketized_column(
    feature_columns["age"], boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65]
)

education_X_occupation = tf.feature_column.crossed_column(
    ["education", "occupation"], hash_bucket_size=int(1e4)
)

native_country_embedded = tf.feature_column.embedding_column(
    feature_columns["native_country"], dimension=task.HYPER_PARAMS.embedding_size
)

Design goals

  • As concise as possible
  • Be able to copy paste schema to manifestation when possible, have clear rules when not possible
  • Don't need separate keys for values vs column; typing is for that
  • Use the same input config for models, transformers, and aggregates

Motivation

  • Consistency with transformer / transformed_feature and aggregator / aggregate
  • Enable built-in trainers, so users don't need any TensorFlow code for canned estimators
  • Enable re-use of model implementations

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions