|
| 1 | +# TensorFlow Data Validation: Checking and analyzing your data |
| 2 | + |
| 3 | +Once your data is in a TFX pipeline, you can use TFX components to |
| 4 | +analyze and transform it. You can use these tools even before you train |
| 5 | +a model. |
| 6 | + |
| 7 | +There are many reasons to analyze and transform your data: |
| 8 | + |
| 9 | +- To find problems in your data. Common problems include: |
| 10 | + - Missing data, such as features with empty values. |
| 11 | + - Labels treated as features, so that your model gets to peek at |
| 12 | + the right answer during training. |
| 13 | + - Features with values outside the range you expect. |
| 14 | + - Data anomalies. |
| 15 | + - Transfer learned model has preprocessing that does not match the |
| 16 | + training data. |
| 17 | +- To engineer more effective feature sets. For example, you can |
| 18 | + identify: |
| 19 | + - Especially informative features. |
| 20 | + - Redundant features. |
| 21 | + - Features that vary so widely in scale that they may slow |
| 22 | + learning. |
| 23 | + - Features with little or no unique predictive information. |
| 24 | + |
| 25 | +TFX tools can both help find data bugs, and help with feature |
| 26 | +engineering. |
| 27 | + |
| 28 | +## TensorFlow Data Validation |
| 29 | + |
| 30 | +- [Overview](#overview) |
| 31 | +- [Schema Based Example Validation](#schema-based-example-validation) |
| 32 | +- [Training-Serving Skew Detection](#training-serving-skew-detection) |
| 33 | +- [Drift Detection](#drift-detection) |
| 34 | + |
| 35 | +### Overview |
| 36 | + |
| 37 | +TensorFlow Data Validation identifies anomalies in training and serving |
| 38 | +data, and can automatically create a schema by examining the data. The |
| 39 | +component can be configured to detect different classes of anomalies in |
| 40 | +the data. It can |
| 41 | + |
| 42 | +1. Perform validity checks by comparing data statistics against a |
| 43 | + schema that codifies expectations of the user. |
| 44 | +2. Detect training-serving skew by comparing examples in training and |
| 45 | + serving data. |
| 46 | +3. Detect data drift by looking at a series of data. |
| 47 | + |
| 48 | +We document each of these functionalities independently: |
| 49 | + |
| 50 | +- [Schema Based Example Validation](#schema-based-example-validation) |
| 51 | +- [Training-Serving Skew Detection](#training-serving-skew-detection) |
| 52 | +- [Drift Detection](#drift-detection) |
| 53 | + |
| 54 | +### Schema Based Example Validation |
| 55 | + |
| 56 | +TensorFlow Data Validation identifies any anomalies in the input data by |
| 57 | +comparing data statistics against a schema. The schema codifies |
| 58 | +properties which the input data is expected to satisfy, such as data |
| 59 | +types or categorical values, and can be modified or replaced by the |
| 60 | +user. |
| 61 | + |
| 62 | +Tensorflow Data Validation is typically invoked multiple times within |
| 63 | +the context of the TFX pipeline: (i) for every split obtained from |
| 64 | +ExampleGen, (ii) for all pre-transformed data used by Transform and |
| 65 | +(iii) for all post-transform data generated by Transform. When invoked |
| 66 | +in the context of Transform (ii-iii), statistics options and |
| 67 | +schema-based constraints can be set by defining the |
| 68 | +[`stats_options_updater_fn`](https://tensorflow.github.io/transform). |
| 69 | +This is particilarly useful when validating unstructured data (e.g. text |
| 70 | +features). See the [user |
| 71 | +code](https://github.com/tensorflow/tfx/blob/master/tfx/examples/bert/mrpc/bert_mrpc_utils.py) |
| 72 | +for an example. |
| 73 | + |
| 74 | +#### Advanced Schema Features |
| 75 | + |
| 76 | +This section covers more advanced schema configuration that can help |
| 77 | +with special setups. |
| 78 | + |
| 79 | +##### Sparse Features |
| 80 | + |
| 81 | +Encoding sparse features in Examples usually introduces multiple |
| 82 | +Features that are expected to have the same valency for all Examples. |
| 83 | +For example the sparse feature: |
| 84 | + |
| 85 | +```python |
| 86 | +WeightedCategories = [('CategoryA', 0.3), ('CategoryX', 0.7)] |
| 87 | +``` |
| 88 | + |
| 89 | +would be encoded using separate Features for index and value: |
| 90 | + |
| 91 | +``` python |
| 92 | +WeightedCategoriesIndex = ['CategoryA', 'CategoryX'] |
| 93 | +WeightedCategoriesValue = [0.3, 0.7] |
| 94 | +``` |
| 95 | + |
| 96 | +with the restriction that the valency of the index and value feature |
| 97 | +should match for all Examples. This restriction can be made explicit in |
| 98 | +the schema by defining a sparse_feature: |
| 99 | + |
| 100 | + |
| 101 | +```python |
| 102 | +sparse_feature { |
| 103 | + name: 'WeightedCategories' |
| 104 | + index_feature { name: 'WeightedCategoriesIndex' } |
| 105 | + value_feature { name: 'WeightedCategoriesValue' } |
| 106 | +} |
| 107 | +``` |
| 108 | + |
| 109 | +The sparse feature definition requires one or more index and one value |
| 110 | +feature which refer to features that exist in the schema. Explicitly |
| 111 | +defining sparse features enables TFDV to check that the valencies of all |
| 112 | +referred features match. |
| 113 | + |
| 114 | +Some use cases introduce similar valency restrictions between Features, |
| 115 | +but do not necessarily encode a sparse feature. Using sparse feature |
| 116 | +should unblock you, but is not ideal. |
| 117 | + |
| 118 | +##### Schema Environments |
| 119 | + |
| 120 | +By default validations assume that all Examples in a pipeline adhere to |
| 121 | +a single schema. In some cases introducing slight schema variations is |
| 122 | +necessary, for instance features used as labels are required during |
| 123 | +training (and should be validated), but are missing during serving. |
| 124 | +Environments can be used to express such requirements, in particular |
| 125 | +`default_environment()`, |
| 126 | +`in_environment()`, and |
| 127 | +`not_in_environment()`. |
| 128 | + |
| 129 | +For example, assume a feature named `'LABEL'` is required for training, |
| 130 | +but is expected to be missing from serving. This can be expressed by: |
| 131 | + |
| 132 | +- Define two distinct environments in the schema: `["SERVING", |
| 133 | + "TRAINING"]` and associate `'LABEL'` only with environment |
| 134 | + `"TRAINING"`. |
| 135 | +- Associate the training data with environment `"TRAINING"` and the |
| 136 | + serving data with environment `"SERVING"`. |
| 137 | + |
| 138 | +##### Schema Generation |
| 139 | + |
| 140 | +The input data schema is specified as an instance of the TensorFlow |
| 141 | +[Schema](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto). |
| 142 | + |
| 143 | +Instead of constructing a schema manually from scratch, a developer can |
| 144 | +rely on TensorFlow Data Validation's automatic schema construction. |
| 145 | +Specifically, TensorFlow Data Validation automatically constructs an |
| 146 | +initial schema based on statistics computed over training data available |
| 147 | +in the pipeline. Users can simply review this autogenerated schema, |
| 148 | +modify it as needed, check it into a version control system, and push it |
| 149 | +explicitly into the pipeline for further validation. |
| 150 | + |
| 151 | +TFDV includes `infer_schema()` to generate a |
| 152 | +schema automatically. For example: |
| 153 | + |
| 154 | +```python |
| 155 | +schema = tfdv.infer_schema(statistics=train_stats) |
| 156 | +tfdv.display_schema(schema=schema) |
| 157 | +``` |
| 158 | + |
| 159 | +This triggers an automatic schema generation based on the following |
| 160 | +rules: |
| 161 | + |
| 162 | +- If a schema has already been auto-generated then it is used as is. |
| 163 | + |
| 164 | +- Otherwise, TensorFlow Data Validation examines the available data |
| 165 | + statistics and computes a suitable schema for the data. |
| 166 | + |
| 167 | +*Note: The auto-generated schema is best-effort and only tries to infer |
| 168 | +basic properties of the data. It is expected that users review and |
| 169 | +modify it as needed.* |
| 170 | + |
| 171 | +### Training-Serving Skew Detection |
| 172 | + |
| 173 | +#### Overview |
| 174 | + |
| 175 | +TensorFlow Data Validation can detect distribution skew between training |
| 176 | +and serving data. Distribution skew occurs when the distribution of |
| 177 | +feature values for training data is significantly different from serving |
| 178 | +data. One of the key causes for distribution skew is using either a |
| 179 | +completely different corpus for training data generation to overcome |
| 180 | +lack of initial data in the desired corpus. Another reason is a faulty |
| 181 | +sampling mechanism that only chooses a subsample of the serving data to |
| 182 | +train on. |
| 183 | + |
| 184 | +##### Example Scenario |
| 185 | + |
| 186 | +**Note:** For instance, in order to compensate for an underrepresented |
| 187 | +slice of data, if a biased sampling is used without upweighting the |
| 188 | +downsampled examples appropriately, the distribution of feature values |
| 189 | +between training and serving data gets artificially skewed. |
| 190 | + |
| 191 | +See the [TensorFlow Data Validation Get Started |
| 192 | +Guide](get_started#checking-data-skew-and-drift) |
| 193 | +for information about configuring training-serving skew detection. |
| 194 | + |
| 195 | +### Drift Detection |
| 196 | + |
| 197 | +Drift detection is supported between consecutive spans of data (i.e., |
| 198 | +between span N and span N+1), such as between different days of training |
| 199 | +data. We express drift in terms of [L-infinity |
| 200 | +distance](https://en.wikipedia.org/wiki/Chebyshev_distance) for |
| 201 | +categorical features and approximate [Jensen-Shannon |
| 202 | +divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) |
| 203 | +for numeric features. You can set the threshold distance so that you |
| 204 | +receive warnings when the drift is higher than is acceptable. Setting |
| 205 | +the correct distance is typically an iterative process requiring domain |
| 206 | +knowledge and experimentation. |
| 207 | + |
| 208 | +See the [TensorFlow Data Validation Get Started |
| 209 | +Guide](get_started#checking-data-skew-and-drift) |
| 210 | +for information about configuring drift detection. |
| 211 | + |
| 212 | +## Using Visualizations to Check Your Data |
| 213 | + |
| 214 | +TensorFlow Data Validation provides tools for visualizing the |
| 215 | +distribution of feature values. By examining these distributions in a |
| 216 | +Jupyter notebook using [Facets](https://pair-code.github.io/facets/) you |
| 217 | +can catch common problems with data. |
| 218 | + |
| 219 | + |
| 220 | + |
| 221 | +### Identifying Suspicious Distributions |
| 222 | + |
| 223 | +You can identify common bugs in your data by using a Facets Overview |
| 224 | +display to look for suspicious distributions of feature values. |
| 225 | + |
| 226 | +#### Unbalanced Data |
| 227 | + |
| 228 | +An unbalanced feature is a feature for which one value predominates. |
| 229 | +Unbalanced features can occur naturally, but if a feature always has the |
| 230 | +same value you may have a data bug. To detect unbalanced features in a |
| 231 | +Facets Overview, choose "Non-uniformity" from the "Sort by" |
| 232 | +dropdown. |
| 233 | + |
| 234 | +The most unbalanced features will be listed at the top of each |
| 235 | +feature-type list. For example, the following screenshot shows one |
| 236 | +feature that is all zeros, and a second that is highly unbalanced, at |
| 237 | +the top of the "Numeric Features" list: |
| 238 | + |
| 239 | + |
| 241 | + |
| 242 | +#### Uniformly Distributed Data |
| 243 | + |
| 244 | +A uniformly distributed feature is one for which all possible values |
| 245 | +appear with close to the same frequency. As with unbalanced data, this |
| 246 | +distribution can occur naturally, but can also be produced by data bugs. |
| 247 | + |
| 248 | +To detect uniformly distributed features in a Facets Overview, choose |
| 249 | +"Non- uniformity" from the "Sort by" dropdown and check the |
| 250 | +"Reverse order" checkbox: |
| 251 | + |
| 252 | + |
| 253 | + |
| 254 | +String data is represented using bar charts if there are 20 or fewer |
| 255 | +unique values, and as a cumulative distribution graph if there are more |
| 256 | +than 20 unique values. So for string data, uniform distributions can |
| 257 | +appear as either flat bar graphs like the one above or straight lines |
| 258 | +like the one below: |
| 259 | + |
| 260 | + |
| 262 | + |
| 263 | +##### Bugs That Can Produce Uniformly Distributed Data |
| 264 | + |
| 265 | +Here are some common bugs that can produce uniformly distributed data: |
| 266 | + |
| 267 | +- Using strings to represent non-string data types such as dates. For |
| 268 | + example, you will have many unique values for a datetime feature |
| 269 | + with representations like `2017-03-01-11-45-03`. Unique values |
| 270 | + will be distributed uniformly. |
| 271 | + |
| 272 | +- Including indices like "row number" as features. Here again you |
| 273 | + have many unique values. |
| 274 | + |
| 275 | +#### Missing Data |
| 276 | + |
| 277 | +To check whether a feature is missing values entirely: |
| 278 | + |
| 279 | +1. Choose "Amount missing/zero" from the "Sort by" drop-down. |
| 280 | +2. Check the "Reverse order" checkbox. |
| 281 | +3. Look at the "missing" column to see the percentage of instances |
| 282 | + with missing values for a feature. |
| 283 | + |
| 284 | +A data bug can also cause incomplete feature values. For example you may |
| 285 | +expect a feature's value list to always have three elements and |
| 286 | +discover that sometimes it only has one. To check for incomplete values |
| 287 | +or other cases where feature value lists don\'t have the expected number |
| 288 | +of elements: |
| 289 | + |
| 290 | +1. Choose "Value list length" from the "Chart to show" drop-down |
| 291 | + menu on the right. |
| 292 | + |
| 293 | +2. Look at the chart to the right of each feature row. The chart shows |
| 294 | + the range of value list lengths for the feature. For example, the |
| 295 | + highlighted row in the screenshot below shows a feature that has |
| 296 | + some zero-length value lists: |
| 297 | + |
| 298 | + |
| 300 | + |
| 301 | +#### Large Differences in Scale Between Features |
| 302 | + |
| 303 | +If your features vary widely in scale, then the model may have |
| 304 | +difficulties learning. For example, if some features vary from 0 to 1 |
| 305 | +and others vary from 0 to 1,000,000,000, you have a big difference in |
| 306 | +scale. Compare the "max" and "min" columns across features to find |
| 307 | +widely varying scales. |
| 308 | + |
| 309 | +Consider normalizing feature values to reduce these wide variations. |
| 310 | + |
| 311 | +#### Labels with Invalid Labels |
| 312 | + |
| 313 | +TensorFlow's Estimators have restrictions on the type of data they |
| 314 | +accept as labels. For example, binary classifiers typically only work |
| 315 | +with {0, 1} labels. |
| 316 | + |
| 317 | +Review the label values in the Facets Overview and make sure they |
| 318 | +conform to the [requirements of |
| 319 | +Estimators](https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/feature_columns.md). |
0 commit comments