Skip to content

Commit 59bb2c8

Browse files
Add documentation with MkDocs (#268)
* Rename docs directory * Add documentation dependencies * Add documentation dependencies to `setup.py` * Add javascripts * Add CSS stylesheets * Add TF logo * Add readme as landing page * Add basic mkdocs config * Use correct index page * Add images for index page * Add install and getting started pages to navigation * Add API docs * Add docs workflow * Remove deprecated code * Check links with lychee in docs workflow * Revert "Check links with lychee in docs workflow" This reverts commit 21b3486. * Add anomalies reference * Fix bad links in docs * Fix links that should be internal * Run pre-commit
1 parent 34c32af commit 59bb2c8

25 files changed

+640
-10
lines changed

.github/workflows/docs.yml

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
name: Deploy docs
2+
on:
3+
workflow_dispatch:
4+
push:
5+
branches:
6+
- 'master'
7+
pull_request:
8+
permissions:
9+
contents: write
10+
jobs:
11+
deploy:
12+
runs-on: ubuntu-latest
13+
steps:
14+
- name: Checkout repo
15+
uses: actions/checkout@v4
16+
17+
- name: Configure Git Credentials
18+
run: |
19+
git config user.name github-actions[bot]
20+
git config user.email 41898282+github-actions[bot]@users.noreply.github.com
21+
if: (github.event_name != 'pull_request')
22+
23+
- name: Set up Python 3.9
24+
uses: actions/setup-python@v5
25+
with:
26+
python-version: '3.9'
27+
cache: 'pip'
28+
cache-dependency-path: |
29+
setup.py
30+
requirements-docs.txt
31+
32+
- name: Save time for cache for mkdocs
33+
run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV
34+
35+
- name: Caching
36+
uses: actions/cache@v4
37+
with:
38+
key: mkdocs-material-${{ env.cache_id }}
39+
path: .cache
40+
restore-keys: |
41+
mkdocs-material-
42+
43+
- name: Install Dependencies
44+
run: pip install -r requirements-docs.txt
45+
46+
- name: Deploy to GitHub Pages
47+
run: mkdocs gh-deploy --force
48+
if: (github.event_name != 'pull_request')
49+
50+
- name: Build docs to check for errors
51+
run: mkdocs build
52+
if: (github.event_name == 'pull_request')
File renamed without changes.
File renamed without changes.

docs/api.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
# TensorFlow Data Validation API Documentation
2+
3+
4+
::: tensorflow_data_validation

g3doc/custom_data_validation.md renamed to docs/custom_data_validation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ freshness: { owner: 'kuochuntsai' reviewed: '2022-11-29' }
66

77
TFDV supports custom data validation using SQL. You can run custom data
88
validation using
9-
[validate_statistics](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/api/validation_api.py;l=236;rcl=488721853)
9+
[validate_statistics](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/api/validation_api.py#L236)
1010
or
11-
[custom_validate_statistics](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/api/validation_api.py;l=535;rcl=488721853).
11+
[custom_validate_statistics](https://github.com/tensorflow/data-validation/blob/master/tensorflow_data_validation/api/validation_api.py#L535).
1212
Use `validate_statistics` to run standard, schema-based data validation along
1313
with custom validation. Use `custom_validate_statistics` to run only custom
1414
validation.
File renamed without changes.
File renamed without changes.

docs/images/feature_stats.png

108 KB
Loading
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Lines changed: 1 addition & 0 deletions
Loading

docs/images/unbalanced.png

40.9 KB
Loading

docs/images/uniform.png

40.7 KB
Loading

docs/images/uniform_cumulative.png

12.2 KB
Loading

docs/images/zero_length.png

72.3 KB
Loading

docs/index.md

Lines changed: 319 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,319 @@
1+
# TensorFlow Data Validation: Checking and analyzing your data
2+
3+
Once your data is in a TFX pipeline, you can use TFX components to
4+
analyze and transform it. You can use these tools even before you train
5+
a model.
6+
7+
There are many reasons to analyze and transform your data:
8+
9+
- To find problems in your data. Common problems include:
10+
- Missing data, such as features with empty values.
11+
- Labels treated as features, so that your model gets to peek at
12+
the right answer during training.
13+
- Features with values outside the range you expect.
14+
- Data anomalies.
15+
- Transfer learned model has preprocessing that does not match the
16+
training data.
17+
- To engineer more effective feature sets. For example, you can
18+
identify:
19+
- Especially informative features.
20+
- Redundant features.
21+
- Features that vary so widely in scale that they may slow
22+
learning.
23+
- Features with little or no unique predictive information.
24+
25+
TFX tools can both help find data bugs, and help with feature
26+
engineering.
27+
28+
## TensorFlow Data Validation
29+
30+
- [Overview](#overview)
31+
- [Schema Based Example Validation](#schema-based-example-validation)
32+
- [Training-Serving Skew Detection](#training-serving-skew-detection)
33+
- [Drift Detection](#drift-detection)
34+
35+
### Overview
36+
37+
TensorFlow Data Validation identifies anomalies in training and serving
38+
data, and can automatically create a schema by examining the data. The
39+
component can be configured to detect different classes of anomalies in
40+
the data. It can
41+
42+
1. Perform validity checks by comparing data statistics against a
43+
schema that codifies expectations of the user.
44+
2. Detect training-serving skew by comparing examples in training and
45+
serving data.
46+
3. Detect data drift by looking at a series of data.
47+
48+
We document each of these functionalities independently:
49+
50+
- [Schema Based Example Validation](#schema-based-example-validation)
51+
- [Training-Serving Skew Detection](#training-serving-skew-detection)
52+
- [Drift Detection](#drift-detection)
53+
54+
### Schema Based Example Validation
55+
56+
TensorFlow Data Validation identifies any anomalies in the input data by
57+
comparing data statistics against a schema. The schema codifies
58+
properties which the input data is expected to satisfy, such as data
59+
types or categorical values, and can be modified or replaced by the
60+
user.
61+
62+
Tensorflow Data Validation is typically invoked multiple times within
63+
the context of the TFX pipeline: (i) for every split obtained from
64+
ExampleGen, (ii) for all pre-transformed data used by Transform and
65+
(iii) for all post-transform data generated by Transform. When invoked
66+
in the context of Transform (ii-iii), statistics options and
67+
schema-based constraints can be set by defining the
68+
[`stats_options_updater_fn`](https://tensorflow.github.io/transform).
69+
This is particilarly useful when validating unstructured data (e.g. text
70+
features). See the [user
71+
code](https://github.com/tensorflow/tfx/blob/master/tfx/examples/bert/mrpc/bert_mrpc_utils.py)
72+
for an example.
73+
74+
#### Advanced Schema Features
75+
76+
This section covers more advanced schema configuration that can help
77+
with special setups.
78+
79+
##### Sparse Features
80+
81+
Encoding sparse features in Examples usually introduces multiple
82+
Features that are expected to have the same valency for all Examples.
83+
For example the sparse feature:
84+
85+
```python
86+
WeightedCategories = [('CategoryA', 0.3), ('CategoryX', 0.7)]
87+
```
88+
89+
would be encoded using separate Features for index and value:
90+
91+
``` python
92+
WeightedCategoriesIndex = ['CategoryA', 'CategoryX']
93+
WeightedCategoriesValue = [0.3, 0.7]
94+
```
95+
96+
with the restriction that the valency of the index and value feature
97+
should match for all Examples. This restriction can be made explicit in
98+
the schema by defining a sparse_feature:
99+
100+
101+
```python
102+
sparse_feature {
103+
name: 'WeightedCategories'
104+
index_feature { name: 'WeightedCategoriesIndex' }
105+
value_feature { name: 'WeightedCategoriesValue' }
106+
}
107+
```
108+
109+
The sparse feature definition requires one or more index and one value
110+
feature which refer to features that exist in the schema. Explicitly
111+
defining sparse features enables TFDV to check that the valencies of all
112+
referred features match.
113+
114+
Some use cases introduce similar valency restrictions between Features,
115+
but do not necessarily encode a sparse feature. Using sparse feature
116+
should unblock you, but is not ideal.
117+
118+
##### Schema Environments
119+
120+
By default validations assume that all Examples in a pipeline adhere to
121+
a single schema. In some cases introducing slight schema variations is
122+
necessary, for instance features used as labels are required during
123+
training (and should be validated), but are missing during serving.
124+
Environments can be used to express such requirements, in particular
125+
`default_environment()`,
126+
`in_environment()`, and
127+
`not_in_environment()`.
128+
129+
For example, assume a feature named `'LABEL'` is required for training,
130+
but is expected to be missing from serving. This can be expressed by:
131+
132+
- Define two distinct environments in the schema: `["SERVING",
133+
"TRAINING"]` and associate `'LABEL'` only with environment
134+
`"TRAINING"`.
135+
- Associate the training data with environment `"TRAINING"` and the
136+
serving data with environment `"SERVING"`.
137+
138+
##### Schema Generation
139+
140+
The input data schema is specified as an instance of the TensorFlow
141+
[Schema](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto).
142+
143+
Instead of constructing a schema manually from scratch, a developer can
144+
rely on TensorFlow Data Validation's automatic schema construction.
145+
Specifically, TensorFlow Data Validation automatically constructs an
146+
initial schema based on statistics computed over training data available
147+
in the pipeline. Users can simply review this autogenerated schema,
148+
modify it as needed, check it into a version control system, and push it
149+
explicitly into the pipeline for further validation.
150+
151+
TFDV includes `infer_schema()` to generate a
152+
schema automatically. For example:
153+
154+
```python
155+
schema = tfdv.infer_schema(statistics=train_stats)
156+
tfdv.display_schema(schema=schema)
157+
```
158+
159+
This triggers an automatic schema generation based on the following
160+
rules:
161+
162+
- If a schema has already been auto-generated then it is used as is.
163+
164+
- Otherwise, TensorFlow Data Validation examines the available data
165+
statistics and computes a suitable schema for the data.
166+
167+
*Note: The auto-generated schema is best-effort and only tries to infer
168+
basic properties of the data. It is expected that users review and
169+
modify it as needed.*
170+
171+
### Training-Serving Skew Detection
172+
173+
#### Overview
174+
175+
TensorFlow Data Validation can detect distribution skew between training
176+
and serving data. Distribution skew occurs when the distribution of
177+
feature values for training data is significantly different from serving
178+
data. One of the key causes for distribution skew is using either a
179+
completely different corpus for training data generation to overcome
180+
lack of initial data in the desired corpus. Another reason is a faulty
181+
sampling mechanism that only chooses a subsample of the serving data to
182+
train on.
183+
184+
##### Example Scenario
185+
186+
**Note:** For instance, in order to compensate for an underrepresented
187+
slice of data, if a biased sampling is used without upweighting the
188+
downsampled examples appropriately, the distribution of feature values
189+
between training and serving data gets artificially skewed.
190+
191+
See the [TensorFlow Data Validation Get Started
192+
Guide](get_started#checking-data-skew-and-drift)
193+
for information about configuring training-serving skew detection.
194+
195+
### Drift Detection
196+
197+
Drift detection is supported between consecutive spans of data (i.e.,
198+
between span N and span N+1), such as between different days of training
199+
data. We express drift in terms of [L-infinity
200+
distance](https://en.wikipedia.org/wiki/Chebyshev_distance) for
201+
categorical features and approximate [Jensen-Shannon
202+
divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence)
203+
for numeric features. You can set the threshold distance so that you
204+
receive warnings when the drift is higher than is acceptable. Setting
205+
the correct distance is typically an iterative process requiring domain
206+
knowledge and experimentation.
207+
208+
See the [TensorFlow Data Validation Get Started
209+
Guide](get_started#checking-data-skew-and-drift)
210+
for information about configuring drift detection.
211+
212+
## Using Visualizations to Check Your Data
213+
214+
TensorFlow Data Validation provides tools for visualizing the
215+
distribution of feature values. By examining these distributions in a
216+
Jupyter notebook using [Facets](https://pair-code.github.io/facets/) you
217+
can catch common problems with data.
218+
219+
![Feature stats](images/feature_stats.png)
220+
221+
### Identifying Suspicious Distributions
222+
223+
You can identify common bugs in your data by using a Facets Overview
224+
display to look for suspicious distributions of feature values.
225+
226+
#### Unbalanced Data
227+
228+
An unbalanced feature is a feature for which one value predominates.
229+
Unbalanced features can occur naturally, but if a feature always has the
230+
same value you may have a data bug. To detect unbalanced features in a
231+
Facets Overview, choose "Non-uniformity" from the "Sort by"
232+
dropdown.
233+
234+
The most unbalanced features will be listed at the top of each
235+
feature-type list. For example, the following screenshot shows one
236+
feature that is all zeros, and a second that is highly unbalanced, at
237+
the top of the "Numeric Features" list:
238+
239+
![Visualization of unbalanced
240+
data](images/unbalanced.png)
241+
242+
#### Uniformly Distributed Data
243+
244+
A uniformly distributed feature is one for which all possible values
245+
appear with close to the same frequency. As with unbalanced data, this
246+
distribution can occur naturally, but can also be produced by data bugs.
247+
248+
To detect uniformly distributed features in a Facets Overview, choose
249+
"Non- uniformity" from the "Sort by" dropdown and check the
250+
"Reverse order" checkbox:
251+
252+
![Histogram of uniform data](images/uniform.png)
253+
254+
String data is represented using bar charts if there are 20 or fewer
255+
unique values, and as a cumulative distribution graph if there are more
256+
than 20 unique values. So for string data, uniform distributions can
257+
appear as either flat bar graphs like the one above or straight lines
258+
like the one below:
259+
260+
![Line graph: cumulative distribution of uniform
261+
data](images/uniform_cumulative.png)
262+
263+
##### Bugs That Can Produce Uniformly Distributed Data
264+
265+
Here are some common bugs that can produce uniformly distributed data:
266+
267+
- Using strings to represent non-string data types such as dates. For
268+
example, you will have many unique values for a datetime feature
269+
with representations like `2017-03-01-11-45-03`. Unique values
270+
will be distributed uniformly.
271+
272+
- Including indices like "row number" as features. Here again you
273+
have many unique values.
274+
275+
#### Missing Data
276+
277+
To check whether a feature is missing values entirely:
278+
279+
1. Choose "Amount missing/zero" from the "Sort by" drop-down.
280+
2. Check the "Reverse order" checkbox.
281+
3. Look at the "missing" column to see the percentage of instances
282+
with missing values for a feature.
283+
284+
A data bug can also cause incomplete feature values. For example you may
285+
expect a feature's value list to always have three elements and
286+
discover that sometimes it only has one. To check for incomplete values
287+
or other cases where feature value lists don\'t have the expected number
288+
of elements:
289+
290+
1. Choose "Value list length" from the "Chart to show" drop-down
291+
menu on the right.
292+
293+
2. Look at the chart to the right of each feature row. The chart shows
294+
the range of value list lengths for the feature. For example, the
295+
highlighted row in the screenshot below shows a feature that has
296+
some zero-length value lists:
297+
298+
![Facets Overview display with feature with zero-length feature value
299+
lists](images/zero_length.png)
300+
301+
#### Large Differences in Scale Between Features
302+
303+
If your features vary widely in scale, then the model may have
304+
difficulties learning. For example, if some features vary from 0 to 1
305+
and others vary from 0 to 1,000,000,000, you have a big difference in
306+
scale. Compare the "max" and "min" columns across features to find
307+
widely varying scales.
308+
309+
Consider normalizing feature values to reduce these wide variations.
310+
311+
#### Labels with Invalid Labels
312+
313+
TensorFlow's Estimators have restrictions on the type of data they
314+
accept as labels. For example, binary classifiers typically only work
315+
with {0, 1} labels.
316+
317+
Review the label values in the Facets Overview and make sure they
318+
conform to the [requirements of
319+
Estimators](https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/feature_columns.md).

0 commit comments

Comments
 (0)