-
Notifications
You must be signed in to change notification settings - Fork 179
Add documentation with MkDocs #268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
smokestacklightnin
wants to merge
21
commits into
tensorflow:master
Choose a base branch
from
smokestacklightnin:ci/docs/add-mkdocs
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
21 commits
Select commit
Hold shift + click to select a range
71f1181
Rename docs directory
smokestacklightnin 39cc387
Add documentation dependencies
smokestacklightnin ed4f78c
Add documentation dependencies to `setup.py`
smokestacklightnin 0b5baad
Add javascripts
smokestacklightnin 16ffa86
Add CSS stylesheets
smokestacklightnin 2738349
Add TF logo
smokestacklightnin de89981
Add readme as landing page
smokestacklightnin 8b3d0b0
Add basic mkdocs config
smokestacklightnin 62e63d4
Use correct index page
smokestacklightnin 5c255d0
Add images for index page
smokestacklightnin 7196217
Add install and getting started pages to navigation
smokestacklightnin 763353e
Add API docs
smokestacklightnin 6689e0a
Add docs workflow
smokestacklightnin 33066a1
Remove deprecated code
smokestacklightnin 21b3486
Check links with lychee in docs workflow
smokestacklightnin 7d76562
Revert "Check links with lychee in docs workflow"
smokestacklightnin 418b861
Add anomalies reference
smokestacklightnin 3943409
Fix bad links in docs
smokestacklightnin 6c532cd
Fix links that should be internal
smokestacklightnin 787d909
Merge remote-tracking branch 'upstream/master' into ci/docs/add-mkdocs
smokestacklightnin 8a96cb4
Run pre-commit
smokestacklightnin File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
name: Deploy docs | ||
on: | ||
workflow_dispatch: | ||
push: | ||
branches: | ||
- 'master' | ||
pull_request: | ||
permissions: | ||
contents: write | ||
jobs: | ||
deploy: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- name: Checkout repo | ||
uses: actions/checkout@v4 | ||
|
||
- name: Configure Git Credentials | ||
run: | | ||
git config user.name github-actions[bot] | ||
git config user.email 41898282+github-actions[bot]@users.noreply.github.com | ||
if: (github.event_name != 'pull_request') | ||
|
||
- name: Set up Python 3.9 | ||
uses: actions/setup-python@v5 | ||
with: | ||
python-version: '3.9' | ||
cache: 'pip' | ||
cache-dependency-path: | | ||
setup.py | ||
requirements-docs.txt | ||
|
||
- name: Save time for cache for mkdocs | ||
run: echo "cache_id=$(date --utc '+%V')" >> $GITHUB_ENV | ||
|
||
- name: Caching | ||
uses: actions/cache@v4 | ||
with: | ||
key: mkdocs-material-${{ env.cache_id }} | ||
path: .cache | ||
restore-keys: | | ||
mkdocs-material- | ||
|
||
- name: Install Dependencies | ||
run: pip install -r requirements-docs.txt | ||
|
||
- name: Deploy to GitHub Pages | ||
run: mkdocs gh-deploy --force | ||
if: (github.event_name != 'pull_request') | ||
|
||
- name: Build docs to check for errors | ||
run: mkdocs build | ||
if: (github.event_name == 'pull_request') | ||
File renamed without changes.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# TensorFlow Data Validation API Documentation | ||
|
||
|
||
::: tensorflow_data_validation |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
File renamed without changes
File renamed without changes
File renamed without changes
File renamed without changes
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,319 @@ | ||
# TensorFlow Data Validation: Checking and analyzing your data | ||
|
||
Once your data is in a TFX pipeline, you can use TFX components to | ||
analyze and transform it. You can use these tools even before you train | ||
a model. | ||
|
||
There are many reasons to analyze and transform your data: | ||
|
||
- To find problems in your data. Common problems include: | ||
- Missing data, such as features with empty values. | ||
- Labels treated as features, so that your model gets to peek at | ||
the right answer during training. | ||
- Features with values outside the range you expect. | ||
- Data anomalies. | ||
- Transfer learned model has preprocessing that does not match the | ||
training data. | ||
- To engineer more effective feature sets. For example, you can | ||
identify: | ||
- Especially informative features. | ||
- Redundant features. | ||
- Features that vary so widely in scale that they may slow | ||
learning. | ||
- Features with little or no unique predictive information. | ||
|
||
TFX tools can both help find data bugs, and help with feature | ||
engineering. | ||
|
||
## TensorFlow Data Validation | ||
|
||
- [Overview](#overview) | ||
- [Schema Based Example Validation](#schema-based-example-validation) | ||
- [Training-Serving Skew Detection](#training-serving-skew-detection) | ||
- [Drift Detection](#drift-detection) | ||
|
||
### Overview | ||
|
||
TensorFlow Data Validation identifies anomalies in training and serving | ||
data, and can automatically create a schema by examining the data. The | ||
component can be configured to detect different classes of anomalies in | ||
the data. It can | ||
|
||
1. Perform validity checks by comparing data statistics against a | ||
schema that codifies expectations of the user. | ||
2. Detect training-serving skew by comparing examples in training and | ||
serving data. | ||
3. Detect data drift by looking at a series of data. | ||
|
||
We document each of these functionalities independently: | ||
|
||
- [Schema Based Example Validation](#schema-based-example-validation) | ||
- [Training-Serving Skew Detection](#training-serving-skew-detection) | ||
- [Drift Detection](#drift-detection) | ||
|
||
### Schema Based Example Validation | ||
|
||
TensorFlow Data Validation identifies any anomalies in the input data by | ||
comparing data statistics against a schema. The schema codifies | ||
properties which the input data is expected to satisfy, such as data | ||
types or categorical values, and can be modified or replaced by the | ||
user. | ||
|
||
Tensorflow Data Validation is typically invoked multiple times within | ||
the context of the TFX pipeline: (i) for every split obtained from | ||
ExampleGen, (ii) for all pre-transformed data used by Transform and | ||
(iii) for all post-transform data generated by Transform. When invoked | ||
in the context of Transform (ii-iii), statistics options and | ||
schema-based constraints can be set by defining the | ||
[`stats_options_updater_fn`](https://tensorflow.github.io/transform). | ||
This is particilarly useful when validating unstructured data (e.g. text | ||
features). See the [user | ||
code](https://github.com/tensorflow/tfx/blob/master/tfx/examples/bert/mrpc/bert_mrpc_utils.py) | ||
for an example. | ||
|
||
#### Advanced Schema Features | ||
|
||
This section covers more advanced schema configuration that can help | ||
with special setups. | ||
|
||
##### Sparse Features | ||
|
||
Encoding sparse features in Examples usually introduces multiple | ||
Features that are expected to have the same valency for all Examples. | ||
For example the sparse feature: | ||
|
||
```python | ||
WeightedCategories = [('CategoryA', 0.3), ('CategoryX', 0.7)] | ||
``` | ||
|
||
would be encoded using separate Features for index and value: | ||
|
||
``` python | ||
WeightedCategoriesIndex = ['CategoryA', 'CategoryX'] | ||
WeightedCategoriesValue = [0.3, 0.7] | ||
``` | ||
|
||
with the restriction that the valency of the index and value feature | ||
should match for all Examples. This restriction can be made explicit in | ||
the schema by defining a sparse_feature: | ||
|
||
|
||
```python | ||
sparse_feature { | ||
name: 'WeightedCategories' | ||
index_feature { name: 'WeightedCategoriesIndex' } | ||
value_feature { name: 'WeightedCategoriesValue' } | ||
} | ||
``` | ||
|
||
The sparse feature definition requires one or more index and one value | ||
feature which refer to features that exist in the schema. Explicitly | ||
defining sparse features enables TFDV to check that the valencies of all | ||
referred features match. | ||
|
||
Some use cases introduce similar valency restrictions between Features, | ||
but do not necessarily encode a sparse feature. Using sparse feature | ||
should unblock you, but is not ideal. | ||
|
||
##### Schema Environments | ||
|
||
By default validations assume that all Examples in a pipeline adhere to | ||
a single schema. In some cases introducing slight schema variations is | ||
necessary, for instance features used as labels are required during | ||
training (and should be validated), but are missing during serving. | ||
Environments can be used to express such requirements, in particular | ||
`default_environment()`, | ||
`in_environment()`, and | ||
`not_in_environment()`. | ||
|
||
For example, assume a feature named `'LABEL'` is required for training, | ||
but is expected to be missing from serving. This can be expressed by: | ||
|
||
- Define two distinct environments in the schema: `["SERVING", | ||
"TRAINING"]` and associate `'LABEL'` only with environment | ||
`"TRAINING"`. | ||
- Associate the training data with environment `"TRAINING"` and the | ||
serving data with environment `"SERVING"`. | ||
|
||
##### Schema Generation | ||
|
||
The input data schema is specified as an instance of the TensorFlow | ||
[Schema](https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto). | ||
|
||
Instead of constructing a schema manually from scratch, a developer can | ||
rely on TensorFlow Data Validation's automatic schema construction. | ||
Specifically, TensorFlow Data Validation automatically constructs an | ||
initial schema based on statistics computed over training data available | ||
in the pipeline. Users can simply review this autogenerated schema, | ||
modify it as needed, check it into a version control system, and push it | ||
explicitly into the pipeline for further validation. | ||
|
||
TFDV includes `infer_schema()` to generate a | ||
schema automatically. For example: | ||
|
||
```python | ||
schema = tfdv.infer_schema(statistics=train_stats) | ||
tfdv.display_schema(schema=schema) | ||
``` | ||
|
||
This triggers an automatic schema generation based on the following | ||
rules: | ||
|
||
- If a schema has already been auto-generated then it is used as is. | ||
|
||
- Otherwise, TensorFlow Data Validation examines the available data | ||
statistics and computes a suitable schema for the data. | ||
|
||
*Note: The auto-generated schema is best-effort and only tries to infer | ||
basic properties of the data. It is expected that users review and | ||
modify it as needed.* | ||
|
||
### Training-Serving Skew Detection | ||
|
||
#### Overview | ||
|
||
TensorFlow Data Validation can detect distribution skew between training | ||
and serving data. Distribution skew occurs when the distribution of | ||
feature values for training data is significantly different from serving | ||
data. One of the key causes for distribution skew is using either a | ||
completely different corpus for training data generation to overcome | ||
lack of initial data in the desired corpus. Another reason is a faulty | ||
sampling mechanism that only chooses a subsample of the serving data to | ||
train on. | ||
|
||
##### Example Scenario | ||
|
||
**Note:** For instance, in order to compensate for an underrepresented | ||
slice of data, if a biased sampling is used without upweighting the | ||
downsampled examples appropriately, the distribution of feature values | ||
between training and serving data gets artificially skewed. | ||
|
||
See the [TensorFlow Data Validation Get Started | ||
Guide](get_started#checking-data-skew-and-drift) | ||
for information about configuring training-serving skew detection. | ||
|
||
### Drift Detection | ||
|
||
Drift detection is supported between consecutive spans of data (i.e., | ||
between span N and span N+1), such as between different days of training | ||
data. We express drift in terms of [L-infinity | ||
distance](https://en.wikipedia.org/wiki/Chebyshev_distance) for | ||
categorical features and approximate [Jensen-Shannon | ||
divergence](https://en.wikipedia.org/wiki/Jensen%E2%80%93Shannon_divergence) | ||
for numeric features. You can set the threshold distance so that you | ||
receive warnings when the drift is higher than is acceptable. Setting | ||
the correct distance is typically an iterative process requiring domain | ||
knowledge and experimentation. | ||
|
||
See the [TensorFlow Data Validation Get Started | ||
Guide](get_started#checking-data-skew-and-drift) | ||
for information about configuring drift detection. | ||
|
||
## Using Visualizations to Check Your Data | ||
|
||
TensorFlow Data Validation provides tools for visualizing the | ||
distribution of feature values. By examining these distributions in a | ||
Jupyter notebook using [Facets](https://pair-code.github.io/facets/) you | ||
can catch common problems with data. | ||
|
||
 | ||
|
||
### Identifying Suspicious Distributions | ||
|
||
You can identify common bugs in your data by using a Facets Overview | ||
display to look for suspicious distributions of feature values. | ||
|
||
#### Unbalanced Data | ||
|
||
An unbalanced feature is a feature for which one value predominates. | ||
Unbalanced features can occur naturally, but if a feature always has the | ||
same value you may have a data bug. To detect unbalanced features in a | ||
Facets Overview, choose "Non-uniformity" from the "Sort by" | ||
dropdown. | ||
|
||
The most unbalanced features will be listed at the top of each | ||
feature-type list. For example, the following screenshot shows one | ||
feature that is all zeros, and a second that is highly unbalanced, at | ||
the top of the "Numeric Features" list: | ||
|
||
 | ||
|
||
#### Uniformly Distributed Data | ||
|
||
A uniformly distributed feature is one for which all possible values | ||
appear with close to the same frequency. As with unbalanced data, this | ||
distribution can occur naturally, but can also be produced by data bugs. | ||
|
||
To detect uniformly distributed features in a Facets Overview, choose | ||
"Non- uniformity" from the "Sort by" dropdown and check the | ||
"Reverse order" checkbox: | ||
|
||
 | ||
|
||
String data is represented using bar charts if there are 20 or fewer | ||
unique values, and as a cumulative distribution graph if there are more | ||
than 20 unique values. So for string data, uniform distributions can | ||
appear as either flat bar graphs like the one above or straight lines | ||
like the one below: | ||
|
||
 | ||
|
||
##### Bugs That Can Produce Uniformly Distributed Data | ||
|
||
Here are some common bugs that can produce uniformly distributed data: | ||
|
||
- Using strings to represent non-string data types such as dates. For | ||
example, you will have many unique values for a datetime feature | ||
with representations like `2017-03-01-11-45-03`. Unique values | ||
will be distributed uniformly. | ||
|
||
- Including indices like "row number" as features. Here again you | ||
have many unique values. | ||
|
||
#### Missing Data | ||
|
||
To check whether a feature is missing values entirely: | ||
|
||
1. Choose "Amount missing/zero" from the "Sort by" drop-down. | ||
2. Check the "Reverse order" checkbox. | ||
3. Look at the "missing" column to see the percentage of instances | ||
with missing values for a feature. | ||
|
||
A data bug can also cause incomplete feature values. For example you may | ||
expect a feature's value list to always have three elements and | ||
discover that sometimes it only has one. To check for incomplete values | ||
or other cases where feature value lists don\'t have the expected number | ||
of elements: | ||
|
||
1. Choose "Value list length" from the "Chart to show" drop-down | ||
menu on the right. | ||
|
||
2. Look at the chart to the right of each feature row. The chart shows | ||
the range of value list lengths for the feature. For example, the | ||
highlighted row in the screenshot below shows a feature that has | ||
some zero-length value lists: | ||
|
||
 | ||
|
||
#### Large Differences in Scale Between Features | ||
|
||
If your features vary widely in scale, then the model may have | ||
difficulties learning. For example, if some features vary from 0 to 1 | ||
and others vary from 0 to 1,000,000,000, you have a big difference in | ||
scale. Compare the "max" and "min" columns across features to find | ||
widely varying scales. | ||
|
||
Consider normalizing feature values to reduce these wide variations. | ||
|
||
#### Labels with Invalid Labels | ||
|
||
TensorFlow's Estimators have restrictions on the type of data they | ||
accept as labels. For example, binary classifiers typically only work | ||
with {0, 1} labels. | ||
|
||
Review the label values in the Facets Overview and make sure they | ||
conform to the [requirements of | ||
Estimators](https://github.com/tensorflow/docs/blob/master/site/en/r1/guide/feature_columns.md). |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.