-
Notifications
You must be signed in to change notification settings - Fork 19.6k
TESTING ONLY (TPU test) #21528
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
TESTING ONLY (TPU test) #21528
Conversation
… hosted TPU based runner
updated build file path
updated build file path
Updated tpu_build job of actions.yml with specific runner label
Added container section
Progress bar would always report the starting batch + 1 at the end of the batch. Now it takes into account `steps_per_execution` for the last batch reported. Fixes keras-team#20861
Using `keras.ops.math.logsumexp` with an int for `axis` in a functional model would throw an error.
…eras-team#21429) Arbitrary functions and classes are not allowed. - Made `Operation` extend `KerasSaveable`, this required moving imports to avoid circular imports - `Layer` no longer need to extend `KerasSaveable` directly - Made feature space `Cross` and `Feature` extend `KerasSaveable` - Also dissallow public function `enable_unsafe_deserialization`
…developed dtypes_new_test.py to use requires_tpu marker
…workflow and added a step to install docker client
…ames (keras-team#21526) * [OpenVINO backend] fix openvino model exported names to match keras names * Update keras/src/export/openvino.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update keras/src/export/openvino.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @sachinprasadhs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request primarily focuses on establishing and verifying the infrastructure for running Keras tests on Tensor Processing Units (TPUs). I've set up a Dockerized environment, integrated TPU initialization into our testing framework, and introduced new and modified existing tests to specifically validate dtype behavior on TPUs. Additionally, I've updated dependency configurations to support these new testing capabilities. There's also a minor, unrelated enhancement to the OpenVINO export utility.
Highlights
- TPU Test Environment Setup: I've introduced a new Dockerfile to create a standardized environment for running Keras tests on TPUs, ensuring consistent setup across different testing scenarios.
- Pytest TPU Integration: I've integrated robust TPU initialization and management directly into our
pytest
framework viaconftest.py
fixtures, making it easier to run and manage TPU-dependent tests. - TPU-Specific Dtype Testing: I've added and modified several dtype-related tests to specifically run on TPUs, verifying that Keras's type promotion rules behave as expected in a distributed TPU context.
- Dependency Management for TPU: I've updated our dependency management by adding new requirements files tailored for JAX and TensorFlow TPU environments, streamlining the setup for these specialized configurations.
- OpenVINO Export Enhancement: I've also included an unrelated change to the OpenVINO export functionality, adding the ability to set friendly names for model inputs during conversion.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request sets up testing on TPUs. It introduces several new test files and fixtures for this purpose. My main feedback is to consolidate the TPU testing strategy. Currently, there are multiple approaches for TPU initialization (in conftest.py
, dtypes_test.py
, dtypes_TPU_test.py
), leading to duplicated and sometimes inconsistent code. Using the pytest fixture from conftest.py
seems like the most robust and reusable approach. Additionally, there are several hardcoded values like the TPU name that should be made configurable. Finally, some generated files like logs and test lists seem to have been accidentally committed and should be removed.
os.environ["KERAS_BACKEND"] = "tensorflow" # Moved to test_case module in Keras | ||
|
||
# Set TPU_NAME if connecting to a specific TPU worker | ||
os.environ["TPU_NAME"] = "harshith-tf-4" | ||
# JAX_PLATFORMS is typically for JAX-specific environments, not directly for TF/Keras TPU. | ||
os.environ["JAX_PLATFORMS"] = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting environment variables at the module level can lead to side effects that are hard to debug, as it affects the entire test session globally. It's better to configure these settings outside the test suite or, if necessary, within a test setup method using a context manager or monkeypatch
to isolate the changes.
try: | ||
tf.config.experimental_disconnect_from_cluster() | ||
except: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except:
can catch system-exiting exceptions like SystemExit
or KeyboardInterrupt
, making it harder to debug or stop the program. It's better to catch Exception
to handle standard errors while letting system-level exceptions propagate.
try: | |
tf.config.experimental_disconnect_from_cluster() | |
except: | |
pass | |
try: | |
tf.config.experimental_disconnect_from_cluster() | |
except Exception: | |
pass |
try: | ||
tf.config.experimental_reset_memory_stats("TPU_SYSTEM") | ||
except: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a bare except:
can catch system-exiting exceptions like SystemExit
or KeyboardInterrupt
, making it harder to debug or stop the program. It's better to catch Exception
to handle standard errors while letting system-level exceptions propagate.
try: | |
tf.config.experimental_reset_memory_stats("TPU_SYSTEM") | |
except: | |
pass | |
try: | |
tf.config.experimental_reset_memory_stats("TPU_SYSTEM") | |
except Exception: | |
pass |
|
||
import tensorflow as tf | ||
|
||
os.environ["TPU_NAME"] = "harshith-tf-4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@classmethod | ||
def setUpClass(cls): | ||
super().setUpClass() | ||
cls.tpu_available = False | ||
cls.tpu_strategy = None | ||
|
||
# Only attempt TPU initialization if the Keras backend is TensorFlow | ||
if backend.backend() == "tensorflow": | ||
print("\nAttempting TPU initialization from DtypesTest.setUpClass...") | ||
try: | ||
# Use empty string '' for auto-detection or 'grpc://<ip_address>:8470' | ||
# or your specific TPU_NAME from env var | ||
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='') | ||
tf.config.experimental_connect_to_cluster(resolver) | ||
tf.tpu.experimental.initialize_tpu_system(resolver) | ||
cls.tpu_strategy = tf.distribute.TPUStrategy(resolver) | ||
cls.tpu_available = True | ||
print("✓ TPU initialization successful from DtypesTest.setUpClass!") | ||
print(f"Number of TPU devices: {cls.tpu_strategy.num_replicas_in_sync}") | ||
print(f"Logical TPU devices: {tf.config.list_logical_devices('TPU')}") | ||
except Exception as e: | ||
print(f"✗ TPU initialization failed from DtypesTest.setUpClass: {e}") | ||
print("Falling back to CPU/GPU testing for this class.") | ||
cls.tpu_available = False | ||
else: | ||
print(f"Skipping TPU initialization for backend: {backend.backend()}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setUpClass
method duplicates the TPU initialization logic. This is the third implementation of it in this PR. It's important to have a single, reliable way to set up the TPU environment for tests to avoid inconsistencies and maintenance overhead. The pytest fixture in conftest.py
is the recommended approach.
# Ensure the backend is set to TensorFlow | ||
os.environ["KERAS_BACKEND"] = "tensorflow" | ||
|
||
os.environ["TPU_NAME"] = "harshith-tf-4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@classmethod | ||
def setUpClass(cls): | ||
super().setUpClass() | ||
cls.tpu_available = False | ||
cls.tpu_strategy = None | ||
|
||
try: | ||
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='') | ||
tf.config.experimental_connect_to_cluster(resolver) | ||
tf.tpu.experimental.initialize_tpu_system(resolver) | ||
cls.tpu_strategy = tf.distribute.TPUStrategy(resolver) | ||
cls.tpu_available = True | ||
print("✓ TPU initialization successful!") | ||
print(f"Number of TPU devices: {cls.tpu_strategy.num_replicas_in_sync}") | ||
print(f"Logical TPU devices: {tf.config.list_logical_devices('TPU')}") | ||
except Exception as e: | ||
print(f"✗ TPU initialization failed: {e}") | ||
print("Falling back to CPU/GPU testing") | ||
cls.tpu_available = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setUpClass
method duplicates the TPU initialization logic that is also present in conftest.py
and keras/src/backend/common/dtypes_test.py
. The logic here is also less robust as it lacks the retry mechanism. To keep the code DRY and maintainable, it would be better to consolidate on a single TPU setup mechanism, preferably the pytest fixture defined in conftest.py
, and use it across all TPU tests with the @pytest.mark.requires_tpu
marker.
from keras.src.testing import test_case | ||
from keras.src.testing.test_utils import named_product | ||
|
||
os.environ["TPU_NAME"] = "harshith-tf-4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #21528 +/- ##
===========================================
- Coverage 82.72% 64.28% -18.44%
===========================================
Files 567 568 +1
Lines 56245 56324 +79
Branches 8790 8802 +12
===========================================
- Hits 46527 36210 -10317
- Misses 7561 18215 +10654
+ Partials 2157 1899 -258
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Draft setup, testing only.