Skip to content

TESTING ONLY (TPU test) #21528

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 55 commits into
base: master
Choose a base branch
from

Conversation

sachinprasadhs
Copy link
Collaborator

Draft setup, testing only.

kharshith-k and others added 30 commits June 16, 2025 15:23
updated build file path
updated build file path
Updated tpu_build job of actions.yml with specific runner label
Added container section
Progress bar would always report the starting batch + 1 at the end of the batch. Now it takes into account `steps_per_execution` for the last batch reported.

Fixes keras-team#20861
Using `keras.ops.math.logsumexp` with an int for `axis` in a functional model would throw an error.
…eras-team#21429)

Arbitrary functions and classes are not allowed.

- Made `Operation` extend `KerasSaveable`, this required moving imports to avoid circular imports
- `Layer` no longer need to extend `KerasSaveable` directly
- Made feature space `Cross` and `Feature` extend `KerasSaveable`
- Also dissallow public function `enable_unsafe_deserialization`
…developed dtypes_new_test.py to use requires_tpu marker
kharshith-k and others added 22 commits July 28, 2025 13:29
…workflow and added a step to install docker client
…ames (keras-team#21526)

* [OpenVINO backend] fix openvino model exported names to match keras names

* Update keras/src/export/openvino.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update keras/src/export/openvino.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @sachinprasadhs, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily focuses on establishing and verifying the infrastructure for running Keras tests on Tensor Processing Units (TPUs). I've set up a Dockerized environment, integrated TPU initialization into our testing framework, and introduced new and modified existing tests to specifically validate dtype behavior on TPUs. Additionally, I've updated dependency configurations to support these new testing capabilities. There's also a minor, unrelated enhancement to the OpenVINO export utility.

Highlights

  • TPU Test Environment Setup: I've introduced a new Dockerfile to create a standardized environment for running Keras tests on TPUs, ensuring consistent setup across different testing scenarios.
  • Pytest TPU Integration: I've integrated robust TPU initialization and management directly into our pytest framework via conftest.py fixtures, making it easier to run and manage TPU-dependent tests.
  • TPU-Specific Dtype Testing: I've added and modified several dtype-related tests to specifically run on TPUs, verifying that Keras's type promotion rules behave as expected in a distributed TPU context.
  • Dependency Management for TPU: I've updated our dependency management by adding new requirements files tailored for JAX and TensorFlow TPU environments, streamlining the setup for these specialized configurations.
  • OpenVINO Export Enhancement: I've also included an unrelated change to the OpenVINO export functionality, adding the ability to set friendly names for model inputs during conversion.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request sets up testing on TPUs. It introduces several new test files and fixtures for this purpose. My main feedback is to consolidate the TPU testing strategy. Currently, there are multiple approaches for TPU initialization (in conftest.py, dtypes_test.py, dtypes_TPU_test.py), leading to duplicated and sometimes inconsistent code. Using the pytest fixture from conftest.py seems like the most robust and reusable approach. Additionally, there are several hardcoded values like the TPU name that should be made configurable. Finally, some generated files like logs and test lists seem to have been accidentally committed and should be removed.

Comment on lines +16 to +21
os.environ["KERAS_BACKEND"] = "tensorflow" # Moved to test_case module in Keras

# Set TPU_NAME if connecting to a specific TPU worker
os.environ["TPU_NAME"] = "harshith-tf-4"
# JAX_PLATFORMS is typically for JAX-specific environments, not directly for TF/Keras TPU.
os.environ["JAX_PLATFORMS"] = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Setting environment variables at the module level can lead to side effects that are hard to debug, as it affects the entire test session globally. It's better to configure these settings outside the test suite or, if necessary, within a test setup method using a context manager or monkeypatch to isolate the changes.

Comment on lines +72 to +75
try:
tf.config.experimental_disconnect_from_cluster()
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a bare except: can catch system-exiting exceptions like SystemExit or KeyboardInterrupt, making it harder to debug or stop the program. It's better to catch Exception to handle standard errors while letting system-level exceptions propagate.

Suggested change
try:
tf.config.experimental_disconnect_from_cluster()
except:
pass
try:
tf.config.experimental_disconnect_from_cluster()
except Exception:
pass

Comment on lines +77 to +80
try:
tf.config.experimental_reset_memory_stats("TPU_SYSTEM")
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using a bare except: can catch system-exiting exceptions like SystemExit or KeyboardInterrupt, making it harder to debug or stop the program. It's better to catch Exception to handle standard errors while letting system-level exceptions propagate.

Suggested change
try:
tf.config.experimental_reset_memory_stats("TPU_SYSTEM")
except:
pass
try:
tf.config.experimental_reset_memory_stats("TPU_SYSTEM")
except Exception:
pass


import tensorflow as tf

os.environ["TPU_NAME"] = "harshith-tf-4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TPU name is hardcoded. This makes the tests less portable and harder to run in different environments. This should be configured via an environment variable set outside the test code, which TPUClusterResolver can then read automatically.

Comment on lines +73 to +98
@classmethod
def setUpClass(cls):
super().setUpClass()
cls.tpu_available = False
cls.tpu_strategy = None

# Only attempt TPU initialization if the Keras backend is TensorFlow
if backend.backend() == "tensorflow":
print("\nAttempting TPU initialization from DtypesTest.setUpClass...")
try:
# Use empty string '' for auto-detection or 'grpc://<ip_address>:8470'
# or your specific TPU_NAME from env var
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
cls.tpu_strategy = tf.distribute.TPUStrategy(resolver)
cls.tpu_available = True
print("✓ TPU initialization successful from DtypesTest.setUpClass!")
print(f"Number of TPU devices: {cls.tpu_strategy.num_replicas_in_sync}")
print(f"Logical TPU devices: {tf.config.list_logical_devices('TPU')}")
except Exception as e:
print(f"✗ TPU initialization failed from DtypesTest.setUpClass: {e}")
print("Falling back to CPU/GPU testing for this class.")
cls.tpu_available = False
else:
print(f"Skipping TPU initialization for backend: {backend.backend()}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This setUpClass method duplicates the TPU initialization logic. This is the third implementation of it in this PR. It's important to have a single, reliable way to set up the TPU environment for tests to avoid inconsistencies and maintenance overhead. The pytest fixture in conftest.py is the recommended approach.

# Ensure the backend is set to TensorFlow
os.environ["KERAS_BACKEND"] = "tensorflow"

os.environ["TPU_NAME"] = "harshith-tf-4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TPU name is hardcoded. This makes the tests less portable and harder to run in different environments. This should be configured via an environment variable set outside the test code.

Comment on lines +49 to +67
@classmethod
def setUpClass(cls):
super().setUpClass()
cls.tpu_available = False
cls.tpu_strategy = None

try:
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
cls.tpu_strategy = tf.distribute.TPUStrategy(resolver)
cls.tpu_available = True
print("✓ TPU initialization successful!")
print(f"Number of TPU devices: {cls.tpu_strategy.num_replicas_in_sync}")
print(f"Logical TPU devices: {tf.config.list_logical_devices('TPU')}")
except Exception as e:
print(f"✗ TPU initialization failed: {e}")
print("Falling back to CPU/GPU testing")
cls.tpu_available = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This setUpClass method duplicates the TPU initialization logic that is also present in conftest.py and keras/src/backend/common/dtypes_test.py. The logic here is also less robust as it lacks the retry mechanism. To keep the code DRY and maintainable, it would be better to consolidate on a single TPU setup mechanism, preferably the pytest fixture defined in conftest.py, and use it across all TPU tests with the @pytest.mark.requires_tpu marker.

from keras.src.testing import test_case
from keras.src.testing.test_utils import named_product

os.environ["TPU_NAME"] = "harshith-tf-4"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The TPU name is hardcoded. This makes the tests less portable and harder to run in different environments. This should be configured via an environment variable set outside the test code.

@sachinprasadhs sachinprasadhs marked this pull request as draft July 29, 2025 22:59
@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 20.25316% with 63 lines in your changes missing coverage. Please review.
✅ Project coverage is 64.28%. Comparing base (6bc6203) to head (2acd44c).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
keras/src/backend/common/dtypes_test_TPU.py 0.00% 60 Missing ⚠️
keras/src/export/openvino.py 84.21% 2 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (6bc6203) and HEAD (2acd44c). Click for more details.

HEAD has 7 uploads less than BASE
Flag BASE (6bc6203) HEAD (2acd44c)
keras 5 2
keras-openvino 1 0
keras-numpy 1 0
keras-torch 1 0
keras-jax 1 0
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #21528       +/-   ##
===========================================
- Coverage   82.72%   64.28%   -18.44%     
===========================================
  Files         567      568        +1     
  Lines       56245    56324       +79     
  Branches     8790     8802       +12     
===========================================
- Hits        46527    36210    -10317     
- Misses       7561    18215    +10654     
+ Partials     2157     1899      -258     
Flag Coverage Δ
keras 64.28% <20.25%> (-18.24%) ⬇️
keras-jax ?
keras-numpy ?
keras-openvino ?
keras-tensorflow 64.28% <20.25%> (-0.07%) ⬇️
keras-torch ?

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants