Skip to content

HeterogeneousCore/SonicTriton: add RetryActionDiffServer; expose connectToServer; update tests #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from

Conversation

trevin-lee
Copy link

Title

HeterogeneousCore/SonicTriton: add RetryActionDiffServer; expose connectToServer; update tests

Body

PR description

  • Add RetryActionDiffServer to switch to an alternative Triton server upon failure.
  • Expose TritonClient::connectToServer(std::string url) and add a testing constructor to enable unit tests.
  • Update BuildFile.xml; remove obsolete tritonRetryActionTest_cfg.py.
  • Expected physics/output changes: none in nominal operation; only affects behavior on retry path.
  • Builds on the retry framework work (related: fastmachinelearning/cmssw#19).

PR validation

  • Builds in CMSSW_15_0_0_pre3 (scram b -j).
  • New retry action compiles; SonicTriton tests continue to pass.
  • Manually exercised retry by providing an alternative server URL; logs show server switch and successful inference.
  • No changes observed in nominal outputs.

Backport

  • Not a backport.

Reviewers: @jmduarte @kpedro88

@jmduarte jmduarte requested a review from kpedro88 August 12, 2025 03:12
@kpedro88
Copy link

Preliminary comments:

  1. do you mean 15_1_0_pre3? (not 15_0)
  2. there are a few outstanding review comments on @kakwok's PR that should be addressed: Test PR for new Retry Framework #19 (review)

Comment on lines +9 to +10
alt_server_url_ = conf.getUntrackedParameter<std::string>("altServerUrl", "");
alt_server_token_ = conf.getUntrackedParameter<std::string>("altServerToken", "");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These parameters should be removed. The alternative server URLs should be obtained from the TritonService, which keeps a master list of all known servers (rather than each module/client keeping its own list inside of its RetryAction(s)).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n.b. the function TritonClient::updateServer() in #19 is provided for this purpose

</export>

<test name="RetryActionDiffServer_test" command="RetryActionDiffServer.cc"/>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tests should be defined in HeterogeneousCore/SonicTriton/test/BuildFile.xml, not the package-level BuildFile.xml.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the correct syntax is:

<bin file="test_RetryActionDiffServer.cc" name="TestHeterogeneousCoreSonicTritonRetryActionDiffServer">
  <use name="catch2"/>
  <use name="FWCore/ParameterSet"/>
  <use name="HeterogeneousCore/SonicTriton"/>
</bin>

@@ -1,5 +1,6 @@
<test name="TestHeterogeneousCoreSonicTritonProducerCPU" command="unittest.sh ${LOCALTOP} CPU"/>
<test name="TestHeterogeneousCoreSonicTritonProducerGPU" command="unittest.sh ${LOCALTOP} GPU"/>
<test name="TestHeterogeneousCoreSonicTritonRetryAction" command="unittest.sh ${LOCALTOP} CPU tritonRetryActionTest_cfg.py"/>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the file tritonRetryActionTest_cfg.py is not committed here. Preferably, the existing tritonTest_cfg.py should be adapted/extended to perform these tests as well, rather than duplicating functionality.

* parameters.
* @param is_testing A boolean flag to select this constructor.
*/
TritonClient(bool is_testing);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be a protected default constructor (with no arguments)? is_testing is never actually used.

};
}

TEST_CASE("Test RetryActionDiffServer Logic", "[RetryActionDiffServer]") {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice setup, so I think we should keep it (in addition to adding to tritonTest_cfg.py, which is somewhere between a unit test and a full integration test).

@@ -0,0 +1,109 @@
#define CATCH_CONFIG_MAIN

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should be renamed test_RetryActionDiffServer.cc or similar

</export>

<test name="RetryActionDiffServer_test" command="RetryActionDiffServer.cc"/>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and the correct syntax is:

<bin file="test_RetryActionDiffServer.cc" name="TestHeterogeneousCoreSonicTritonRetryActionDiffServer">
  <use name="catch2"/>
  <use name="FWCore/ParameterSet"/>
  <use name="HeterogeneousCore/SonicTriton"/>
</bin>

@@ -7,9 +7,13 @@
<use name="HeterogeneousCore/CUDAUtilities"/>
<use name="triton-inference-client"/>
<use name="protobuf"/>
<use name="catch2"/>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be moved to test/BuildFile.xml as indicated in other comments (i.e. removed from here)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants