Skip to content

Conversation

ZhanruiSunCh
Copy link
Collaborator

@ZhanruiSunCh ZhanruiSunCh commented May 26, 2025

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added an optional post-build sanity check for Docker images, including waiting for artifact availability and running automated validation tests.
    • Introduced environment flags to enable or disable sanity checks for development and release images.
    • Improved test job management for different image types and architectures.
  • Enhancements

    • Expanded job name mapping and test stage logic for better pipeline control.

@ZhanruiSunCh ZhanruiSunCh force-pushed the user/zhanruis/0523_add_sanity_check_for_release_image branch from 3e98c7a to 20bbacb Compare May 28, 2025 09:08
@ZhanruiSunCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6754 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6754 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #4922 (Partly Tested) completed with status: 'FAILURE'

@ZhanruiSunCh ZhanruiSunCh force-pushed the user/zhanruis/0523_add_sanity_check_for_release_image branch from 85176ed to 213ea3e Compare May 29, 2025 03:49
@ZhanruiSunCh
Copy link
Collaborator Author

/bot run --stage-list "Build-Docker-Images"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6865 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6865 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #4994 (Partly Tested) completed with status: 'FAILURE'

@ZhanruiSunCh ZhanruiSunCh marked this pull request as ready for review May 29, 2025 04:06
@tensorrt-cicd
Copy link
Collaborator

PR_Github #6869 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6869 [ run ] completed with state FAILURE
/LLM/PipelineMonitor/L0_MergeRequest_PR pipeline #49 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6872 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6872 [ run ] completed with state FAILURE
/LLM/PipelineMonitor/L0_MergeRequest_PR pipeline #50 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6899 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6899 [ run ] completed with state FAILURE
/LLM/PipelineMonitor/L0_MergeRequest_PR pipeline #51 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6904 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6906 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6904 [ run ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6906 [ run ] completed with state FAILURE
/LLM/PipelineMonitor/L0_MergeRequest_PR pipeline #53 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6911 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6911 [ run ] completed with state FAILURE
/LLM/PipelineMonitor/L0_MergeRequest_PR pipeline #54 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6918 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6918 [ run ] completed with state FAILURE
/LLM/PipelineMonitor/L0_MergeRequest_PR pipeline #55 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6919 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6919 [ run ] completed with state FAILURE
/LLM/PipelineMonitor/L0_MergeRequest_PR pipeline #56 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6990 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #6990 [ run ] completed with state FAILURE
/LLM/PipelineMonitor/L0_MergeRequest_PR pipeline #57 (Partly Tested) completed with status: 'FAILURE'

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7006 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7009 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #7006 [ run ] completed with state ABORTED

ZhanruiSunCh and others added 10 commits July 18, 2025 03:08
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Co-authored-by: Yanchao Lu <[email protected]>
Signed-off-by: Zhanrui Sun <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Co-authored-by: Yanchao Lu <[email protected]>
Signed-off-by: Zhanrui Sun <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
Co-authored-by: Yanchao Lu <[email protected]>
Signed-off-by: Zhanrui Sun <[email protected]>
Signed-off-by: ZhanruiSunCh <[email protected]>
@ZhanruiSunCh ZhanruiSunCh force-pushed the user/zhanruis/0523_add_sanity_check_for_release_image branch from 3b37261 to 4584177 Compare July 18, 2025 10:08
@ZhanruiSunCh
Copy link
Collaborator Author

/bot run --extra-stage "Build-Docker-Images" --skip-test

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
jenkins/BuildDockerImage.groovy (1)

516-574: Review artifact polling implementation for potential improvements.

The polling logic is functional but could benefit from some refinements:

  1. Magic numbers: The 60-minute timeout and 60-second polling interval are hardcoded
  2. Error handling: The wget command uses basic error handling but could be more robust
  3. Resource efficiency: The polling approach is reasonable but consider if there are webhook alternatives

Consider extracting the timeout and polling interval as parameters:

-                        def maxWaitMinutes = 60
-                        def pollIntervalSeconds = 60
+                        def maxWaitMinutes = params.artifactWaitTimeoutMinutes ?: 60
+                        def pollIntervalSeconds = params.artifactPollIntervalSeconds ?: 60

Additionally, consider adding more detailed logging for debugging:

+                                echo "Checking ${fileUrl}..."
                                 def exitCode = sh(
                                     script: "wget --spider --quiet --timeout=30 --tries=1 '${fileUrl}'",
                                     returnStatus: true
                                 )
+                                echo "Exit code: ${exitCode}"
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between de3f294 and 4584177.

📒 Files selected for processing (3)
  • jenkins/BuildDockerImage.groovy (8 hunks)
  • jenkins/L0_MergeRequest.groovy (2 hunks)
  • jenkins/L0_Test.groovy (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • jenkins/L0_MergeRequest.groovy
  • jenkins/L0_Test.groovy
🧰 Additional context used
🧠 Learnings (1)
jenkins/BuildDockerImage.groovy (1)
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.374Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (9)
jenkins/BuildDockerImage.groovy (9)

15-15: LGTM: Environment variable initialization follows established pattern.

The ARTIFACT_PATH variable initialization follows the same pattern as UPLOAD_PATH, providing a fallback value when the environment variable is not set.


29-29: LGTM: Boolean flag initialization is appropriate.

The RUN_SANITY_CHECK flag provides a clean way to conditionally enable the new sanity check functionality.


43-50: LGTM: Global variable declaration aligns with pipeline coordination needs.

The addition of IMAGE_KEY_TO_TAG field and its initialization in globalVars follows the established pattern and supports the coordination with test pipelines mentioned in the AI summary.


212-217: LGTM: Conditional logic for post-merge builds is correctly implemented.

The logic correctly gates the NGC artifact usage to post-merge builds only, which prevents potential issues with development builds that might not have the required NGC artifacts available.


271-273: LGTM: imageKeyToTag assignment moved to appropriate location.

Moving the imageKeyToTag assignment inside the build step ensures that the mapping is only created when the build actually succeeds, which aligns with the past review comment about ensuring imageKeyToTag only contains successful builds.


298-300: LGTM: Consistent with the devel image assignment pattern.

The imageKeyToTag assignment for the NGC release image follows the same pattern as the devel image, ensuring consistency in the tagging logic.


440-448: LGTM: Common parameters function reduces code duplication.

The getCommonParameters function appropriately centralizes the parameter mapping that's used for triggering downstream jobs, following DRY principles.


575-608: LGTM: Sanity check stage implementation follows Jenkins best practices.

The stage correctly:

  • Uses conditional execution with when expression
  • Updates globalVars with imageKeyToTag before triggering downstream job
  • Properly handles job parameters and propagation
  • Includes appropriate error handling for failed downstream jobs
  • Uses descriptive logging messages

609-609: LGTM: Stage name updated to reflect security focus.

The stage name change from generic "Register NGC Images" to "Register NGC Images for Security Checks" better describes the purpose of this stage.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12309 [ run ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12309 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #9143 (Partly Tested) completed with status: 'FAILURE'

@ZhanruiSunCh
Copy link
Collaborator Author

/bot skip --comment "Sanity check code passed, failed due to flaky network issue on NSPECT check"

@ZhanruiSunCh ZhanruiSunCh enabled auto-merge (squash) July 21, 2025 07:19
@ZhanruiSunCh
Copy link
Collaborator Author

/bot skip --comment "Sanity check code passed, failed due to flaky network issue on NSPECT check"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12429 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12424 [ skip ] triggered by Bot

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12429 [ skip ] completed with state ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #12424 [ skip ] completed with state SUCCESS
Skipping testing for commit 4584177

@ZhanruiSunCh ZhanruiSunCh merged commit 3cbc23f into NVIDIA:main Jul 21, 2025
3 checks passed
timlee0212 pushed a commit to timlee0212/TensorRT-LLM that referenced this pull request Jul 21, 2025
…uild wheels for devel image) (NVIDIA#4656)

Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: Zhanrui Sun <[email protected]>
Co-authored-by: Yanchao Lu <[email protected]>
NVShreyas pushed a commit to NVShreyas/TensorRT-LLM that referenced this pull request Jul 28, 2025
…uild wheels for devel image) (NVIDIA#4656)

Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: Zhanrui Sun <[email protected]>
Co-authored-by: Yanchao Lu <[email protected]>
Signed-off-by: Shreyas Misra <[email protected]>
Ransiki pushed a commit to Ransiki/TensorRT-LLM that referenced this pull request Jul 29, 2025
…uild wheels for devel image) (NVIDIA#4656)

Signed-off-by: ZhanruiSunCh <[email protected]>
Signed-off-by: Zhanrui Sun <[email protected]>
Co-authored-by: Yanchao Lu <[email protected]>
Signed-off-by: Ransiki Zhang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants