Skip to content

feat: add upgrade cluster playbook#3102

Draft
redscholar wants to merge 1 commit into
kubesphere:mainfrom
redscholar:upgrade_cluster
Draft

feat: add upgrade cluster playbook#3102
redscholar wants to merge 1 commit into
kubesphere:mainfrom
redscholar:upgrade_cluster

Conversation

@redscholar

Copy link
Copy Markdown
Contributor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for reviewers:

Does this PR introduced a user-facing change?


Additional documentation, usage docs, etc.:


@kubesphere-prow

Copy link
Copy Markdown

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubesphere-prow kubesphere-prow Bot added do-not-merge/release-note-label-needed kind/feature Categorizes issue or PR as related to a new feature. labels May 19, 2026
@kubesphere-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: redscholar

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubesphere-prow kubesphere-prow Bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 19, 2026
@redscholar redscholar marked this pull request as draft May 19, 2026 09:14
@kubesphere-prow kubesphere-prow Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive cluster upgrade feature, including a new CLI command, an upgrade playbook, and specialized roles for component backups and version validations. The review identifies several technical improvements: addressing potential 'undefined variable' errors in CRI pre-checks for non-cluster hosts, correcting the use of 'run_once' on tasks relying on host-specific etcd data, removing redundant fact-gathering steps to improve performance, and replacing bash-specific syntax with portable shell scripts for better compatibility across environments.

Comment on lines +19 to +23
when:
- .cri.container_manager | eq "containerd"
- .upgrade.cri | not
- .upgrade | default dict | empty | not
- .containerd_current_version.stdout | empty | not

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable .containerd_current_version is only registered for nodes in the k8s_cluster group (see builtin/core/roles/defaults/tasks/main.yaml). Since this precheck role is executed for all hosts in the playbook, it will cause an 'undefined variable' error on hosts that are not part of the Kubernetes cluster (e.g., standalone load balancers or external etcd nodes). You should add a check to ensure the host belongs to the k8s_cluster group before accessing this variable.

- name: CRI | Validate installed containerd version when not upgrading cri
  when:
    - .groups.k8s_cluster | default list | has .inventory_hostname
    - .cri.container_manager | eq "containerd"
    - .upgrade.cri | not
    - .upgrade | default dict | empty | not
    - .containerd_current_version.stdout | default "" | empty | not

Comment on lines +29 to +33
run_once: true
when:
- .upgrade.etcd | not
- .upgrade | default dict | empty | not
- .etcd_install_version.error | empty

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using run_once: true on a task that relies on host-specific variables like .etcd_install_version is risky. If the first host in the play's inventory is not an etcd node, this validation will be skipped entirely. Additionally, if the first host is not an etcd node, .etcd_install_version will be undefined, leading to a playbook execution error. It is better to remove run_once: true and scope the task to the etcd group to ensure every etcd node is validated.

- name: ETCD | Validate installed etcd version is sufficient (when not upgrading etcd)
  when:
    - .groups.etcd | default list | has .inventory_hostname
    - .upgrade.etcd | not
    - .upgrade | default dict | empty | not
    - .etcd_install_version.error | empty

- hosts:
- etcd
serial: 1
gather_facts: true

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The gather_facts: true directive is used in multiple plays throughout this playbook. Since facts are gathered for all hosts in the first play (line 4), they are cached and available for the remainder of the playbook execution. Removing redundant fact gathering in subsequent plays (lines 38, 57, 67, 84) will improve the overall performance of the upgrade process.

Comment on lines +53 to +59
for ((i=1; i<=30; i++)); do
if systemctl is-active kubelet.service >/dev/null 2>&1; then
echo "✅ kubelet is active"
exit 0
fi
sleep 5
done

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The loop uses bash-specific syntax ((i=1; i<=30; i++)). While KubeKey environments typically have bash, it is safer and more portable to use a standard while loop or the seq command to ensure compatibility across different Linux distributions where /bin/sh might not be bash.

    i=1
    while [ $i -le 30 ]; do
      if systemctl is-active kubelet.service >/dev/null 2>&1; then
        echo "✅ kubelet is active"
        exit 0
      fi
      i=$((i+1))
      sleep 5
    done

@redscholar redscholar force-pushed the upgrade_cluster branch 2 times, most recently from 48b400f to 0cd9d7f Compare May 25, 2026 07:41
@redscholar redscholar force-pushed the upgrade_cluster branch 2 times, most recently from 3536b85 to ac3fe8d Compare June 1, 2026 09:19
Signed-off-by: redscholar <blacktiledhouse@gmail.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 2, 2026

Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
9.2% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@ks-ci-bot

Copy link
Copy Markdown
Contributor

@redscholar: PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/release-note-label-needed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. needs-rebase size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants