feat: add upgrade cluster playbook#3102
Conversation
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: redscholar The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive cluster upgrade feature, including a new CLI command, an upgrade playbook, and specialized roles for component backups and version validations. The review identifies several technical improvements: addressing potential 'undefined variable' errors in CRI pre-checks for non-cluster hosts, correcting the use of 'run_once' on tasks relying on host-specific etcd data, removing redundant fact-gathering steps to improve performance, and replacing bash-specific syntax with portable shell scripts for better compatibility across environments.
| when: | ||
| - .cri.container_manager | eq "containerd" | ||
| - .upgrade.cri | not | ||
| - .upgrade | default dict | empty | not | ||
| - .containerd_current_version.stdout | empty | not |
There was a problem hiding this comment.
The variable .containerd_current_version is only registered for nodes in the k8s_cluster group (see builtin/core/roles/defaults/tasks/main.yaml). Since this precheck role is executed for all hosts in the playbook, it will cause an 'undefined variable' error on hosts that are not part of the Kubernetes cluster (e.g., standalone load balancers or external etcd nodes). You should add a check to ensure the host belongs to the k8s_cluster group before accessing this variable.
- name: CRI | Validate installed containerd version when not upgrading cri
when:
- .groups.k8s_cluster | default list | has .inventory_hostname
- .cri.container_manager | eq "containerd"
- .upgrade.cri | not
- .upgrade | default dict | empty | not
- .containerd_current_version.stdout | default "" | empty | not| run_once: true | ||
| when: | ||
| - .upgrade.etcd | not | ||
| - .upgrade | default dict | empty | not | ||
| - .etcd_install_version.error | empty |
There was a problem hiding this comment.
Using run_once: true on a task that relies on host-specific variables like .etcd_install_version is risky. If the first host in the play's inventory is not an etcd node, this validation will be skipped entirely. Additionally, if the first host is not an etcd node, .etcd_install_version will be undefined, leading to a playbook execution error. It is better to remove run_once: true and scope the task to the etcd group to ensure every etcd node is validated.
- name: ETCD | Validate installed etcd version is sufficient (when not upgrading etcd)
when:
- .groups.etcd | default list | has .inventory_hostname
- .upgrade.etcd | not
- .upgrade | default dict | empty | not
- .etcd_install_version.error | empty| - hosts: | ||
| - etcd | ||
| serial: 1 | ||
| gather_facts: true |
There was a problem hiding this comment.
The gather_facts: true directive is used in multiple plays throughout this playbook. Since facts are gathered for all hosts in the first play (line 4), they are cached and available for the remainder of the playbook execution. Removing redundant fact gathering in subsequent plays (lines 38, 57, 67, 84) will improve the overall performance of the upgrade process.
| for ((i=1; i<=30; i++)); do | ||
| if systemctl is-active kubelet.service >/dev/null 2>&1; then | ||
| echo "✅ kubelet is active" | ||
| exit 0 | ||
| fi | ||
| sleep 5 | ||
| done |
There was a problem hiding this comment.
The loop uses bash-specific syntax ((i=1; i<=30; i++)). While KubeKey environments typically have bash, it is safer and more portable to use a standard while loop or the seq command to ensure compatibility across different Linux distributions where /bin/sh might not be bash.
i=1
while [ $i -le 30 ]; do
if systemctl is-active kubelet.service >/dev/null 2>&1; then
echo "✅ kubelet is active"
exit 0
fi
i=$((i+1))
sleep 5
donef6922b9 to
4e85d62
Compare
48b400f to
0cd9d7f
Compare
3536b85 to
ac3fe8d
Compare
Signed-off-by: redscholar <blacktiledhouse@gmail.com>
ac3fe8d to
09c4f02
Compare
|
|
@redscholar: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |


What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for reviewers:
Does this PR introduced a user-facing change?
Additional documentation, usage docs, etc.: