Skip to content

Fix/node check issue#59

Merged
devalpatel67 merged 2 commits intomainfrom
fix/node-check-issue
Oct 28, 2025
Merged

Fix/node check issue#59
devalpatel67 merged 2 commits intomainfrom
fix/node-check-issue

Conversation

@devalpatel67
Copy link
Copy Markdown
Collaborator

@devalpatel67 devalpatel67 commented Oct 28, 2025

Summary by CodeRabbit

  • Bug Fixes

    • Fixed node status parsing to correctly retrieve synchronization information.
  • Chores

    • Updated default version values for core components and related services to latest releases.
    • Added new operator version configuration attributes for enhanced deployment control.
    • Improved pod upgrade reliability with additional stabilization delay.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Oct 28, 2025

Walkthrough

Configuration version defaults updated for Akash components (v0.38.1→v1.0.0, chart versions bumped). Two new operator version attributes added. JSON parsing key for node sync status renamed (SyncInfo→sync_info). Operator upgrade flow modified to configure image tags from Config and insert 20-second waits for pod settlement.

Changes

Cohort / File(s) Summary
Configuration version and operator attribute updates
application/config/config.py
Bumped AKASH_VERSION from v0.38.1 to v1.0.0, AKASH_NODE_HELM_CHART_VERSION from 12.0.3 to 14.0.0, PROVIDER_SERVICES_VERSION from v0.6.10 to v0.10.1, and PROVIDER_SERVICES_HELM_CHART_VERSION from 11.6.0 to 14.0.3. Added PROVIDER_HOSTNAME_OPERATOR_VERSION and PROVIDER_INVENTORY_OPERATOR_VERSION (both v0.10.0).
Node status sync info key renaming
application/service/provider_service.py
Changed JSON parsing keys from node_status["SyncInfo"] to node_status["sync_info"] and catching_up nested access accordingly, affecting node readiness determination and sync state logging.
Operator upgrade flow with versioning and timing
application/service/upgrade_service.py
Configured hostname and inventory operator upgrades to set image.tag from Config (PROVIDER_HOSTNAME_OPERATOR_VERSION and PROVIDER_INVENTORY_OPERATOR_VERSION). Added 20-second sleep delays before pod version verification in upgrade and verification sections to allow pods to settle.

Sequence Diagram(s)

sequenceDiagram
    participant Upgrade as Upgrade Service
    participant Config as Config
    participant Helm as Helm
    participant Pods as Operator Pods
    participant Verify as Verification

    Upgrade->>Config: Fetch PROVIDER_HOSTNAME_OPERATOR_VERSION
    Config-->>Upgrade: v0.10.0
    Upgrade->>Config: Fetch PROVIDER_INVENTORY_OPERATOR_VERSION
    Config-->>Upgrade: v0.10.0
    
    Upgrade->>Helm: helm upgrade --set image.tag=v0.10.0 (hostname)
    Helm-->>Upgrade: Command executed
    Upgrade->>Helm: helm upgrade --set image.tag=v0.10.0 (inventory)
    Helm-->>Upgrade: Command executed
    
    rect rgb(200, 220, 255)
        Note over Upgrade,Pods: Wait for pod rollout (20s)
        Upgrade->>Pods: sleep 20 seconds
        Pods-->>Upgrade: Ready
    end
    
    Upgrade->>Verify: Verify pod versions
    Verify->>Pods: Capture pod version output
    Pods-->>Verify: Version info
    Verify-->>Upgrade: Verification complete
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Config version changes are straightforward but span multiple component versions—verify version compatibility matrix across Akash and provider services.
  • Node status key renaming requires verification that all downstream code handling sync_info is compatible with the new lowercase snake_case format.
  • Operator upgrade timing logic introduces sleep-based synchronization—confirm 20-second delays are adequate for pod rollout across different environments and verify no race conditions exist between upgrade command execution and pod readiness checks.

Possibly related PRs

  • sync_info key change #58: Directly related—modifies node_status JSON key parsing in provider_service.py (SyncInfo→sync_info) alongside latest_block_height and catching_up access.
  • feat(provider): upgrade for provider version 0.8.2 #50: Related—modifies provider upgrade logic in upgrade_service.py including operator upgrade commands and versioning.
  • Fix/compare version #57: Related—modifies upgrade_service.py operator upgrade flow but with opposite approach (removed vs. added --set image.tag overrides and altered install/uninstall behavior).

Suggested reviewers

  • jigar-arc10

Poem

🐰 Hops with glee through versions bright,
From v0.38.1 to v1.0 in sight!
Operators dressed in fresh new tags,
Twenty-second naps, no more lag bags,
sync_info whispers, lowercase and neat—
This upgrade is quite the treat!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Fix/node check issue" is related to a significant change in the pull request. The primary functional fix involves correcting JSON parsing in provider_service.py, where node status keys are changed from "SyncInfo" to "sync_info", which directly affects how node readiness is determined. This aligns with the title's reference to fixing a node check issue. However, the title is somewhat vague and doesn't specify what the actual node check problem is, making it difficult for someone scanning the PR history to understand the exact nature of the fix without further investigation. Additionally, the PR includes multiple other substantial changes like version bumps and operator upgrade logic enhancements that the title doesn't acknowledge.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/node-check-issue

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@devalpatel67 devalpatel67 merged commit 6ef1883 into main Oct 28, 2025
1 check was pending
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
application/service/upgrade_service.py (1)

172-174: Bug confirmed: SSH connection is closed prematurely inside check_upgrade_status, then reused by calling methods.

Both upgrade_network (line 178) and upgrade_provider (line 283) call check_upgrade_status(ssh_client), then immediately reuse the same ssh_client in subsequent run_ssh_command() calls (lines 190+ and 297+ respectively). The finally block at lines 172-174 closes the connection, causing these subsequent calls to fail.

Fix: Remove the ssh_client.close() from check_upgrade_status's finally block. Callers already have their own finally blocks (verified at line 279) that properly close the connection after they are done using it.

         finally:
-            ssh_client.close()
+            # Caller is responsible for closing ssh_client
🧹 Nitpick comments (4)
application/config/config.py (1)

37-38: New operator version envs: confirm tag format and centralize canonicalization.
You strip a leading "v" later before --set image.tag. Either store canonical tags here (without "v") or add a small helper (e.g., canonical_tag(v)->strip leading v). Reduces repeated lstrip and avoids mismatches.

If tags in your registry already include "v", confirm whether charts expect "v"-less tags; behavior differs per image.

application/service/provider_service.py (2)

423-430: Make sync parsing backward-compatible and resilient.
Support both "sync_info" and legacy "SyncInfo"; guard against missing/str values to avoid KeyError/ValueError loops.

Apply:

-                    node_height = int(node_status["sync_info"]["latest_block_height"])
+                    sync = node_status.get("sync_info") or node_status.get("SyncInfo") or {}
+                    node_height = int(str(sync.get("latest_block_height", "0")))
...
-                    if node_status["sync_info"]["catching_up"]:
+                    if bool(sync.get("catching_up", True)):

407-411: Avoid TTY for non-interactive exec.
Drop -it to prevent TTY artifacts in JSON parsing and flaky hangs over SSH.

-                    "kubectl exec -it akash-node-1-0 -n akash-services -c akash-node -- akash status",
+                    "kubectl exec akash-node-1-0 -n akash-services -c akash-node -- akash status",
application/service/upgrade_service.py (1)

365-366: Replace fixed sleep with rollout status checks.
Static 20s waits are brittle. Wait for deployments to finish rolling out instead.

-            log.info("Waiting for 20 seconds to allow pods to be upgraded...")
-            time.sleep(20) # Wait for the pods to be upgraded
+            log.info("Waiting for pods to complete rollout...")
+            run_ssh_command(ssh_client, "kubectl -n akash-services rollout status deploy/akash-provider --timeout=2m", True, task_id=task_id)
+            run_ssh_command(ssh_client, "kubectl -n akash-services get pods -o custom-columns='NAME:.metadata.name,IMAGE:.spec.containers[*].image' | grep -v akash-node", True, task_id=task_id)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between af2d003 and 0d3c572.

📒 Files selected for processing (3)
  • application/config/config.py (1 hunks)
  • application/service/provider_service.py (1 hunks)
  • application/service/upgrade_service.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
application/service/upgrade_service.py (2)
application/config/config.py (1)
  • Config (4-57)
application/utils/ssh_utils.py (1)
  • run_ssh_command (138-204)
🔇 Additional comments (3)
application/service/upgrade_service.py (2)

6-6: No concerns.
Import required for waits.


329-337: Address lstrip edge case and version normalization inconsistency.

The code correctly strips the "v" prefix from config values that default to "v0.10.0" format. However, two issues need attention:

  1. Inconsistent version normalization: Line 607 in provider_service.py uses .replace("v", "") while lines 329 and 335 use .lstrip("v"). The latter is risky—if a version ever contains leading "v" characters (e.g., "vv0.10.0"), lstrip() removes all of them, whereas .replace() is more explicit. Standardize to .replace("v", "") for clarity and robustness.

  2. lstrip() edge case: Although Config defaults include "v" prefix, document or validate that versions cannot start with multiple "v" characters; otherwise, use explicit string operations.

Confirm that the akash-hostname-operator and akash-inventory-operator helm charts (from akash/akash-hostname-operator and akash/akash-inventory-operator repos) expect image.tag without the "v" prefix. Review the helm chart values or chart documentation to ensure this assumption is correct.

application/config/config.py (1)

28-36: Version bumps require manual verification in your cluster—provided script cannot run in sandbox.

The specified Helm chart versions cannot be verified in the sandbox environment due to network restrictions. You must verify these versions exist in your target Helm repositories (akash vs akash-dev) by running the provided verification script in your actual cluster or session to prevent installation failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants