Skip to content

fix(BA-1840): Make cloud provider detection future-proof #5086

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 15, 2025

Conversation

achimnol
Copy link
Member

@achimnol achimnol commented Jul 12, 2025

resolves #5085 (BA-1840)

This PR modernizes ai.backend.common.networking.curl() and ai.backend.common.identity.{detect_cloud(), get_instance_*()} functions.

  • Use the officially documented metadata URLs for each cloud provider instead of relying on hacky sysfs-based DMI vendor detection to make it future-proof.
    • Tested actual working on Azure (DSv5), AWS (t3), and a local Linux VM.
  • Update the Azure/GCP instance ID generation logic to combine the VM name and BASE-32 encoded value of the last 5 bytes of VM UUID (e.g., i-my-vm-name-mfqwcylb) as the VM name is only guaranteed to be unique within resource groups where resource groups are just tags.
  • Use asyncio.staggered.staggered_race() to run the happy-eyeballs logic over multiple cloud-provider detection routines.
    • NOTE: You have to re-raise (or "skip catching") exceptions if you want to indicate them as explicit suspension of each coroutine.
  • Use aiohttp.ClientTimeout(connect=...) instead of async with asyncio.timeout() nor classic timeout=... parameter to explicitly set timeouts on the connection attempts only, excluding response parsing and processing.
  • Use raise_for_status=True and catch aiohttp.ClientError which includes aiohttp.ClientResponseError.
    • Update test_curl_* test cases to use aioresponses to make raise_for_status=True option working in tests.
  • Ensure importing ai.backend.common.identity module working regardless whether there is a running event loop or not.
  • Avoid using assert and handling AssertionError for mandated runtime checks as they may be omitted with Python runtime's optimization flags.

Example

AWS

current_provider = amazon
instance_id      = i-010a01d9e13073306
instance_ip      = 172.26.8.181
instance_type    = t3.xlarge
instance_region  = amazon/ap-northeast-2

Azure

current_provider = azure
instance_id      = i-base-image-test-p3l253at
instance_ip      = 10.130.0.5
instance_type    = Standard_D2ads_v5
instance_region  = azure/KoreaCentral

GCP

current_provider = google
instance_id      = i-instance-20250713-054151-663n56lh
instance_ip      = 10.132.0.2
instance_type    = e2-highcpu-8
instance_region  = google/asia-northeast3-c

Local VM

current_provider = None
instance_id      = i-localdev
instance_ip      = 127.0.1.1
instance_type    = default
instance_region  = local

Checklist: (if applicable)

  • Milestone metadata specifying the target backport version
  • Mention to the original issue

@Copilot Copilot AI review requested due to automatic review settings July 12, 2025 14:51
@github-actions github-actions bot added size:L 100~500 LoC comp:common Related to Common component labels Jul 12, 2025
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors cloud provider detection to be async and future-proof, replacing local heuristics with concurrent metadata endpoint probes and standardizing return values using a new CloudProvider enum.

  • Introduces CloudProvider enum and per-provider async detection helpers
  • Replaces synchronous detect_cloud with asyncio.staggered_race for fastest metadata lookup
  • Enhances curl helper with aiohttp timeouts, raise_for_status, and type-safe overloads

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/ai/backend/common/networking.py Updated curl to use aiohttp.ClientTimeout, added overloads
src/ai/backend/common/identity.py Rewrote detect_cloud with async sub-detectors, added CloudProvider enum and asyncio.staggered_race
Comments suppressed due to low confidence (2)

src/ai/backend/common/identity.py:97

  • [nitpick] Consider adding unit tests or integration tests for detect_cloud to validate correct provider detection and fallback behavior across AWS, Azure, and GCP.
async def detect_cloud() -> Optional[CloudProvider]:

src/ai/backend/common/networking.py:23

  • The runtime annotation references yarl.URL but yarl is only imported under TYPE_CHECKING. Either import yarl at runtime or enable postponed annotations (from __future__ import annotations) to avoid a NameError.
    url: str | yarl.URL,

@achimnol achimnol requested a review from HyeockJinKim July 12, 2025 16:39
@achimnol achimnol force-pushed the topic/fix-cloud-detection branch from 470e1aa to 2a646f2 Compare July 12, 2025 17:28
@achimnol achimnol added this to the 25.6 milestone Jul 13, 2025
@HyeockJinKim HyeockJinKim added this pull request to the merge queue Jul 15, 2025
Merged via the queue into main with commit b7708f4 Jul 15, 2025
28 checks passed
@HyeockJinKim HyeockJinKim deleted the topic/fix-cloud-detection branch July 15, 2025 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
comp:common Related to Common component size:L 100~500 LoC
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix cloud provider detection to be compatible with current Azure and future-proof
2 participants