Skip to content

Test agents sometimes run out of space during PR validation #6906

@lbussell

Description

@lbussell

Summary

PR validation agents occasionally run out of disk space during test execution, causing test failures with "No space left on device" errors.

Problem Description

During test runs, the available disk space on test agents drops rapidly as Docker images are pulled, built, and SDK archives are downloaded/extracted. The agent starts with limited free space (~8GB out of ~73GB) and can reach critical levels (<1GB) within a few minutes of test execution.

Example Failure

Failing test run

From the test logs:

##[debug]Agent environment resources - Disk: / Available 8103.96 MB out of 73325.30 MB
...
##[warning]Free disk space on / is lower than 5%; Currently used: 95.38%
...
##[debug]Agent environment resources - Disk: / Available 126.80 MB out of 73325.30 MB
...
System.IO.IOException : No space left on device : '/tmp/8eedb88b-4e97-4380-95ea-924d6df37d13/packs/Microsoft.AspNetCore.App.Ref/11.0.0-alpha.1.25609.108/ref/net10.0/System.Security.Cryptography.xml'

The test SdkImageTests.VerifyDotnetFolderContents failed because extracting SDK archives to /tmp exhausted disk space.

Root Causes

  1. Docker images accumulate - Building and pulling many images consumes significant disk space
  2. SDK archives are downloaded to temp - Tests download and extract large SDK archives (~200MB+ each) to /tmp
  3. Cleanup only runs at job boundaries - Docker cleanup happens in init-docker-linux.yml and cleanup-docker-linux.yml, but not during test execution
  4. No mid-test cleanup mechanism - There's no way to clean up between individual tests

Proposed Solutions

  1. Add periodic Docker cleanup during tests - Add a PruneUnusedImages() method to DockerHelper.cs that can be called between tests when disk space is low

  2. Clean up SDK archive temp files immediately - Ensure TempFolderContext in SdkImageTests.cs#L283-L298 is disposed promptly and consider streaming extraction

  3. Delete intermediate test images more aggressively - Add docker image prune -f after deleting tagged images in ProjectTemplateTestScenario.cs#L165

  4. Enhance existing cleanup scripts - Add temp file cleanup to cleanup-docker-linux.yml and Invoke-CleanupDocker.ps1

  5. Add disk space monitoring/early exit - Add a CheckDiskSpace() method to DockerHelper.cs that warns or fails fast if available space is critically low

Metadata

Metadata

Assignees

Type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions