Skip to content

Corrupted downloads when a file on S3 changes mid-download #2321

@chriskuehl

Description

@chriskuehl

We're observing behavior where aws-cli downloads a corrupt file from S3 if the file is replaced mid-download. We're thinking this happens because of multipart downloads -- each part fetched is consistent with some version of the file, but some of the parts are coming from different versions of the file.

In the end, we end up with aws-cli exiting zero but a corrupted file that produces errors further along in our processing.

Reproduction

This reproduces the issue almost 100% of the time:

  1. Make a virtualenv and install the latest aws-cli:

    virtualenv -ppython2.7 venv && venv/bin/pip install awscli
  2. Make eight files files f0 through f7 which each consist of 1GB of a single byte repeated:

    for i in {0..7}; do
        dd if=/dev/zero bs=1M count=1000 | tr '\000' "\00${i}" > "f${i}"
    done
  3. Upload some of these files fully, with other uploads still in-progress, to the same key, then start a download of that key using aws-cli. Here's one example script:

    #!/usr/bin/time bash
    set -euo pipefail
    n="$RANDOM"
    key="s3://my-bucket/test-${n}"
    results="result-${n}"
    
    # stagger the uploads a bit, start them all in the background
    for f in f*; do
        venv/bin/aws s3 cp "$f" "$key" &
        sleep 5
    done
    
    # wait for the first three to finish
    wait %1
    wait %2
    wait %3
    
    venv/bin/aws s3 cp "$key" "$results"
    my_hash=$(openssl sha1 "$results" | cut -d' ' -f2)
    echo "my hash is: $my_hash"
  4. Compare the hash of the uploaded file with the f files, and it's usually different:

    $ openssl sha1 result-3854 f*
    SHA1(result-3854)= 99db13b557cb00b7b15410bad1c360e89b530f58
    SHA1(f0)= cb19f836c2830ff88ff45694565da65be73b7a69
    SHA1(f1)= eee4fdda7e8ac4955b9d4b97fb823c07ba0f73b4
    SHA1(f2)= c4f79272572f3fd74800c2d7b83c936646475c2e
    SHA1(f3)= bc143c1ff8156c7ab8d41f4a700c1f2d16fbadb3
    SHA1(f4)= f035d33802e80e293f9cdcc307474809a6c45ad1
    SHA1(f5)= d406dcd45e1a85050ce0eaa34f708de4cf25b142
    SHA1(f6)= 13f9c9358cd0824395f2890aacdced2e93b33f27
    SHA1(f7)= 77bbea0dad5295f4e67c371825f0b1a857cc4b5b

    Looking at the hexdump, the downloaded file is a mix of f2, f3, f5, f6, and f7:

    $ hd result-3854
    00000000  02 02 02 02 02 02 02 02  02 02 02 02 02 02 02 02  |................|
    *
    0e800000  03 03 03 03 03 03 03 03  03 03 03 03 03 03 03 03  |................|
    *
    16800000  05 05 05 05 05 05 05 05  05 05 05 05 05 05 05 05  |................|
    *
    1a800000  06 06 06 06 06 06 06 06  06 06 06 06 06 06 06 06  |................|
    *
    20800000  07 07 07 07 07 07 07 07  07 07 07 07 07 07 07 07  |................|
    *
    3e800000
    

I did these steps on Debian stretch.

Expected behavior

It'd be great if aws-cli could either download the file consistently, or at least detect that it has downloaded a corrupted file and exit nonzero to avoid propagating errors.

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature-requestA feature should be added or improved.p3This is a minor priority issues3

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions