-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
We're observing behavior where aws-cli downloads a corrupt file from S3 if the file is replaced mid-download. We're thinking this happens because of multipart downloads -- each part fetched is consistent with some version of the file, but some of the parts are coming from different versions of the file.
In the end, we end up with aws-cli exiting zero but a corrupted file that produces errors further along in our processing.
Reproduction
This reproduces the issue almost 100% of the time:
-
Make a virtualenv and install the latest aws-cli:
virtualenv -ppython2.7 venv && venv/bin/pip install awscli
-
Make eight files files
f0
throughf7
which each consist of 1GB of a single byte repeated:for i in {0..7}; do dd if=/dev/zero bs=1M count=1000 | tr '\000' "\00${i}" > "f${i}" done
-
Upload some of these files fully, with other uploads still in-progress, to the same key, then start a download of that key using aws-cli. Here's one example script:
#!/usr/bin/time bash set -euo pipefail n="$RANDOM" key="s3://my-bucket/test-${n}" results="result-${n}" # stagger the uploads a bit, start them all in the background for f in f*; do venv/bin/aws s3 cp "$f" "$key" & sleep 5 done # wait for the first three to finish wait %1 wait %2 wait %3 venv/bin/aws s3 cp "$key" "$results" my_hash=$(openssl sha1 "$results" | cut -d' ' -f2) echo "my hash is: $my_hash"
-
Compare the hash of the uploaded file with the
f
files, and it's usually different:$ openssl sha1 result-3854 f* SHA1(result-3854)= 99db13b557cb00b7b15410bad1c360e89b530f58 SHA1(f0)= cb19f836c2830ff88ff45694565da65be73b7a69 SHA1(f1)= eee4fdda7e8ac4955b9d4b97fb823c07ba0f73b4 SHA1(f2)= c4f79272572f3fd74800c2d7b83c936646475c2e SHA1(f3)= bc143c1ff8156c7ab8d41f4a700c1f2d16fbadb3 SHA1(f4)= f035d33802e80e293f9cdcc307474809a6c45ad1 SHA1(f5)= d406dcd45e1a85050ce0eaa34f708de4cf25b142 SHA1(f6)= 13f9c9358cd0824395f2890aacdced2e93b33f27 SHA1(f7)= 77bbea0dad5295f4e67c371825f0b1a857cc4b5b
Looking at the hexdump, the downloaded file is a mix of f2, f3, f5, f6, and f7:
$ hd result-3854 00000000 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 |................| * 0e800000 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 03 |................| * 16800000 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 05 |................| * 1a800000 06 06 06 06 06 06 06 06 06 06 06 06 06 06 06 06 |................| * 20800000 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 07 |................| * 3e800000
I did these steps on Debian stretch.
Expected behavior
It'd be great if aws-cli
could either download the file consistently, or at least detect that it has downloaded a corrupted file and exit nonzero to avoid propagating errors.