-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Sidecar and Compact: Out of order block uploads lead to overlaps and Compact critical errors #2753
Description
Thanos, Prometheus and Golang version used:
thanos, version 0.12.2 (branch: HEAD, revision: 52e10c6)
build user: root@c1a6cf60f03d
build date: 20200430-16:24:03
go version: go1.13.6
prometheus, version 2.18.1 (branch: HEAD, revision: ecee9c8abfd118f139014cb1b174b08db3f342cf)
build user: root@2117a9e64a7e
build date: 20200507-16:51:47
go version: go1.14.2
Object Storage Provider:
Google Cloud Storage
What happened:
Compact ran into an overlapped block error:
Jun 05 08:09:29 caller=compact.go:460 msg="critical error detected; halting" err="compaction: group 0@15585302070356120162: pre compaction overlap check: overlaps found while gathering blocks. [mint: 1591185600000, maxt: 1591192800000, range: 2h0m0s, blocks: 2]: <ulid: 01EA1R7GMDZP9XBPQBYBJPES3W, mint: 1591171200000, maxt: 1591200000000, range: 8h0m0s>, <ulid: 01E9XB4ECJ3NPQ08PBS7XHAXZ6, mint: 1591185600000, maxt: 1591192800000, range: 2h0m0s>"
Looking at the source system, it appears that it was was isolated from the internet due to an upstream network failure for several days. Once the network was accessible, it began uploading blocks, but failed on 01E9XB4ECJ3NPQ08PBS7XHAXZ6 at 07:52:19. That block was later uploaded at 08:05:51:
Jun 03 11:00:12 caller=shipper.go:347 msg="upload new block" id=01E9WXCZWJA9TXDEY0JJ6KACNR
Jun 03 13:00:41 caller=sidecar.go:340 err="iter local block metas: check exists: Get https://storage.googleapis.com/storage/v1/b/<bucket>/o/01E9X48Q4J2X91Y6WPDW6ST3T7%2Fmeta.json?alt=json&prettyPrint=false&projection=full: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp: i/o timeout" uploaded=0
...
Jun 05 07:52:19 caller=sidecar.go:340 err="iter local block metas: check exists: Get https://storage.googleapis.com/storage/v1/b/<bucket>/o/01E9X48Q4J2X91Y6WPDW6ST3T7%2Fmeta.json?alt=json&prettyPrint=false&projection=full: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp 172.217.20.106:443: connect: network is unreachable" uploaded=0
Jun 05 07:52:42 caller=shipper.go:347 msg="upload new block" id=01E9X48Q4J2X91Y6WPDW6ST3T7
Jun 05 07:53:10 caller=shipper.go:347 msg="upload new block" id=01E9XB4ECJ3NPQ08PBS7XHAXZ6
Jun 05 07:59:06 caller=shipper.go:313 msg="shipping failed" block=01E9XB4ECJ3NPQ08PBS7XHAXZ6 err="upload chunks: upload file /data/thanos/upload/01E9XB4ECJ3NPQ08PBS7XHAXZ6/chunks/000001 as 01E9XB4ECJ3NPQ08PBS7XHAXZ6/chunks/000001: Post https://storage.googleapis.com/upload/storage/v1/b/<bucket>/o?alt=json&prettyPrint=false&projection=full&uploadType=resumable&upload_id=AAANsUljcCy9pr7Ob_I23MocUc_Iob8dhIv_3X1HwW2PrySiehs-3uXBZLRwAMbkwDh84HqPsxmlgnTopL
Jun 05 07:59:06 caller=shipper.go:347 msg="upload new block" id=01E9XJ05MJNCDZ3WX87WKB85P1
...
Jun 05 08:05:51 caller=sidecar.go:340 err="failed to sync 1 blocks" uploaded=21
Jun 05 08:05:51 caller=shipper.go:347 msg="upload new block" id=01E9XB4ECJ3NPQ08PBS7XHAXZ6
In between the first upload failure of 01E9XB4ECJ3NPQ08PBS7XHAXZ6 and the retry, compact compacted the blocks that were present:
Jun 05 08:05:34 caller=compact.go:633 group="0@{replica=\"oob-osl6508\", site=\"osl65\"}" groupKey=0@15585302070356120162 msg="compaction available and planned; downloading blocks" plan="[/data/compact/0@15585302070356120162/01E9WXCZWJA9TXDEY0JJ6KACNR /data/compact/0@15585302070356120162/01E9X48Q4J2X91Y6WPDW6ST3T7 /data/compact/0@15585302070356120162/01E9XJ05MJNCDZ3WX87WKB85P1]"
Jun 05 08:05:48 caller=fetcher.go:451 component=block.BaseFetcher msg="successfully synchronized block metadata" duration=25.221939741s cached=27921 returned=27921 partial=3
Jun 05 08:05:49 caller=compact.go:692 group="0@{replica=\"oob-osl6508\", site=\"osl65\"}" groupKey=0@15585302070356120162 msg="downloaded and verified blocks; compacting blocks" plan="[/data/compact/0@15585302070356120162/01E9WXCZWJA9TXDEY0JJ6KACNR /data/compact/0@15585302070356120162/01E9X48Q4J2X91Y6WPDW6ST3T7 /data/compact/0@15585302070356120162/01E9XJ05MJNCDZ3WX87WKB85P1]" duration=15.079950156s
Jun 05 08:05:57 caller=compact.go:441 msg="compact blocks" count=3 mint=1591171200000 maxt=1591200000000 ulid=01EA1R7GMDZP9XBPQBYBJPES3W sources="[01E9WXCZWJA9TXDEY0JJ6KACNR 01E9X48Q4J2X91Y6WPDW6ST3T7 01E9XJ05MJNCDZ3WX87WKB85P1]" duration=7.489969252s
Jun 05 08:05:57 caller=compact.go:722 group="0@{replica=\"oob-osl6508\", site=\"osl65\"}" groupKey=0@15585302070356120162 msg="compacted blocks" new=01EA1R7GMDZP9XBPQBYBJPES3W blocks="[/data/compact/0@15585302070356120162/01E9WXCZWJA9TXDEY0JJ6KACNR /data/compact/0@15585302070356120162/01E9X48Q4J2X91Y6WPDW6ST3T7 /data/compact/0@15585302070356120162/01E9XJ05MJNCDZ3WX87WKB85P1]" duration=7.529746624s overlapping_blocks=false
Jun 05 08:06:06 caller=compact.go:764 group="0@{replica=\"oob-osl6508\", site=\"osl65\"}" groupKey=0@15585302070356120162 msg="uploaded block" result_block=01EA1R7GMDZP9XBPQBYBJPES3W duration=8.934421357s
Compact is configured with --consistency-delay=2h --delete-delay=6h
It appears that a transient upload error of one block among many, that were several days old resulted in a race between an uploading sidecar and compact where a compacted block was formed before all blocks that should have been included in it were successfully uploaded.
This overlap issue seems to be at odds with the documented expectation in https://thanos.io/operating/troubleshooting.md/#overlaps
What you expected to happen:
I would expect compact to wait for consistency-delay after a block was uploaded to the object store to make it available for compaction. It appears to rely on the ULID to calculate age rather than the object creation time of metadata.json, which is problematic when the block age does not correspond to its age in the object store.
How to reproduce it (as minimally and precisely as possible):
Set up thanos compact to run against an objstore bucket. Set up a prometheus install with several hours of uncompacted blocks already present. Move one block out of the prometheus data directory. Start thanos sidecar so that it uploads all existing blocks. Watch compact compact the several hours old blocks. Move reserved block back into the prometheus directory and restart thanos sidecar. It should then upload and compact will halt.