Skip to content

block_cache checksum frequently triggerred #226

@wangyuyue

Description

@wangyuyue

Describe the bug
I run the ssd_perf/graph_cache_leader workload for 64 navy writer threads and 64 navy reader threads. After 50 minutes, the terminal output is flooded with the item header checksum mismatch error.

To Reproduce
The configuration file I use:

{
    "cache_config": {
        "cacheSizeMB": 71680,
        "navyReaderThreads": 64,
        "navyWriterThreads": 64,
        "nvmCachePaths": [
            "/dev/nvme0n1"
        ],
        "nvmCacheSizeMB": 1075200,
        "writeAmpDeviceList": [
            "/dev/nvme0n1"
        ],
        "navyBigHashBucketSize": 4096,
        "navyBigHashSizePct": 65,
        "navyBlockSize": 4096,
        "navyParcelMemoryMB": 6048,
        "enableChainedItem": true,
        "htBucketPower": 26,
        "moveOnSlabRelease": false,
        "numPools": 2,
        "poolRebalanceIntervalSec": 5,
        "poolSizes": [
            0.2,
            0.8
        ]
    },
    "test_config": {
        "opDelayNs": 0,
        "opDelayBatch": 3,
        "generator": "online",
        "enableLookaside": true,
        "keyPoolDistribution": [
            0.5228545418419219,
            0.477145458158078
        ],
        "numKeys": 4100000000,
        "numOps": 200000000,
        "numThreads": 16,
        "opPoolDistribution": [
            0.571,
            0.429
        ],
        "poolDistributions": [
            {
                "addChainedRatio": 0.0,
                "delRatio": 0.0,
                "getRatio": 0.7684563460126871,
                "keySizeRange": [
                    8,
                    16
                ],
                "keySizeRangeProbability": [
                    1.0
                ],
                "loneGetRatio": 0.2315436539873129,
                "popDistFile": "fbobj_pop.json",
                "setRatio": 0.0,
                "valSizeDistFile": "fbobj_sizes.json"
            },
            {
                "addChainedRatio": 0.0,
                "delRatio": 0.0,
                "getRatio": 0.8841112719979642,
                "keySizeRange": [
                    8,
                    16
                ],
                "keySizeRangeProbability": [
                    1.0
                ],
                "loneGetRatio": 0.1158887280020357,
                "popDistFile": "assoc_pop.json",
                "setRatio": 0.0,
                "valSizeDistFile": "assoc_sizes.json"
            }
        ]
    }
}

Expected behavior
Output statistics for every minute.

Screenshots

E0417 20:53:13.873994 878573 BlockCache.cpp:402] Item header checksum mismatch. Region 5229 is likely corrupted. Expected:1667855729, Actual: 564302642. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.049113 878575 BlockCache.cpp:402] Item header checksum mismatch. Region 5225 is likely corrupted. Expected:1634476133, Actual: 1712872625. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.211392 878577 BlockCache.cpp:402] Item header checksum mismatch. Region 5228 is likely corrupted. Expected:1868832889, Actual: 3472898925. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.413154 878579 BlockCache.cpp:402] Item header checksum mismatch. Region 5221 is likely corrupted. Expected:1948283493, Actual: 367335369. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.565924 878581 BlockCache.cpp:402] Item header checksum mismatch. Region 5230 is likely corrupted. Expected:1919033451, Actual: 3110670073. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.705444 878583 BlockCache.cpp:402] Item header checksum mismatch. Region 5219 is likely corrupted. Expected:1634476133, Actual: 1712872625. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:14.863216 878585 BlockCache.cpp:402] Item header checksum mismatch. Region 5226 is likely corrupted. Expected:1667855729, Actual: 564302642. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.022443 878587 BlockCache.cpp:402] Item header checksum mismatch. Region 5223 is likely corrupted. Expected:539912047, Actual: 1512533508. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.154551 878589 BlockCache.cpp:402] Item header checksum mismatch. Region 5245 is likely corrupted. Expected:1864397680, Actual: 29886577. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.297171 878591 BlockCache.cpp:402] Item header checksum mismatch. Region 5222 is likely corrupted. Expected:1713401463, Actual: 3011049547. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.436860 878593 BlockCache.cpp:402] Item header checksum mismatch. Region 5224 is likely corrupted. Expected:543516756, Actual: 4032079696. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.586964 878595 BlockCache.cpp:402] Item header checksum mismatch. Region 5243 is likely corrupted. Expected:544763750, Actual: 614490409. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.722491 878597 BlockCache.cpp:402] Item header checksum mismatch. Region 5239 is likely corrupted. Expected:1735353376, Actual: 2068717658. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:15.886340 878599 BlockCache.cpp:402] Item header checksum mismatch. Region 5232 is likely corrupted. Expected:1864397680, Actual: 29886577. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:16.040464 878601 BlockCache.cpp:402] Item header checksum mismatch. Region 5236 is likely corrupted. Expected:2053205024, Actual: 884959818. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:16.173985 878603 BlockCache.cpp:402] Item header checksum mismatch. Region 5231 is likely corrupted. Expected:543516756, Actual: 4032079696. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:16.327759 878605 BlockCache.cpp:402] Item header checksum mismatch. Region 5227 is likely corrupted. Expected:1814062440, Actual: 1958191057. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).
E0417 20:53:16.474437 878607 BlockCache.cpp:402] Item header checksum mismatch. Region 5234 is likely corrupted. Expected:544763750, Actual: 614490409. Aborting reclaim. Remaining items in the region will not be cleaned up (destructor won't be invoked).

Server:

  • Hardware: Intel Ice Lake, 72 core, 256GB RAM, 1.6TB NVMe
  • OS: ubuntu 20.04
  • CacheLib Version: 4dc071d

Question:
Can anyone help me analyze the reason for the crash? Is this because of the large number of navy threads in use and potentially a CacheLib error?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions