https://github.com/CoffeeBeforeArch/cuda_programming/blob/8711be08e2060680820304594f3691cc0016edd1/03_sum_reduction/diverged/sumReduction.cu#L75 Shouldn't the second call be <<<1, GRID_SIZE>>> instead of <<<1, TB_SIZE>>>? I think GRID_SIZE is the number of partial sums.