Adding Compensated Summation to SpM-DV #123

jlgreathouse · 2015-08-14T19:50:30Z

Fixes Issue #109

This patch is an overarching solution to both Issue #109 as well as a feature enhancement that allows users to optionally compute SpM-DV using compensated summation.

Compensated summation is a technique for reducing the cumulative error that occurs during floating point summation. Such errors occur both because of differences in magnitude between the summands (where part or all of the smaller one may be rounded out) as well as due to catastrophic cancellation, where subtracting a large value from a sum will cause previously marginal values that were rounded out to become significant (and now wrong).

It does this by, in part, replacing simple floating point summation with an error-free transformation, such as 2Sum or Fast2Sum. These equations essentially perform the equation A + B = (Sum, Error).

The Sum2 algorithm uses this transformation to greatly reduce the final error in a summation. As you continue to add values into sum, you continue to separately add the errors together. At the end of the summation, you add the final sum value and the final error value together. The end result is that your final answer is as accurate (0 ULP) as performing the summation in 2x precision and then casting down to your original precision. In other words, if all of your values to sum are floats, your answer should be exactly the same as if you'd performed the summation using doubles and then cast the final answer as a float.

This code uses a parallel version of this algorithm, PSum2. This compensated summation algorithm was added to both csrmv_general and csrmv_adaptive. test-blas2 and clsparse-bench were both modified to allow the command line option to use compensated summation.

When running test-blas2 with compensated summation enabled, all double-precision tests pass with 0 ULP difference compared to a CPU-side algorithm performing the same algorithm. The single precision tests sometimes differ compared to the CPU calculations performed in double precision. This is because, on my test platforms, the AMD OpenCL runtime does not allow single precision denormals. Instead, it rounds all denorms to zero, causing a large error that we can't compensate.

While this replaces all of the additions in SpM-DV with many more FLOPs, the overhead of compensated summation is, in most cases, modest. On a FirePro W9100, the average single-precision slowdown of csrmv_adaptive is 5%, and the average double-precision slowdown is 2%. On a Radeon R9 Fury X (with a DPFP rate of 1/16 the SPFP rate, but with much higher bandwidth than the FirePro W9100), the slowdown is 10% for single-precision and 28% for double-precision.

(Note that the above graphic is not a comparison between the relative performance of the two cards. '1' in each bar is the performance in that category with traditional summation.)

Note that we do not perform a fully compensated dot product, as described in the Yamanaka, Ogita, Rump, and Oishi paper. Performing an error-free transformation on the products in our SpM-DV is not extremely computationally intensive, but carrying around the extra error values from the multiplications is burdensome in csrmv_adaptive. We've seen that the majority of error in our calculations comes from summation, so we deemed that good enough. Nonetheless, this means that we can't fully guarantee a 0 ULP accuracy compared to 2x precision calculations, because the error due to multiplies may compound.

…ts us from adding an extra copy of the final workgroup if it turns out that our total number of rows exactly fits within the final rowBlock. This caused problems with global asynchronous reduction.

…lity. Using double precision intermediate calculations for the float answer. Using compensated summation to (in effect) perform quad precision intermediate calculations for the double answer. Calculating ULPs difference between CPU and GPU results.

…oating-point underflow, this calculates answers in twice the precision before rounding to the native precision. In other words, like calculating all of the intermediate results in double before finally rounding to float. This results in many matrices being bitwise identical to what is calculated on the CPU.

…xpanding the rowBlocks buffer size by 2x in order to have a place to reduce error values between workgroups in the CSR-LongRows case.

…rameters out of the kernel. Made the 2sum algorithm in csrmv_general slightly faster.

…p in CSR-LongRows to work on more than a single block of NNZs. This is more efficient and results in higher performance. Also split up CSR-Vector from the LongRows algorithm. Added some more tuning knobs to CSR-Adaptive with respect to these changes.

…l reduction mechanism when there are relatively few rows within the row block.

… the number of threads assigned to the parallel CSR-Stream reduction on the CPU instead of making each GPU workgroup do it. Change the action away from being a division and replace with some faster bit math.

…eries of short rows and then a new long-ish row. CSR-Stream runs into performance issues when trying to reduce multiple rows of extremely varying lengths, so we just put these rows into different workgroups.

… on SpM-DV algorithms.

…ws at a time to CSR-Vector to only 1. It turns out that, after recent modifications to CSR-Stream, it is more efficient to use CSR-Stream for this case.

…pes from size_t to unsigned int, since we currently do not work on extremely large data structures. Changing around some other data types. In general, all of this results in some minor performance gains on Fiji GPUs.

…rformance increases on DPFP-starved GPUs when working in double precision mode.

… of multiplications that snuck their way into the code. 32-bit integer multiply is slow (up to 16x slower than addition on AMD GPUs), so replaced with full-speed addition or full-speed 24-bit multiply when required. Results in major performance gains on Fiji GPUs.

…c work to better decide whether our target hardware supports appropriate atomics. Currently does not work with targets that support fp64 but not 64-bit integer atomics.

… not support 64-bit atomics. Fall back to using only CSR-Vector in this case. Also made some changes to CSR-LongRows beta initialization to fix a memory consistency issue.

Conflicts: src/library/kernels/csrmv_adaptive.cl

…e opaque command structure. Adding it as a command-line option for the test-blas2 program.

… summation. Fixed naming convention for this in test-blas2.

… single-precision compensated summation can have errors due to lack of denorm support.

kknox · 2015-08-18T22:31:15Z

@jlgreathouse
I need you to retarget this towards /develop branch 😉 I don't know of a way to change the PR itself, so I think that means closing this one and making a new PR.

jlgreathouse added 25 commits July 20, 2015 12:20

Minor bug fixes. In particular, the change to computeRowBlocks preven…

ad7144c

…ts us from adding an extra copy of the final workgroup if it turns out that our total number of rows exactly fits within the final rowBlock. This caused problems with global asynchronous reduction.

Adding compensated summation support to CSR-Adaptive. This requires e…

7670486

…xpanding the rowBlocks buffer size by 2x in order to have a place to reduce error values between workgroups in the CSR-LongRows case.

Merge branch 'extended_precision' into newer_adaptive

c4d4f98

Minor code cleanup modifications. Moved CSR-Adaptive configuration pa…

01a68bc

…rameters out of the kernel. Made the 2sum algorithm in csrmv_general slightly faster.

Modifications to the CSR-Stream case. In particular, a faster paralle…

430a7aa

…l reduction mechanism when there are relatively few rows within the row block.

Changed CSR-Adaptive to improve the CSR-Stream performance. Calculate…

54108c7

… the number of threads assigned to the parallel CSR-Stream reduction on the CPU instead of making each GPU workgroup do it. Change the action away from being a division and replace with some faster bit math.

Removing a function that is no longer in use.

021d3d6

Change row block generation function to break row blocks that are a s…

25b5737

…eries of short rows and then a new long-ish row. CSR-Stream runs into performance issues when trying to reduce multiple rows of extremely varying lengths, so we just put these rows into different workgroups.

Merge remote-tracking branch 'upstream/develop' into newer_adaptive

a52ad84

Code cleanup and modifications to improve readability and performance…

ebf357c

… on SpM-DV algorithms.

Changing the csrmv_adaptive configuration parameter from sending 2 ro…

b23c29c

…ws at a time to CSR-Vector to only 1. It turns out that, after recent modifications to CSR-Stream, it is more efficient to use CSR-Stream for this case.

Merge remote-tracking branch 'upstream/develop' into newer_adaptive

60c30c1

Adding floating point FMA into CSR-Adaptive. This results in minor pe…

7daac88

…rformance increases on DPFP-starved GPUs when working in double precision mode.

Made changes to preprocessor macros to align with csrmv_general. Basi…

30f32f8

…c work to better decide whether our target hardware supports appropriate atomics. Currently does not work with targets that support fp64 but not 64-bit integer atomics.

Changes to allow us to use double precision even when the system does…

05cfc0d

… not support 64-bit atomics. Fall back to using only CSR-Vector in this case. Also made some changes to CSR-LongRows beta initialization to fix a memory consistency issue.

Adding FMAs to CSR-General and some consts and other type correctness.

dd85239

Merge remote-tracking branch 'upstream/develop' into newer_adaptive

5bd0e0c

Conflicts: src/library/kernels/csrmv_adaptive.cl

Making extended precision (compensated summation) part of the clSpars…

5084268

…e opaque command structure. Adding it as a command-line option for the test-blas2 program.

Extending clsparse-bench to allow command-line control of compensated…

6900254

… summation. Fixed naming convention for this in test-blas2.

Minor modifications to test-blas2 so that it works OK even though GPU…

ee78e57

… single-precision compensated summation can have errors due to lack of denorm support.

jlgreathouse mentioned this pull request Aug 19, 2015

Adding Compensated Summation to SpM-DV #124

Merged

jlgreathouse closed this Aug 19, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adding Compensated Summation to SpM-DV #123

Adding Compensated Summation to SpM-DV #123

Uh oh!

jlgreathouse commented Aug 14, 2015

Uh oh!

kknox commented Aug 18, 2015

Uh oh!

Uh oh!

Adding Compensated Summation to SpM-DV #123

Adding Compensated Summation to SpM-DV #123

Uh oh!

Conversation

jlgreathouse commented Aug 14, 2015

Uh oh!

kknox commented Aug 18, 2015

Uh oh!

Uh oh!