Why do the strides work in this way?

I'm opening this issue in a bid to get some help on a vexing code problem in my project, which incorporates PocketFFT.  If this is not the right place to pose a question, I'm happy to take it offline with someone who would offer to help.

The principal problem is that I have Fortran-ordered multi-dimensional arrays, (column-major, the contiguous memory lists all column elements consecutively, in three dimensions {*x*, *y*, *z*} the column being a "vertical" column against *x* / *y* "slabs") , whereas PocketFFT expects C order (row-major).  To help frame the issue, I have been able to carry out forward and backward FFT "round trips" with such data and access the individual elements in an order that displays them in the same output format as GNU Octave performing `fftn(x)` on the same multi-dimensional data. (Whether Octave uses Fortran- or C-ordered arrays, I am entering the data in a script so the rows, slabs, and columns are all implied by the layout of data in the script.) My question is more about why what I'm doing works--I don't want to be caught getting the right answers for the wrong reasons.

What I do for a given **real-to-half complex problem** with { *m*, *n*, *p* } elements:
* Make the input `shape_in` of the problem a `std::vector<size_t>` of { *p*, *n*, *m* }
* Under this logic of reversing the order of lengths fed to the API, the output shape of the problem would seem to be { *p*, *n*, *m*/2 + 1 }.  This is not part of the PocketFFT API calls, however.
* The input strides `stride_in` work out to { *m* x *n* x *R*, *m* x *R*, *R* }, where *R* is `sizeof(your_real_type)`.
* The output strides `stride_out`, which are part of the actual API call, then work out as { (*m* / 2 + 1) x *n* x *C*, (*m* / 2 + 1) x *C*, *C* }, where *C* is `sizeof(std::complex<your_real_type>)`.  

If I then run it like so:
```
r2c(shape_in, stride_in, stride_out, { 0, 1, 2 }, FORWARD, your_real_signal_data, your_complex_xform_data, 1.0);
c2r(shape_in, stride_out, stride_in, { 0, 1, 2 }, BACKWARD, your_complex_xform_data, your_real_signal_data, 1.0 / (m * n * p));
```

The resulting signal will make the correct round-trip.  I can also query the intermediate `complex_xform_data` to produce what looks like the correct result (matching GNU Octave running the same problem--I got over the roundoff problem by rounding all input elements to the nearest 1.0e-8 and printing the input signal as `%12.8lf` so I could copy it into the Octave script).  It took me *a lot* of effort to solve the logical Rubik's cube.

**The insight**, it seems, was that I need to access the elements of the transformed signal as if the result were (or, because it is?) now a problem with shape { *m* / 2 + 1, *n*, *p* }, but again in Fortran order.  This pertains to the result of just the forward FFT, not the round trip. (In order to do a convolution, I will need to get in there, between forward and reverse FFTs, and do some element-wise operations.  That will require me to know what elements of the frequency space I'm accessing.)  In order to access element { *x*, *y*, *z* } of the transformed result, with *x* varying in the range [0, *m*) and *y* between [0, *n*) and *z* between [0, *p*), here's the pseudocode:

```
getFrequencyData(const size_t x_pos, const size_t y_pos, const size_t z_pos) {

  // Apply periodic considerations.  If the FFT is real-to-complex, the following filter will       
  // detect a request for information that might be outside the explicit data array.  A             
  // complex-to-complex transform will have frequencies for each piece of real data.                
  size_t x_act = x_pos - (shape_in[0] * (x_pos / shape_in[0]));
  size_t y_act = y_pos - (shape_in[1] * (y_pos / shape_in[1]));
  size_t z_act = z_pos - (shape_in[2] * (z_pos / shape_in[2]));
  size_t xyz_act;
  if (x_act >= shape_out[2]) {
    x_act = shape_in[0] - x_act;
    if (y_act > 0) {
      y_act = shape_in[1] - y_act;
    }
    if (z_act > 0) {
      z_act = shape_in[2] - z_act;
    }
  }

  // Determine the array index to access.
  size_t xyz_act = (((z_act * shape_out[1]) + y_act) * shape_out[2]) + x_act;
  result = z_frequency[xyz_act];

  // Take the complex conjugate, if necessary.
  if (x_act >= shape_out[2]) {
    result.y = -result.y;
  }
  return result;
}
```

I'm hoping that someone is here who can confirm my reasoning.  This has been quite a puzzle, as I said, but I think I finally have it.  I'm finding that PocketFFT and cuFFT both follow the same data layouts, including for real-to-half complex transforms, so I've been able to extend my own code's API to subsume both packages for FFTs in C++ and CUDA, respectively.  With a few more improvements I should be able to leverage the class containing code such as the above without having to go back through this degree of pain every time I'm faced with a new FFT.  It will also internalize the allocation of out-of-place data, if required, and even automate repeated calls to PocketFFT to mimic the behavior of batched FFT calls in cuFFT.  Very glad to have your code holding up the C++ side of things, providing a critical CPU-only run mode and the ability to write other parts of the software that perform in serial C++ what we *want* all of our CUDA kernels to be doing up to 2000x faster.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why do the strides work in this way? #25

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Why do the strides work in this way? #25

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions