Skip to content

✨ feat: Support parallel computation for NDT#6376

Merged
mvieth merged 7 commits intoPointCloudLibrary:masterfrom
Tao-Sheng:feat/ndt_openmp
Mar 21, 2026
Merged

✨ feat: Support parallel computation for NDT#6376
mvieth merged 7 commits intoPointCloudLibrary:masterfrom
Tao-Sheng:feat/ndt_openmp

Conversation

@Tao-Sheng
Copy link
Copy Markdown
Contributor

Support parallel computation for NDT by OpenMP with different neighbor search method

Support parallel computation for NDT by OpenMP with different neighbor search method
Use signed std::ptrdiff_t instead of std::size_t
@larshg larshg added module: registration changelog: enhancement Meta-information for changelog generation labels Dec 11, 2025
@larshg larshg added this to the pcl-1.16.0 milestone Dec 11, 2025
@larshg larshg requested review from Copilot and mvieth December 11, 2025 06:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds OpenMP-based parallel computation support to the Normal Distributions Transform (NDT) registration algorithm, along with multiple neighbor search methods to optimize performance. The changes introduce thread-level parallelization to speed up the derivative computation phase of NDT.

Key changes:

  • Adds a NeighborSearchMethod enum with four options: KDTREE (original), DIRECT26, DIRECT7, and DIRECT1
  • Implements setNumberOfThreads() method with OpenMP support for parallel execution
  • Parallelizes the main computation loop in computeDerivatives() using OpenMP directives
  • Updates test to verify functionality with the new API

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
registration/include/pcl/registration/ndt.h Adds enum for neighbor search methods, declares thread control API and private member variables for threading configuration
registration/include/pcl/registration/impl/ndt.hpp Implements setNumberOfThreads() with OpenMP detection, parallelizes derivative computation loop, adds switch statement for different search methods
test/registration/test_ndt.cpp Updates test to explicitly configure search method and thread count for validation
Comments suppressed due to low confidence (1)

registration/include/pcl/registration/impl/ndt.hpp:269

  • The OpenMP parallel region has a critical thread safety issue. The variables score, score_gradient, and hessian are marked as shared, but multiple threads will update them concurrently (via score += on line 266 and calls to updateDerivatives which modifies score_gradient and hessian). These concurrent updates to shared variables will cause race conditions, leading to incorrect results. You need to use reduction clauses for score and critical sections or atomic operations for score_gradient and hessian, or alternatively use thread-local accumulators that are combined after the parallel region.
#pragma omp parallel for num_threads(threads_) schedule(dynamic, 32)                   \
    shared(score, score_gradient, hessian) firstprivate(neighborhood, distances)
  // Update gradient and hessian for each point, line 17 in Algorithm 2 [Magnusson 2009]
  for (std::ptrdiff_t idx = 0; idx < static_cast<std::ptrdiff_t>(input_->size());
       idx++) {
    // Transformed Point
    const auto& x_trans_pt = trans_cloud[idx];

    // Find neighbors with different search method
    switch (search_method_) {
    case KDTREE:
      // Radius search has been experimentally faster than direct neighbor checking.
      target_cells_.radiusSearch(x_trans_pt, resolution_, neighborhood, distances);
      break;
    case DIRECT26:
      target_cells_.getNeighborhoodAtPoint(x_trans_pt, neighborhood);
      break;
    case DIRECT7:
      target_cells_.getFaceNeighborsAtPoint(x_trans_pt, neighborhood);
      break;
    case DIRECT1:
      target_cells_.getVoxelAtPoint(x_trans_pt, neighborhood);
      break;
    }

    // Original Point
    const Eigen::Vector3d x = (*input_)[idx].getVector3fMap().template cast<double>();
    const Eigen::Vector3d x_trans_position =
        x_trans_pt.getVector3fMap().template cast<double>();

    for (const auto& cell : neighborhood) {
      // Denorm point, x_k' in Equations 6.12 and 6.13 [Magnusson 2009]
      const Eigen::Vector3d x_trans = x_trans_position - cell->getMean();
      // Inverse Covariance of Occupied Voxel
      // Uses precomputed covariance for speed.
      const Eigen::Matrix3d c_inv = cell->getInverseCov();

      // Compute derivative of transform function w.r.t. transform vector, J_E and H_E
      // in Equations 6.18 and 6.20 [Magnusson 2009]
      computePointDerivatives(x);
      // Update score, gradient and hessian, lines 19-21 in Algorithm 2, according to
      // Equations 6.10, 6.12 and 6.13, respectively [Magnusson 2009]
      score +=
          updateDerivatives(score_gradient, hessian, x_trans, c_inv, compute_hessian);
    }
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Apply copilot suggestion to fix some error
Copy link
Copy Markdown
Member

@mvieth mvieth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have to at least put a omp critical region around computePointDerivatives(x) and updateDerivatives(...) because they both use global variables.
Also maybe put NeighborSearchMethod into the NormalDistributionTransform class, to make it clear that they belong together?

@mvieth mvieth added the needs: more work Specify why not closed/merged yet label Feb 8, 2026
@mvieth mvieth requested a review from Copilot March 14, 2026 10:46
- Added the switch to select different search methods to computeHessian() as well (would not make sense to use different search methods in computeDerivatives() and computeHessian())
- Added overloads to functions computePointDerivatives(), updateDerivatives(), and updateHessian() to be able to use local variables instead of the global variables point_jacobian_ and point_hessian_
- Change the parallelized region in computeDerivatives() by using the new function overloads, and by introducing a custom reduction (goal: compatibility with OpenMP 2.0). Synchronization overhead between threads is minimal, with only two critical regions at the end to accumulate, score, gradient, and hessian
- Tested with two larger point clouds, and there is indeed a nice speed-up when increasing the number of threads (however not a perfect 1-to-1 speedup because there are still other functions in NDT that are not parallelized)
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mvieth mvieth removed the needs: more work Specify why not closed/merged yet label Mar 15, 2026
@mvieth
Copy link
Copy Markdown
Member

mvieth commented Mar 15, 2026

I looked a bit more into this, and I realized that my earlier suggestion to put an omp critical region around computePointDerivatives(x) and updateDerivatives(...) would hinder parallelization a lot because a lot of computation time is spent in those two function. So instead I added function overloads to those two and updateHessian(...) to be able to use local variables (private per-thread) instead of the global class members point_jacobian_ and point_hessian_. I also created a reduction on score, score_gradient, and hessian (with a workaround for the last two to make this compatible with OpenMP 2.0), so now there is only minimal synchronization overhead at the end when the results from all threads are added together. I have tested with two larger point clouds (from the tutorials), and there is indeed a nice speed-up when increasing the number of threads (however not a perfect 1-to-1 speed-up because there are still other functions in NDT that are not parallelized). In a future PR, we could check which other functions of NDT could be parallelized sensibly.

I also added the switch to select different search methods to computeHessian() as well (would not make sense to use different search methods in computeDerivatives() and computeHessian()).

@mvieth mvieth requested a review from larshg March 15, 2026 09:52
@mvieth mvieth merged commit 8273409 into PointCloudLibrary:master Mar 21, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog: enhancement Meta-information for changelog generation module: registration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants