Skip to content

Conversation

@rezmoss
Copy link
Contributor

@rezmoss rezmoss commented Feb 7, 2025

Problem

When processing Node.js packages with complex symlink structures (particularly .pnpm or turf packages), the _pathsToNode function would enter an exponential path growth pattern:

  • Normal packages process in 0-1ms
  • .pnpm/turf packages cause path counts to explode thousands
  • This leads to severe performance degradation/stuck forever

Root Cause

The issue stems from how parent paths are processed. The combination of:

  1. Getting all constituent paths with p.ConstituentPaths()
  2. Recursively processing each parent path
  3. Returning nil for visited nodes (causing reprocessing)

creates a compounding effect where each level of symlink traversal multiplies the paths to process

Solution

The fix modifies _pathsToNode to:

  1. Process only immediate parent path instead of all constituents
  2. Return accumulated paths for visited nodes instead of nil
  3. Maintain proper path building with basename

This prevents the path explosion while preserving the expected file traversal behavior

@kzantow
Copy link
Contributor

kzantow commented Feb 7, 2025

Thanks for the contribution @rezmoss! Is there a way to add a test illustrating the problem on a small scale and how this change fixes it?

@rezmoss
Copy link
Contributor Author

rezmoss commented Feb 7, 2025

@kzantow Sure! I'll work on that to reproduce the issue with a new Docker image and share it here soon

@rezmoss
Copy link
Contributor Author

rezmoss commented Feb 10, 2025

@kzantow Building a similar structure to a real docker image was really hard, but I finally built a dockerfile that reproduces almost the same behavior, my real docker image would get stuck for hours, but this one took 26 mins and 35 secs on my machine

original-code

After the patch, the same process took just 1 min and 32 secs

patch-v

here’s the dockerfile you can build and test

FROM node:16 AS builder

WORKDIR /app/lib

RUN npm install -g pnpm@8

RUN for dir in metric-analyzer event-processor cache-manager data-validator stream-utils audit-tools metrics-collector token-handler queue-manager workflow-engine service-registry task-scheduler load-balancer trace-analyzer storage-engine pipeline-runner; do \
    mkdir -p $dir && \
    cd $dir && \
    echo "{\"name\":\"$dir\",\"version\":\"1.0.0\",\"dependencies\":{\
        \"@turf/along\":\"6.5.0\",\
        \"@turf/turf\":\"6.5.0\",\
        \"@babel/runtime\":\"7.23.7\",\
        \"tslib\":\"2.5.3\",\
        \"@turf/boolean-point-in-polygon\":\"6.5.0\",\
        \"@turf/center-of-mass\":\"6.5.0\",\
        \"@turf/center-mean\":\"6.5.0\"\
    }}" > package.json && \
    cd .. ; \
done

RUN for dir in metric-analyzer event-processor cache-manager data-validator stream-utils audit-tools metrics-collector token-handler queue-manager workflow-engine service-registry task-scheduler load-balancer trace-analyzer storage-engine pipeline-runner; do \
    cd $dir && \
    pnpm install && \
    cd ..; \
done

# create deeply nested struc
RUN for dir in metric-analyzer event-processor cache-manager data-validator stream-utils audit-tools metrics-collector token-handler queue-manager workflow-engine service-registry task-scheduler load-balancer trace-analyzer storage-engine pipeline-runner; do \
    mkdir -p $dir/deep/nested/path/node_modules/.pnpm/@turf/along/dist/es && \
    echo '{"nested": true}' > $dir/deep/nested/path/node_modules/.pnpm/@turf/along/dist/es/package.json && \
    for pkg in along turf boolean-point-in-polygon center-of-mass center-mean; do \
        pkg_path="$dir/deep/nested/path/node_modules/.pnpm/@turf/$pkg/dist/es" && \
        mkdir -p $pkg_path && \
        echo "{\"name\":\"@turf/$pkg\",\"nested\":true}" > $pkg_path/package.json; \
    done; \
done

# create cross-links between pkgs and dirs
RUN for dir1 in metric-analyzer event-processor cache-manager data-validator stream-utils audit-tools metrics-collector token-handler queue-manager workflow-engine service-registry task-scheduler load-balancer trace-analyzer storage-engine pipeline-runner; do \
    for dir2 in metric-analyzer event-processor cache-manager data-validator stream-utils audit-tools metrics-collector token-handler queue-manager workflow-engine service-registry task-scheduler load-balancer trace-analyzer storage-engine pipeline-runner; do \
        if [ "$dir1" != "$dir2" ]; then \
            mkdir -p $dir1/node_modules/.pnpm/@turf/nested/$dir2 && \
            for pkg in along turf boolean-point-in-polygon center-of-mass center-mean; do \
                ln -s ../../../../../$dir2/node_modules/.pnpm/@turf+${pkg}@6.5.0 $dir1/node_modules/.pnpm/@turf/nested/$dir2/$pkg-link; \
            done; \
        fi; \
    done; \
done

# create deep nested pkgs cross lnks
RUN for dir in metric-analyzer event-processor cache-manager; do \
    cd $dir && \
    for i in $(seq 1 20); do \
        for pkg in along turf boolean-point-in-polygon center-of-mass center-mean; do \
            pkg_dir="deep/level$i/node_modules/.pnpm/@turf/$pkg" && \
            mkdir -p $pkg_dir/dist/es && \
            echo "{\"name\":\"@turf/$pkg\",\"level\":$i}" > $pkg_dir/dist/es/package.json && \
            if [ $i -gt 1 ]; then \
                prev=$((i-1)) && \
                ln -s ../../../../../level$prev/node_modules/.pnpm/@turf/$pkg $pkg_dir/prev-link && \
                for other_pkg in along turf boolean-point-in-polygon center-of-mass center-mean; do \
                    if [ "$pkg" != "$other_pkg" ]; then \
                        ln -s ../../$other_pkg $pkg_dir/cross-link-$other_pkg; \
                    fi; \
                done; \
            fi; \
        done; \
    done && \
    cd ..; \
done

FROM node:16-slim
COPY --from=builder /app /app
WORKDIR /app

CMD ["tail", "-f", "/dev/null"]

@rezmoss
Copy link
Contributor Author

rezmoss commented Feb 18, 2025

Hey @kzantow just following up on this, have you had a chance to look at the Dockerfile I shared? Let me know if you need any more details

@rezmoss
Copy link
Contributor Author

rezmoss commented May 21, 2025

hey @kzantow just checking in again to see if you had a chance to look at the dockerfile i shared, it shows the path explosion issue and how the patch really helps with performance

@kzantow
Copy link
Contributor

kzantow commented May 22, 2025

Sorry this got lost in the noise @rezmoss 🤦 I'll give this a review tomorrow! 👍

@kzantow kzantow self-assigned this May 22, 2025
@kzantow kzantow added this to OSS May 22, 2025
@kzantow kzantow moved this to In Review in OSS May 22, 2025
@kzantow
Copy link
Contributor

kzantow commented May 23, 2025

FYI @rezmoss -- I've been testing this manually and can confirm it seems to solve issues with the image you provided as well as the image here: anchore/syft#3928 (comment). I'm actively trying to get this merged, including some sort of meaningful test. It's just a bit of a challenging thing to test in a unit test scenario, as you are probably aware. Thanks very much for your work on this, it's not a trivial change at all! 😁

@rezmoss
Copy link
Contributor Author

rezmoss commented May 24, 2025

@kzantow thanks for testing it out and confirming, let me know if anything else is needed to move it forward ;)

@kzantow
Copy link
Contributor

kzantow commented Jun 4, 2025

Hey @rezmoss sorry for the delay here. I found a set of cases that this change wasn't accounting for and couldn't account for without restoring some of the prior behavior of walking up the tree, which ended up undoing a lot of the good this did in the first place -- as you said there was a compounding of issues causing this explosion. I spent some time trying to adjust how it works to avoid the performance issues while not missing potential paths and I realized there was a different fundamental change that should happen - instead of returning all the potential paths, we really only need the first matching one from an indexed item -- we're returning "real" files here, as long as they match a requested glob. I ended up implementing an alternate change here: #411.

However your work on identifying the problem and providing a way to reproduce it was invaluable to get it fixed, so I wanted to express my deepest gratitude for your help here, despite not accepting this PR directly. Thank you!

@kzantow kzantow closed this Jun 4, 2025
@github-project-automation github-project-automation bot moved this from In Review to Done in OSS Jun 4, 2025
@rezmoss
Copy link
Contributor Author

rezmoss commented Jun 4, 2025

thanks @kzantow totally get it and just glad the root issue’s sorted, really appreciate the thoughtful followup and kind words, happy could help in any way ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

2 participants