Python build file tree support for Code Intelligence#160
Conversation
- Rename ProcessorContext.MavenArtifactIDs to ArtifactIDs for ecosystem-agnostic usage - Replace hand-rolled parseMavenArtifactID with parseArtifactID using packageurl-go - For purls with a namespace (Maven, npm) return "namespace:name"; for others (PyPI) return name only - Add comprehensive test cases for parseArtifactID covering Maven, PyPI, npm, and invalid purls Rationale: The build file tree processor needs to parse artifact IDs from any ecosystem, not just Maven. Using packageurl-go provides correct purl parsing for all ecosystem types and eliminates the Maven-specific string manipulation that would break on non-Maven purls. This commit made by [/dd:git:commit:atomic](https://github.com/DataDog/claude-marketplace/tree/main/dd/commands/git/commit/atomic.md)
- Create SimpleProcessor with identical BFS logic from MavenProcessor - Register SimpleProcessor for FileTypePomXML (replacing MavenProcessor registration) - Delete MavenProcessor as its logic is fully absorbed by SimpleProcessor - Rename test functions from TestMavenProcessor_* to TestSimpleProcessor_Maven_* - Fix test helper mavenFileComponent to produce realistic purls (separate artifactID param) - Update test expected IDs to match proper purl parsing (e.g. "com.example:child" not "com.example:child/pom.xml") Rationale: MavenProcessor's BFS-over-FileDependencies logic is not Maven-specific -- it works for any file type that has dependency edges and optional artifact IDs. Renaming to SimpleProcessor and making it generic allows registering it for requirements.txt and other file types without duplicating the BFS logic. All existing Maven test scenarios pass unchanged under the new name. This commit made by [/dd:git:commit:atomic](https://github.com/DataDog/claude-marketplace/tree/main/dd/commands/git/commit/atomic.md)
|
🎯 Code Coverage (details) 🔗 Commit SHA: e4fcd1d | Docs | Datadog PR Page | Give us feedback! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 972f0499f8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // if a dependency target is a known file-type component that wasn't | ||
| // collected, we add it to the same FileType group as its dependent so | ||
| // the processor can resolve it during BFS. | ||
| if len(filterSet) > 0 { |
There was a problem hiding this comment.
Handle cross-type targets without filters
When callers use the documented unfiltered form GetBuildFileTrees(sbom), requirements SBOMs that depend on a file component such as libs/mylib/setup.py lose that edge: the first pass groups the file component under its real basename (setup.py) with the no-op processor, while the requirements.txt SimpleProcessor only indexes files in the requirements group. This reclassification pass is the only code that adds cross-type targets to the dependent's group, but it is skipped unless a filter is provided, so unfiltered results omit the setup.py/pyproject dependency and its artifact ID even though filtered calls work.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in this PR.
046c3a0 to
c3057ab
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c3057ab528
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ArtifactDetail: models.ArtifactDetail{ | ||
| Name: name, | ||
| Filename: filePath, | ||
| Ecosystem: models.EcosystemPyPI, | ||
| }, |
There was a problem hiding this comment.
Preserve Python artifact versions for pinned dependencies
When a requirements.txt entry pins an internal package, e.g. mylib==1.0.0, parseLine records the package version, and internal/output/sbom/cyclonedx.go only links it to a scanned artifact when the artifact has the same version. This new artifact construction leaves Version empty, so the downstream findArtifact check fails and no requirements.txt -> pyproject.toml/setup.py dependency edge is emitted for pinned local modules, which is a common requirements format.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: findArtifact now matches when artifact.Version is empty (local modules without a declared version), via the condition (version == "" || artifact.Version == "" || artifact.Version == version).
c3057ab to
ef47bdd
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ef47bddae8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return nil, nil | ||
| } | ||
|
|
||
| name := normalizedRequirementName(pyproject.Project.Name) |
There was a problem hiding this comment.
Read Poetry project names when extracting artifacts
For pyproject.toml files that use Poetry metadata only ([tool.poetry] name = ..., which this extractor already supports for dependencies), this reads only [project].name and returns no scanned artifact. In that common layout, requirements entries for the local module can never match a file component, so the new requirements.txt build-file tree loses the dependency/ID for the Poetry module.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: GetArtifact now falls back to [tool.poetry].name when [project].name is absent.
| actions := scanner.ScannerActions{ | ||
| DirectoryPaths: dirs, | ||
| ExcludePaths: opts.ExcludePaths, | ||
| Recursive: opts.Recursive, | ||
| ExtractMavenPomArtifactIds: opts.ExtractMavenPomArtifactIds, | ||
| DirectoryPaths: dirs, | ||
| ExcludePaths: opts.ExcludePaths, | ||
| Recursive: opts.Recursive, | ||
| ExtractArtifacts: opts.ExtractArtifacts, |
There was a problem hiding this comment.
Expose manifest parsing through GenerateSBOM
When callers use the documented sbomgen.GenerateSBOM(..., sbomgen.DefaultOptions()) API, this action never enables ScannerActions.ManifestParsers, and DoScan only adds manifest extractors such as pyproject.toml when that flag is set. As a result the new PyProjectTOML artifact extractor cannot run through the public sbomgen API, so requirements.txt → pyproject.toml build-file edges are absent unless users bypass this package and call scanner directly.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: GenerateSBOM now sets ManifestParsers: opts.ExtractArtifactIds so pyproject.toml runs through the public API when artifact extraction is enabled.
ef47bdd to
584b899
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 584b8994c6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for ft, files := range grouped { | ||
| for _, f := range files { |
There was a problem hiding this comment.
Expand cross-type targets to a fixed point
When a filtered tree has a local module that itself depends on another file component (for example requirements.txt -> libs/a/pyproject.toml -> libs/b/pyproject.toml), this loop only scans the files slice captured before line 177 appends libs/a/pyproject.toml. The newly added target is never examined for its own SBOM dependency edges, so libs/b/pyproject.toml is absent from the processor's filesByPath and SimpleProcessor drops that transitive dependency even though the SBOM contains it. Consider expanding these cross-type targets until no new file components are added.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: replaced the single-pass loop with a fixed-point loop that keeps iterating until no new cross-type targets are added.
584b899 to
6a9568e
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6a9568e915
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| | ------------------ | ---------- | ------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| | `Recursive` | `bool` | Scan subdirectories recursively (default: `true`) | | ||
| | `ExcludePaths` | `[]string` | Glob patterns to exclude from scanning | | ||
| | `ExtractArtifacts` | `bool` | Emit file-type components and dependency edges for build files, enabling `GetBuildFileTrees` dependency and ID resolution (default: `true`) | |
There was a problem hiding this comment.
Use the actual artifact option name
The README now documents ExtractArtifacts, but the public sbomgen.Options struct exposes ExtractArtifactIds instead. Users following this example to disable artifact extraction with sbomgen.Options{ExtractArtifacts: false} will get a compile error, so the documented API should match the field name in pkg/sbomgen/sbomgen.go.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: README now documents ExtractArtifactIds (matching the actual struct field).
6a9568e to
c56993c
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c56993c422
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| RegisterBuildFileProcessor(FileTypePomXML, &SimpleProcessor{}) | ||
| RegisterBuildFileProcessor(FileTypeRequirementsTxt, &SimpleProcessor{}) |
There was a problem hiding this comment.
Register pyproject.toml with the simple processor
When a scan emits a pyproject.toml file component from GetArtifact, buildProcessorContext can populate ctx.ArtifactIDs and dependency edges for that file, but this init block only registers the simple processor for POMs and requirements files. As a result, callers using GetBuildFileTrees(sbom, FileTypePyprojectToml) fall through to noopProcessor, so the pyproject build-file tree loses its package ID and any local file dependencies even though the SBOM contains the data; register FileTypePyprojectToml here as well.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: FileTypePyprojectToml is now registered with SimpleProcessor in the init() block.
…or for build file tree support Adds GetArtifact to PyProjectTOMLExtractor so that pyproject.toml files expose their project identity ([project].name, normalized per PEP 503) as a ScannedArtifact. This enables findArtifact to match packages from sibling lockfiles (e.g. requirements.txt) against the local module, producing inter-module dependency edges in the SBOM — the same mechanism already used for Maven pom.xml. GetArtifact lives on PyProjectTOMLExtractor (the entity that owns pyproject.toml) rather than on RequirementsTxtExtractor; the lockfile parser has no knowledge of pyproject.toml. PyProjectProject gains a Name field so the canonical PyProjectTOML type carries the identity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…integration tests - Register SimpleProcessor for FileTypeRequirementsTxt alongside FileTypePomXML - Add integration test covering Python requirements.txt build file tree generation with inter-module relationships via PyProjectTOMLExtractor.GetArtifact
The flag gates artifact ID extraction specifically (the name/purl used to identify internal modules), not general artifact data. The previous name caused confusion with CI/CD artifact concepts.
c56993c to
3a33496
Compare
- Register SimpleProcessor for FileTypePyprojectToml so GetBuildFileTrees(sbom, FileTypePyprojectToml) produces results instead of falling through to noopProcessor - Read [tool.poetry].name as fallback in GetArtifact when [project].name is absent (Poetry-only layout) - Enable ManifestParsers in GenerateSBOM so PyProjectTOMLExtractor GetArtifact runs through the public sbomgen API - Match ScannedArtifact with empty Version so pinned requirements entries (mylib==1.0.0) link to internal modules that declare no version - Expand cross-type dependency targets to a fixed point and remove the filter guard so chains (requirements.txt → a/pyproject.toml → b/pyproject.toml) resolve fully with or without a FileType filter
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54c29ea355
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ManifestParsers: opts.ExtractArtifactIds, | ||
| ExtractArtifactIds: opts.ExtractArtifactIds, |
There was a problem hiding this comment.
Decouple manifest scanning from artifact extraction
When a library caller disables ExtractArtifactIds to suppress file-type components/dependency edges, this also leaves ManifestParsers false, so manifest-only extractors such as pyproject.toml and package.json are not enabled by DoScan unless explicitly selected elsewhere. That makes GenerateSBOM(..., Options{ExtractArtifactIds: false}) drop packages from manifest-only projects even though the option is documented as controlling only artifact IDs/build-file edges; keep manifest parser enablement independent from the artifact-output toggle.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: ManifestParsers is now an independent field in sbomgen.Options. Callers can enable manifest parsing without artifact extraction. Internally, ExtractArtifactIds=true still activates manifest parsers as a side-effect (required for GetArtifact to run on manifest extractors like PyProjectTOMLExtractor), via ManifestParsers: opts.ManifestParsers || opts.ExtractArtifactIds.
When ExtractArtifactIds is enabled, every file that matches a registered extractor is now emitted as a type:file component in the SBOM, even if it produces zero packages and has no artifact identity (GetArtifact returns nil or is not implemented). Previously, a build file only appeared in GetBuildFileTrees if it either had at least one package with role:"manifest", or its extractor implemented GetArtifact. Files like a requirements.txt containing only -e ../base (no parseable packages, no GetArtifact) were invisible. The bare type:file component has no datadog:maven-package property and does not participate in findArtifact matching — it only ensures the file appears as a node in the build file tree.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13c377dd0a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // Even if GetArtifact is not implemented or returned nil, emit a bare | ||
| // file-type component so every scanned build file appears in GetBuildFileTrees, | ||
| // regardless of whether it has an artifact identity or any packages. | ||
| if parsedLockfile.Artifact == nil { | ||
| parsedLockfile.Artifact = &models.ScannedArtifact{ | ||
| ArtifactDetail: models.ArtifactDetail{Filename: f.Path()}, |
There was a problem hiding this comment.
Avoid emitting lockfiles as build-file components
When ExtractArtifactIds is enabled, this fallback runs for every extractor that does not implement ArtifactExtractor, including lockfile-only parsers such as Cargo.lock, package-lock.json, or yarn.lock. Those bare ScannedArtifacts become CycloneDX file components, and GetBuildFileTrees collects all file components as build files, so the library default can return lockfiles/unknown file types even when no manifest was present. Restrict this fallback to actual manifest/build-file extractors or skip it for lockfile-only parsers.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed: the bare ScannedArtifact fallback is now inside the ArtifactExtractor type-assertion block. Only extractors that explicitly opt in by implementing ArtifactExtractor receive a bare file-type component. Lockfile-only parsers (Cargo.lock, yarn.lock, etc.) are unaffected. RequirementsTxtExtractor was given a trivial GetArtifact returning nil so requirements.txt files still appear in GetBuildFileTrees.
| func findArtifact(name string, version string, artifacts []models.ScannedArtifact) *models.ScannedArtifact { | ||
| for _, artifact := range artifacts { | ||
| if artifact.Name == name && (version == "" || artifact.Version == version) { | ||
| if artifact.Name == name && (version == "" || artifact.Version == "" || artifact.Version == version) { |
There was a problem hiding this comment.
Do not treat unknown artifact versions as universal matches
Fresh evidence: this line now makes an empty artifact version match any versioned package, so a repo-local artifact with the same name but a different or unparsed version is linked anyway. For example, if a pyproject module is version 2.0.0 but a requirements file pins mylib==1.0.0, the empty artifact version currently creates a false requirements.txt -> pyproject.toml dependency edge instead of rejecting the version mismatch. Preserve/extract artifact versions and only wildcard when the dependency itself is unversioned.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Leaving as-is. The condition (version == "" || artifact.Version == "" || artifact.Version == version) is intentional: local modules in Python monorepos commonly omit the version in pyproject.toml, and internal requirements.txt entries for those modules rarely pin an exact version. The scenario in the comment (artifact version 2.0.0 vs pinned 1.0.0) is already handled correctly — the match is rejected when both sides have a non-empty, differing version. The remaining edge case (empty artifact version + pinned requirement) is an acceptable trade-off: false positives are unlikely in practice and false negatives (missing inter-module edges) would be more harmful.
…r implementors Lockfile-only parsers (Cargo.lock, yarn.lock, etc.) were receiving a bare ScannedArtifact and appearing as type:file components in GetBuildFileTrees. Move the fallback inside the ArtifactExtractor type-assertion block so only extractors that opt in by implementing the interface get a bare component. Add a trivial GetArtifact to RequirementsTxtExtractor so requirements.txt files continue to appear in GetBuildFileTrees as intended. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ions ManifestParsers (parse manifests as package sources when no lockfile exists) and ExtractArtifactIds (emit file-type SBOM components for build file trees) are independent capabilities and should be controlled separately. Expose ManifestParsers as its own field in sbomgen.Options. Internally, ExtractArtifactIds still enables manifest parsers as a side-effect (needed to invoke GetArtifact on manifest extractors like PyProjectTOMLExtractor), but callers can now also enable ManifestParsers independently without implying any artifact ID extraction. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e4fcd1d912
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if _, isFileComponent := fileComponentPaths[depPath]; !isFileComponent { | ||
| continue | ||
| } | ||
| bf := BuildFile{FileType: ft, FilePath: depPath} |
There was a problem hiding this comment.
Preserve the target file type for cross-type dependencies
When a requirements file depends on a file component of another type, such as libs/mylib/pyproject.toml or setup.py, this constructs the target BuildFile with the dependent group's ft instead of the target file's actual type. As a result GetBuildFileTrees(sbom, FileTypeRequirementsTxt) can return a pyproject.toml/setup.py path labeled as FileTypeRequirementsTxt, even though the API documents BuildFile.FileType as the file's type; consumers can no longer reliably distinguish the manifest type for these dependency entries.
Useful? React with 👍 / 👎.
Summary
MavenProcessorwith a genericSimpleProcessorthat handles build file tree generation for any file type via BFS overFileDependencies. Adding support for a new file type now requires only a registration call — no new processor code needed.GetArtifacttoPyProjectTOMLExtractorsopyproject.tomlfiles report their module identity (reading[project].nameor[tool.poetry].name) — enabling Python inter-module relationships inGetBuildFileTrees.SimpleProcessorforrequirements.txtandpyproject.tomlso both file types participate in build file tree traversal.ExtractArtifactIds(previouslyExtractArtifacts) for clarity.ManifestParsersas an independent field insbomgen.Options, decoupled fromExtractArtifactIds. Manifest parsing (pyproject.toml as package source when no lockfile exists) and artifact ID extraction are orthogonal capabilities. Internally,ExtractArtifactIds=truestill activates manifest parsers as a side-effect (required forGetArtifactto run on manifest extractors).type:fileSBOM component for every build file that opts in (by implementingArtifactExtractor) whenExtractArtifactIds=true, even ifGetArtifactreturns no identity. Lockfile-only parsers (Cargo.lock, yarn.lock, etc.) that do not implementArtifactExtractorare unaffected.RequirementsTxtExtractorimplementsArtifactExtractorwith a trivialGetArtifactreturningnil, sorequirements.txtfiles appear inGetBuildFileTreeseven when they contain only local editable deps.findArtifactversion matching to also match when the artifact has no declared version — handles local modules that omit the version field.parseArtifactIDfrom Maven-only to any purl type.Changes
pkg/sbomgen/simple_processor.goSimpleProcessor(replacesMavenProcessor); registered forpom.xml,requirements.txt, andpyproject.tomlpkg/sbomgen/build_files.goparseArtifactID; fixed-point cross-type expansionpkg/sbomgen/sbomgen.goManifestParsersexposed as independentOptionsfield;ExtractArtifactIdsrenamedpkg/extractor/python/parse-pyproject-toml.goGetArtifact: reads[project].namethen[tool.poetry].namepkg/extractor/python/parse-requirements-txt.goGetArtifactso requirements.txt opts in to file-type component emissionpkg/extractor/python/types.goNamefield toPyProjectPoetrystructpkg/extractor/extract.goScannedArtifactfallback scoped toArtifactExtractorimplementors onlyinternal/output/sbom/cyclonedx.goaddFileDependencieshandles name-less artifacts;findArtifactallows empty artifact versionpkg/scanner/datadog_sbom_generator.goTest plan
SimpleProcessor(BFS traversal, artifact ID lookup, multi-type registration)PyProjectTOMLExtractor.GetArtifact(PEP project name, poetry name fallback, missing name, parse error)build_files_test.go: cross-type expansion, empty-name artifacts, fixed-point loopaddFileDependenciesbare artifact changeSimpleProcessorgo test ./...passesgolangci-lintpasses clean🤖 Generated with Claude Code