Skip to content

Commit 91bd8fc

Browse files
paulOsinskiMaffooch
authored andcommitted
[docs] expand deduplication / reimport documentation (DefectDojo#14392)
* edit dedupe reimport docs * Update docs/content/triage_findings/finding_deduplication/OS__deduplication_tuning.md Co-authored-by: Cody Maffucci <46459665+Maffooch@users.noreply.github.com> * change article name and update links * Update docs/content/triage_findings/finding_deduplication/PRO__deduplication_tuning.md Co-authored-by: Cody Maffucci <46459665+Maffooch@users.noreply.github.com> * remove weird line --------- Co-authored-by: Cody Maffucci <46459665+Maffooch@users.noreply.github.com>
1 parent f87f9b7 commit 91bd8fc

File tree

7 files changed

+64
-14
lines changed

7 files changed

+64
-14
lines changed

docs/content/get_started/about/faq.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ If you're looking to add a new tool to your suite, we have a list of recommended
6969
There are two different methods to import a single report from a security tool:
7070

7171
- **Import** handles the report as a single point-in-time record. Importing a report creates a Test containing the resulting Findings.
72-
- **[Reimport](/import_data/import_intro/import_vs_reimport/)** is used to update an existing Test with a new set of results. If you have a more open-ended approach to your testing process, you can continuously Reimport the latest version of your report to an existing Test. DefectDojo will compare the results of the incoming report to your existing data, record any changes, and then adjust the Findings in the Test to match the latest report.
72+
- **[Reimport](/import_data/import_intro/reimport/)** is used to update an existing Test with a new set of results. If you have a more open-ended approach to your testing process, you can continuously Reimport the latest version of your report to an existing Test. DefectDojo will compare the results of the incoming report to your existing data, record any changes, and then adjust the Findings in the Test to match the latest report.
7373

7474
To understand the difference, it’s helpful to think of Import as recording a single instance of a scan event, and Reimport as updating a continual record of scanning.
7575

docs/content/get_started/common_use_cases/common_use_cases.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@ Each of these report categories can be handled by a separate Engagement, with a
3838
![image](images/example_product_hierarchy_bigcorp.png)
3939

4040
- If a Product has a CI/CD pipeline, all of the results from that pipeline can be continually imported into a single open-ended Engagement. Each tool used will create a separate Test within the CI/CD Engagement, which can be continuously updated with new data.
41-
(See our guide to [Reimport](/import_data/import_intro/import_vs_reimport/))
41+
(See our guide to [Reimport](/import_data/import_intro/reimport/))
4242
- Each Pen Test effort can have a separate Engagement created to contain all of the results: e.g. "Q1 Pen Test 2024," "Q2 Pen Test 2024," etc.
4343
- BigCorp will likely want to run their own mock PCI audit so that they're prepared for the real thing. The results of those audits can also be stored as a separate Engagement.
4444

docs/content/import_data/import_intro/import_vs_reimport.md renamed to docs/content/import_data/import_intro/reimport.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Import vs Reimport"
2+
title: "Reimport"
33
description: "Learn how to import data manually, through the API, or via a connector"
44
weight: 2
55
aliases:
@@ -80,7 +80,13 @@ This header indicates the actions taken by an Import/Reimport.
8080
* **\# left untouched shows the count of Open Findings which were unchanged by a Reimport (because they also existed in the incoming report).**
8181
* **\#** **reactivated** shows any Closed Findings which were reopened by an incoming Reimport.
8282

83-
## Reimport via API \- special note
83+
## Reimport Deduplication
84+
85+
Reimport decides whether an incoming item matches an existing Finding using **[Reimport Deduplication](/triage_findings/finding_deduplication/about_deduplication/)** settings. This is separate from “Same Tool Deduplication” and “Cross Tool Deduplication,” which operate after Findings exist.
86+
87+
If you are seeing Reimport close old Findings and create new Findings when only a minor attribute changes (for example, a line number shift), tune **Reimport Deduplication** for that tool to use stable identifiers that ignore those attributes (such as Unique ID From Tool).
88+
89+
## Reimport via API - special note
8490

8591
Note that the /reimport API endpoint can both **extend an existing Test** (apply the method in this article) **or create a new Test** with new data \- an initial call to `/import`, or setting up a Test in advance is not required.
8692

docs/content/triage_findings/finding_deduplication/OS__deduplication_tuning.md

Lines changed: 5 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Deduplication Tuning"
2+
title: "Deduplication Tuning (Open Source)"
33
description: "Configure deduplication in DefectDojo Open Source: algorithms, hash fields, endpoints, and service"
44
weight: 5
55
audience: opensource
@@ -106,6 +106,10 @@ Notes:
106106

107107
## After changing deduplication settings
108108

109+
After changing algorithms or Hash computation, you will need to **recompute hashes** for the affected parser/test type before the new matching behavior will apply consistently across existing data.
110+
111+
Note: Recomputing hashes can be lead to long wait times on large instances. Plan maintenance windows accordingly.
112+
109113
- Changes to dedupe configuration (e.g., `HASHCODE_FIELDS_PER_SCANNER`, `HASH_CODE_FIELDS_ALWAYS`, `DEDUPLICATION_ALGORITHM_PER_PARSER`) are not applied retroactively automatically. To re-evaluate existing findings you must run the management command below.
110114

111115
Run inside the uwsgi container. Example (hash codes only, no dedupe):

docs/content/triage_findings/finding_deduplication/PRO__deduplication_tuning.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
---
2-
title: "Deduplication Tuning"
2+
title: "Deduplication Tuning (Pro)"
33
description: "Configure how DefectDojo identifies and manages duplicate findings"
44
weight: 4
55
audience: pro
66
aliases:
77
- /en/working_with_findings/finding_deduplication/tune_deduplication
88
---
9+
910
Deduplication Tuning is a DefectDojo Pro feature that gives you fine-grained control over how findings are deduplicated, allowing you to optimize duplicate detection for your specific security testing workflow.
1011

1112
## Deduplication Settings
@@ -41,6 +42,8 @@ Uses a combination of selected fields to generate a unique hash. When selected,
4142
#### Unique ID From Tool
4243
Leverages the security tool's own internal identifier for findings, ensuring perfect deduplication when the scanner provides reliable unique IDs.
4344

45+
This algorithm can be useful when working with SAST scanners, or situations where a Finding can "move around" in source code as development progresses.
46+
4447
#### Unique ID From Tool or Hash Code
4548
Attempts to use the tool's unique ID first, then falls back to the hash code if no unique ID is available. This provides the most flexible deduplication option.
4649

@@ -60,7 +63,11 @@ Unlike Same Tool Deduplication, Cross Tool Deduplication only supports the Hash
6063

6164
## Reimport Deduplication
6265

63-
Reimport Deduplication Settings are specifically designed for reimporting data using Universal Parsers or the Generic Parser.
66+
**⚠️ Reimport processes can completely discard Findings before they are recorded. This can lead to data loss if set incorrectly, so Reimport Deduplication settings should be adjusted with caution.**
67+
68+
Reimport Deduplication Settings can be used to set an algorithm for Universal Parsers, or for a Generic Findings Import Parser.
69+
70+
Reimport Deduplication cannot be adjusted for other tools by default. Users who want to adjust the Reimport Deduplication algorithm for other tools in their instance should reach out to [DefectDojo Support](mailto:support@defectdojo.com) for assistance.
6471

6572
![image](images/reimport_deduplication.png)
6673

@@ -74,6 +81,8 @@ The same three algorithm options are available for Reimport Deduplication as for
7481
- Unique ID From Tool
7582
- Unique ID From Tool or Hash Code
7683

84+
Reimport can completely discard Findings before they are recorded, so Reimport Deduplication settings should be adjusted with caution.
85+
7786
## Deduplication Best Practices
7887

7988
For optimal results with Deduplication Tuning:
@@ -85,3 +94,7 @@ For optimal results with Deduplication Tuning:
8594
- **Avoid overly broad deduplication**: Cross-tool deduplication with too few hash fields may result in false duplicates
8695

8796
By tuning deduplication settings to your specific tools, you can significantly reduce duplicate noise.
97+
98+
## Locked Findings
99+
100+
Whenever Deduplication Settings are changed for a given tool, Deduplication hashes are re-calculated for that tool across the entire DefectDojo instance.

docs/content/triage_findings/finding_deduplication/about_deduplication.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,29 @@ By default, these Tests would need to be nested under the same Product for Dedup
2626

2727
Duplicate Findings are set as Inactive by default. This does not mean the Duplicate Finding itself is Inactive. Rather, this is so that your team only has a single active Finding to work on and remediate, with the implication being that once the original Finding is Mitigated, the Duplicates will also be Mitigated.
2828

29-
## Deduplication vs Reimport
29+
## Reimport Deduplication
3030

31-
Deduplication and Reimport are similar processes but they have a key difference:
31+
Deduplication and Reimport are similar processes, but they use different algorithms to identify Finding matches.
3232

33-
* When you Reimport to a Test, the Reimport process looks at incoming Findings, **filters and** **discards any matches**. Those matches will never be created as Findings or Finding Duplicates.
34-
* Deduplication is applied 'passively' on Findings that have already been created. It will identify duplicates in scope and **label them**, but it will not delete or discard the Finding unless 'Delete Deduplicate Findings' is enabled.
35-
* The 'reimport' action of discarding a Finding always happens before deduplication; DefectDojo **cannot deduplicate Findings that are never created** as a result of Reimport's filtering.
33+
* When you Reimport to a Test, the Reimport process looks at incoming Findings, **compares hash codes, and then discards any matches**. Those matches will never be created as Findings or Finding Duplicates.
34+
35+
However, any Findings that remain after Reimport Deduplication are still subject to Same-Tool Deduplication. So if you use narrower a scope for Same-Tool Deduplication, you can end up with Duplicates within a Reimport pipeline.
36+
37+
### Example
38+
39+
Here's a tool with a Reimport Deduplication algorithm which is different from the Same-Tool Deduplication algorithm.
40+
41+
| Deduplication Algorithm | Hash Code Fields |
42+
| ----- | ---- |
43+
| Reimport | Title, CWE, Severity, Description, Line Number |
44+
| Same-Tool | Title, CWE, Severity, Description |
45+
46+
Let's say you had a Finding in DefectDojo with a given line number. You re-scanned your environment and the line number of that vulnerability changed. You reimport to the same Test. Here's what will happen during reimport, and deduplication:
47+
48+
* During Reimport, the Finding will not be matched to any Findings that already exist, because the line number is different. So a new Finding will be created in the Test.
49+
* After Reimport is complete, the Same-Tool Deduplication algorithm will run. Same-Tool Deduplication does not consider line number in this configuration, so the new Finding will be labelled as a duplicate.
50+
51+
Reimport can completely discard Findings before they are recorded, so Reimport Deduplication settings should be adjusted with caution.
3652

3753
## When are duplicates appropriate?
3854

@@ -119,3 +135,14 @@ For example, let’s say that you had your Maximum Duplicates field set to ‘1
119135
### Applying this setting
120136

121137
Applying **Delete Deduplicate Findings** will begin a deletion process immediately. This setting can be applied on the **System Settings** page. See Enabling Deduplication for more information.
138+
139+
## Troubleshooting Deduplication
140+
141+
Sometimes, Deduplication does not work as expected. Here are some examples of ways that Deduplication might not be working correctly, along with possible solutions.
142+
143+
| What you see | Most likely cause | What to tune |
144+
| --- | --- | --- |
145+
| Reimport closes an old Finding and creates a new one when only the line number changed | Reimport matching uses unstable fields (for example, line number) | <strong>Reimport Deduplication</strong> (prefer stable IDs or stable hash fields) |
146+
| Multiple Findings are created in the same Test that you believe should be duplicates | Deduplication matching is not configured for that tool or scope | <strong>Same Tool Deduplication</strong> (and consider “Delete Deduplicate Findings” behavior) |
147+
| Duplicates are created across different tools | Cross-tool matching is disabled or too strict | <strong>Cross Tool Deduplication (Pro only)</strong> (hash-based matching) |
148+
| Excess duplicates of the same Finding are being created, across Tests | Asset Hierarchy is not set up correctly | [Consider Reimport for continual testing](/triage_findings/finding_deduplication/avoid_excess_duplicates/) |

docs/content/triage_findings/finding_deduplication/avoid_excess_duplicates.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ weight: 4
55
aliases:
66
- /en/working_with_findings/finding_deduplication/avoiding_duplicates_via_reimport
77
---
8-
One of DefectDojo’s strengths is that the data model can accommodate many different use\-cases and applications. You’ll likely change your approach as you master the software and discover ways to optimize your workflow.
8+
One of DefectDojo’s strengths is that the data model can accommodate many different use-cases and applications. You’ll likely change your approach as you master the software and discover ways to optimize your workflow.
99

1010
By default, DefectDojo does not delete any duplicate Findings that are created. Each Finding is considered to be a separate instance of a vulnerability. So in this case, **Duplicate Findings** can be an indicator that a process change is required to your workflow.
1111

@@ -46,7 +46,7 @@ DefectDojo has two methods for importing test data to create Findings: **Import*
4646

4747
Each time you import new vulnerability reports into DefectDojo, those reports will be stored in a Test object. A Test object can be created by a user ahead of time to hold a future **Import**. If a user wants to import data without specifying a Test destination, a new Test will be created to store the incoming report.
4848

49-
Tests are flexible objects, and although they can only hold one *kind* of report, they can handle multiple instances of that same report through the **Reimport** method. To learn more about Reimport, see our **[article](/import_data/import_intro/import_vs_reimport/)** on this topic.
49+
Tests are flexible objects, and although they can only hold one *kind* of report, they can handle multiple instances of that same report through the **Reimport** method. To learn more about Reimport, see our **[article](/import_data/import_intro/reimport/)** on this topic.
5050

5151

5252
## Using Reimport for continual Tests

0 commit comments

Comments
 (0)