Skip to content

Commit 5a2cfec

Browse files
committed
edit dedupe reimport docs
1 parent c6b4b8d commit 5a2cfec

5 files changed

Lines changed: 64 additions & 11 deletions

File tree

docs/content/import_data/import_intro/import_vs_reimport.md

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Import vs Reimport"
2+
title: "Reimport"
33
description: "Learn how to import data manually, through the API, or via a connector"
44
weight: 2
55
aliases:
@@ -80,7 +80,13 @@ This header indicates the actions taken by an Import/Reimport.
8080
* **\# left untouched shows the count of Open Findings which were unchanged by a Reimport (because they also existed in the incoming report).**
8181
* **\#** **reactivated** shows any Closed Findings which were reopened by an incoming Reimport.
8282

83-
## Reimport via API \- special note
83+
## Reimport Deduplication
84+
85+
Reimport decides whether an incoming item matches an existing Finding using **[Reimport Deduplication](/triage_findings/finding_deduplication/about_deduplication/)** settings. This is separate from “Same Tool Deduplication” and “Cross Tool Deduplication,” which operate after Findings exist.
86+
87+
If you are seeing Reimport close old Findings and create new Findings when only a minor attribute changes (for example, a line number shift), tune **Reimport Deduplication** for that tool to use stable identifiers that ignore those attributes (such as Unique ID From Tool).
88+
89+
## Reimport via API - special note
8490

8591
Note that the /reimport API endpoint can both **extend an existing Test** (apply the method in this article) **or create a new Test** with new data \- an initial call to `/import`, or setting up a Test in advance is not required.
8692

docs/content/triage_findings/finding_deduplication/OS__deduplication_tuning.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "Deduplication Tuning"
2+
title: "Deduplication Tuning (Open Source)"
33
description: "Configure deduplication in DefectDojo Open Source: algorithms, hash fields, endpoints, and service"
44
weight: 5
55
audience: opensource
@@ -106,6 +106,10 @@ Notes:
106106

107107
## After changing deduplication settings
108108

109+
After changing algorithms or Hash computation, you will need to **recompute hashes** for the affected parser/test type before the new matching behavior will apply consistently across existing data.
110+
111+
Note: Recomputing hashes can be lead to on large instances. Plan maintenance windows accordingly.
112+
109113
- Changes to dedupe configuration (e.g., `HASHCODE_FIELDS_PER_SCANNER`, `HASH_CODE_FIELDS_ALWAYS`, `DEDUPLICATION_ALGORITHM_PER_PARSER`) are not applied retroactively automatically. To re-evaluate existing findings you must run the management command below.
110114

111115
Run inside the uwsgi container. Example (hash codes only, no dedupe):
@@ -141,3 +145,6 @@ To help troubleshooting deduplication use the following tools:
141145
![Unique ID from Tool and Hash Code on the View Finding page](images/hash_code_id_field.png)
142146

143147
![Unique ID from Tool and Hash Code on the Finding List Status Column](images/hash_code_status_column.png)
148+
149+
In Open Source,
150+

docs/content/triage_findings/finding_deduplication/PRO__deduplication_tuning.md

Lines changed: 15 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
11
---
2-
title: "Deduplication Tuning"
2+
title: "Deduplication Tuning (Pro)"
33
description: "Configure how DefectDojo identifies and manages duplicate findings"
44
weight: 4
55
audience: pro
66
aliases:
77
- /en/working_with_findings/finding_deduplication/tune_deduplication
88
---
9+
910
Deduplication Tuning is a DefectDojo Pro feature that gives you fine-grained control over how findings are deduplicated, allowing you to optimize duplicate detection for your specific security testing workflow.
1011

1112
## Deduplication Settings
@@ -41,6 +42,8 @@ Uses a combination of selected fields to generate a unique hash. When selected,
4142
#### Unique ID From Tool
4243
Leverages the security tool's own internal identifier for findings, ensuring perfect deduplication when the scanner provides reliable unique IDs.
4344

45+
This algorithm can be useful when working with SAST scanners, or situations where a Finding can "move around" in source code as development progresses.
46+
4447
#### Unique ID From Tool or Hash Code
4548
Attempts to use the tool's unique ID first, then falls back to the hash code if no unique ID is available. This provides the most flexible deduplication option.
4649

@@ -60,7 +63,11 @@ Unlike Same Tool Deduplication, Cross Tool Deduplication only supports the Hash
6063

6164
## Reimport Deduplication
6265

63-
Reimport Deduplication Settings are specifically designed for reimporting data using Universal Parsers or the Generic Parser.
66+
**⚠️ Reimport processes can completely discard Findings before they are recorded. This can lead to data loss if set incorrectly, so Reimport Deduplication settings should be adjusted with caution.**
67+
68+
Reimport Deduplication Settings can be used to set an algorithm for Universal Parsers, or for a Generic Findings Import Parser.
69+
70+
Reimport Deduplication cannot be adjusted for other tools by default. Users who want to adjust the Reimport Deduplication algorithm for other tools in their instance should reach out to [DefectDojo Support](mailto:support@defectdojo.com) for assistance.
6471

6572
![image](images/reimport_deduplication.png)
6673

@@ -74,6 +81,8 @@ The same three algorithm options are available for Reimport Deduplication as for
7481
- Unique ID From Tool
7582
- Unique ID From Tool or Hash Code
7683

84+
Reimport can completely discard Findings before they are recorded, so Reimport Deduplication settings should be adjusted with caution.
85+
7786
## Deduplication Best Practices
7887

7988
For optimal results with Deduplication Tuning:
@@ -85,3 +94,7 @@ For optimal results with Deduplication Tuning:
8594
- **Avoid overly broad deduplication**: Cross-tool deduplication with too few hash fields may result in false duplicates
8695

8796
By tuning deduplication settings to your specific tools, you can significantly reduce duplicate noise.
97+
98+
## Locked Findings
99+
100+
Whenever Deduplication Settings are changed for a given tool, Deduplication hashes will need to be re-calculated for that tool across the entire DefectDojo instance. During this process, Findings of this tool will be "locked", and their Deduplication Algorithm cannot not be changed again until the recalculation is complete.

docs/content/triage_findings/finding_deduplication/about_deduplication.md

Lines changed: 32 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,29 @@ By default, these Tests would need to be nested under the same Product for Dedup
2626

2727
Duplicate Findings are set as Inactive by default. This does not mean the Duplicate Finding itself is Inactive. Rather, this is so that your team only has a single active Finding to work on and remediate, with the implication being that once the original Finding is Mitigated, the Duplicates will also be Mitigated.
2828

29-
## Deduplication vs Reimport
29+
## Reimport Deduplication
3030

31-
Deduplication and Reimport are similar processes but they have a key difference:
31+
Deduplication and Reimport are similar processes, but they use different algorithms to identify Finding matches.
3232

33-
* When you Reimport to a Test, the Reimport process looks at incoming Findings, **filters and** **discards any matches**. Those matches will never be created as Findings or Finding Duplicates.
34-
* Deduplication is applied 'passively' on Findings that have already been created. It will identify duplicates in scope and **label them**, but it will not delete or discard the Finding unless 'Delete Deduplicate Findings' is enabled.
35-
* The 'reimport' action of discarding a Finding always happens before deduplication; DefectDojo **cannot deduplicate Findings that are never created** as a result of Reimport's filtering.
33+
* When you Reimport to a Test, the Reimport process looks at incoming Findings, **compares hash codes, and then discards any matches**. Those matches will never be created as Findings or Finding Duplicates.
34+
35+
However, any Findings that remain after Reimport Deduplication are still subject to Same-Tool Deduplication. So if you use narrower a scope for Same-Tool Deduplication, you can end up with Duplicates within a Reimport pipeline.
36+
37+
### Example
38+
39+
Here's a tool with a Reimport Deduplication algorithm which is different from the Same-Tool Deduplication algorithm.
40+
41+
| Deduplication Algorithm | Hash Code Fields |
42+
| ----- | ---- |
43+
| Reimport | Title, CWE, Severity, Description, Line Number |
44+
| Same-Tool | Title, CWE, Severity, Description |
45+
46+
Let's say you had a Finding in DefectDojo with a given line number. You re-scanned your environment and the line number of that vulnerability changed. You reimport to the same Test. Here's what will happen during reimport, and deduplication:
47+
48+
* During Reimport, the Finding will not be matched to any Findings that already exist, because the line number is different. So a new Finding will be created in the Test.
49+
* After Reimport is complete, the Same-Tool Deduplication algorithm will run. Same-Tool Deduplication does not consider line number in this configuration, so the new Finding will be labelled as a duplicate.
50+
51+
Reimport can completely discard Findings before they are recorded, so Reimport Deduplication settings should be adjusted with caution.
3652

3753
## When are duplicates appropriate?
3854

@@ -119,3 +135,14 @@ For example, let’s say that you had your Maximum Duplicates field set to ‘1
119135
### Applying this setting
120136

121137
Applying **Delete Deduplicate Findings** will begin a deletion process immediately. This setting can be applied on the **System Settings** page. See Enabling Deduplication for more information.
138+
139+
## Troubleshooting Deduplication
140+
141+
Sometimes, Deduplication does not work as expected. Here are some examples of ways that Deduplication might not be working correctly, along with possible solutions.
142+
143+
| What you see | Most likely cause | What to tune |
144+
| --- | --- | --- |
145+
| Reimport closes an old Finding and creates a new one when only the line number changed | Reimport matching uses unstable fields (for example, line number) | <strong>Reimport Deduplication</strong> (prefer stable IDs or stable hash fields) |
146+
| Multiple Findings are created in the same Test that you believe should be duplicates | Deduplication matching is not configured for that tool or scope | <strong>Same Tool Deduplication</strong> (and consider “Delete Deduplicate Findings” behavior) |
147+
| Duplicates are created across different tools | Cross-tool matching is disabled or too strict | <strong>Cross Tool Deduplication (Pro only)</strong> (hash-based matching) |
148+
| Excess duplicates of the same Finding are being created, across Tests | Asset Hierarchy is not set up correctly | [Consider Reimport for continual testing](/triage_findings/finding_deduplication/avoid_excess_duplicates/) |

docs/content/triage_findings/finding_deduplication/avoid_excess_duplicates.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ weight: 4
55
aliases:
66
- /en/working_with_findings/finding_deduplication/avoiding_duplicates_via_reimport
77
---
8-
One of DefectDojo’s strengths is that the data model can accommodate many different use\-cases and applications. You’ll likely change your approach as you master the software and discover ways to optimize your workflow.
8+
One of DefectDojo’s strengths is that the data model can accommodate many different use-cases and applications. You’ll likely change your approach as you master the software and discover ways to optimize your workflow.
99

1010
By default, DefectDojo does not delete any duplicate Findings that are created. Each Finding is considered to be a separate instance of a vulnerability. So in this case, **Duplicate Findings** can be an indicator that a process change is required to your workflow.
1111

0 commit comments

Comments
 (0)