FEAT: DatasetConfiguration Refactor by rlundeen2 · Pull Request #2071 · microsoft/PyRIT

rlundeen2 · 2026-06-22T22:07:11Z

DatasetConfiguration as is worked fairly well for the first scenarios. However, it ran into several issues as we added more. garak.encoding needed seedPrompts; jailbreak needed two types of datasets (both the harms and datasets themselves). Psychosocial had datasets tied to techniques. And other Garak scenarios need different more flexible types. This PR refactors DatasetConfiguration so that it can be a better fit for these diverse scenarios.

Additionally (one big addition) is that if the Dataset isn't in memory, it will use the DatasetProvider to fetch the dataset and put it in memory. This means you don't need the load_default_datasets initializer on startup unless you want to preload things.

This refactors DatasetConfiguration from a single catch-all class into a generic base plus typed subclasses (DatasetObjectiveConfiguration, DatasetPromptConfiguration, and DatasetAttackConfiguration, the default most scenarios use). Constraints are now expressed through composable validators that run against the fully resolved dataset before max_dataset_size sampling, so they describe the dataset itself rather than a sampled subset. Each resolved dataset carries a DatasetSourceKind (inline vs. memory), which lets a scenario require or forbid inline seeds — useful for CLI flags such as --objectives. Non-emptiness is enforced as a default validator on every configuration, and typed subclasses layer a seed-type check on top.

Resolution is also more predictable: when a configured dataset name is not yet in memory and auto_fetch is set, it is fetched from the registered SeedDatasetProvider on demand, and any failure now raises loudly with the chained root cause instead of silently warning. The redundant pre-pass that eagerly pre-populated memory has been removed in favor of this on-demand path, and all scenario callers and tests have been migrated to the new methods (get_seeds_async, get_seed_attack_groups_async, get_attack_groups_by_dataset_async). Legacy getters remain but emit deprecation warnings. All scenario unit tests pass.

Restructure DatasetConfiguration into a generic base plus typed subclasses (DatasetObjectiveConfiguration, DatasetPromptConfiguration, DatasetAttackConfiguration). Constraints are expressed through composable validators run against the fully resolved dataset (pre-sampling), and the resolved set carries a DatasetSourceKind (inline vs memory) so validators can require or forbid inline seeds. Auto-fetch missing datasets from the provider on demand and raise loudly (with chained root cause) instead of silently warning. Non-emptiness is enforced as a default validator on every config. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Scenarios now fetch their datasets from the registered provider on demand the first time they run, so the load_default_datasets initializer is no longer required for everyday runs or in the recommended default config. Remove it from the default config examples and per-run scanner command examples, and add a note explaining it is now an optional preload step (useful for warming memory or populating a database for offline use). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a dataset_names tuple to ResolvedDataset plus a restrict_dataset_names validator so scenarios that pair techniques with specific datasets (e.g. psychosocial, jailbreak) can constrain which datasets a configuration may draw from. Replace the EXPLICIT_SEED_GROUPS_KEY dict-overload with an explicit inline-vs-named split. Named resolution (_collect_named_seeds_async) now returns only real dataset names, and get_seeds_async resolves inline data through a dedicated branch, removing the reserved-key collision guards. Inline data keeps a single honest INLINE_DATASET_NAME label in the by-dataset views (used for atomic-attack naming), so user-facing labels read 'technique_inline' instead of leaking the old sentinel. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Remove DatasetObjectiveConfiguration, DatasetPromptConfiguration, and the generic DatasetConfiguration[SeedT] flat resolver (get_seeds_async). None had production callers; the objectives-only constraint that motivated the typed subclasses is enforced at runtime per-technique, not at the dataset level. DatasetConfiguration is now a plain (non-generic) base of resolution/fetch/ validate/sample plumbing plus the deprecated legacy getters, and DatasetAttackConfiguration remains the one concrete resolver scenarios use. Tests that exercised the removed flat resolver now drive the same base resolution through DatasetAttackConfiguration. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

romanlutz · 2026-06-23T12:16:25Z

I'm a bit worried about datasets we pulled into DB in the past that need updates (eg in the context of standardizing harm categories, or fixing metadata fields). Similarly, what happens to live datasets like threat feeds (think Odin or others).

How do we make sure they get refreshed?

rlundeen2 · 2026-06-23T21:16:07Z

I'm a bit worried about datasets we pulled into DB in the past that need updates (eg in the context of standardizing harm categories, or fixing metadata fields). Similarly, what happens to live datasets like threat feeds (think Odin or others).

How do we make sure they get refreshed?

Good idea to include; I would separate from this PR but there are a couple solutions I like;

We could add an init parameter to this class that always refreshes
we could add an initializer that refreshes existing datasets that haven't been refreshed in X days (or refreshes all if that is 0)

I like #2 more

romanlutz · 2026-06-26T20:17:35Z

+        # into the scenario's friendly "dataset not available" message.
+        try:
+            seed_groups = await self._dataset_config.get_seed_attack_groups_async()
+        except DatasetConstraintError:


Catching DatasetConstraintError here and then falling back to _raise_dataset_exception() drops the original failure context. DatasetConstraintError can distinguish missing datasets, auto-fetch failures, and validation failures, but this path turns all of them into the same generic message. Since DatasetConstraintError is already a ValueError, I think we should let it propagate (or re-raise with from exc) so operators keep the actual cause.

romanlutz · 2026-06-26T20:17:35Z

+        groups_by_dataset, resolved = await self._build_groups_by_dataset_async()
+        self.validate(resolved)
+        groups = [group for groups in groups_by_dataset.values() for group in groups]
+        groups = self._apply_max_dataset_size(groups)


This changes max_dataset_size from per-dataset sampling to global sampling for callers that use get_seed_attack_groups_async(). EncodingDatasetConfiguration now goes through this path, and its default config has two datasets plus max_dataset_size=3, so a run resolves 3 attack groups total instead of up to 3 from each dataset as before. That silently cuts the scenario's coverage. Can we either preserve per-dataset sampling here or make the new budget explicit in encoding defaults/tests?

rlundeen2 changed the title ~~MAINT: DatasetConfiguration Refactor~~ FEAT: DatasetConfiguration Refactor Jun 22, 2026

rlundeen2 mentioned this pull request Jun 22, 2026

[BREAKING] FEAT: Standardize Jailbreak scenario defaults #2045

Open

rlundeen2 and others added 3 commits June 22, 2026 15:44

romanlutz reviewed Jun 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FEAT: DatasetConfiguration Refactor#2071

FEAT: DatasetConfiguration Refactor#2071
rlundeen2 wants to merge 4 commits into
microsoft:mainfrom
rlundeen2:rlundeen2-redesign-dataset-configuration

rlundeen2 commented Jun 22, 2026 •

edited

Loading

Uh oh!

romanlutz commented Jun 23, 2026

Uh oh!

rlundeen2 commented Jun 23, 2026 •

edited

Loading

Uh oh!

romanlutz Jun 26, 2026

Uh oh!

romanlutz Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rlundeen2 commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romanlutz commented Jun 23, 2026

Uh oh!

rlundeen2 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

romanlutz Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

romanlutz Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rlundeen2 commented Jun 22, 2026 •

edited

Loading

rlundeen2 commented Jun 23, 2026 •

edited

Loading