MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

Search-based red-team attack on diffusion LLMs (dLLMs). The attacker generates a <mask:N> template; the victim dLLM fills the masks in parallel and emits harmful content. Evaluated on LLaDA, LLaDA-1.5, LLaDA2.0-mini, Dream-Instruct, and MMaDA-CoT across HarmBench, JailbreakBench, StrongReject, and AdvBench.

The pipeline has three stages — bootstrap a pattern library (A), grow it via UCB-guided online attack (B), then transfer it by pure retrieval (C). The Quick start below runs Stage B against one victim using the library checked into logs/.

Setup

Hardware: ≥ 2 GPUs (one hosts the attacker / scorer LLMs via vLLM, the other runs the victim dLLM). Each vLLM server fits in ~45% of a 40 GB card with the default 4 k context.

# 1. Clone + enter
git clone <repo> Attack4dLLM && cd Attack4dLLM

# 2. Conda env (Python 3.10)
conda create -n dllm python=3.10 -y
conda activate dllm

# 3. Core dependencies
pip install torch --index-url https://download.pytorch.org/whl/cu121
pip install vllm transformers accelerate safetensors einops regex xxhash
pip install numpy pandas tqdm omegaconf pyyaml jinja2 termcolor matplotlib
pip install requests boto3      # local + Bedrock judges

# 4. (optional) Per-victim extras — install only what you plan to run
pip install decord                            # MMaDA visual tokens
pip install flash-attn --no-build-isolation   # speedup on Hopper / Ampere

Bedrock judge — set credentials in the environment before any --judge bedrock invocation:

export AWS_BEARER_TOKEN_BEDROCK=<your-token>
export AWS_REGION=<your-region>

Victim model weights are pulled from HuggingFace on first use; the path is read from configs/<victim>_attack.yaml. Set HF_HOME if you want the cache to live somewhere specific. The benchmark data files (data/{jailbreakbench,strongreject,harmbench,advbench}*.json) ship with the repo.

A copy of the pattern library learned in Stage A + B is checked into logs/shared_pattern_registry.json (≈ 1 300 patterns) and logs/shared_pattern_set.json. Both stages B and C read from these — you can reproduce Stage B and Stage C without re-running Stage A.

Static evaluation with `benchmark/`

For a quick static probe you do not need to run any of the stages above. benchmark/{jailbreak,harmbench,strongreject}.json ships 100 goals per benchmark, each with up to 3 high-transferability templates that successfully broke ≥ 4 of the 5 victims. Each record is

{
  "goal_id": "jbb_0",
  "goal": "...",
  "templates": ["<template with <mask:N> placeholders>", "...", "..."]
}

Expand <mask:N> to the victim's mask token (<|mask|> for Dream / LLaDA2, <|mdm_mask|> for LLaDA / LLaDA-1.5 / MMaDA), submit, and judge — no attacker LLM, no scorer, no UCB:

import json, re
from inference import load_victim, run_victim_online

MASK_TOKEN = "<|mask|>"   # pick per victim
victim_cfg = "configs/dream_attack.yaml"
_, model, tok = load_victim(victim_cfg)

def expand(tmpl):
    return re.sub(r"<mask:(\d+)>", lambda m: MASK_TOKEN * int(m.group(1)), tmpl)

records = json.load(open("benchmark/jailbreak.json"))
for r in records:
    for tmpl in r["templates"]:
        out = run_victim_online(
            victim_model=model, victim_tokenizer=tok,
            prompt=r["goal"], template=expand(tmpl),
            victim_config_path=victim_cfg,
        )
        # judge `out` with evaluation.py / your own scorer

This is the simplest reproduction path for any new victim and the cheapest way to compare different decoding settings (mask ratio, step count, block size, …) against a fixed set of known-good attacks.

Quick start

conda activate dllm

# 1. Start the attacker + weaker base LLMs (GPU 0)
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-4B-Instruct-2507 --port 8100 \
    --trust-remote-code --max-model-len 4096 --gpu-memory-utilization 0.45 &
CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-4B-Base --port 8101 \
    --trust-remote-code --max-model-len 4096 --gpu-memory-utilization 0.45 &

# 2. Attack: Dream on JailbreakBench, 100 goals (GPU 1)
CUDA_VISIBLE_DEVICES=1 python run_victim.py \
    --bench jbb --victim dream --max-goals 100

# 3. Aggregate the per-iteration records into one row per goal
python aggregate_records.py --bench jailbreak --out-dir results/main

# 4. Judge ASR + Harmscore (Bedrock Qwen3-235B)
python evaluation.py --files 'results/main/jailbreak_dream.json' \
    --out results/main/summary.json --metrics asr,hs --judge bedrock

Records land in logs/jbb_dream/expansion_records.jsonl; re-running resumes (goals already in the file are skipped). Swap --bench jbb for hb / sr, and --victim dream for one of {llada, llada1_5, llada2, mmada} to cover the other (benchmark, victim) cells.

Stage A — Bootstrap pattern library

exploration.py mines an initial library of mask-fill schemas from a small seed of harmful goals. Each goal goes through the attacker LLM, which emits a strategy + jailbreak prompt + <mask:N> template; templates that succeed against the victim are abstracted into schemas via framework/pattern.py and inserted into the registry.

python exploration.py config=configs/exploration.yaml

Output: logs/shared_pattern_registry.json (the canonical library) and logs/shared_pattern_set.json (per-goal-type buckets).

You usually do not need to re-run this — the persisted library in logs/ is what every later stage reads. Re-run only when changing the seed set or the schema-extraction prompt.

Stage B — Online expansion + attack

expansion.py is the core attack loop, invoked through run_victim.py. For each goal it runs up to max_iters=3 iterations of:

UCB pattern selection from the goal-type bucket of the registry.
attacker.instantiate — one LLM call that produces a strategy tag, a jailbreak prompt, and a <mask:N> template aligned to the chosen schema.
Victim inference — the dLLM fills the masks in parallel.
Scorer (Qwen3-4B-Instruct via vLLM) — HarmBench-style classifier, reward ∈ [0, 1].
Fallback (≤ 3 retries) when reward < 0.7: the weaker base model on port 8101 produces a raw response → Bedrock Qwen3-235B inserts [UNSAFE]…[/UNSAFE] tags → tagged spans become <mask:N> → re-attack.
Pattern evolution — when a later iteration improves on the previous, the summarizer distills a new schema + template and writes both to logs/shared_pattern_registry.json.

Stage B is what the Quick start drives. Suffix tags (--suffix _foo) isolate per-experiment registries; without a suffix everything updates the shared library.

Stage C — Transferability

baselines/transition.py reuses the persisted library as a frozen black box: no attacker LLM at attack time, the victim is the only LLM in the loop.

For each AdvBench goal:

Look up the goal-type bucket.
Sample k patterns from the bucket (or use embedding cosine, via baselines/transition_embed.py).
Swap line 1 of each representative_template for the new goal, expand <mask:N> to the victim's mask token, submit to the victim.
Best-of-k ASR — the goal counts as jailbroken if any of the k attempts is judged unsafe.

# Refresh goal-type buckets if the registry changed (needs vLLM 8100 + Bedrock)
python scripts/_expand_taxonomy_bedrock.py

# Per-GPU runners
./scripts/_run_transition_per_gpu.sh 0 "dream llada llada1_5 mmada llada2"
./scripts/_run_baseline_per_gpu.sh   0 "direct/dream pad/dream dija/dream ..."

# Judge
python scripts/_judge_transition.py \
    --files 'results/transition_v2/advbench_*.json' \
    --out results/transition_v2/bedrock_asr_summary.json --workers 16
python scripts/_judge_hs.py --transition \
    --files 'results/transition_v2/advbench_*.json' \
    --out results/transition_v2/bedrock_hs_summary.json --workers 16

Outputs land in results/transition_v2/advbench_{victim}.json. The baselines/ directory also contains transition_baselines.py (direct / pad / dija), transition_embed.py (BGE cosine retrieval), and transition_random_global.py (uniform-random over the whole registry).

Repository layout

exploration.py            Stage A — bootstrap pattern library
expansion.py              Stage B core loop (UCB + fallback + evolution)
run_victim.py             Stage B CLI entry — main + defense ablations
inference.py              Victim model load + run
aggregate_records.py      expansion_records.jsonl  ->  per-goal result JSON
evaluation.py             ASR-LLM + Harmscore judging (Bedrock / local)
utils.py                  Config / prompt / parsing helpers

baselines/                Stage C transferability
  transition.py             best-of-k retrieval from the registry
  transition_baselines.py   direct / pad / dija
  transition_embed.py       BGE cosine retrieval
  transition_random_global.py

framework/                Attacker, Scorer, PatternRegistry, Summarizer, Retriever
llm/                      HuggingFace, vLLM, Bedrock backends
sample/                   Per-victim dLLM mask-fill (LLaDA, LLaDA2, Dream, MMaDA)
configs/                  YAML configs (one per victim, plus expansion / exploration)
data/                     JBB, HB, StrongReject, AdvBench goal files
scripts/                  Per-GPU runners, judges, taxonomy / embedding helpers
benchmark/                Curated (goal, template) pairs for static evaluation

Citation

If you use this code or the released pattern library / benchmark in your research, please cite:

@misc{ma2026maskforgestructureawareadaptiveattacks,
      title={MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models}, 
      author={Yingzi Ma and Zhengyue Zhao and Xiaogeng Liu and Minhui Xue and Yue Zhao and Chaowei Xiao},
      year={2026},
      eprint={2606.04027},
      archivePrefix={arXiv},
      primaryClass={cs.CR},
      url={https://arxiv.org/abs/2606.04027}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

Setup

Static evaluation with `benchmark/`

Quick start

Stage A — Bootstrap pattern library

Stage B — Online expansion + attack

Stage C — Transferability

Repository layout

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
baselines		baselines
benchmark		benchmark
configs		configs
data		data
framework		framework
llm		llm
sample		sample
scripts		scripts
.gitignore		.gitignore
README.md		README.md
aggregate_records.py		aggregate_records.py
evaluation.py		evaluation.py
expansion.py		expansion.py
exploration.py		exploration.py
inference.py		inference.py
redteamprompt.txt		redteamprompt.txt
run_victim.py		run_victim.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

Setup

Static evaluation with benchmark/

Quick start

Stage A — Bootstrap pattern library

Stage B — Online expansion + attack

Stage C — Transferability

Repository layout

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Static evaluation with `benchmark/`

Packages