Skip to content

fix(addie): scrub model/provider self-disclosure from responses#5506

Open
andybevan-scope3 wants to merge 2 commits into
adcontextprotocol:mainfrom
andybevan-scope3:addie-identity-disclosure-backstop
Open

fix(addie): scrub model/provider self-disclosure from responses#5506
andybevan-scope3 wants to merge 2 commits into
adcontextprotocol:mainfrom
andybevan-scope3:addie-identity-disclosure-backstop

Conversation

@andybevan-scope3

Copy link
Copy Markdown

Adds a deterministic backstop so the community assistant cannot break character and disclose the underlying model or vendor under task-stress (a tool failure, an out-of-scope request) when the prompt-level identity rule alone is not enough.

Changes

  • Output backstoprewritePersonaCollapse() removes any sentence that discloses the underlying model or provider, wired into applyResponsePipeline at every response return site. Sentence-level (keeps the rest of the reply), preserves fenced code blocks, idempotent. Precision-tuned so legitimate client references ("Claude Code", "Claude Desktop") and ordinary prose ("serves as a model for…") pass through untouched.
  • Prompt rule — adds an "Identity continuity" rule to the assistant system prompt.
  • Input/output canaries (security.ts) — flags identity-substitution prompt injection on input ("you are actually …") and the same disclosure patterns on output as an audit signal.
  • Log-on-catch — warns when the backstop actually scrubs a disclosure, so a rising rate is observable (e.g. after a model change).

Tests

  • ~50 new unit tests across the post-processor and security patterns (TDD: red → green).
  • Full suite green locally; typecheck clean.

🤖 Generated with Claude Code

Add a deterministic backstop so Addie can't break character and disclose
the underlying model or vendor under task-stress (a tool failure, an
out-of-scope request) when the prompt-level identity rule alone isn't
enough.

- rewritePersonaCollapse() removes any sentence disclosing the model or
  provider, wired into applyResponsePipeline at every response return site
- identity.md gains an "Identity continuity" hard rule
- security.ts flags identity-substitution injection on input and the same
  disclosure patterns on output (audit canary)
- log-on-catch warns when the backstop actually scrubs a disclosure

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@aao-ipr-bot

aao-ipr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

IPR Policy Agreement Required

@andybevan-scope3 — thanks for the contribution. Before this PR can be merged, the AgenticAdvertising.Org IPR Policy requires your agreement.

To agree, post a new comment on this PR with the exact phrase:

I have read the IPR Policy

Your signature is recorded once and covers all contributions to AAO repositories. See signatures/README.md for what gets recorded and why.

@andybevan-scope3

Copy link
Copy Markdown
Author

I have read the IPR Policy

@andybevan-scope3 andybevan-scope3 marked this pull request as ready for review June 13, 2026 02:59
@aao-ipr-bot

aao-ipr-bot Bot commented Jun 13, 2026

Copy link
Copy Markdown

IPR Policy — signed

Thanks, @andybevan-scope3. Your agreement to the IPR Policy is recorded at signatures/ipr-signatures.json and applies to all AAO repositories.

@aao-release-bot aao-release-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Argus review could not complete

The automated review encountered an issue (possibly reached max turns, timed out, or failed to post the final gh pr review). A human reviewer should take this PR.

View workflow run

This is an automated message from the Argus AI review workflow.

@andybevan-scope3 andybevan-scope3 marked this pull request as draft June 15, 2026 18:20
@andybevan-scope3 andybevan-scope3 marked this pull request as ready for review June 15, 2026 18:20

@aao-release-bot aao-release-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Argus review could not complete

The automated review encountered an issue (possibly reached max turns, timed out, or failed to post the final gh pr review). A human reviewer should take this PR.

View workflow run

This is an automated message from the Argus AI review workflow.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@aao-release-bot aao-release-bot Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Argus review could not complete

The automated review encountered an issue (possibly reached max turns, timed out, or failed to post the final gh pr review). A human reviewer should take this PR.

View workflow run

This is an automated message from the Argus AI review workflow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant