Skip to content

feat: add bio-bait spam detection with profile bio scanning#10

Open
rezhajulio wants to merge 7 commits into
mainfrom
feat/bio-bait-spam-detection
Open

feat: add bio-bait spam detection with profile bio scanning#10
rezhajulio wants to merge 7 commits into
mainfrom
feat/bio-bait-spam-detection

Conversation

@rezhajulio
Copy link
Copy Markdown
Owner

@rezhajulio rezhajulio commented May 1, 2026

Summary

Detects two related spam vectors that have been showing up in the Indonesian Telegram community:

  1. Bait phrases in messages — e.g. cek bio aku, liat byoh, open my bio. Spammers obfuscate bio with misspellings, separators, and Cyrillic look-alikes (b.i.o, b1o, bioohh, Ьіо). The handler normalizes the text (NFKC + lowercase + zero-width strip), canonicalizes obfuscated variants back to bio, then matches a small set of imperative + bio + possessive patterns.

  2. Promo/scam links inside the user's Telegram profile bio — e.g. private t.me/+<invite-hash> invite links combined with promo hint words (VIP, promo, open, ready, …) and/or non-whitelisted @mentions. Some spammers send innocuous group messages while their bio carries the actual links. The user's bio is fetched once per hour via bot.get_chat() and cached in bot_data.

On match the handler deletes the message, restricts the user, clears the cached bio, and posts a notification (separate templates for message-bait vs bio-link cases) to the warning topic, then raises ApplicationHandlerStop.

Detection logic

Message bait

  • NFKC normalize → lowercase → strip zero-width chars
  • Canonicalize bio/byo obfuscations to bio (handles Cyrillic look-alikes)
  • 80-char length cap on normalized text (real bait is short)
  • 4 narrow regex patterns gated on imperative cue + bio and/or first-person possessive

Profile bio scan

  • Always flag t.me/+... private invite links
  • Flag non-whitelisted t.me/{username} links (reuses is_url_whitelisted)
  • Flag 2+ non-whitelisted @username mentions, OR 1 mention combined with a promo hint (vip, bcl, asp, open, ready, …)
  • Single bare @mention alone is not enough (avoids false positives)

Changes

  • New handler: src/bot/handlers/bio_bait.py (registered at group=2; contact_spam/new_user_spam/duplicate_spam/message_handler shifted to 3/4/5/6).
  • New config flag: bio_bait_enabled (Settings + GroupConfig, default True).
  • New Indonesian templates: BIO_BAIT_SPAM_NOTIFICATION (+ _NO_RESTRICT) and BIO_LINK_SPAM_NOTIFICATION (+ _NO_RESTRICT).
  • New tests: tests/test_bio_bait.py — 79 tests covering normalization, true positives (cek bio kak, lihat bio dong, bio aku update, Cyrillic/obfuscated forms), false positives (biology, bioinformatics, bio aku ada di README, thank you my bro), bio-link detection, per-user TTL cache, all handler branches (admin/bot/disabled/no-text/delete-fail/restrict-fail/notify-fail).

Verification

  • uv run pytest626 passed (was 547 → +79)
  • bio_bait.py at 100% coverage
  • Overall coverage: 99%
  • ruff check clean

Notes

  • Off-by-default lewd-keyword filter was not included in this PR (oracle recommended observing real samples first).
  • Handler ordering ensures bio-bait runs before contact/new-user/duplicate/profile checks; ApplicationHandlerStop short-circuits downstream when a match fires.
  • Bio fetch errors are swallowed and not cached, so transient API issues don't permanently mask a spam bio.

rezhajulio and others added 4 commits May 18, 2026 22:09
Detects two related spam vectors common in Indonesian Telegram groups:

1. Bait phrases in messages (e.g. "cek bio aku", "liat byoh",
   "open my bio"). Spammers obfuscate the word "bio" with
   misspellings, separators (b.i.o, b1o), and Cyrillic look-alikes
   (Ьіо). The handler normalizes (NFKC + lowercase + zero-width strip)
   and canonicalizes obfuscated variants back to "bio" before matching
   a small set of imperative + bio + possessive patterns.

2. Promo/scam links inside the user's Telegram profile bio. Some
   spammers send innocuous group messages while their bio carries
   t.me/+invite links, non-whitelisted t.me/{user} links, or multiple
   non-whitelisted @mentions (sometimes paired with promo hint words
   like VIP, BCL, ASP, open). The user's bio is fetched once per hour
   via bot.get_chat() and cached in bot_data.

On match the handler deletes the message, restricts the user, clears
the cached bio, and posts a notification (separate templates for
message-bait vs bio-link cases) to the warning topic.

- New handler: src/bot/handlers/bio_bait.py (registered at group=2,
  shifts contact/new_user/duplicate/message handlers to 3/4/5/6).
- New config: bio_bait_enabled (Settings + GroupConfig, default True).
- New templates: BIO_BAIT_SPAM_NOTIFICATION (+ NO_RESTRICT) and
  BIO_LINK_SPAM_NOTIFICATION (+ NO_RESTRICT) in constants.py.
- Tests: tests/test_bio_bait.py covers normalization, true positives
  (incl. Cyrillic / obfuscated forms), false positives (biology,
  bioinformatics, "bio aku ada di README"), bio-link detection,
  per-user TTL cache, all handler branches.

626 tests pass, bio_bait.py at 100% coverage, ruff clean.
Replace real-looking Telegram invite hashes and @username from spam
examples in code comments and tests with obvious placeholders so the
repository does not propagate (or appear to endorse) actual scam links.
…tion

- Remove generic promo hints (open, ready, available) that FP on developer bios
- Use word-boundary regex for promo hint matching (prevents 'vip' inside 'advancement')
- Fix Cyrillic 4-char obfuscation: allow ь as filler between b and i in BIO_OBFUSCATED_RE
- Add tests: word boundary, generic words removed, strong hints still work, Cyrillic filler
@rezhajulio rezhajulio force-pushed the feat/bio-bait-spam-detection branch from 6b56c8b to 811c4ce Compare May 18, 2026 15:09
…warning-topic guard, narrowed filter

Changes:
- Use is_user_admin_or_trusted() for admin/trusted bypass (consistent with other spam handlers)
- Move owner alert block inside monitor_only check (enforcement mode no longer sends alerts)
- Add warning-topic guard: skip alert when alert_chat_id matches monitored group, log, increment owner_alert_skipped_warning_topic metric
- Narrow main.py handler filter to TEXT|CAPTION to reduce unnecessary handler invocations
- Add 5 tests covering: trusted bypass, enforcement no-alert, same-group skip, enforcement metrics (message_bait + bio_links)
Non-text messages (photo without caption, sticker, etc.) were blocked
by the bio-bait handler filter requiring TEXT or CAPTION. This meant
users posting media with no text could bypass bio-link detection.

Changes:
- Add BIO_BAIT_FILTER constant to bio_bait.py using only GROUPS & ~COMMAND
- Register handler with BIO_BAIT_FILTER in main.py instead of inline filter
- Add TestBioBaitRegistrationFilter regression tests:
  - Non-text group message passes filter
  - Text group message still passes filter
  - Command messages still excluded

Duplicate_spam and message handler filters left unchanged.
…list

- Move is_url_whitelisted to services/telegram_utils.py (eliminates cross-handler dependency)
- Add hard cache cap (2000) with LRU eviction for bio cache
- Tighten pattern 1 to require end-of-string anchor (prevents false positives like 'open source bio library')
- Count @mentions by occurrence instead of unique set (catches repeated same-mention spam)
- Convert all logging to f-strings in bio_bait.py
- Add tests for cache eviction, duplicate mentions, and benign phrases
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant