feat: add bio-bait spam detection with profile bio scanning by rezhajulio · Pull Request #10 · rezhajulio/PythonID-bot

rezhajulio · 2026-05-01T15:59:36Z

Summary

Detects two related spam vectors that have been showing up in the Indonesian Telegram community:

Bait phrases in messages — e.g. cek bio aku, liat byoh, open my bio. Spammers obfuscate bio with misspellings, separators, and Cyrillic look-alikes (b.i.o, b1o, bioohh, Ьіо). The handler normalizes the text (NFKC + lowercase + zero-width strip), canonicalizes obfuscated variants back to bio, then matches a small set of imperative + bio + possessive patterns.
Promo/scam links inside the user's Telegram profile bio — e.g. private t.me/+<invite-hash> invite links combined with promo hint words (VIP, promo, open, ready, …) and/or non-whitelisted @mentions. Some spammers send innocuous group messages while their bio carries the actual links. The user's bio is fetched once per hour via bot.get_chat() and cached in bot_data.

On match the handler deletes the message, restricts the user, clears the cached bio, and posts a notification (separate templates for message-bait vs bio-link cases) to the warning topic, then raises ApplicationHandlerStop.

Detection logic

Message bait

NFKC normalize → lowercase → strip zero-width chars
Canonicalize bio/byo obfuscations to bio (handles Cyrillic look-alikes)
80-char length cap on normalized text (real bait is short)
4 narrow regex patterns gated on imperative cue + bio and/or first-person possessive

Profile bio scan

Always flag t.me/+... private invite links
Flag non-whitelisted t.me/{username} links (reuses is_url_whitelisted)
Flag 2+ non-whitelisted @username mentions, OR 1 mention combined with a promo hint (vip, bcl, asp, open, ready, …)
Single bare @mention alone is not enough (avoids false positives)

Changes

New handler: src/bot/handlers/bio_bait.py (registered at group=2; contact_spam/new_user_spam/duplicate_spam/message_handler shifted to 3/4/5/6).
New config flag: bio_bait_enabled (Settings + GroupConfig, default True).
New Indonesian templates: BIO_BAIT_SPAM_NOTIFICATION (+ _NO_RESTRICT) and BIO_LINK_SPAM_NOTIFICATION (+ _NO_RESTRICT).
New tests: tests/test_bio_bait.py — 79 tests covering normalization, true positives (cek bio kak, lihat bio dong, bio aku update, Cyrillic/obfuscated forms), false positives (biology, bioinformatics, bio aku ada di README, thank you my bro), bio-link detection, per-user TTL cache, all handler branches (admin/bot/disabled/no-text/delete-fail/restrict-fail/notify-fail).

Verification

uv run pytest → 626 passed (was 547 → +79)
bio_bait.py at 100% coverage
Overall coverage: 99%
ruff check clean

Notes

Off-by-default lewd-keyword filter was not included in this PR (oracle recommended observing real samples first).
Handler ordering ensures bio-bait runs before contact/new-user/duplicate/profile checks; ApplicationHandlerStop short-circuits downstream when a match fires.
Bio fetch errors are swallowed and not cached, so transient API issues don't permanently mask a spam bio.

Detects two related spam vectors common in Indonesian Telegram groups: 1. Bait phrases in messages (e.g. "cek bio aku", "liat byoh", "open my bio"). Spammers obfuscate the word "bio" with misspellings, separators (b.i.o, b1o), and Cyrillic look-alikes (Ьіо). The handler normalizes (NFKC + lowercase + zero-width strip) and canonicalizes obfuscated variants back to "bio" before matching a small set of imperative + bio + possessive patterns. 2. Promo/scam links inside the user's Telegram profile bio. Some spammers send innocuous group messages while their bio carries t.me/+invite links, non-whitelisted t.me/{user} links, or multiple non-whitelisted @mentions (sometimes paired with promo hint words like VIP, BCL, ASP, open). The user's bio is fetched once per hour via bot.get_chat() and cached in bot_data. On match the handler deletes the message, restricts the user, clears the cached bio, and posts a notification (separate templates for message-bait vs bio-link cases) to the warning topic. - New handler: src/bot/handlers/bio_bait.py (registered at group=2, shifts contact/new_user/duplicate/message handlers to 3/4/5/6). - New config: bio_bait_enabled (Settings + GroupConfig, default True). - New templates: BIO_BAIT_SPAM_NOTIFICATION (+ NO_RESTRICT) and BIO_LINK_SPAM_NOTIFICATION (+ NO_RESTRICT) in constants.py. - Tests: tests/test_bio_bait.py covers normalization, true positives (incl. Cyrillic / obfuscated forms), false positives (biology, bioinformatics, "bio aku ada di README"), bio-link detection, per-user TTL cache, all handler branches. 626 tests pass, bio_bait.py at 100% coverage, ruff clean.

Replace real-looking Telegram invite hashes and @username from spam examples in code comments and tests with obvious placeholders so the repository does not propagate (or appear to endorse) actual scam links.

…tion - Remove generic promo hints (open, ready, available) that FP on developer bios - Use word-boundary regex for promo hint matching (prevents 'vip' inside 'advancement') - Fix Cyrillic 4-char obfuscation: allow ь as filler between b and i in BIO_OBFUSCATED_RE - Add tests: word boundary, generic words removed, strong hints still work, Cyrillic filler

…warning-topic guard, narrowed filter Changes: - Use is_user_admin_or_trusted() for admin/trusted bypass (consistent with other spam handlers) - Move owner alert block inside monitor_only check (enforcement mode no longer sends alerts) - Add warning-topic guard: skip alert when alert_chat_id matches monitored group, log, increment owner_alert_skipped_warning_topic metric - Narrow main.py handler filter to TEXT|CAPTION to reduce unnecessary handler invocations - Add 5 tests covering: trusted bypass, enforcement no-alert, same-group skip, enforcement metrics (message_bait + bio_links)

Non-text messages (photo without caption, sticker, etc.) were blocked by the bio-bait handler filter requiring TEXT or CAPTION. This meant users posting media with no text could bypass bio-link detection. Changes: - Add BIO_BAIT_FILTER constant to bio_bait.py using only GROUPS & ~COMMAND - Register handler with BIO_BAIT_FILTER in main.py instead of inline filter - Add TestBioBaitRegistrationFilter regression tests: - Non-text group message passes filter - Text group message still passes filter - Command messages still excluded Duplicate_spam and message handler filters left unchanged.

…list - Move is_url_whitelisted to services/telegram_utils.py (eliminates cross-handler dependency) - Add hard cache cap (2000) with LRU eviction for bio cache - Tighten pattern 1 to require end-of-string anchor (prevents false positives like 'open source bio library') - Count @mentions by occurrence instead of unique set (catches repeated same-mention spam) - Convert all logging to f-strings in bio_bait.py - Add tests for cache eviction, duplicate mentions, and benign phrases

rezhajulio mentioned this pull request May 1, 2026

chore: sanitize spam example references #11

Closed

rezhajulio and others added 4 commits May 18, 2026 22:09

chore: sanitize spam example references

b764bed

Replace real-looking Telegram invite hashes and @username from spam examples in code comments and tests with obvious placeholders so the repository does not propagate (or appear to endorse) actual scam links.

feat: add bio-bait monitor-only mode with owner alerts and metrics

811c4ce

rezhajulio force-pushed the feat/bio-bait-spam-detection branch from 6b56c8b to 811c4ce Compare May 18, 2026 15:09

rezhajulio added 3 commits May 18, 2026 22:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add bio-bait spam detection with profile bio scanning#10

feat: add bio-bait spam detection with profile bio scanning#10
rezhajulio wants to merge 7 commits into
mainfrom
feat/bio-bait-spam-detection

rezhajulio commented May 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rezhajulio commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Detection logic

Message bait

Profile bio scan

Changes

Verification

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rezhajulio commented May 1, 2026 •

edited

Loading