Skip to content

PRO100CHOK/website-email-phone-extractor-python

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Website Email & Phone Extractor — Python Example

Crawl any list of websites and pull every email address, phone number, LinkedIn, Twitter, Facebook, Instagram, YouTube, TikTok, and Telegram handle in Python — for lead generation, recruiting, journalism, or technical SEO contact discovery.

Apify Actor Python 3.10+ License: MIT

This Python project drives the Website Contact Scraper Apify actor — a focused crawler that visits a website's homepage plus its contact / about / team / press / careers pages (configurable depth and breadth), then pulls every email, phone, and social link the site reveals. Up to 100 starting domains per run, $1 per 1,000 results, no rate limit on your end.

What this does

Building an SDR / outreach pipeline always hits the same wall: you have a list of prospect domains, but their contact pages bury emails behind images, JavaScript, or info[at]example.com obfuscation. Generic regex scrapers miss the obfuscated ones; commercial enrichment APIs (Apollo, Hunter, Lusha) start at $99+/month per seat and gate the data behind a credits system. This actor instead just does the boring crawling work — visits a few key pages per domain, runs robust extraction regexes against deobfuscated HTML, and returns clean lists per domain.

Use cases

  • B2B outbound prospecting — convert a list of company domains into emails + phones for SDR outreach.
  • Recruiting & headhunting — pull contact info from career pages of target hire companies.
  • Journalism & PR — find press@ / media@ contacts for outlets you want to pitch.
  • Technical SEO outreach — broken-link building or guest-post pitching requires a real contact, not a contact form.
  • Influencer outreach — extract emails from creator websites linked in social media bios.
  • Sales territory enrichment — feed your CRM with phone numbers for cold-call follow-up.
  • Compliance & due diligence — verify that a domain has working contact info before signing a partnership.

Requirements

  • Python 3.10+
  • A free Apify account
  • No third-party enrichment account needed

Quick start

git clone https://github.com/pro100chok/website-email-phone-extractor-python.git
cd website-email-phone-extractor-python
pip install -r requirements.txt
cp .env.example .env
# paste your APIFY_API_TOKEN
python main.py

main.py enriches ten developer-tools websites (Linear, Airtable, Retool, Hex, Mintlify, Vercel, Fly.io, Supabase, PlanetScale, Dagster), picks the best generic email per domain (hello@/contact@/sales@ first), and saves to JSON + CSV.

How it works

For each startUrls entry the crawler:

  1. Visits the URL and parses links matching contact / about / team / press / careers / support routes (multiple language variants).
  2. Crawls those pages depth-first, up to maxDepth levels and maxPagesPerDomain pages total per domain.
  3. Extracts emails from mailto: links, plain text (deobfuscating [at], (at), at, @), and structured data (Person/Organization JSON-LD blocks).
  4. Extracts phone numbers from tel: links and free text using locale-aware regex (US, EU, intl formats).
  5. Detects social handles for LinkedIn, X / Twitter, Facebook, Instagram, YouTube, TikTok, Telegram.

The output is one dataset record per starting domain. Multiple emails / phones / socials are returned as arrays so you can pick the one that matches your outreach style.

Example: prospect list enrichment

import os
from apify_client import ApifyClient

client = ApifyClient(os.environ["APIFY_API_TOKEN"])

run = client.actor("pro100chok/extract-emails").call(run_input={
    "startUrls": [
        {"url": "https://www.linear.app"},
        {"url": "https://www.airtable.com"},
    ],
    "maxDepth": 2,
    "maxPagesPerDomain": 15,
})

for it in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"{it['domain']}: {it['emails'][:3]}  {it.get('phones', [])[:1]}")

Example output

{
  "url": "https://www.linear.app",
  "domain": "linear.app",
  "emails": ["hello@linear.app", "press@linear.app", "careers@linear.app"],
  "phones": [],
  "socials": {
    "twitter": "https://twitter.com/linear",
    "linkedin": "https://www.linkedin.com/company/linear-app/",
    "youtube": "https://www.youtube.com/@linear"
  },
  "pagesCrawled": 12
}

Input parameters

Parameter Type Required Description
startUrls object[] yes Up to 100 starting URLs. Accepts full URLs, bare domains, with or without www/protocol.
maxDepth integer no Crawl depth from each start page. 0 = start page only. Default 2.
maxPagesPerDomain integer no Max pages crawled per domain. Default 10.
concurrency integer no Number of domains processed in parallel. Default 10.
useProxy boolean no Enable proxy. Off by default — most public sites don't need it.
proxyConfiguration object no Used only when useProxy=true. Supports Apify Proxy or custom URLs.

More examples

File Demonstrates
examples/01_basic_usage.py Single-website crawl.
examples/02_from_search_results.py Multi-domain bulk crawl.
examples/03_role_email_filter.py Filter to role inboxes (sales@, support@, etc.).
examples/04_export_to_csv.py Pandas export + best-contact-per-domain logic.
examples/05_export_to_google_sheets.py Append fresh enrichment to a shared Sheet.

FAQ

How much does it cost? $1 per 1,000 domains crawled (each domain = one result item). Apify's free $5/month covers ~5,000 enrichments before you pay anything.

Will it find every email on the site? It finds every email present in HTML, deobfuscated forms ([at], (at), HTML entities), mailto: links, and JSON-LD schemas across the homepage + a small set of high-value subpages. Emails buried in images, PDFs, or third-party form widgets won't be extracted — those usually require either OCR or an expensive headless-browser pass.

What's maxPagesPerDomain for? Most sites have all useful contact info reachable from the homepage and one click away (contact / about / team / press). Setting maxPagesPerDomain: 15 lets you reach those routes without blowing through credits crawling every blog post.

Is this GDPR-compliant? The actor extracts publicly-published contact information from public websites. Whether it's legal to use those contacts for cold outreach depends on your jurisdiction (B2B is generally fine under GDPR's legitimate-interest basis if you offer opt-out; consumer/personal emails are stricter). Consult your legal team before bulk outreach in EU markets.

Can I run this with proxies? Yes. Set useProxy: true and pass proxyConfiguration. Most public marketing sites don't need proxies, but Cloudflare-protected sites occasionally do.

Can I extract emails from a single page only? Yes. Set maxDepth: 0 — the actor will only crawl the URL you give it, no link discovery.

Does it work with subdomains? Yes. Pass https://subdomain.example.com and the crawler will crawl that subdomain (not the parent domain) up to maxPagesPerDomain.

What social platforms does it detect? LinkedIn, X / Twitter, Facebook, Instagram, YouTube, TikTok, Telegram, GitHub, Discord, Crunchbase, AngelList — anywhere a recognizable canonical URL appears on the page.

Can I verify the emails I get? Yes. Pipe the output through the Email Verifier actor to check deliverability and risk scores before you send your outreach.

Related actors

See all my actors at apify.com/pro100chok.

License

MIT — see LICENSE.


Built on top of the Website Contact Scraper Apify actor.