A Node.js and TypeScript web scraping pipeline for discovering companies, enriching company profiles, and extracting software engineering job opportunities from company websites.
Node Scraper combines Google Search via SERP API, AI-assisted enrichment, OpenAI-based normalization, and ZenRows-powered scraping to turn unstructured web data into clean JSON datasets for company intelligence and engineering job discovery.
- Automated company name generation
- LinkedIn company profile discovery through Google Search operators
- Company enrichment using SERP API AI-powered search
- OpenAI-powered data normalization and description generation
- Website crawling to discover careers and jobs pages
- Software engineering job extraction from company websites and ATS pages
- Job normalization into structured fields:
- Description
- Requirements
- Benefits
- Location
- Employment type
- Workplace type
- Seniority
- Salary range
- Job classification into:
- Frontend
- Backend
- Full Stack
- DevOps
- Scraping support through ZenRows:
- Proxy rotation
- CAPTCHA handling
- Anti-bot protection
- Header management
- Local JSON-based datasets for companies, jobs, errors, and scraping state
Company Generator
|
v
Google Search via SERP API
|
v
LinkedIn Company URL Discovery
|
v
SERP API AI Enrichment
|
v
OpenAI Normalization
|
v
Company Dataset
|
v
Website / Careers Page Scraping
|
v
Software Job Extraction
|
v
OpenAI Job Structuring
|
v
Job Classification
|
v
Structured JSON Output
The pipeline is split into two major stages:
- Company discovery and enrichment
- Careers page discovery, job scraping, and job normalization
Data is persisted in JSON files under src/companies/data and src/jobs/data.
- Node.js
- TypeScript
- SERP API
- OpenAI API
- ZenRows
- Playwright
- Cheerio
- html-to-text
- dotenv
- tsx
The pipeline generates candidate company names, currently scoped by location, such as United States-based companies.
For each company, the scraper queries Google Search through SERP API using advanced search operators, such as restricting results to LinkedIn company pages.
Example search intent:
site:linkedin.com/company "Company Name"
The scraper parses search results and extracts matching LinkedIn company profile URLs.
The pipeline uses SERP API AI-powered search to retrieve public company information such as:
- Company size
- Company type
- Founding year
- Official website
- Industry
- Followers
- Overview
Raw enrichment results are sent to the OpenAI API and converted into clean, typed JSON. OpenAI is also used to generate concise, high-quality company descriptions.
For each enriched company, the scraper visits the official website and attempts to locate the careers or jobs page.
The scraper focuses on software engineering roles and filters out irrelevant postings using software-related keywords.
Job pages are scraped and optimized into text suitable for AI processing.
OpenAI transforms raw job content into structured fields including requirements, benefits, locations, seniority, workplace type, and employment type.
Each job is classified into one of the supported software engineering categories:
frontendbackendfullstackdevops
- Node.js 20 or newer recommended
- npm
- SERP API key
- OpenAI API key
- ZenRows API key
Clone the repository:
git clone https://github.com/your-org/talentgraph-scraper.git
cd talentgraph-scraperInstall dependencies:
npm installCreate a .env file:
touch .envAdd the required environment variables:
SERP_API_KEY=your_serp_api_key
OPENAI_API_KEY=your_openai_api_key
ZENROWS_API_KEY=your_zenrows_api_keyBuild the project:
npm run buildnpx tsx src/companies/companies-main-script.ts <companies-to-update> <companies-to-generate>Example:
npx tsx src/companies/companies-main-script.ts 25 50This generates 50 candidate company names and updates up to 25 company profiles.
npx tsx src/jobs/jobs-main-script.ts <companies-to-scrape>Example:
npx tsx src/jobs/jobs-main-script.ts 20This visits up to 20 eligible company websites, finds careers pages, extracts software jobs, normalizes the data, and appends results to the jobs dataset.
npm testCompany records are stored in:
src/companies/data/companies-info.json
Example:
{
"name": "ExampleCloud",
"website": "https://examplecloud.com",
"industry": "Software",
"customerTypes": ["B2B"],
"technology": "Cloud Systems",
"companySize": "201-500",
"foundedYear": 2018,
"description": ["ExampleCloud helps engineering teams deploy, monitor, and scale cloud-native applications."],
"id": "examplecloud",
"lastScrapedAt": "2026-05-19",
"linkedinUrl": "https://www.linkedin.com/company/examplecloud",
"overview": "ExampleCloud is a privately held software company focused on cloud infrastructure.",
"followers": 12500,
"followersInfo": "12,500 followers on LinkedIn",
"listJobsUrl": "https://examplecloud.com/careers"
}Job records are stored in:
src/jobs/data/jobs-info.json
Example:
{
"title": "Senior Backend Engineer",
"url": "https://examplecloud.com/careers/senior-backend-engineer",
"locations": [
{
"scope": "country",
"workplaceType": "remote",
"businessRegion": "amer",
"continent": "northamerica",
"country": "unitedstates"
}
],
"remoteLocationTokens": ["United States"],
"onsiteOrHybrifLocationTokens": [],
"skills": ["Node.js", "TypeScript", "PostgreSQL", "AWS"],
"workplaceTypes": ["remote"],
"employmentType": "fulltime",
"description": ["Build and operate backend services that power customer-facing cloud infrastructure products."],
"requirements": [
"5+ years of backend engineering experience",
"Strong production experience with TypeScript and distributed systems"
],
"benefits": ["Health insurance", "Remote work stipend", "Equity package"],
"overview": "Senior backend role focused on scalable cloud infrastructure services.",
"seniority": "senior",
"category": "backend",
"publishedAt": "2026-05-10",
"companyId": "examplecloud",
"salaryRange": {
"min": 150000,
"max": 210000,
"interval": "year",
"currency": "USD",
"bonus": true,
"equity": true
}
}Configuration is primarily handled through source maps, prompts, keyword lists, and JSON data files.
| Variable | Description |
|---|---|
SERP_API_KEY |
API key used for Google Search and AI-powered SERP API queries |
OPENAI_API_KEY |
API key used for normalization, enrichment, and classification |
ZENROWS_API_KEY |
API key used for proxy-backed scraping and anti-bot handling |
Company-related configuration lives in:
src/companies
Key files include:
src/companies/prompts
src/companies/data/companies-info.json
src/companies/data/error-companies.json
src/companies/data/forbidden-companies.json
src/model/companies-model.ts
You can adjust:
- Company generation prompts
- Enrichment prompts
- Allowed industries
- Forbidden industries
- Company size values
- Business type values
- Technology categories
Job-related configuration lives in:
src/jobs
Key files include:
src/jobs/jobs-keywords.ts
src/jobs/map/job-info-map.ts
src/jobs/data/jobs-info.json
src/jobs/data/jobs-errors.json
src/model/jobs-model.ts
You can adjust:
- Software job keywords
- Excluded job keywords
- Supported job categories
- Seniority levels
- Workplace types
- Employment types
- Supported locations
- Supported currencies
- ATS scraping behavior
- SERP API, OpenAI, and ZenRows usage may incur cost.
- Search results can vary by region, query wording, and time.
- Company and job data may be incomplete, stale, or inconsistent across public sources.
- AI-normalized output should be validated before being used in production workflows.
- Some websites may block scraping even with proxy and anti-bot tooling.
- CAPTCHA bypass and proxy usage should comply with provider terms and applicable laws.
- Respect
robots.txt, website terms of service, and data privacy regulations. - Rate limits should be handled carefully to avoid unnecessary API failures or account throttling.
- LinkedIn and company websites may change page structure, causing extraction logic to require maintenance.
- Add database persistence with PostgreSQL or MongoDB
- Add a web dashboard for browsing companies and jobs
- Support more job categories such as Data Engineering, Security, Mobile, and ML Engineering
- Add queue-based scraping for larger workloads
- Add retry policies and backoff strategies
- Add richer observability and scraping metrics
- Improve ATS-specific scrapers
- Add deduplication for companies and job postings
- Add salary estimation when compensation is missing
- Add automated data quality checks
- Add CI workflows for tests and linting
Contributions are welcome.
Recommended workflow:
- Fork the repository.
- Create a feature branch.
- Make a focused change.
- Add or update tests when relevant.
- Run the test suite.
- Open a pull request with a clear description of the change.
Please keep contributions scoped, readable, and aligned with the existing TypeScript style.
MIT License. Update this section with the final license text before publishing.