Node Scraper

A Node.js and TypeScript web scraping pipeline for discovering companies, enriching company profiles, and extracting software engineering job opportunities from company websites.

Node Scraper combines Google Search via SERP API, AI-assisted enrichment, OpenAI-based normalization, and ZenRows-powered scraping to turn unstructured web data into clean JSON datasets for company intelligence and engineering job discovery.

Features

Automated company name generation
LinkedIn company profile discovery through Google Search operators
Company enrichment using SERP API AI-powered search
OpenAI-powered data normalization and description generation
Website crawling to discover careers and jobs pages
Software engineering job extraction from company websites and ATS pages
Job normalization into structured fields:
- Description
- Requirements
- Benefits
- Location
- Employment type
- Workplace type
- Seniority
- Salary range
Job classification into:
- Frontend
- Backend
- Full Stack
- DevOps
Scraping support through ZenRows:
- Proxy rotation
- CAPTCHA handling
- Anti-bot protection
- Header management
Local JSON-based datasets for companies, jobs, errors, and scraping state

Architecture Overview

Company Generator
       |
       v
Google Search via SERP API
       |
       v
LinkedIn Company URL Discovery
       |
       v
SERP API AI Enrichment
       |
       v
OpenAI Normalization
       |
       v
Company Dataset
       |
       v
Website / Careers Page Scraping
       |
       v
Software Job Extraction
       |
       v
OpenAI Job Structuring
       |
       v
Job Classification
       |
       v
Structured JSON Output

The pipeline is split into two major stages:

Company discovery and enrichment
Careers page discovery, job scraping, and job normalization

Data is persisted in JSON files under src/companies/data and src/jobs/data.

Tech Stack

Node.js
TypeScript
SERP API
OpenAI API
ZenRows
Playwright
Cheerio
html-to-text
dotenv
tsx

How It Works

1. Generate Company Names

The pipeline generates candidate company names, currently scoped by location, such as United States-based companies.

2. Discover LinkedIn Profiles

For each company, the scraper queries Google Search through SERP API using advanced search operators, such as restricting results to LinkedIn company pages.

Example search intent:

site:linkedin.com/company "Company Name"

3. Extract LinkedIn Company URLs

The scraper parses search results and extracts matching LinkedIn company profile URLs.

4. Enrich Company Data

The pipeline uses SERP API AI-powered search to retrieve public company information such as:

Company size
Company type
Founding year
Official website
Industry
Followers
Overview

5. Normalize Company Profiles with OpenAI

Raw enrichment results are sent to the OpenAI API and converted into clean, typed JSON. OpenAI is also used to generate concise, high-quality company descriptions.

6. Visit Company Websites

For each enriched company, the scraper visits the official website and attempts to locate the careers or jobs page.

7. Scrape Job Listings

The scraper focuses on software engineering roles and filters out irrelevant postings using software-related keywords.

8. Extract Job Details

Job pages are scraped and optimized into text suitable for AI processing.

9. Normalize Jobs with OpenAI

OpenAI transforms raw job content into structured fields including requirements, benefits, locations, seniority, workplace type, and employment type.

10. Classify Jobs

Each job is classified into one of the supported software engineering categories:

frontend
backend
fullstack
devops

Installation

Prerequisites

Node.js 20 or newer recommended
npm
SERP API key
OpenAI API key
ZenRows API key

Setup

Clone the repository:

git clone https://github.com/your-org/talentgraph-scraper.git
cd talentgraph-scraper

Install dependencies:

npm install

Create a .env file:

touch .env

Add the required environment variables:

SERP_API_KEY=your_serp_api_key
OPENAI_API_KEY=your_openai_api_key
ZENROWS_API_KEY=your_zenrows_api_key

Build the project:

npm run build

Usage

Run Company Discovery and Enrichment

npx tsx src/companies/companies-main-script.ts <companies-to-update> <companies-to-generate>

Example:

npx tsx src/companies/companies-main-script.ts 25 50

This generates 50 candidate company names and updates up to 25 company profiles.

Run Job Scraping and Normalization

npx tsx src/jobs/jobs-main-script.ts <companies-to-scrape>

Example:

npx tsx src/jobs/jobs-main-script.ts 20

This visits up to 20 eligible company websites, finds careers pages, extracts software jobs, normalizes the data, and appends results to the jobs dataset.

Run Tests

npm test

Output Format

Company Data

Company records are stored in:

src/companies/data/companies-info.json

Example:

{
  "name": "ExampleCloud",
  "website": "https://examplecloud.com",
  "industry": "Software",
  "customerTypes": ["B2B"],
  "technology": "Cloud Systems",
  "companySize": "201-500",
  "foundedYear": 2018,
  "description": ["ExampleCloud helps engineering teams deploy, monitor, and scale cloud-native applications."],
  "id": "examplecloud",
  "lastScrapedAt": "2026-05-19",
  "linkedinUrl": "https://www.linkedin.com/company/examplecloud",
  "overview": "ExampleCloud is a privately held software company focused on cloud infrastructure.",
  "followers": 12500,
  "followersInfo": "12,500 followers on LinkedIn",
  "listJobsUrl": "https://examplecloud.com/careers"
}

Job Data

Job records are stored in:

src/jobs/data/jobs-info.json

Example:

{
  "title": "Senior Backend Engineer",
  "url": "https://examplecloud.com/careers/senior-backend-engineer",
  "locations": [
    {
      "scope": "country",
      "workplaceType": "remote",
      "businessRegion": "amer",
      "continent": "northamerica",
      "country": "unitedstates"
    }
  ],
  "remoteLocationTokens": ["United States"],
  "onsiteOrHybrifLocationTokens": [],
  "skills": ["Node.js", "TypeScript", "PostgreSQL", "AWS"],
  "workplaceTypes": ["remote"],
  "employmentType": "fulltime",
  "description": ["Build and operate backend services that power customer-facing cloud infrastructure products."],
  "requirements": [
    "5+ years of backend engineering experience",
    "Strong production experience with TypeScript and distributed systems"
  ],
  "benefits": ["Health insurance", "Remote work stipend", "Equity package"],
  "overview": "Senior backend role focused on scalable cloud infrastructure services.",
  "seniority": "senior",
  "category": "backend",
  "publishedAt": "2026-05-10",
  "companyId": "examplecloud",
  "salaryRange": {
    "min": 150000,
    "max": 210000,
    "interval": "year",
    "currency": "USD",
    "bonus": true,
    "equity": true
  }
}

Configuration

Configuration is primarily handled through source maps, prompts, keyword lists, and JSON data files.

Environment Variables

Variable	Description
`SERP_API_KEY`	API key used for Google Search and AI-powered SERP API queries
`OPENAI_API_KEY`	API key used for normalization, enrichment, and classification
`ZENROWS_API_KEY`	API key used for proxy-backed scraping and anti-bot handling

Company Configuration

Company-related configuration lives in:

src/companies

Key files include:

src/companies/prompts
src/companies/data/companies-info.json
src/companies/data/error-companies.json
src/companies/data/forbidden-companies.json
src/model/companies-model.ts

You can adjust:

Company generation prompts
Enrichment prompts
Allowed industries
Forbidden industries
Company size values
Business type values
Technology categories

Job Configuration

Job-related configuration lives in:

src/jobs

Key files include:

src/jobs/jobs-keywords.ts
src/jobs/map/job-info-map.ts
src/jobs/data/jobs-info.json
src/jobs/data/jobs-errors.json
src/model/jobs-model.ts

You can adjust:

Software job keywords
Excluded job keywords
Supported job categories
Seniority levels
Workplace types
Employment types
Supported locations
Supported currencies
ATS scraping behavior

Limitations & Considerations

SERP API, OpenAI, and ZenRows usage may incur cost.
Search results can vary by region, query wording, and time.
Company and job data may be incomplete, stale, or inconsistent across public sources.
AI-normalized output should be validated before being used in production workflows.
Some websites may block scraping even with proxy and anti-bot tooling.
CAPTCHA bypass and proxy usage should comply with provider terms and applicable laws.
Respect robots.txt, website terms of service, and data privacy regulations.
Rate limits should be handled carefully to avoid unnecessary API failures or account throttling.
LinkedIn and company websites may change page structure, causing extraction logic to require maintenance.

Roadmap

Add database persistence with PostgreSQL or MongoDB
Add a web dashboard for browsing companies and jobs
Support more job categories such as Data Engineering, Security, Mobile, and ML Engineering
Add queue-based scraping for larger workloads
Add retry policies and backoff strategies
Add richer observability and scraping metrics
Improve ATS-specific scrapers
Add deduplication for companies and job postings
Add salary estimation when compensation is missing
Add automated data quality checks
Add CI workflows for tests and linting

Contributing

Contributions are welcome.

Recommended workflow:

Fork the repository.
Create a feature branch.
Make a focused change.
Add or update tests when relevant.
Run the test suite.
Open a pull request with a clear description of the change.

Please keep contributions scoped, readable, and aligned with the existing TypeScript style.

License

MIT License. Update this section with the final license text before publishing.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src		src
.gitignore		.gitignore
.prettierrc.json		.prettierrc.json
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

Node Scraper

Features

Architecture Overview

Tech Stack

How It Works

1. Generate Company Names

2. Discover LinkedIn Profiles

3. Extract LinkedIn Company URLs

4. Enrich Company Data

5. Normalize Company Profiles with OpenAI

6. Visit Company Websites

7. Scrape Job Listings

8. Extract Job Details

9. Normalize Jobs with OpenAI

10. Classify Jobs

Installation

Prerequisites

Setup

Usage

Run Company Discovery and Enrichment

Run Job Scraping and Normalization

Run Tests

Output Format

Company Data

Job Data

Configuration

Environment Variables

Company Configuration

Job Configuration

Limitations & Considerations

Roadmap

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages