GitHub Data Pipeline

A headless TypeScript pipeline that discovers GitHub developers, scores their open-source contributions, and stores ranked profiles in PostgreSQL — built for leaderboards and talent discovery.

How it works

Discovery → Scrape → Score → Analyze, in four stages:

Discover — Finds GitHub users by location and follower range via the Search API
Scrape — Fetches repos, languages, topics, and merged PRs via GraphQL
Score — Computes score = stars × (userPRs / totalPRs) per repo, then sums to a total
Analyze — Categorizes each user across five skill areas: AI, Backend, Frontend, DevOps, Data

All GraphQL responses are cached in PostgreSQL (SHA-256 keyed, 30-day TTL). Multiple GitHub tokens are rotated automatically based on remaining quota.

Setup

Prerequisites: Node.js 20+, npm (or npx/tsx), PostgreSQL

git clone https://github.com/chemicoholic21/github-data-pipeline.git
cd github-data-pipeline
npm install         # or: npm ci
cp .env.example .env   # add DATABASE_URL and GitHub tokens
# If you prefer the tsx runtime: use `npx tsx` for direct TypeScript execution
npm run db:push

Run the pipeline:

# Use npm to run the package scripts
npm run bulk-discover "Chennai"
npm run bulk-discover -- "San Francisco" 0 1   # with start index and page

# Or run the TypeScript entry directly with npx/tsx
npx tsx src/scripts/bulk-discover.ts "Chennai"
npx tsx src/scripts/bulk-discover.ts "San Francisco" 0 1

Overview

This pipeline:

Discovers GitHub users by location and follower range using the GitHub Search API (@octokit/rest)
Stage 1 (SCRAPE): Fetches deep profile data (repos, languages, topics, merged PRs) via the GitHub GraphQL API → github_users, github_repos, github_pull_requests
Stage 2 (COMPUTE): Calculates per-repository scores using score = stars × (userPRs / totalPRs) → user_repo_scores
Stage 3 (AGGREGATE): Sums all repo scores → user_scores, syncs to leaderboard
Stage 4 (ANALYZE): Categorizes repos by topics/languages, computes skill scores across five categories (AI, Backend, Frontend, DevOps, Data) → user_skill_scores
Caches all GitHub API responses in api_cache (SHA-256 keyed) to avoid redundant requests
Manages a pool of multiple GitHub tokens, automatically rotating to the token with the highest remaining quota via token_rate_limit

Running the Pipeline

⚠️ Scraping GitHub data takes a long time — hours for large regions. Always run inside a tmux session so it survives disconnects.
tmux new -s pipeline        # start a named session
# ... run your command ...
# Ctrl+B, D to detach       # safely disconnect
tmux attach -t pipeline     # reattach later

Scripts

1. Scrape a region from GitHub (makes API calls)

Discovers developers by location, fetches their repos and PRs, scores them, and writes everything to the database.

# Single region (npm)
npm run bulk-discover "Bengaluru"

# Multiple regions in one run
npm run bulk-discover -- "Bengaluru, San Francisco, London, Berlin, Mumbai, Beijing"

# Resume from a specific range index and page (useful after a crash or rate limit)
npm run bulk-discover -- "Bengaluru, San Francisco" 0 5   # start at range 0, page 5
npm run bulk-discover -- "Bengaluru, San Francisco" 2 1   # start at range 2, page 1

# Or run directly with npx/tsx
npx tsx src/scripts/bulk-discover.ts "Bengaluru"
npx tsx src/scripts/bulk-discover.ts "Bengaluru, San Francisco" 0 5

2. Refresh worker (continuous profile updates)

Daemon that automatically refreshes stale GitHub profiles (>30 days old). Runs indefinitely, picking the oldest users and re-running the full pipeline on each.

# Start the refresh worker (npm)
npm run refresh-worker

# Or deploy via tmux for persistence
deploy/run-worker.sh

# Or run directly with npx/tsx
npx tsx src/scripts/refresh-worker.ts

Environment tunables:

Variable	Default	Description
`REFRESH_AFTER_DAYS`	30	Days before a profile is considered stale
`REFRESH_BATCH_SIZE`	200	Users to fetch per batch
`PER_USER_DELAY_MS`	1500	Delay between users (rate limit safety)
`IDLE_SLEEP_MS`	300000	Sleep when no stale users (5 min)

3. Populate leaderboard from cached data (no API calls)

If you've already scraped data and just need to (re)populate the leaderboard — use this. Reads entirely from api_cache, no GitHub calls made.

# Populate everything from cache
npx tsx src/scripts/populate-leaderboard-from-cache.ts

# Only process users not yet in the leaderboard (safest for large caches)
npx tsx src/scripts/populate-leaderboard-from-cache.ts --only-missing

# Preview what would run without writing anything
npx tsx src/scripts/populate-leaderboard-from-cache.ts --dry-run --limit=10

# Process a single user
npx tsx src/scripts/populate-leaderboard-from-cache.ts --username=torvalds

# Resume from a specific offset
npx tsx src/scripts/populate-leaderboard-from-cache.ts --offset=1000 --limit=500

4. Bulk SQL scripts (fastest — runs inside PostgreSQL)

Use these to recompute scores or refresh the leaderboard after schema changes or bulk imports. Much faster than the TypeScript equivalents.

# Using npm
npm run sql:populate-analyses       # recompute skill scores from repos + PRs  (~2 min for 72K users)
npm run sql:populate-leaderboard    # sync scored users → leaderboard          (~30s for 72K users)

# Or use npx/tsx to run the TypeScript runner directly
npx tsx src/scripts/run-sql.ts populate-analyses
npx tsx src/scripts/run-sql.ts populate-leaderboard

Run populate-analyses before populate-leaderboard if recomputing from scratch.

Scoring

repo_score = stars × (user_merged_prs / total_merged_prs)

Repos with fewer than 10 stars are excluded
Score is capped at 10,000 per repo
Total score is the sum across all qualifying repos

Experience levels:

Score	Label
< 10	Newcomer
10–99	Contributor
100–499	Active Contributor
500–1,999	Core Contributor
≥ 2,000	Open Source Leader

Database

Full schema in schema.sql. Tables:

Status	Tables	Purpose
Current	`github_users`, `github_repos`, `github_pull_requests`, `user_repo_scores`, `user_scores`, `skills`, `user_skill_scores`	Pipeline: scraped data, scores
Current	`leaderboard`, `api_cache`	Consolidated leaderboard, parsed cache
Current	`conversations`, `messages`	Chat system
Current	`token_rate_limit`	Infra: rate limit tracking
Deprecated	`users_old`, `analyses_old`, `leaderboard_old`, `api_cache_old`	Legacy tables - kept for backward compatibility

Note: The consolidated tables were previously named leaderboard_v2 / api_cache_v2. They are now just leaderboard / api_cache. To apply the rename on an existing database, run sql/rename-v2-to-primary.sql in your SQL editor, then verify with npm run verify:rename or npx tsx src/scripts/verify-rename.ts.

Name		Name	Last commit message	Last commit date
Latest commit History 158 Commits
deploy		deploy
drizzle		drizzle
sql		sql
src		src
.env.example		.env.example
.gitignore		.gitignore
.prettierrc		.prettierrc
Readme.md		Readme.md
db-analysis.mjs		db-analysis.mjs
drizzle.config.ts		drizzle.config.ts
github_api_cache_hit_miss_flow.svg		github_api_cache_hit_miss_flow.svg
opencode.json		opencode.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
schema.sql		schema.sql
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GitHub Data Pipeline

How it works

Setup

Overview

Running the Pipeline

Scripts

1. Scrape a region from GitHub (makes API calls)

2. Refresh worker (continuous profile updates)

3. Populate leaderboard from cached data (no API calls)

4. Bulk SQL scripts (fastest — runs inside PostgreSQL)

Scoring

Database

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GitHub Data Pipeline

How it works

Setup

Overview

Running the Pipeline

Scripts

1. Scrape a region from GitHub (makes API calls)

2. Refresh worker (continuous profile updates)

3. Populate leaderboard from cached data (no API calls)

4. Bulk SQL scripts (fastest — runs inside PostgreSQL)

Scoring

Database

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages