A Go-based automated tender scraper designed to extract, process, and store tender information from multiple e-procurement platforms. Includes features for handling document downloads, captcha solving, retry mechanisms, and logging failed extractions for reliability.
-
Concurrent tender scraping across multiple domains and states.
-
Detailed data extraction including:
- Basic details
- Critical dates
- Work/item information
- Tender documents
- Payment and EMD fee details
- Corrigendum and authority information
-
Retry and failure handling with logging for failed tenders.
-
Document downloads with offline and online instrument support.
-
CSV and JSONL exports for structured data persistence.
-
Session management for stateful scraping.
-
Captcha solving support for protected portals.
.
├── cli # Command-line entry points
├── docDownloads # Handles tender document downloads and AWS integration
├── http # HTTP server utilities (optional API or monitoring)
├── scraper # Core scraping logic
│ ├── captcha # Captcha solver
│ ├── extract # Tender data extraction and parsing
│ ├── nav # Navigation and link extraction
│ └── pastTenders # Historical tender processing
├── session # Session management and state persistence
├── TenderData # Output directories for logs, links, and tenders
└── utils # Shared utilities, types, and helpers
- Go 1.21+
- Docker (optional, for containerized execution)
- Internet access to target tender portals
Clone the repository:
git clone https://github.com/yourusername/tender-scraper.git
cd tender-scraper
go mod tidyCLI mode:
go run cli/main.goDocker mode:
docker build -t tender-scraper .
docker-compose up-
Domains & States: Specify in your CLI input or configuration files.
-
Output Path: TendertData/ directory contains:
Tenders/→ extracted tender data (JSONL)Links/→ scraped tender links (CSV)Failed/→ failed extractionsLogs/→ scraping and session logs
- DataScraper: Manages a session and extracts single tender data.
- TenderParser: Handles parsing of tender pages using Colly.
- Navigation: Crawls links to tender pages and corrigenda.
- Failed Logs: Automatically writes tenders or search links that fail extraction.
- Retry Mechanism: Up to 3 attempts with exponential backoff per tender.
- Supports downloading tender-related PDFs and files.
- Handles offline and online payment instrument details for document access.
- Conversion of extracted data to a consistent
utils.Tenderformat. - Date parsing, string cleanup, and session helpers.
- Collector logs:
TenderData/Logs/collectors - Session logs:
TenderData/Logs/sessions - Failed tenders:
TenderData/Failed/
All logs include timestamps, domain/state info, and serial number for traceability.