A versatile web scraping application built to efficiently gather data from a variety of websites.
This project is ideal for automating data extraction tasks and transforming raw HTML data into structured formats.
This service application allows users to specify target websites and define the elements to extract, enabling seamless and customizable public data scraping. The tool is designed to handle a range of web scraping scenarios, from simple data extraction to complex, multi-page crawling tasks.
- User-friendly configuration
Easily set up scraping tasks through a simple UI or via supported API requests - Customizable scraping rules
Users can specify target elements using CSS selectors - Multiple users support
Run scraper tasks for several users at once - Direct link to data monitor service
Configure scraped data thresholds and notifiers in supplementary monitor - REST API
Output data can be retrieved remotely by sending a request to appropriate REST API endpoint - OpenAPI documentation
Interactive Swagger UI is available at/api/v1/docsand raw OpenAPI JSON at/api/v1/docs/openapi.json - Error handling
Built-in mechanisms to manage failed requests and handle dynamic content
- Node.js 18+ (service - LINK)
- MongoDB (data storage - LINK)
- Docker (all-in-one approach - LINK)
- Bruno (API testing - LINK)
It's strongly recommended to use the Docker containers approach for installing and running the web-scraper service. However if, for whatever reasons, it's not an option then a local installation is also possible and it requires the following steps to be performed:
- Clone the repository:
git clone https://github.com/piopon/web-scraper.git - Navigate to the project directory:
cd web-scraper - Install dependencies:
npm install
Before running the application service create an .env file with the following data:
# connection parameters
SERVER_ADDRESS=[STRING] # service address
SERVER_PORT=[INTEGER] # service port number
# database settings
DB_PORT=[INTEGER] # the port for database connection
DB_NAME=[STRING] # the name of the database
DB_ADDRESS=[STRING] # the IP address for datavase
DB_USER=[STRING] # database user (authentication)
DB_PASSWORD=[STRING] # database password (authentication)
# data monitor settings
MONITOR_ADDRESS=[STRING] # monitor service address
MONITOR_PORT=[INTEGER] # monitor service port number
#scraper settings
SCRAP_INACTIVE_DAYS=[INTEGER] # number of days from last login to treat user as inative
SCRAP_INTERVAL_SEC=[INTEGER] # default seconds interval between each scrap operation
SCRAP_EXTRAS_TYPE=[ENUM] # the type of extra value in data output
# internal hash and secrets
ENCRYPT_SALT=[STRING|INTEGER] # randomize salt value
SESSION_SHA=[STRING] # hash for session cookie
JWT_SECRET=[STRING] # hash for JSON Web Token
CHALLENGE_PREFIX=[STRING] # challenge token prefix
CHALLENGE_JOIN=[STRING] # challenge token separator
CHALLENGE_EOL_MINS=[INTEGER] # challenge deadline minutes
CHALLENGE_EOL_SEPARATOR=[STRING] # challenge deadline separator
# external authentication
GOOGLE_CLIENT_ID=[STRING] # external Google login client ID
GOOGLE_CLIENT_SECRET=[STRING] # secret for external Google login client
# demo functionality
DEMO_MODE=[ENUM] # the demo session mode (duplicate OR overwrite)
DEMO_BASE=[STRING] # base demo email
DEMO_USER=[STRING] # user template email
DEMO_PASS=[STRING] # base demo user password
# CI functionality
CI_USER=[STRING] # CI user email
CI_PASS=[STRING] # CI user password
Docker image healthcheck notes:
- No extra configuration is required by default.
- The container healthcheck calls
/api/v1/statusand validates component states.- Default accepted states:
web-database=running. - Default accepted states:
web-components=running. - Default accepted states:
web-server=running. - Default accepted states:
web-scraper=running|stopped.
- Default accepted states:
Component names received from status endpoint are trimmed before matching (to support padded logger names).
Optional advanced override is available by setting the HEALTHCHECK_COMPONENT_STATES environment variable as a JSON object where key is component name and value is list of accepted states:
HEALTHCHECK_COMPONENT_STATES='{"web-database":["running"],"web-scraper":["running"],"web-components":["running"],"web-server":["running"]}'
Additionally, keep a top-level VERSION file in repository root.
This file should contain the full runtime version string in format version+sha.
Default value in repository is used as fallback when git metadata is unavailable.
To update VERSION from current package.json version and git commit SHA, use:
npm run version:sync
There are two supported ways to run web-scraper service:
-
LOCAL
- Start the MongoDB instance
- Go to web-scraper directory and use the command:
npm run start
This will invoke the web-scraper locally on your platform.
-
DOCKER
- Go to web-scraper directory and use the command:
docker compose up -d
This will invoke the web-scraper in the Docker container in the detached mode (argument
-d).
In order to display the logs of the service type the command:docker logs scraper - Go to web-scraper directory and use the command:
Docker log rotation is configured for both app and mongodb services through docker-compose.yml:
- Logging driver:
json-file - Max single log file size:
10m - Max retained log files:
5
These Docker logging driver settings keep container logs from growing without limits. To adjust log rotation, update max-size and max-file values in docker-compose.yml under x-logging.
After the service is up and running the next steps are as follows:
- Open the web-browser and navigate to the configured
IP:PORTaddress.
Login to your account, create a new one, or open a demo session
- Customize your scraping tasks by modifying configuration groups, observers and fill all components data

- Go to data monitor service and define thresholds and notifiers to make use of scraped values
This button is available only when MONITOR_ADDRESS (and optionally MONITOR_PORT) environment variable is defined.
After correctly adding first observer your data is now being scraped from the specified location!
Check the created users directory for scraped data values or error details if configuration is incorrect.
Also keep in mind that initially this directory may contain folders for CI and demo users, depending on the configuration.
Current status of web-scraper's components can be quicky checked in the bottom right corner:

After successfull login this panel contains also additional links to:
web-scraper/
βββ .github/workflows/ # GitHub workflows for CI/CD
βββ docs/ # Requests documentation and docs assets
βββ public # Frontend UI source files
βββ src/ # Backend UI source files
βββ test/ # Unit tests logic
βββ .dockerignore # List of files ignored by Docker
βββ .gitignore # List of files ignored by GIT
βββ CODEOWNERS # List of code owners
βββ docker-compose.yml # Docker compose file for this service and MongoDB
βββ Dockerfile # Docker container recipe
βββ LICENSE # GPL-2.0 license description
βββ package-lock.json # Node.js snapshot of the dependency tree
βββ package.json # Node.js project metadata
βββ README.md # Top-level project description (this document)
Contributions are welcome! To contribute:
- Fork the repository.
- Create a new branch for your feature or bugfix.
- Submit a pull request with a clear description of your changes.
This project is licensed under the GPL-2.0 license. See the LICENSE file for details.
For questions or suggestions, feel free to contact me through GitHub or via email.
Created by PNK with β€ @ 2023-2025


