web-scraper

A versatile web scraping application built to efficiently gather data from a variety of websites.
This project is ideal for automating data extraction tasks and transforming raw HTML data into structured formats.

Overview 💡

This service application allows users to specify target websites and define the elements to extract, enabling seamless and customizable public data scraping. The tool is designed to handle a range of web scraping scenarios, from simple data extraction to complex, multi-page crawling tasks.

Features 🎨

User-friendly configuration
Easily set up scraping tasks through a simple UI or via supported API requests
Customizable scraping rules
Users can specify target elements using CSS selectors
Multiple users support
Run scraper tasks for several users at once
Direct link to data monitor service
Configure scraped data thresholds and notifiers in supplementary monitor
REST API
Output data can be retrieved remotely by sending a request to appropriate REST API endpoint
OpenAPI documentation
Interactive Swagger UI is available at /api/v1/docs and raw OpenAPI JSON at /api/v1/docs/openapi.json
Error handling
Built-in mechanisms to manage failed requests and handle dynamic content

Requirements 📝

Node.js 18+ (service - LINK)
MongoDB (data storage - LINK)
Docker (all-in-one approach - LINK)
Bruno (API testing - LINK)

Installation 📥

It's strongly recommended to use the Docker containers approach for installing and running the web-scraper service. However if, for whatever reasons, it's not an option then a local installation is also possible and it requires the following steps to be performed:

Clone the repository:

git clone https://github.com/piopon/web-scraper.git

Navigate to the project directory:
```
cd web-scraper
```
Install dependencies:
```
npm install
```

Configuration 🔧

Before running the application service create an .env file with the following data:

# connection parameters
SERVER_ADDRESS=[STRING]          # service address
SERVER_PORT=[INTEGER]            # service port number

# database settings
DB_PORT=[INTEGER]                # the port for database connection
DB_NAME=[STRING]                 # the name of the database
DB_ADDRESS=[STRING]              # the IP address for datavase
DB_USER=[STRING]                 # database user (authentication)
DB_PASSWORD=[STRING]             # database password (authentication)

# data monitor settings
MONITOR_ADDRESS=[STRING]         # monitor service address
MONITOR_PORT=[INTEGER]           # monitor service port number

#scraper settings
SCRAP_INACTIVE_DAYS=[INTEGER]    # number of days from last login to treat user as inative
SCRAP_INTERVAL_SEC=[INTEGER]     # default seconds interval between each scrap operation
SCRAP_EXTRAS_TYPE=[ENUM]         # the type of extra value in data output

# internal hash and secrets
ENCRYPT_SALT=[STRING|INTEGER]    # randomize salt value
SESSION_SHA=[STRING]             # hash for session cookie
JWT_SECRET=[STRING]              # hash for JSON Web Token
CHALLENGE_PREFIX=[STRING]        # challenge token prefix
CHALLENGE_JOIN=[STRING]          # challenge token separator
CHALLENGE_EOL_MINS=[INTEGER]     # challenge deadline minutes
CHALLENGE_EOL_SEPARATOR=[STRING] # challenge deadline separator

# external authentication
GOOGLE_CLIENT_ID=[STRING]        # external Google login client ID
GOOGLE_CLIENT_SECRET=[STRING]    # secret for external Google login client

# demo functionality
DEMO_MODE=[ENUM]                 # the demo session mode (duplicate OR overwrite)
DEMO_BASE=[STRING]               # base demo email
DEMO_USER=[STRING]               # user template email
DEMO_PASS=[STRING]               # base demo user password

# CI functionality
CI_USER=[STRING]                 # CI user email
CI_PASS=[STRING]                 # CI user password

Docker image healthcheck notes:

No extra configuration is required by default.
The container healthcheck calls /api/v1/status and validates component states.
- Default accepted states: web-database=running.
- Default accepted states: web-components=running.
- Default accepted states: web-server=running.
- Default accepted states: web-scraper=running|stopped.

Component names received from status endpoint are trimmed before matching (to support padded logger names).

Optional advanced override is available by setting the HEALTHCHECK_COMPONENT_STATES environment variable as a JSON object where key is component name and value is list of accepted states:

HEALTHCHECK_COMPONENT_STATES='{"web-database":["running"],"web-scraper":["running"],"web-components":["running"],"web-server":["running"]}'

Additionally, keep a top-level VERSION file in repository root. This file should contain the full runtime version string in format version+sha. Default value in repository is used as fallback when git metadata is unavailable.

To update VERSION from current package.json version and git commit SHA, use:

npm run version:sync

Usage 💻

There are two supported ways to run web-scraper service:

LOCAL
- Start the MongoDB instance
- Go to web-scraper directory and use the command:
```
npm run start
```
This will invoke the web-scraper locally on your platform.
DOCKER
- Go to web-scraper directory and use the command:
```
docker compose up -d
```
This will invoke the web-scraper in the Docker container in the detached mode (argument -d).
In order to display the logs of the service type the command:
```
docker logs scraper
```

Docker log rotation is configured for both app and mongodb services through docker-compose.yml:

Logging driver: json-file
Max single log file size: 10m
Max retained log files: 5

These Docker logging driver settings keep container logs from growing without limits. To adjust log rotation, update max-size and max-file values in docker-compose.yml under x-logging.

After the service is up and running the next steps are as follows:

Open the web-browser and navigate to the configured IP:PORT address.
Login to your account, create a new one, or open a demo session
Customize your scraping tasks by modifying configuration groups, observers and fill all components data
Go to data monitor service and define thresholds and notifiers to make use of scraped values This button is available only when MONITOR_ADDRESS (and optionally MONITOR_PORT) environment variable is defined.

After correctly adding first observer your data is now being scraped from the specified location!

Check the created users directory for scraped data values or error details if configuration is incorrect. Also keep in mind that initially this directory may contain folders for CI and demo users, depending on the configuration.

Additional information ℹ️

Current status of web-scraper's components can be quicky checked in the bottom right corner:

After successfull login this panel contains also additional links to:

detailed scraper running status with logs:
scraper service configuration

Project Structure 📊

web-scraper/
├── .github/workflows/     # GitHub workflows for CI/CD
├── docs/                  # Requests documentation and docs assets
├── public                 # Frontend UI source files
├── src/                   # Backend UI source files
├── test/                  # Unit tests logic
├── .dockerignore          # List of files ignored by Docker
├── .gitignore             # List of files ignored by GIT
├── CODEOWNERS             # List of code owners
├── docker-compose.yml     # Docker compose file for this service and MongoDB
├── Dockerfile             # Docker container recipe
├── LICENSE                # GPL-2.0 license description
├── package-lock.json      # Node.js snapshot of the dependency tree
├── package.json           # Node.js project metadata
└── README.md              # Top-level project description (this document)

Contributing 🤝

Contributions are welcome! To contribute:

Fork the repository.
Create a new branch for your feature or bugfix.
Submit a pull request with a clear description of your changes.

License 📜

This project is licensed under the GPL-2.0 license. See the LICENSE file for details.

Contact 💬

For questions or suggestions, feel free to contact me through GitHub or via email.

Created by PNK with ❤ @ 2023-2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

web-scraper

Overview 💡

Features 🎨

Requirements 📝

Installation 📥

Configuration 🔧

Usage 💻

Additional information ℹ️

Project Structure 📊

Contributing 🤝

License 📜

Contact 💬

About

Uh oh!

Releases 5

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3,803 Commits
.github		.github
docs		docs
public		public
scripts		scripts
src		src
test		test
.dockerignore		.dockerignore
.gitignore		.gitignore
CODEOWNERS		CODEOWNERS
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
VERSION		VERSION
docker-compose.yml		docker-compose.yml
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

web-scraper

Overview 💡

Features 🎨

Requirements 📝

Installation 📥

Configuration 🔧

Usage 💻

Additional information ℹ️

Project Structure 📊

Contributing 🤝

License 📜

Contact 💬

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages