diff --git a/.github/workflows/build.yml b/.github/workflows/build.yml new file mode 100644 index 0000000..8721cd3 --- /dev/null +++ b/.github/workflows/build.yml @@ -0,0 +1,30 @@ +name: Build and Test + +on: + push: + branches: [ main ] + pull_request: + branches: [ main ] + +jobs: + build: + name: Build and Test + runs-on: ubuntu-latest + + steps: + - name: Checkout repository + uses: actions/checkout@v4 + + - name: Setup .NET + uses: actions/setup-dotnet@v4 + with: + dotnet-version: '10.0.x' + + - name: Restore dependencies + run: dotnet restore source/WebsiteValidator/ + + - name: Build + run: dotnet build source/WebsiteValidator/ --no-restore --configuration Release + + - name: Test + run: dotnet test source/WebsiteValidator/ --no-build --configuration Release --verbosity normal diff --git a/Anforderungen/R00001-simple-cleanup-and-welcome.md b/Anforderungen/R00001-simple-cleanup-and-welcome.md new file mode 100644 index 0000000..34cd167 --- /dev/null +++ b/Anforderungen/R00001-simple-cleanup-and-welcome.md @@ -0,0 +1,40 @@ +# R00001 — Simple Cleanup and Welcome + +## Quelle + +GitHub Issue: #1 + +## Beschreibung + +Grundlegende Repository-Hygiene: README aktualisieren und einen Build-Workflow fuer Continuous Integration einrichten. + +## Akzeptanzkriterien + +1. **README.md aktualisieren**: Die README soll den aktuellen Stand des Projekts widerspiegeln (.NET 10.0, aktuelle Features, Build-Anleitung). +2. **Build-Workflow erstellen**: Ein GitHub Actions Workflow (`.github/workflows/build.yml`) der bei Push und Pull Request auf `main` automatisch baut und Tests ausfuehrt. + +## Technische Details + +### README.md + +- Projektbeschreibung aktualisieren +- Tech-Stack dokumentieren (.NET 10.0, HtmlAgilityPack, System.CommandLine, xUnit) +- Build- und Test-Befehle dokumentieren +- Verwendungsanleitung (CLI-Optionen) aktualisieren + +### Build-Workflow (.github/workflows/build.yml) + +- Trigger: push und pull_request auf `main` +- Steps: Checkout, .NET 10.0 Setup, Restore, Build, Test +- Runner: ubuntu-latest + +## Betroffene Dateien + +- `README.md` (aendern) +- `.github/workflows/build.yml` (neu) + +## Teststrategie + +- Build-Workflow: Syntaktische Korrektheit der YAML-Datei pruefen +- README: Inhaltliche Pruefung (manuelle Verifikation) +- Gesamter Build muss weiterhin fehlerfrei kompilieren diff --git a/README.md b/README.md index a6b6260..9ce8d2e 100644 --- a/README.md +++ b/README.md @@ -1,36 +1,61 @@ -# website-validator -A dotnet application that crawls a website checking for http 404s and maybe more stuff later +# Website Validator -Intended usage: +A command-line tool that crawls a website and validates it -- checking for HTTP errors, broken links, and collecting structural information for further analysis. + +## Tech Stack + +- .NET 10.0 (C#) +- [HtmlAgilityPack](https://html-agility-pack.net/) -- HTML parsing and link extraction +- [System.CommandLine](https://github.com/dotnet/command-line-api) -- CLI argument parsing +- [xUnit](https://xunit.net/) -- Unit testing + +## Build + +```bash +dotnet build source/WebsiteValidator/ ``` -websitevalidator -u https://www.yourdomain.whatever -c [--limit xxx] -o structure.json + +## Test + +```bash +dotnet test source/WebsiteValidator/ ``` -Output: +## Usage -A big json file with a lot of information. -A part of it being the structure of the website. -Useful for further analysis. Its a simple big JSON file. A good thing if you like to use powershell e.g.. Just read the thing and do whatever. +```bash +websitevalidator --url [options] +``` + +### Options + +| Option | Short | Description | +|---|---|---| +| `--url` | `-u` | **(required)** The URL of the website to crawl | +| `--links` | `-l` | List all links found on the page | +| `--crawl` | `-c` | Crawl the full site and list all links | +| `--ignore-ssl` | | Ignore SSL certificate errors | +| `--human` | `-h` | Human-readable output (default is JSON) | +| `--output` | `-o` | Save results to a file | +| `--limit` | | Maximum number of pages to crawl | +| `--additionalEntrypoints` | `--ae` | Text file with additional entry point URLs (e.g. from a sitemap) | -## Next tasks +### Examples -### basic functionality +```bash +# List all links on a page +websitevalidator -u https://example.com -l - - [X] convert relative urls to absolute ones - - [X] return the output either as human readable or json (is there a generic approach?); maybe add a --human switch for the more readable output and default to json - - [X] return only distinct results - - [X] enable some basic crawling activity - - [X] remember the result of each url, so every url is only crawled once - - [ ] only check external urls, but do not feed links from them back into the system. It is important that they are basically reachable but we do not want to check their pages, too) - - [ ] also crawl resource files like linked images, css and javascript - - [ ] add an option for a final human readable report? +# Crawl a website and save results as JSON +websitevalidator -u https://example.com -c -o results.json -### validations +# Crawl with a page limit and human-readable output +websitevalidator -u https://example.com -c --limit 100 -h + +# Crawl with additional entry points from a sitemap +websitevalidator -u https://example.com -c --ae sitemap-urls.txt -o results.json +``` - - [ ] validations should be configurable without the need for a recompilation - - [ ] group results by http status code, create error messages for 404s and other problems - - [ ] pages shall not contain "Error", "Warning", or anything else that looks like a php problem - - [ ] can I have an overview of which pages are mentioned in the sitemap and which are not - - [ ] can I have an overview of pages which are possibly disallowed by robots.txt - - [ ] we need something that allows us to mute known validation messages that we want to ignore +## Output +Results are written as JSON by default -- a structured file suitable for further analysis with tools like PowerShell, jq, or custom scripts. Use `--human` for a more readable console output.