Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: Build and Test

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]

jobs:
build:
name: Build and Test
runs-on: ubuntu-latest

steps:
- name: Checkout repository
uses: actions/checkout@v4

- name: Setup .NET
uses: actions/setup-dotnet@v4
with:
dotnet-version: '10.0.x'

- name: Restore dependencies
run: dotnet restore source/WebsiteValidator/

- name: Build
run: dotnet build source/WebsiteValidator/ --no-restore --configuration Release

- name: Test
run: dotnet test source/WebsiteValidator/ --no-build --configuration Release --verbosity normal
40 changes: 40 additions & 0 deletions Anforderungen/R00001-simple-cleanup-and-welcome.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# R00001 — Simple Cleanup and Welcome

## Quelle

GitHub Issue: #1

## Beschreibung

Grundlegende Repository-Hygiene: README aktualisieren und einen Build-Workflow fuer Continuous Integration einrichten.

## Akzeptanzkriterien

1. **README.md aktualisieren**: Die README soll den aktuellen Stand des Projekts widerspiegeln (.NET 10.0, aktuelle Features, Build-Anleitung).
2. **Build-Workflow erstellen**: Ein GitHub Actions Workflow (`.github/workflows/build.yml`) der bei Push und Pull Request auf `main` automatisch baut und Tests ausfuehrt.

## Technische Details

### README.md

- Projektbeschreibung aktualisieren
- Tech-Stack dokumentieren (.NET 10.0, HtmlAgilityPack, System.CommandLine, xUnit)
- Build- und Test-Befehle dokumentieren
- Verwendungsanleitung (CLI-Optionen) aktualisieren

### Build-Workflow (.github/workflows/build.yml)

- Trigger: push und pull_request auf `main`
- Steps: Checkout, .NET 10.0 Setup, Restore, Build, Test
- Runner: ubuntu-latest

## Betroffene Dateien

- `README.md` (aendern)
- `.github/workflows/build.yml` (neu)

## Teststrategie

- Build-Workflow: Syntaktische Korrektheit der YAML-Datei pruefen
- README: Inhaltliche Pruefung (manuelle Verifikation)
- Gesamter Build muss weiterhin fehlerfrei kompilieren
75 changes: 50 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,36 +1,61 @@
# website-validator
A dotnet application that crawls a website checking for http 404s and maybe more stuff later
# Website Validator

Intended usage:
A command-line tool that crawls a website and validates it -- checking for HTTP errors, broken links, and collecting structural information for further analysis.

## Tech Stack

- .NET 10.0 (C#)
- [HtmlAgilityPack](https://html-agility-pack.net/) -- HTML parsing and link extraction
- [System.CommandLine](https://github.com/dotnet/command-line-api) -- CLI argument parsing
- [xUnit](https://xunit.net/) -- Unit testing

## Build

```bash
dotnet build source/WebsiteValidator/
```
websitevalidator -u https://www.yourdomain.whatever -c [--limit xxx] -o structure.json

## Test

```bash
dotnet test source/WebsiteValidator/
```

Output:
## Usage

A big json file with a lot of information.
A part of it being the structure of the website.
Useful for further analysis. Its a simple big JSON file. A good thing if you like to use powershell e.g.. Just read the thing and do whatever.
```bash
websitevalidator --url <URL> [options]
```

### Options

| Option | Short | Description |
|---|---|---|
| `--url` | `-u` | **(required)** The URL of the website to crawl |
| `--links` | `-l` | List all links found on the page |
| `--crawl` | `-c` | Crawl the full site and list all links |
| `--ignore-ssl` | | Ignore SSL certificate errors |
| `--human` | `-h` | Human-readable output (default is JSON) |
| `--output` | `-o` | Save results to a file |
| `--limit` | | Maximum number of pages to crawl |
| `--additionalEntrypoints` | `--ae` | Text file with additional entry point URLs (e.g. from a sitemap) |

## Next tasks
### Examples

### basic functionality
```bash
# List all links on a page
websitevalidator -u https://example.com -l

- [X] convert relative urls to absolute ones
- [X] return the output either as human readable or json (is there a generic approach?); maybe add a --human switch for the more readable output and default to json
- [X] return only distinct results
- [X] enable some basic crawling activity
- [X] remember the result of each url, so every url is only crawled once
- [ ] only check external urls, but do not feed links from them back into the system. It is important that they are basically reachable but we do not want to check their pages, too)
- [ ] also crawl resource files like linked images, css and javascript
- [ ] add an option for a final human readable report?
# Crawl a website and save results as JSON
websitevalidator -u https://example.com -c -o results.json

### validations
# Crawl with a page limit and human-readable output
websitevalidator -u https://example.com -c --limit 100 -h

# Crawl with additional entry points from a sitemap
websitevalidator -u https://example.com -c --ae sitemap-urls.txt -o results.json
```

- [ ] validations should be configurable without the need for a recompilation
- [ ] group results by http status code, create error messages for 404s and other problems
- [ ] pages shall not contain "Error", "Warning", or anything else that looks like a php problem
- [ ] can I have an overview of which pages are mentioned in the sitemap and which are not
- [ ] can I have an overview of pages which are possibly disallowed by robots.txt
- [ ] we need something that allows us to mute known validation messages that we want to ignore
## Output

Results are written as JSON by default -- a structured file suitable for further analysis with tools like PowerShell, jq, or custom scripts. Use `--human` for a more readable console output.
Loading