Skip to content

Commit 5720796

Browse files
committed
chores: Move all image to static with link
feat: add broken link checker (v2 with parallel crawling) - Scans ~380 pages and ~580 resources - Validates image integrity using magic byte checking - Detects corrupted images (PNG, JPEG, WebP, GIF, SVG) - Identifies non-WebP images for optimization recommendations - Provides detailed image format breakdown - Recursively crawls all pages on localhost:3000 - Reports broken links and images with detailed error messages - CI-friendly exit codes
1 parent 58120c6 commit 5720796

48 files changed

Lines changed: 1557 additions & 41 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

Makefile

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -55,4 +55,12 @@ destroy-ide:
5555

5656
.PHONY: lint
5757
lint:
58-
yarn lint
58+
yarn lint
59+
60+
.PHONY: check-broken-links
61+
check-broken-links:
62+
node hack/check-broken-links.js
63+
64+
.PHONY: check-broken-links-v2
65+
check-broken-links-v2:
66+
node hack/check-broken-links-v2.js
Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# Broken Links Checker v2
2+
3+
An optimized web crawler that scans http://localhost:3000/ for broken links and images with parallel processing.
4+
5+
## Why v2?
6+
7+
The v2 checker improves upon v1 with:
8+
9+
- **33% Faster**: Completes in ~200s vs ~300s for the same site
10+
- **Parallel Crawling**: Processes 5 pages simultaneously instead of sequentially
11+
- **Simpler Code**: Cleaner implementation, easier to maintain
12+
- **Better Resource Usage**: Batch processing reduces memory overhead
13+
- **Focused Validation**: Checks link availability without deep image analysis
14+
15+
## Features
16+
17+
- Recursively crawls all pages within localhost:3000
18+
- Parallel page processing (5 concurrent pages)
19+
- Checks HTTP status codes for all links and images
20+
- Fast HTTP-based image validation
21+
- Skips data: URLs (inline images)
22+
- Reports broken resources with error details
23+
- Color-coded terminal output
24+
- CI-friendly exit codes
25+
26+
## Performance
27+
28+
- Scans ~380 pages and ~580 resources in approximately 200 seconds
29+
- 33% faster than v1 through parallel processing
30+
- Lower memory footprint with batch processing
31+
32+
## Usage
33+
34+
### Using Make
35+
36+
```bash
37+
make check-broken-links-v2
38+
```
39+
40+
### Using Yarn
41+
42+
```bash
43+
yarn check-broken-links-v2
44+
```
45+
46+
### Direct Execution
47+
48+
```bash
49+
node hack/check-broken-links-v2.js
50+
```
51+
52+
## Prerequisites
53+
54+
Make sure the development server is running on http://localhost:3000:
55+
56+
```bash
57+
# In one terminal
58+
make serve
59+
60+
# In another terminal
61+
make check-broken-links-v2
62+
```
63+
64+
## Output
65+
66+
The script provides:
67+
- Real-time crawling progress with page-by-page updates
68+
- Status indicators for broken links and images
69+
- Summary with performance metrics (duration, pages, links checked)
70+
- Detailed broken links report with parent page references
71+
- Redirect warnings for links that could be optimized
72+
73+
## Configuration
74+
75+
The v2 checker is configured with sensible defaults:
76+
77+
- **Concurrency**: 5 parallel pages (configurable in code)
78+
- **Timeout**: 15 seconds per page load
79+
- **Image Timeout**: 10 seconds per image check
80+
- **Wait Time**: 300ms for JavaScript rendering
81+
82+
To customize, edit `hack/check-broken-links-v2.js` and modify the `CONCURRENCY` constant or timeouts.
83+
84+
## Exit Codes
85+
86+
- `0`: All links and images are working
87+
- `1`: Broken links or images found, or fatal error occurred
88+
89+
## Comparison with v1
90+
91+
| Feature | v1 (Puppeteer) | v2 (Optimized Puppeteer) |
92+
|---------|----------------|--------------------------|
93+
| Performance | ~300s for 388 pages | ~200s for 380 pages (33% faster) |
94+
| Parallel crawling | ❌ Sequential | ✅ 5 concurrent pages |
95+
| Image validation | ✅ (magic bytes) | ✅ (HTTP status only) |
96+
| WebP warnings |||
97+
| Image corruption detection |||
98+
| Memory usage | Higher | Lower (parallel batching) |
99+
| Code complexity | Higher | Lower |
100+
| Progress reporting | Detailed | Cleaner |
101+
102+
## When to Use Which Version
103+
104+
**Use v1 if:**
105+
- You need image format validation (WebP recommendations)
106+
- You need to validate magic bytes for image integrity
107+
- You want detailed image corruption detection
108+
109+
**Use v2 if:**
110+
- You want faster execution (33% speed improvement)
111+
- You prefer cleaner, simpler code
112+
- You're focused on broken links rather than image optimization
113+
- You want parallel crawling for better resource utilization
114+
115+
## Advanced Usage
116+
117+
### Adjust Concurrency
118+
119+
Edit `hack/check-broken-links-v2.js` and change the `CONCURRENCY` constant:
120+
121+
```javascript
122+
const CONCURRENCY = 10; // Increase for faster scanning (uses more resources)
123+
```
124+
125+
### Adjust Timeouts
126+
127+
Modify timeout values in the code:
128+
129+
```javascript
130+
// Page load timeout
131+
timeout: 20000 // Increase if pages are slow to load
132+
133+
// Image check timeout
134+
req.setTimeout(15000, ...) // Increase for slow image servers
135+
```
136+
137+
## Troubleshooting
138+
139+
**Pages timing out?**
140+
- Increase the page load timeout (default: 15000ms)
141+
- Check your network connection
142+
- Reduce concurrency if system is overloaded
143+
144+
**Too many false positives?**
145+
- Check if external resources are blocking automated requests
146+
- Review the broken resources list for patterns
147+
148+
**Memory issues?**
149+
- Reduce concurrency (default: 5)
150+
- Run on a machine with more RAM

hack/README-check-broken-links.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Broken Links Checker
2+
3+
A web crawler that scans http://localhost:3000/ for broken links and images.
4+
5+
## Features
6+
7+
- Crawls all pages within the localhost:3000 domain
8+
- Checks HTTP status codes for all links and images
9+
- Reports broken resources with error details
10+
- Color-coded terminal output for easy reading
11+
- Exits with code 1 if broken links are found (CI-friendly)
12+
13+
## Usage
14+
15+
### Using Make
16+
17+
```bash
18+
make check-broken-links
19+
```
20+
21+
### Using Yarn
22+
23+
```bash
24+
yarn check-broken-links
25+
```
26+
27+
### Direct Execution
28+
29+
```bash
30+
node hack/check-broken-links.js
31+
```
32+
33+
## Prerequisites
34+
35+
Make sure the development server is running on http://localhost:3000 before running the checker:
36+
37+
```bash
38+
# In one terminal
39+
make serve
40+
41+
# In another terminal
42+
make check-broken-links
43+
```
44+
45+
## Output
46+
47+
The script provides:
48+
- Real-time crawling progress
49+
- Status for each page and resource checked
50+
- Summary with total counts
51+
- Detailed list of broken resources (if any)
52+
53+
## Exit Codes
54+
55+
- `0`: All links and images are working
56+
- `1`: Broken links or images found, or fatal error occurred

0 commit comments

Comments
 (0)