Skip to content

Commit cfb38f5

Browse files
authored
2024.12.12 (#112)
* Update license date, gosh this is an old project * Update all requirements * Move stopwords to their own file and integrate trafilatura with BeautifulSoup because reading HTML twice uses more memory...yay? * Get rid of custom stemmer we're using trafilatura's now * Update Dockerfile to latest * Update pyproject.toml * Update Dockerfile to correctly install package to /python-seo-analyzer directory * Change talk() method to as_dict() and add new metadata fields to Page output * Add dockerignore file * Ignore dotenv files * Add llm analyst * Add langchain because I forgot it, eek! * Add llm analyst to analyzer * Add llm analyst to __main__ * Add llm analyst to page.py * Add llm analyst to website.py * Increase http connect and read timeouts * Add test_analyze_with_llm test * Add more copy to the README.md file
1 parent 410dffe commit cfb38f5

18 files changed

Lines changed: 859 additions & 596 deletions

.dockerignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
.env
2+
.vscode
3+
.github
4+
.pytest_cache
5+
.git
6+
.dockerignore
7+
.gitignore
8+
*.pyc
9+
env/
10+
venv/
11+
*/__pycache__/*
12+
tests/
13+
Dockerfile
14+
*.pyc

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# I don't want the python virtual env in github!
22
venv
33
env
4+
.env
45

56
# nor visual
67
.vscode

Dockerfile

Lines changed: 12 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,17 @@
1-
FROM python:3-alpine
1+
FROM python:3.12-bullseye
22

3-
WORKDIR /app
3+
RUN apt-get update -y && apt-get upgrade -y
44

5-
COPY . /app
5+
RUN pip3 install --upgrade pip
6+
RUN pip3 install uv
67

7-
RUN python3 setup.py install
8+
COPY ./requirements.txt /python-seo-analyzer/
9+
10+
RUN uv pip install --system --verbose --requirement /python-seo-analyzer/requirements.txt
11+
RUN uv cache clean --verbose
12+
13+
COPY . /python-seo-analyzer
14+
15+
RUN python3 -m pip install /python-seo-analyzer
816

917
ENTRYPOINT ["/usr/local/bin/seoanalyze"]

LICENSE

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Copyright 2012-2021 Seth Black.
1+
Copyright 2012-2024 Seth Black.
22
All rights reserved.
33

44
Redistribution and use in source and binary forms, with or without modification,

README.md

Lines changed: 11 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,23 @@
1-
Python SEO Analyzer
1+
Python SEO and GEO Analyzer
22
===================
33

4-
An SEO tool that analyzes the structure of a site, crawls the site, counts words in the body of the site and warns of any technical SEO issues.
4+
A modern SEO and GEO analysis tool that combines technical optimization and authentic human value. Beyond traditional site crawling and structure analysis, it uses AI to evaluate content's expertise signals, conversational engagement, and cross-platform presence. It helps you maintain strong technical foundations while ensuring your site demonstrates genuine authority and value to real users.
55

6-
Requires Python 3.6+, BeautifulSoup4 and urllib3.
6+
The AI features were heavily influenced by the clickbait-titled SEL article [A 13-point roadmap for thriving in the age of AI search](https://searchengineland.com/seo-roadmap-ai-search-449199).
77

88
Installation
99
------------
1010

1111
### PIP
1212

1313
```
14-
pip3 install pyseoanalyzer
14+
pip install pyseoanalyzer
1515
```
1616

1717
### Docker
1818

19+
The docker image is available on [Docker Hub](https://hub.docker.com/r/sethblack/python-seo-analyzer) and can be run with the same command-line arguments as the script.
20+
1921
```
2022
docker run sethblack/python-seo-analyzer [ARGS ...]
2123
```
@@ -79,6 +81,11 @@ Alternatively, you can run the analysis as a script from the seoanalyzer folder.
7981
python -m seoanalyzer https://www.sethserver.com/ -f html > results.html
8082
```
8183

84+
AI Optimization
85+
---------------
86+
87+
The first pass of AI optimization features use Anthropic's `claude-3-sonnet-20240229` model to evaluate the content of the site. You will need to have an API key from [Anthropic](https://www.anthropic.com/) to use this feature. The API key needs to be set as the environment variable `ANTHROPIC_API_KEY`. I recommend using a `.env` file to set this variable. Once the API key is set, the AI optimization features can be enabled with the `--run-llm-analysis` flag.
88+
8289
Notes
8390
-----
8491

pyproject.toml

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,18 @@ build-backend = "hatchling.build"
44

55
[project]
66
name = "pyseoanalyzer"
7-
version = "2024.04.21"
7+
version = "2024.12.12"
88
authors = [
99
{name = "Seth Black", email = "sblack@sethserver.com"},
1010
]
1111
dependencies = [
1212
"beautifulsoup4>=4.12.3",
13-
"certifi>=2024.2.2",
14-
"Jinja2>=3.1.3",
15-
"lxml>=5.2.1",
16-
"MarkupSafe>=2.1.5",
17-
"urllib3>=2.2.1",
13+
"certifi>=2024.8.30",
14+
"Jinja2>=3.1.4",
15+
"lxml>=5.3.0",
16+
"MarkupSafe>=3.0.2",
17+
"trafilatura>=2.0.0",
18+
"urllib3>=2.2.3",
1819
]
1920
requires-python = ">= 3.8"
2021
description = "An SEO tool that analyzes the structure of a site, crawls the site, count words in the body of the site and warns of any technical SEO issues."

pyseoanalyzer/__init__.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
11
#!/usr/bin/env python3
22

33
from .analyzer import analyze
4-
from .stemmer import stem

pyseoanalyzer/__main__.py

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,12 @@ def main():
4646
action="store_false",
4747
help="Analyze all the existing inner links as well (might be time consuming).",
4848
)
49+
arg_parser.add_argument(
50+
"--run-llm-analysis",
51+
default=False,
52+
action="store_true",
53+
help="Run LLM analysis on the content.",
54+
)
4955

5056
args = arg_parser.parse_args()
5157

@@ -55,6 +61,7 @@ def main():
5561
analyze_headings=args.analyze_headings,
5662
analyze_extra_tags=args.analyze_extra_tags,
5763
follow_links=args.no_follow_links,
64+
run_llm_analysis=args.run_llm_analysis,
5865
)
5966

6067
if args.output_format == "html":

pyseoanalyzer/analyzer.py

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,18 @@
22
from operator import itemgetter
33
from .website import Website
44

5+
56
def calc_total_time(start_time):
67
return time.time() - start_time
78

9+
810
def analyze(
911
url,
1012
sitemap_url=None,
1113
analyze_headings=False,
1214
analyze_extra_tags=False,
1315
follow_links=True,
16+
run_llm_analysis=False,
1417
):
1518
start_time = time.time()
1619

@@ -22,17 +25,18 @@ def analyze(
2225
}
2326

2427
site = Website(
25-
url,
26-
sitemap_url,
27-
analyze_headings,
28-
analyze_extra_tags,
29-
follow_links,
28+
base_url=url,
29+
sitemap=sitemap_url,
30+
analyze_headings=analyze_headings,
31+
analyze_extra_tags=analyze_extra_tags,
32+
follow_links=follow_links,
33+
run_llm_analysis=run_llm_analysis,
3034
)
3135

3236
site.crawl()
3337

3438
for p in site.crawled_pages:
35-
output["pages"].append(p.talk())
39+
output["pages"].append(p.as_dict())
3640

3741
output["duplicate_pages"] = [
3842
list(site.content_hashes[p])

pyseoanalyzer/http.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ def __init__(self):
88
user_agent = {"User-Agent": "Mozilla/5.0"}
99

1010
self.http = PoolManager(
11-
timeout=Timeout(connect=1.0, read=2.0),
11+
timeout=Timeout(connect=2.0, read=7.0),
1212
cert_reqs="CERT_REQUIRED",
1313
ca_certs=certifi.where(),
1414
headers=user_agent,

0 commit comments

Comments
 (0)