Comparative security analysis of AI-generated vs human-written Flask web applications using automated vulnerability scanning (OWASP ZAP).
This project examines whether AI-generated Flask web applications contain more or different exploitable vulnerabilities than equivalent human-written apps. Each application is scanned using OWASP ZAP, and findings are recorded in data/dataset.csv by severity and vulnerability type.
The data/analysis.ipynb notebook processes raw OWASP ZAP scan outputs and transforms them into a structured dataset for analysis. It handles vulnerability categorization, severity mapping, and exploitability labeling for both AI-generated and human-written applications. The processed data is then used to compute comparative metrics such as risk score and exploitability rate.
Prompt2Exploit/
├── data/
│ ├── ai_apps/
│ │ ├── login_app1/2/3.py
│ │ ├── form_app1/2/3.py
│ │ ├── notes_app2.py
│ │ └── api_app1/3.py
│ ├── human_apps/
│ │ ├── login_app1/2/3.py
│ │ ├── form_app1/2/3.py
│ │ ├── notes_app2.py
│ │ └── api_app1/3.py
│ └── dataset.csv
│ └── analysis.ipynb
└── prompt.txt
└── Zap_Levels.txt
| Category | Apps | Description |
|---|---|---|
| Login | login_app1/2/3 |
Auth flows with registration, admin-only login, dashboard |
| Forms | form_app1/2/3 |
Contact, feedback, and inquiry submission forms |
| Notes/Todos | notes_app2 |
CRUD note-taking and todo list apps |
| API/Services | api_app1/3 |
REST API, URL shortener, file upload service |
ai_apps genereted from ChatGpt 5.3 mini model, you can find Master prompt and sub-prompt in prompt.txt.
human_apps sourced from: henry-richard7, patrickloeber, gilcierweb, pj8912, CoreyMSchafer, TechWithTim, pallets patterns.
pip install flask flask-sqlalchemy
cd data/ai_apps # or data/human_apps
python <app_name>.py
# Accessible at http://127.0.0.1:5000data/dataset.csv records scan results for both ai and human app types:
| Field | Values |
|---|---|
app_id |
e.g. login_app1, notes_app2 |
app_type |
ai | human |
feature |
login, notes, form, api |
vulnerability |
ZAP category |
severity |
high, medium, low |
exploitable |
yes | no |
You can find the category and how it was implement on the dataset.
| App Type | Risk Score |
|---|---|
| AI | 1.212766 |
| Human | 1.440000 |
Risk score formaula:
For an application with:
- 1 high (exploitable)
- 2 medium (not exploitable)
- 1 low (not exploitable)
Example:
Calculation:
- High: 3 × 1.5 = 4.5
- Medium: 2 × 0.5 = 1.0 each → 2.0 total
- Low: 1 × 0.5 = 0.5
Total vulnerabilities = 4
Final Risk Score = 7.0 / 4 = 1.75
| App Type | Exploitable | Proportion |
|---|---|---|
| AI | No | 0.978723 |
| AI | Yes | 0.021277 |
| Human | No | 0.853333 |
| Human | Yes | 0.146667 |
This study shows that AI-generated Flask applications and human-written Flask applications differ not only in the number of vulnerabilities but also in their nature and severity distribution.
AI-generated applications tend to produce consistent security misconfigurations such as missing HTTP security headers and cookie attribute issues. These are systematic but generally low to medium severity, resulting in a lower exploitability rate.
Human-written applications demonstrate a broader and more diverse vulnerability landscape, including higher-severity issues such as persistent XSS, buffer overflow patterns, and authentication weaknesses. These vulnerabilities are less consistent but more likely to be exploitable and higher impact.
Overall findings: AI code → repetitive configuration-level issues / Human code → diverse logic-level vulnerabilities Human apps show higher risk score and exploitability