Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .github/workflows/prove-shared-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: prove-shared-ci

on:
push:
paths:
- "prove-shared/**"
- ".github/workflows/prove-shared-ci.yml"
pull_request:
paths:
- "prove-shared/**"
- ".github/workflows/prove-shared-ci.yml"

jobs:
test-prove-shared:
runs-on: ubuntu-latest

steps:
- name: Checkout
uses: actions/checkout@v4

- name: Setup Python
uses: actions/setup-python@v5
with:
python-version: "3.10"

- name: Install package and test dependencies
run: |
python -m pip install --upgrade pip
python -m pip install -e ./prove-shared[test]

- name: Run tests
run: |
python -m pytest prove-shared/tests -q
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,8 @@ entity_cache.p
wikidata_claims_refs_parsed.db
results.csv
*.sh
!prove-api/scripts/*.sh
!prove-processing/scripts/*.sh
API_key.txt
*.out
CodeArchive/
Expand All @@ -178,6 +180,7 @@ output.log2.gz
tester.py
api/secrets.py
utils/secrets.py
**/local_secrets.py
db/
tmp/
.claude/settings.local.json
185 changes: 185 additions & 0 deletions README.new.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
# ProVe (Provenance Verification for Wikidata claims)


## Overview

ProVe is a system designed to automatically verify claims and references in Wikidata. It extracts claims from Wikidata entities, fetches the referenced URLs, processes the HTML content, and uses NLP models to determine whether the claims are supported by the referenced content.

It:
1. extracts claims and references from a Wikidata item,
2. fetches reference content from external URLs,
3. selects evidence sentences,
4. runs textual entailment,
5. stores and serves results through API and background services.

## Current Repository Structure

The codebase is now organized into three top-level folders inside this workspace:

- prove-api: HTTP/API layer, dashboard, templates, docs, queue endpoint
- prove-processing: background workers, pipeline orchestration, ML/NLP models
- prove-shared: pip-installable shared package (MongoDB models/handlers, auth, utilities)

Root-level files still include global project metadata such as pyproject.toml, README.md, LICENSE, and project planning docs.

## Architecture Summary
Comment on lines +1 to +25
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description says the README was replaced, but this change adds a new README.new.md while the existing README.md remains in the repo. If the intent is to update the main project documentation, consider renaming this to README.md (or updating tooling/config/docs to point at README.new.md) to avoid having two competing entrypoint READMEs.

Copilot uses AI. Check for mistakes.

### 1) Data Collection and Processing

- WikidataParser extracts claims and reference URLs from QIDs.
- HTMLFetcher downloads referenced pages (requests/selenium fallback).
- HTMLSentenceProcessor turns HTML into candidate evidence sentences.

### 2) Evidence Selection and Verification

- EvidenceSelector ranks candidate evidence against claims.
- ClaimEntailmentChecker classifies SUPPORTS / REFUTES / NOT ENOUGH INFO.

### 3) NLP Models

- TextualEntailmentModule (BERT-FEVER style entailment)
- SentenceRetrievalModule (sentence relevance scoring)
- VerbModule (graph statement verbalization)

### 4) Storage

- MongoDB: html content, entailment outputs, parser stats, queue/status
- SQLite: historical/aggregated data used by API logic in legacy paths

## Shared Package (prove-shared)

The shared package is installable and used by API and processing code.

### Local install

From root:

```bash
uv sync
# or
pip install .
```

Root pyproject.toml includes a local path dependency to install prove_shared from prove-shared.

### Direct shared install

```bash
cd prove-shared
pip install -e .
```

### Import style

```python
from prove_shared import MongoDBHandler, AsyncAuth, Status
from prove_shared.mongo_handler import requestItemProcessing
```

## Setup Instructions

## 1) Python environment

Use Python 3.10.16 as declared in project metadata.

## 2) Install dependencies

Install from root:

```bash
pip install .
```

## 3) Download model assets

The base model assets are still required for processing pipelines.

Download:

https://emckclac-my.sharepoint.com/:u:/g/personal/stty3154_kcl_ac_uk/IQDeSEYuxxRDSp-zJovVXvbRAVmhmXRw97g7D0eLmJIKyUs?e=Iq446V

Place the base folder at the expected location used by model paths in processing modules.

## 4) Runtime secrets

Environment-specific secrets files are required and should remain gitignored.

Key examples:

- prove-shared/src/prove_shared/local_secrets.py
- prove-api/api/local_secrets.py (if used by API modules)
- prove-processing/utils/local_secrets.py (legacy paths still referenced by some processing code)

## 5) Configuration

Shared runtime settings are in:

- prove-shared/config.yaml

Includes DB paths, batch sizes, thresholds, and algorithm version.

## How to Run

## Processing a single entity

```python
from ProVe_main_process import initialize_models, process_entity

models = initialize_models()
qid = "Q44"
html_df, entailment_results, parser_stats = process_entity(qid, models)
```

## Start processing service

```bash
cd prove-processing
python ProVe_main_service.py
```

## Start API service

```bash
cd prove-api
python api/app.py
```

## Background processing

The scheduler can process:

- top viewed Wikidata items,
- pagepile list items,
- heuristic/random QID queues.

## Data Flow

1. API or scheduler enqueues a QID.
2. Processing worker fetches queue task.
3. Parser extracts claims + reference URLs.
4. HTML collector fetches and stores page content metadata.
5. Evidence selector ranks candidate sentences.
6. Entailment model classifies claim-evidence relationship.
7. Results are written to MongoDB and served by API.

## Notes on Ongoing Split

This repository currently contains all three components in one workspace folder, but structure and imports are being aligned for independent repository operation:

- prove-api
- prove-processing
- prove-shared

Project planning details are documented in project.md.

## Legacy Information Preserved from Previous README

The original README emphasized:

- parser/fetcher/evidence/entailment pipeline,
- MongoDB + SQLite storage model,
- service entry points,
- configuration in config.yaml,
- required external model folder.

All of these remain applicable, now mapped to the split folder layout above.
22 changes: 0 additions & 22 deletions api/db/website.py

This file was deleted.

File renamed without changes.
12 changes: 7 additions & 5 deletions api/app.py → prove-api/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,16 @@
from flask import Flask, jsonify, request, render_template_string
from flask_cors import CORS
from flasgger import Swagger, swag_from
import sys
import json

from custom_decorators import log_request, api_required, AsyncAuth
from local_secrets import CODE_PATH
from queue_manager import QueueManager
try:
from custom_decorators import log_request, api_required
from queue_manager import QueueManager
except ImportError:
from api.custom_decorators import log_request, api_required
from api.queue_manager import QueueManager

sys.path.append(CODE_PATH)
from prove_shared.auth import AsyncAuth
import functions


Expand Down
40 changes: 40 additions & 0 deletions prove-api/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# TODO: This is a duplicate of the root config.yaml for the prove-api package.
# Once the split is finalised, each package should only contain the settings it needs.

database:
name: 'wikidata_claims_refs_parsed.db'
result_db_for_API: '/home/ubuntu/mntdisk/reference_checked.db'

queue:
heuristic: 'random'

version:
algo_version: '1.1.1'

parsing:
reset_database: True # This is a developer mode to clean-up DB to test soemthing

spacy:
model: 'en_core_web_sm'

html_fetching:
batch_size: 10
delay: 1.0
fetching_driver: 'chrome' # available options: 'chrome' or 'requests'
timeout: 15

logging:
level: 'INFO'
format: '%(asctime)s - %(levelname)s - %(message)s'

text_processing:
sentence_slide:
enabled: true
window_size: 2 # sliding window for masking sentences
join_char: ' '

evidence_selection:
batch_size: 256
n_top_sentences: 5
score_threshold: 0
token_size: 512
11 changes: 4 additions & 7 deletions api/custom_decorators.py → prove-api/custom_decorators.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,21 @@
from base64 import b64encode, b64decode
from functools import wraps
from flask import request
import os
import threading
import time
from typing import Any, Union
import sys

from pymongo import MongoClient

try:
from utils_api import get_ip_location, logger
from local_secrets import CODE_PATH, SOURCE, API_KEY, PRIVATE_KEY
from local_secrets import SOURCE, API_KEY, PRIVATE_KEY
except ImportError:
from api.utils_api import get_ip_location, logger
from api.local_secrets import CODE_PATH, SOURCE, API_KEY, PRIVATE_KEY
from api.local_secrets import SOURCE, API_KEY, PRIVATE_KEY

sys.path.append(CODE_PATH)
from utils.mongo_handler import MongoDBHandler
from utils.auth import AsyncAuth
from prove_shared.mongo_handler import MongoDBHandler
from prove_shared.auth import AsyncAuth


class StatsDBHandler(MongoDBHandler):
Expand Down
Loading