-
Notifications
You must be signed in to change notification settings - Fork 4
Refactor project structure and enhance documentation #59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
NathanGavenski
wants to merge
6
commits into
main
Choose a base branch
from
CodeSplit
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
7bf4d18
Split code into project.md structure
NathanGavenski 03402b5
prove-shared imports and project dependecies
NathanGavenski 630b276
More complete readme.md
NathanGavenski 03ee925
tests for prove-shared
NathanGavenski c2c7c52
Flatten prove-api structure and integrate prove-shared imports (#60)
thedeepaksengar b6f1197
Fix for using prove-shared and installing from prove-api
NathanGavenski File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,33 @@ | ||
| name: prove-shared-ci | ||
|
|
||
| on: | ||
| push: | ||
| paths: | ||
| - "prove-shared/**" | ||
| - ".github/workflows/prove-shared-ci.yml" | ||
| pull_request: | ||
| paths: | ||
| - "prove-shared/**" | ||
| - ".github/workflows/prove-shared-ci.yml" | ||
|
|
||
| jobs: | ||
| test-prove-shared: | ||
| runs-on: ubuntu-latest | ||
|
|
||
| steps: | ||
| - name: Checkout | ||
| uses: actions/checkout@v4 | ||
|
|
||
| - name: Setup Python | ||
| uses: actions/setup-python@v5 | ||
| with: | ||
| python-version: "3.10" | ||
|
|
||
| - name: Install package and test dependencies | ||
| run: | | ||
| python -m pip install --upgrade pip | ||
| python -m pip install -e ./prove-shared[test] | ||
|
|
||
| - name: Run tests | ||
| run: | | ||
| python -m pytest prove-shared/tests -q |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| # ProVe (Provenance Verification for Wikidata claims) | ||
|
|
||
|
|
||
| ## Overview | ||
|
|
||
| ProVe is a system designed to automatically verify claims and references in Wikidata. It extracts claims from Wikidata entities, fetches the referenced URLs, processes the HTML content, and uses NLP models to determine whether the claims are supported by the referenced content. | ||
|
|
||
| It: | ||
| 1. extracts claims and references from a Wikidata item, | ||
| 2. fetches reference content from external URLs, | ||
| 3. selects evidence sentences, | ||
| 4. runs textual entailment, | ||
| 5. stores and serves results through API and background services. | ||
|
|
||
| ## Current Repository Structure | ||
|
|
||
| The codebase is now organized into three top-level folders inside this workspace: | ||
|
|
||
| - prove-api: HTTP/API layer, dashboard, templates, docs, queue endpoint | ||
| - prove-processing: background workers, pipeline orchestration, ML/NLP models | ||
| - prove-shared: pip-installable shared package (MongoDB models/handlers, auth, utilities) | ||
|
|
||
| Root-level files still include global project metadata such as pyproject.toml, README.md, LICENSE, and project planning docs. | ||
|
|
||
| ## Architecture Summary | ||
|
|
||
| ### 1) Data Collection and Processing | ||
|
|
||
| - WikidataParser extracts claims and reference URLs from QIDs. | ||
| - HTMLFetcher downloads referenced pages (requests/selenium fallback). | ||
| - HTMLSentenceProcessor turns HTML into candidate evidence sentences. | ||
|
|
||
| ### 2) Evidence Selection and Verification | ||
|
|
||
| - EvidenceSelector ranks candidate evidence against claims. | ||
| - ClaimEntailmentChecker classifies SUPPORTS / REFUTES / NOT ENOUGH INFO. | ||
|
|
||
| ### 3) NLP Models | ||
|
|
||
| - TextualEntailmentModule (BERT-FEVER style entailment) | ||
| - SentenceRetrievalModule (sentence relevance scoring) | ||
| - VerbModule (graph statement verbalization) | ||
|
|
||
| ### 4) Storage | ||
|
|
||
| - MongoDB: html content, entailment outputs, parser stats, queue/status | ||
| - SQLite: historical/aggregated data used by API logic in legacy paths | ||
|
|
||
| ## Shared Package (prove-shared) | ||
|
|
||
| The shared package is installable and used by API and processing code. | ||
|
|
||
| ### Local install | ||
|
|
||
| From root: | ||
|
|
||
| ```bash | ||
| uv sync | ||
| # or | ||
| pip install . | ||
| ``` | ||
|
|
||
| Root pyproject.toml includes a local path dependency to install prove_shared from prove-shared. | ||
|
|
||
| ### Direct shared install | ||
|
|
||
| ```bash | ||
| cd prove-shared | ||
| pip install -e . | ||
| ``` | ||
|
|
||
| ### Import style | ||
|
|
||
| ```python | ||
| from prove_shared import MongoDBHandler, AsyncAuth, Status | ||
| from prove_shared.mongo_handler import requestItemProcessing | ||
| ``` | ||
|
|
||
| ## Setup Instructions | ||
|
|
||
| ## 1) Python environment | ||
|
|
||
| Use Python 3.10.16 as declared in project metadata. | ||
|
|
||
| ## 2) Install dependencies | ||
|
|
||
| Install from root: | ||
|
|
||
| ```bash | ||
| pip install . | ||
| ``` | ||
|
|
||
| ## 3) Download model assets | ||
|
|
||
| The base model assets are still required for processing pipelines. | ||
|
|
||
| Download: | ||
|
|
||
| https://emckclac-my.sharepoint.com/:u:/g/personal/stty3154_kcl_ac_uk/IQDeSEYuxxRDSp-zJovVXvbRAVmhmXRw97g7D0eLmJIKyUs?e=Iq446V | ||
|
|
||
| Place the base folder at the expected location used by model paths in processing modules. | ||
|
|
||
| ## 4) Runtime secrets | ||
|
|
||
| Environment-specific secrets files are required and should remain gitignored. | ||
|
|
||
| Key examples: | ||
|
|
||
| - prove-shared/src/prove_shared/local_secrets.py | ||
| - prove-api/api/local_secrets.py (if used by API modules) | ||
| - prove-processing/utils/local_secrets.py (legacy paths still referenced by some processing code) | ||
|
|
||
| ## 5) Configuration | ||
|
|
||
| Shared runtime settings are in: | ||
|
|
||
| - prove-shared/config.yaml | ||
|
|
||
| Includes DB paths, batch sizes, thresholds, and algorithm version. | ||
|
|
||
| ## How to Run | ||
|
|
||
| ## Processing a single entity | ||
|
|
||
| ```python | ||
| from ProVe_main_process import initialize_models, process_entity | ||
|
|
||
| models = initialize_models() | ||
| qid = "Q44" | ||
| html_df, entailment_results, parser_stats = process_entity(qid, models) | ||
| ``` | ||
|
|
||
| ## Start processing service | ||
|
|
||
| ```bash | ||
| cd prove-processing | ||
| python ProVe_main_service.py | ||
| ``` | ||
|
|
||
| ## Start API service | ||
|
|
||
| ```bash | ||
| cd prove-api | ||
| python api/app.py | ||
| ``` | ||
|
|
||
| ## Background processing | ||
|
|
||
| The scheduler can process: | ||
|
|
||
| - top viewed Wikidata items, | ||
| - pagepile list items, | ||
| - heuristic/random QID queues. | ||
|
|
||
| ## Data Flow | ||
|
|
||
| 1. API or scheduler enqueues a QID. | ||
| 2. Processing worker fetches queue task. | ||
| 3. Parser extracts claims + reference URLs. | ||
| 4. HTML collector fetches and stores page content metadata. | ||
| 5. Evidence selector ranks candidate sentences. | ||
| 6. Entailment model classifies claim-evidence relationship. | ||
| 7. Results are written to MongoDB and served by API. | ||
|
|
||
| ## Notes on Ongoing Split | ||
|
|
||
| This repository currently contains all three components in one workspace folder, but structure and imports are being aligned for independent repository operation: | ||
|
|
||
| - prove-api | ||
| - prove-processing | ||
| - prove-shared | ||
|
|
||
| Project planning details are documented in project.md. | ||
|
|
||
| ## Legacy Information Preserved from Previous README | ||
|
|
||
| The original README emphasized: | ||
|
|
||
| - parser/fetcher/evidence/entailment pipeline, | ||
| - MongoDB + SQLite storage model, | ||
| - service entry points, | ||
| - configuration in config.yaml, | ||
| - required external model folder. | ||
|
|
||
| All of these remain applicable, now mapped to the split folder layout above. | ||
This file was deleted.
Oops, something went wrong.
File renamed without changes.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,40 @@ | ||
| # TODO: This is a duplicate of the root config.yaml for the prove-api package. | ||
| # Once the split is finalised, each package should only contain the settings it needs. | ||
|
|
||
| database: | ||
| name: 'wikidata_claims_refs_parsed.db' | ||
| result_db_for_API: '/home/ubuntu/mntdisk/reference_checked.db' | ||
|
|
||
| queue: | ||
| heuristic: 'random' | ||
|
|
||
| version: | ||
| algo_version: '1.1.1' | ||
|
|
||
| parsing: | ||
| reset_database: True # This is a developer mode to clean-up DB to test soemthing | ||
|
|
||
| spacy: | ||
| model: 'en_core_web_sm' | ||
|
|
||
| html_fetching: | ||
| batch_size: 10 | ||
| delay: 1.0 | ||
| fetching_driver: 'chrome' # available options: 'chrome' or 'requests' | ||
| timeout: 15 | ||
|
|
||
| logging: | ||
| level: 'INFO' | ||
| format: '%(asctime)s - %(levelname)s - %(message)s' | ||
|
|
||
| text_processing: | ||
| sentence_slide: | ||
| enabled: true | ||
| window_size: 2 # sliding window for masking sentences | ||
| join_char: ' ' | ||
|
|
||
| evidence_selection: | ||
| batch_size: 256 | ||
| n_top_sentences: 5 | ||
| score_threshold: 0 | ||
| token_size: 512 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR description says the README was replaced, but this change adds a new
README.new.mdwhile the existingREADME.mdremains in the repo. If the intent is to update the main project documentation, consider renaming this toREADME.md(or updating tooling/config/docs to point atREADME.new.md) to avoid having two competing entrypoint READMEs.