Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 14 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# ── Google Drive integration ────────────────────────────────────────────────
# Required by scripts/full_pipeline.py (--api mode) and scripts/ci_pipeline.py.
# Obtain from the project's Google Cloud service account (ask the project owner).

# The full JSON key for the Google service account, as a single-line string.
GOOGLE_SERVICE_ACCOUNT_JSON=

# The ID of the root Google Drive folder containing 'useful' and 'not-useful' subfolders.
# Found in the folder's URL: drive.google.com/drive/folders/<ID>
GOOGLE_DRIVE_ROOT_FOLDER_ID=

# Set to "true" if the Drive folder is a shared drive (Team Drive).
# Leave blank or omit for standard My Drive folders.
GOOGLE_DRIVE_USE_SHARED_DRIVE=
19 changes: 10 additions & 9 deletions .github/workflows/working_sw.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ jobs:
sudo apt-get update
sudo apt-get install -y tesseract-ocr
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install ".[dev]"

# CI pipeline (streams 20 PDFs per class from Drive and trains model)
- name: CI pipeline (Drive stream)
Expand All @@ -36,10 +36,10 @@ jobs:
uses: actions/cache@v3
with:
path: |
src/model/models
src/classifier/models
~/.joblib
~/.sklearn
key: ${{ runner.os }}-sklearn-model-${{ hashFiles('src/model/**/*.py') }}
key: ${{ runner.os }}-sklearn-model-${{ hashFiles('src/classifier/**/*.py') }}
restore-keys: |
${{ runner.os }}-sklearn-model-

Expand All @@ -60,17 +60,17 @@ jobs:
- name: Run PDF extraction script (repo test asset)
run: |
INPUT_PDF="tests/test.pdf"
python src/preprocessing/pdf_text_extraction.py "$INPUT_PDF"
python -m src.io.pdf_text_extraction "$INPUT_PDF"

# - name: Generate labels
# run: |
# echo "Generating labels for extracted text..."
# python src/preprocessing/generate_labels.py
# python src/io/generate_labels.py

# - name: Load and preprocess data
# run: |
# echo "Loading and preprocessing dataset..."
# python src/preprocessing/data_loader.py
# python src/io/data_loader.py
# No full pipeline here; use full_training_pipeline.py on main or a scheduled workflow

- name: Validate preprocessing pipeline
Expand Down Expand Up @@ -113,8 +113,9 @@ jobs:
with:
name: trained-model
path: |
src/model/models/pdf_classifier_model.pkl
src/model/models/tfidf_vectorizer.pkl
src/classifier/models/pdf_classifier.json
src/classifier/models/tfidf_vectorizer.pkl
src/classifier/models/label_encoder.pkl

- name: Show pipeline summary
run: |
Expand All @@ -124,4 +125,4 @@ jobs:
echo "2. Label Generation Results:"
ls -l data/labels.json
echo "3. Model Training Results:"
ls -l src/model/models/
ls -l src/classifier/models/
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -211,6 +211,7 @@ data/needs-check
data/not-useful
data/processed-text
data/useful
src/model/models/*.pkl
data/results/
src/classifier/models/*.pkl


8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ git clone https://github.com/NovakLabOSU/FracFeedExtractor.git
cd FracFeedExtractor
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
pip install -e ".[dev]"
```

```bash
Expand All @@ -134,17 +134,17 @@ git clone https://github.com/NovakLabOSU/FracFeedExtractor.git
cd FracFeedExtractor
py -m venv venv
./venv/Scripts/activate
pip install -r requirements.txt
pip install -e ".[dev]"
```

### Quick Start

```bash
# Classify and extract from a folder of PDFs
python classify_extract.py path/to/pdfs/
python src/pipeline/classify_extract.py path/to/pdfs/

# Adjust the LLM model or confidence threshold
python classify_extract.py path/to/pdfs/ --llm-model llama3.1:8b --confidence-threshold 0.70
python src/pipeline/classify_extract.py path/to/pdfs/ --llm-model qwen2.5:7b --confidence-threshold 0.70
```

Results are written to `data/results/metrics/` (per-paper JSON) and `data/results/summaries/` (pipeline CSV).
Expand Down
17 changes: 0 additions & 17 deletions data/results/Adams_1989_results.json

This file was deleted.

12 changes: 0 additions & 12 deletions data/results/Ferreira_1999_results.json

This file was deleted.

16 changes: 0 additions & 16 deletions data/results/Fisher_2008_results.json

This file was deleted.

12 changes: 0 additions & 12 deletions data/results/Sousa_2015_results.json

This file was deleted.

11 changes: 0 additions & 11 deletions data/results/classifications.csv

This file was deleted.

110 changes: 0 additions & 110 deletions data/results/classifications.json

This file was deleted.

14 changes: 0 additions & 14 deletions data/results/metrics/Adams_1989_results.json

This file was deleted.

This file was deleted.

12 changes: 0 additions & 12 deletions data/results/test_biomistral_results.json

This file was deleted.

12 changes: 0 additions & 12 deletions data/results/test_quick_results.json

This file was deleted.

12 changes: 0 additions & 12 deletions data/results/test_results.json

This file was deleted.

Loading
Loading