A local, offline-first code auditing system with incremental indexing (MD5-based) and in-memory retrieval, designed for consumer GPUs.
- GPU: NVIDIA RTX 3060 Laptop (6GB VRAM)
- RAM: 32GB
- FastAPI backend
POST /audit: audits a code snippet with optional repository context (incremental RAG)GET /health: runtime metrics (CPU/RAM/process + GPU/VRAM when NVML is available)
- Incremental indexing
- MD5 change detection
- only new/modified files are re-embedded
- manifest persisted at
data/manifests/manifest.json
- Vector store
- ChromaDB
PersistentClientpersists to disk (data/chroma/) - runtime uses an in-memory cache (
VectorCache) loaded from Chroma for low-latency retrieval
- ChromaDB
- Embeddings
sentence-transformers/all-MiniLM-L6-v2- GPU preferred when
torch.cuda.is_available()
- UI
- Streamlit frontend
- shows audit report + PCA(2D) Plotly scatter
- shows live CPU/RAM/GPU usage via
/healthand request latency/timings
Local-Code-Guardian/
backend/
app/
main.py
api/
routes_audit.py
routes_health.py
rag/
embedder.py
chroma_store.py
retriever.py
indexing/
incremental_indexer.py
manifest.py
file_hashing.py
analysis/
pca.py
frontend/
streamlit_app.py
data/
chroma/
manifests/
manifest.json
requirements.txt
- Conda environment (example name:
local-code-guardian) - Python 3.10
- CUDA 12.1
- PyTorch 2.5.1 (CUDA build) already installed in the environment
- Ollama installed (Windows supported)
In your conda env:
python -m pip install -r requirements.txtNote: torch is intentionally NOT pinned in requirements.txt to avoid overwriting your CUDA-enabled PyTorch install.
ollama pull llama3:8b-instruct-q4_K_MYou can override the model name using OLLAMA_MODEL.
python -m uvicorn backend.app.main:app --host 127.0.0.1 --port 8000streamlit run frontend/streamlit_app.pyOpen the Streamlit URL shown in the terminal.
- In the Streamlit sidebar:
- set
Backend URL(defaulthttp://localhost:8000) - optionally set
Git repo pathto enable incremental indexing + retrieval - paste code and click
Audit
- set
- The UI will show:
- audit report
timings(index/retrieve/llm/total + HTTP RTT)- PCA scatter plot of embeddings (new/updated vectors are highlighted)
- live CPU/RAM/GPU metrics (auto-refresh)
POST /audit- body:
{ "code": "...", "prompt": "...", "repo_path": "...", "top_k": 5 } - returns:
{ report, retrieved, points, timings }
- body:
GET /health- returns:
{ status, cpu, ram, process, gpu }
- returns:
Environment variables:
OLLAMA_BASE_URL(defaulthttp://localhost:11434/api)OLLAMA_MODEL(defaultllama3:8b-instruct-q4_K_M)OLLAMA_TIMEOUT_S(default600)OLLAMA_NUM_GPU(optional, passed to Ollama options when set)CHROMA_PERSIST_DIR(defaultdata/chroma)CHROMA_COLLECTION(defaultcode)MANIFEST_PATH(defaultdata/manifests/manifest.json)EMBEDDING_MODEL_NAME(defaultsentence-transformers/all-MiniLM-L6-v2)
- Chroma is used as persistent storage; retrieval is performed from an in-memory cache loaded from Chroma.
- Indexing currently embeds whole files as single vectors (no chunking yet).
- Ollama GPU layer control is version-dependent;
OLLAMA_NUM_GPUis kept as an optional knob.