Skip to content

Bio2Byte/local_chat

Repository files navigation

RAGU

RAGU is a local-first Streamlit application for retrieval-augmented question answering over structural biology and computational biology PDFs. It lets users upload papers, index them with self-hosted models, ask evidence-grounded questions, optionally visualize structures in Mol*, and optionally listen to the generated answer through Piper when text-to-speech is available.

The project runs:

  • locally with Python and a local Ollama installation
  • with Docker Compose using a dedicated Chroma service plus monitoring and observability services

What RAGU does

  1. Upload one or more PDF documents.
  2. Parse and split them into chunks.
  3. Create embeddings through Ollama.
  4. Store vectors in Chroma.
  5. Retrieve candidate chunks for a question.
  6. Rerank them with a cross-encoder.
  7. Ask a local Ollama chat model to answer strictly from retrieved context.
  8. Detect a PDB code in the answer and render a Mol* structure viewer when possible.
  9. Optionally synthesize the answer with Piper and play it in the browser.

Architecture

flowchart TD
    User[User] --> App[Streamlit app]
    App --> Upload[PDF upload]
    Upload --> Parse[PyMuPDFLoader]
    Parse --> Split[RecursiveCharacterTextSplitter]
    Split --> Embed[Ollama embeddings]
    Embed --> Chroma[Chroma service]

    User --> Ask[Question]
    Ask --> App
    App --> Retrieve[Retrieve from Chroma]
    Retrieve --> Rerank[Cross-encoder reranking]
    Rerank --> Chat[Ollama chat]
    Chat --> Answer[Streamed answer]
    Answer --> PDB[PDB extraction]
    PDB --> MolStar[Mol* structure viewer]
    Answer --> Piper[Piper TTS if available]
Loading

Service topology

flowchart LR
    App[app]
    Ollama[ollama]
    Chroma[chroma]
    Redis[redis]
    OTel[otel-collector]
    Zipkin[zipkin]
    Exporter[ollama-exporter]
    Prom[prometheus]
    Grafana[grafana]
    CAdvisor[cadvisor]

    App --> Ollama
    App --> Chroma
    App -. future .-> Redis

    Chroma --> OTel
    OTel --> Zipkin
    OTel --> Prom

    Ollama --> Exporter
    Exporter --> Prom
    CAdvisor --> Prom
    Prom --> Grafana
Loading

Current feature set

  • Multi-PDF upload from the Streamlit UI
  • Dedicated Chroma service over HTTP
  • Local/self-hosted Ollama embedding and chat inference
  • Cross-encoder reranking for better retrieval quality
  • Streamed answer rendering in the UI
  • PDB extraction from the answer text
  • Mol* 3D structure viewer integration
  • Optional text-to-speech through Piper with graceful fallback
  • Chroma observability through OTEL Collector and Zipkin
  • Ollama monitoring through exporter, Prometheus, Grafana, and cAdvisor
  • Docker Compose stack with Redis provisioned for future use

Project structure

.
├── AGENT.md
├── ARCHITECTURE.md
├── CONTRIBUTING.md
├── DESIGN.md
├── Dockerfile
├── OPERATIONS.md
├── ROADMAP.md
├── README.md
├── SECURITY.md
├── USERS.md
├── app.py
├── docker-compose.yml
├── grafana/
├── otel-collector-config.yaml
├── prometheus/
├── public/
├── requirements.txt
├── scripts/
│   └── ollama-entrypoint.sh
└── examples/

Code layout

The application is still mostly implemented as a single Streamlit entrypoint in app.py.

Key runtime helpers in the current code:

  • process_document: PDF loading and chunking
  • get_chroma_client: Chroma client selection
  • get_vector_collection: collection bootstrap with Ollama embeddings
  • add_to_vector_collection: upsert chunks
  • query_collection: retrieval
  • re_rank_cross_encoders: reranking
  • call_llm: Ollama chat generation
  • search_pdb_code: PDB extraction
  • get_piper_status / synthesize_speech: optional text-to-speech

Technology stack

  • Python 3.12
  • Streamlit
  • Ollama
  • ChromaDB
  • LangChain document loading and text splitting utilities
  • Sentence Transformers cross-encoder reranking
  • PyMuPDF
  • Piper
  • Docker Compose
  • Redis
  • OpenTelemetry Collector
  • Zipkin
  • Prometheus
  • Grafana
  • cAdvisor

Requirements

For local development

  • Python 3.12 recommended
  • pip
  • Ollama installed and running locally

For Docker Compose

  • Docker
  • Docker Compose

Configuration

Project runtime configuration is stored in .env.

Current variables:

STREAMLIT_HOST_PORT=8501
OLLAMA_HOST_PORT=11434
REDIS_HOST_PORT=6379
CHROMA_HOST_PORT=8000
ZIPKIN_HOST_PORT=9411
PROMETHEUS_HOST_PORT=9090
GRAFANA_HOST_PORT=3000
OLLAMA_EXPORTER_HOST_PORT=8001
CADVISOR_HOST_PORT=8081

OLLAMA_BASE_URL=http://ollama:11434
HOST_OLLAMA_BASE_URL=http://host.docker.internal:11434
HOST_OLLAMA_EXPORTER_HOST=host.docker.internal:11434
OLLAMA_CHAT_MODEL=qwen2.5:3b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_HEALTHCHECK_EMBED_MODEL=nomic-embed-text
OLLAMA_MODELS=qwen2.5:3b nomic-embed-text

CHROMA_CLIENT_MODE=http
CHROMA_HOST=chroma
CHROMA_PORT=8000
CHROMA_SSL=false
CHROMA_OPEN_TELEMETRY__SERVICE_NAME=chroma

GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin

Notes:

  • OLLAMA_BASE_URL points to the Compose service hostname when running in containers.
  • HOST_OLLAMA_BASE_URL is used by docker-compose.host-ollama.yml when the app should call an Ollama service running on the Docker host.
  • HOST_OLLAMA_EXPORTER_HOST is the matching host Ollama address for the optional exporter.
  • OLLAMA_MODELS is used by the Ollama startup script to preload the required models.
  • CHROMA_CLIENT_MODE=http is the intended Docker Compose mode.
  • Redis is provisioned but not yet used by the Python code.
  • Piper is optional and degrades to text-only mode when unavailable.

Recommended model pairs

As of April 7, 2026, these Ollama model names are available in the official library and are reasonable options if you want more capability without jumping straight to very large local models.

1. Safest for low-end laptops

  • Chat: llama3.2:1b
  • Embeddings: all-minilm

Why:

  • llama3.2:1b is the smallest current Llama 3.2 text model in the Ollama library.
  • all-minilm is a very small embedding model at about 46MB.

Tradeoff:

  • Fastest and lightest option
  • Lowest answer quality of the recommended set

2. Best balanced default for weak hardware

  • Chat: llama3.2:3b
  • Embeddings: nomic-embed-text

Why:

  • llama3.2:3b is still relatively small at about 2.0GB and is a solid default chat model.
  • nomic-embed-text is compact at about 274MB and has a stronger retrieval profile than very small embedding models.

Tradeoff:

  • Good balance of quality and resource usage
  • This remains the safest default recommendation for this project

3. More capable while still realistic on modest laptops

  • Chat: qwen2.5:3b
  • Embeddings: nomic-embed-text

Why:

  • qwen2.5:3b is still in the small-model range and is generally a stronger reasoning/instruction-following step up than the most lightweight options.
  • nomic-embed-text remains a good retrieval fit for this app.

Tradeoff:

  • Better quality than the lighter pairings
  • Heavier than llama3.2:3b

4. Highest-quality option I would still consider for some low-end laptops

  • Chat: qwen2.5:7b
  • Embeddings: all-minilm or nomic-embed-text

Why:

  • qwen2.5:7b is a meaningful quality step up if your laptop can tolerate it.
  • Use all-minilm if memory pressure is tight.
  • Use nomic-embed-text if retrieval quality matters more than minimizing memory.

Tradeoff:

  • This is no longer a universally safe low-end choice
  • I would treat this as borderline for low-end hardware, especially without enough RAM

Optional embedding upgrade

If chat speed is acceptable but retrieval quality is still your bottleneck, a reasonable heavier embedding upgrade is:

  • Embeddings: mxbai-embed-large

Tradeoff:

  • better retrieval potential than the smallest embedding models
  • materially heavier than all-minilm and nomic-embed-text

Practical recommendation

If you are unsure, use:

OLLAMA_CHAT_MODEL=llama3.2:3b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_HEALTHCHECK_EMBED_MODEL=nomic-embed-text
OLLAMA_MODELS=llama3.2:3b nomic-embed-text

If your laptop struggles, drop to:

OLLAMA_CHAT_MODEL=llama3.2:1b
OLLAMA_EMBED_MODEL=all-minilm
OLLAMA_HEALTHCHECK_EMBED_MODEL=all-minilm
OLLAMA_MODELS=llama3.2:1b all-minilm

If you want a stronger model and your machine can handle it, try:

OLLAMA_CHAT_MODEL=qwen2.5:3b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_HEALTHCHECK_EMBED_MODEL=nomic-embed-text
OLLAMA_MODELS=qwen2.5:3b nomic-embed-text

Sources:

Run locally

1. Create and activate a virtual environment

python3 -m venv .venv
source .venv/bin/activate

2. Install Python dependencies

pip install --upgrade pip
pip install -r requirements.txt

3. Start Ollama and pull required models

ollama pull qwen2.5:3b
ollama pull nomic-embed-text

If you run locally without Docker, make sure Ollama is available at:

http://localhost:11434

If needed, export local runtime variables before starting Streamlit:

export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_CHAT_MODEL=qwen2.5:3b
export OLLAMA_EMBED_MODEL=nomic-embed-text

4. Start the application

streamlit run app.py

5. Open the UI

By default:

http://localhost:8501

Run with Docker Compose

1. Review .env

Adjust ports, models, Grafana credentials, or related service configuration if needed.

2. Start the stack

The Compose file uses profiles:

  • full: starts the complete stack with Streamlit, Chroma, Ollama, Redis, tracing, and monitoring.
  • minimal: starts only the Streamlit app, Ollama, and Redis.

For the complete stack, run:

docker compose --profile full up --build

For the minimal stack, run:

CHROMA_CLIENT_MODE=persistent docker compose --profile minimal up --build

To use an Ollama service already running on the host instead of the bundled Compose service, first make sure the required models are available on the host:

ollama pull qwen2.5:3b
ollama pull nomic-embed-text

Then start Compose with the host override:

docker compose -f docker-compose.yml -f docker-compose.host-ollama.yml --profile minimal up --build

The override makes the app call HOST_OLLAMA_BASE_URL, which defaults to http://host.docker.internal:11434, and keeps the bundled ollama container out of the active profile. On Linux hosts, Ollama may need to listen on the Docker host gateway instead of only loopback, for example by starting Ollama with OLLAMA_HOST=0.0.0.0:11434; only do this on a trusted local network.

3. Open the application

With the default .env values:

  • Streamlit app: http://localhost:8501
  • Chroma API: http://localhost:8000
  • Ollama API: http://localhost:11434
  • Redis: localhost:6379
  • Zipkin tracing UI: http://localhost:9411
  • Ollama Exporter metrics: http://localhost:8001/metrics
  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3000
  • cAdvisor: http://localhost:8081

Important:

  • The first Ollama startup may take time because required models are downloaded automatically.
  • The app service waits for Ollama and Redis health checks before starting.
  • The minimal profile uses the app's embedded persistent Chroma store and does not start the dedicated Chroma, OpenTelemetry, Zipkin, Prometheus, Grafana, cAdvisor, or Ollama exporter containers.
  • Chroma traces are exported through OpenTelemetry Collector to Zipkin.
  • Piper installation happens inside the app image on supported architectures.

4. Stop the stack

docker compose down

To also remove named volumes:

docker compose down -v

Docker services

app

  • Builds from Dockerfile
  • Runs streamlit run app.py
  • Connects to Chroma over HTTP
  • Contains optional Piper text-to-speech support

chroma

  • Uses the official chromadb/chroma image
  • Stores Chroma data in the chroma_data volume
  • Exports OpenTelemetry traces to the collector

otel-collector

  • Uses otel/opentelemetry-collector-contrib
  • Receives OTLP traces from Chroma
  • Exports traces to Zipkin and logs them through the debug exporter

zipkin

  • Uses openzipkin/zipkin
  • Provides the tracing UI for Chroma observability

ollama-exporter

  • Uses the unofficial community exporter lucabecker42/ollama-exporter
  • Scrapes Ollama and exposes Prometheus metrics at /metrics
  • Provides Ollama-specific metrics such as version info, model inventory, running models, VRAM usage, and scrape stats

prometheus

  • Scrapes the Ollama exporter
  • Scrapes cAdvisor for container CPU, RAM, and accelerator metrics
  • Stores time-series metrics locally
  • Provides the query backend for dashboards and alerting

cadvisor

  • Uses the Docker Hub mirror litetex/ghcr.google.cadvisor
  • Exposes container-level CPU and memory metrics for Docker services
  • Can expose accelerator metrics when GPU support is available on the host

grafana

  • Connects to Prometheus as the default data source
  • Auto-provisions a starter Ollama Overview dashboard
  • Provides the UI for Ollama monitoring

ollama

  • Uses the official ollama/ollama image
  • Runs scripts/ollama-entrypoint.sh
  • Pulls the required chat and embedding models automatically
  • Persists model data in a named Docker volume

redis

  • Uses redis:7-alpine
  • Persists Redis data in a named Docker volume
  • Present for future caching and memory features

Persistence

Data persisted by the stack:

  • app_data: embedded Chroma data used by the minimal Docker Compose profile
  • chroma_data: Chroma database contents
  • ollama_data: downloaded Ollama models
  • redis_data: Redis state
  • prometheus_data: Prometheus time-series data
  • grafana_data: Grafana state

Chroma observability

The Docker stack includes the Chroma observability pattern described in the Chroma Docker guide:

  • Chroma emits OpenTelemetry traces
  • Chroma can also emit OTLP metrics through the collector
  • OpenTelemetry Collector receives them over OTLP
  • Zipkin stores and visualizes the resulting traces
  • Prometheus scrapes the collector's Prometheus metrics endpoint

Once the stack is running, open:

http://localhost:9411

Zipkin will start empty until requests hit Chroma. To generate a quick sample trace, call:

curl http://localhost:8000/api/v2/heartbeat

Then use the Zipkin UI and click Run Query.

If you see an error like:

unknown service opentelemetry.proto.collector.metrics.v1.MetricsService

it means Chroma is trying to export OTLP metrics but the OpenTelemetry Collector was configured only for traces. The included collector config in this repository now defines both:

  • a traces pipeline for Zipkin
  • a metrics pipeline with a Prometheus exporter on otel-collector:8889

Ollama monitoring stack

The Docker stack now includes a full Ollama monitoring path:

  • community Ollama exporter
  • Prometheus
  • Grafana

Endpoints:

  • Ollama exporter metrics: http://localhost:8001/metrics
  • Prometheus UI: http://localhost:9090
  • Grafana UI: http://localhost:3000

Grafana credentials come from .env:

GRAFANA_ADMIN_USER
GRAFANA_ADMIN_PASSWORD

The included Grafana provisioning automatically:

  • creates Prometheus as the default data source
  • loads a starter dashboard named Ollama Overview

The dashboard now includes gauge panels for:

  • Ollama CPU usage
  • Ollama RAM usage
  • Ollama GPU memory usage

GPU note:

  • the GPU gauge depends on accelerator metrics being available from cAdvisor
  • on systems without exposed GPU metrics, that gauge will stay at 0

Text-to-speech

RAGU supports optional text-to-speech with Piper.

Current behavior:

  • if Piper is installed and the configured voice model exists, the app renders a browser audio player after generating an answer
  • if Piper is unavailable, the app shows a visible informational message instead of failing silently
  • if Piper synthesis fails, the text answer still works and the user sees a warning

Docker behavior:

  • the app image attempts to install Piper for amd64
  • the app image also attempts an arm64 install path using the aarch64 Piper release asset
  • if Piper cannot be installed for the current architecture, the app remains text-only

Important:

  • the app no longer uses hardcoded local Windows paths for Piper
  • TTS playback happens in the browser via st.audio, not on the server/container speakers

The Ollama exporter used here is unofficial and community-maintained.

Examples

Example PDFs are available in examples/ for quick manual testing.

Suggested manual flow:

  1. Start the application.
  2. Upload one or more PDFs from examples/.
  3. Click Process.
  4. Ask a structural biology question.
  5. Inspect the retrieved chunks and reranked context in the UI expanders.

Current limitations

  • The application is still implemented as a single-file Streamlit app.
  • Redis is provisioned but not yet integrated into the Python runtime.
  • There is no automated test suite yet.
  • Error handling for dependency failures can still be improved.

Contributing

See CONTRIBUTING.md for architecture and technical decisions.

If you are an AI agent modifying the repository, read AGENT.md first.

Additional project documentation:

Origin

The project was originally inspired by:

About

No description, website, or topics provided.

Resources

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors