RAGU

RAGU is a local-first Streamlit application for retrieval-augmented question answering over structural biology and computational biology PDFs. It lets users upload papers, index them with self-hosted models, ask evidence-grounded questions, optionally visualize structures in Mol*, and optionally listen to the generated answer through Piper when text-to-speech is available.

The project runs:

locally with Python and a local Ollama installation
with Docker Compose using a dedicated Chroma service plus monitoring and observability services

What RAGU does

Upload one or more PDF documents.
Parse and split them into chunks.
Create embeddings through Ollama.
Store vectors in Chroma.
Retrieve candidate chunks for a question.
Rerank them with a cross-encoder.
Ask a local Ollama chat model to answer strictly from retrieved context.
Detect a PDB code in the answer and render a Mol* structure viewer when possible.
Optionally synthesize the answer with Piper and play it in the browser.

Architecture

flowchart TD
    User[User] --> App[Streamlit app]
    App --> Upload[PDF upload]
    Upload --> Parse[PyMuPDFLoader]
    Parse --> Split[RecursiveCharacterTextSplitter]
    Split --> Embed[Ollama embeddings]
    Embed --> Chroma[Chroma service]

    User --> Ask[Question]
    Ask --> App
    App --> Retrieve[Retrieve from Chroma]
    Retrieve --> Rerank[Cross-encoder reranking]
    Rerank --> Chat[Ollama chat]
    Chat --> Answer[Streamed answer]
    Answer --> PDB[PDB extraction]
    PDB --> MolStar[Mol* structure viewer]
    Answer --> Piper[Piper TTS if available]

Service topology

flowchart LR
    App[app]
    Ollama[ollama]
    Chroma[chroma]
    Redis[redis]
    OTel[otel-collector]
    Zipkin[zipkin]
    Exporter[ollama-exporter]
    Prom[prometheus]
    Grafana[grafana]
    CAdvisor[cadvisor]

    App --> Ollama
    App --> Chroma
    App -. future .-> Redis

    Chroma --> OTel
    OTel --> Zipkin
    OTel --> Prom

    Ollama --> Exporter
    Exporter --> Prom
    CAdvisor --> Prom
    Prom --> Grafana

Current feature set

Multi-PDF upload from the Streamlit UI
Dedicated Chroma service over HTTP
Local/self-hosted Ollama embedding and chat inference
Cross-encoder reranking for better retrieval quality
Streamed answer rendering in the UI
PDB extraction from the answer text
Mol* 3D structure viewer integration
Optional text-to-speech through Piper with graceful fallback
Chroma observability through OTEL Collector and Zipkin
Ollama monitoring through exporter, Prometheus, Grafana, and cAdvisor
Docker Compose stack with Redis provisioned for future use

Project structure

.
├── AGENT.md
├── ARCHITECTURE.md
├── CONTRIBUTING.md
├── DESIGN.md
├── Dockerfile
├── OPERATIONS.md
├── ROADMAP.md
├── README.md
├── SECURITY.md
├── USERS.md
├── app.py
├── docker-compose.yml
├── grafana/
├── otel-collector-config.yaml
├── prometheus/
├── public/
├── requirements.txt
├── scripts/
│   └── ollama-entrypoint.sh
└── examples/

Code layout

The application is still mostly implemented as a single Streamlit entrypoint in app.py.

Key runtime helpers in the current code:

process_document: PDF loading and chunking
get_chroma_client: Chroma client selection
get_vector_collection: collection bootstrap with Ollama embeddings
add_to_vector_collection: upsert chunks
query_collection: retrieval
re_rank_cross_encoders: reranking
call_llm: Ollama chat generation
search_pdb_code: PDB extraction
get_piper_status / synthesize_speech: optional text-to-speech

Technology stack

Python 3.12
Streamlit
Ollama
ChromaDB
LangChain document loading and text splitting utilities
Sentence Transformers cross-encoder reranking
PyMuPDF
Piper
Docker Compose
Redis
OpenTelemetry Collector
Zipkin
Prometheus
Grafana
cAdvisor

Requirements

For local development

Python 3.12 recommended
pip
Ollama installed and running locally

For Docker Compose

Docker
Docker Compose

Configuration

Project runtime configuration is stored in .env.

Current variables:

STREAMLIT_HOST_PORT=8501
OLLAMA_HOST_PORT=11434
REDIS_HOST_PORT=6379
CHROMA_HOST_PORT=8000
ZIPKIN_HOST_PORT=9411
PROMETHEUS_HOST_PORT=9090
GRAFANA_HOST_PORT=3000
OLLAMA_EXPORTER_HOST_PORT=8001
CADVISOR_HOST_PORT=8081

OLLAMA_BASE_URL=http://ollama:11434
HOST_OLLAMA_BASE_URL=http://host.docker.internal:11434
HOST_OLLAMA_EXPORTER_HOST=host.docker.internal:11434
OLLAMA_CHAT_MODEL=qwen2.5:3b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_HEALTHCHECK_EMBED_MODEL=nomic-embed-text
OLLAMA_MODELS=qwen2.5:3b nomic-embed-text

CHROMA_CLIENT_MODE=http
CHROMA_HOST=chroma
CHROMA_PORT=8000
CHROMA_SSL=false
CHROMA_OPEN_TELEMETRY__SERVICE_NAME=chroma

GRAFANA_ADMIN_USER=admin
GRAFANA_ADMIN_PASSWORD=admin

Notes:

OLLAMA_BASE_URL points to the Compose service hostname when running in containers.
HOST_OLLAMA_BASE_URL is used by docker-compose.host-ollama.yml when the app should call an Ollama service running on the Docker host.
HOST_OLLAMA_EXPORTER_HOST is the matching host Ollama address for the optional exporter.
OLLAMA_MODELS is used by the Ollama startup script to preload the required models.
CHROMA_CLIENT_MODE=http is the intended Docker Compose mode.
Redis is provisioned but not yet used by the Python code.
Piper is optional and degrades to text-only mode when unavailable.

Recommended model pairs

As of April 7, 2026, these Ollama model names are available in the official library and are reasonable options if you want more capability without jumping straight to very large local models.

1. Safest for low-end laptops

Chat: llama3.2:1b
Embeddings: all-minilm

Why:

llama3.2:1b is the smallest current Llama 3.2 text model in the Ollama library.
all-minilm is a very small embedding model at about 46MB.

Tradeoff:

Fastest and lightest option
Lowest answer quality of the recommended set

2. Best balanced default for weak hardware

Chat: llama3.2:3b
Embeddings: nomic-embed-text

Why:

llama3.2:3b is still relatively small at about 2.0GB and is a solid default chat model.
nomic-embed-text is compact at about 274MB and has a stronger retrieval profile than very small embedding models.

Tradeoff:

Good balance of quality and resource usage
This remains the safest default recommendation for this project

3. More capable while still realistic on modest laptops

Chat: qwen2.5:3b
Embeddings: nomic-embed-text

Why:

qwen2.5:3b is still in the small-model range and is generally a stronger reasoning/instruction-following step up than the most lightweight options.
nomic-embed-text remains a good retrieval fit for this app.

Tradeoff:

Better quality than the lighter pairings
Heavier than llama3.2:3b

4. Highest-quality option I would still consider for some low-end laptops

Chat: qwen2.5:7b
Embeddings: all-minilm or nomic-embed-text

Why:

qwen2.5:7b is a meaningful quality step up if your laptop can tolerate it.
Use all-minilm if memory pressure is tight.
Use nomic-embed-text if retrieval quality matters more than minimizing memory.

Tradeoff:

This is no longer a universally safe low-end choice
I would treat this as borderline for low-end hardware, especially without enough RAM

Optional embedding upgrade

If chat speed is acceptable but retrieval quality is still your bottleneck, a reasonable heavier embedding upgrade is:

Embeddings: mxbai-embed-large

Tradeoff:

better retrieval potential than the smallest embedding models
materially heavier than all-minilm and nomic-embed-text

Practical recommendation

If you are unsure, use:

OLLAMA_CHAT_MODEL=llama3.2:3b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_HEALTHCHECK_EMBED_MODEL=nomic-embed-text
OLLAMA_MODELS=llama3.2:3b nomic-embed-text

If your laptop struggles, drop to:

OLLAMA_CHAT_MODEL=llama3.2:1b
OLLAMA_EMBED_MODEL=all-minilm
OLLAMA_HEALTHCHECK_EMBED_MODEL=all-minilm
OLLAMA_MODELS=llama3.2:1b all-minilm

If you want a stronger model and your machine can handle it, try:

OLLAMA_CHAT_MODEL=qwen2.5:3b
OLLAMA_EMBED_MODEL=nomic-embed-text
OLLAMA_HEALTHCHECK_EMBED_MODEL=nomic-embed-text
OLLAMA_MODELS=qwen2.5:3b nomic-embed-text

Sources:

Run locally

1. Create and activate a virtual environment

python3 -m venv .venv
source .venv/bin/activate

2. Install Python dependencies

pip install --upgrade pip
pip install -r requirements.txt

3. Start Ollama and pull required models

ollama pull qwen2.5:3b
ollama pull nomic-embed-text

If you run locally without Docker, make sure Ollama is available at:

http://localhost:11434

If needed, export local runtime variables before starting Streamlit:

export OLLAMA_BASE_URL=http://localhost:11434
export OLLAMA_CHAT_MODEL=qwen2.5:3b
export OLLAMA_EMBED_MODEL=nomic-embed-text

4. Start the application

streamlit run app.py

5. Open the UI

By default:

http://localhost:8501

Run with Docker Compose

1. Review `.env`

Adjust ports, models, Grafana credentials, or related service configuration if needed.

2. Start the stack

The Compose file uses profiles:

full: starts the complete stack with Streamlit, Chroma, Ollama, Redis, tracing, and monitoring.
minimal: starts only the Streamlit app, Ollama, and Redis.

For the complete stack, run:

docker compose --profile full up --build

For the minimal stack, run:

CHROMA_CLIENT_MODE=persistent docker compose --profile minimal up --build

To use an Ollama service already running on the host instead of the bundled Compose service, first make sure the required models are available on the host:

ollama pull qwen2.5:3b
ollama pull nomic-embed-text

Then start Compose with the host override:

docker compose -f docker-compose.yml -f docker-compose.host-ollama.yml --profile minimal up --build

The override makes the app call HOST_OLLAMA_BASE_URL, which defaults to http://host.docker.internal:11434, and keeps the bundled ollama container out of the active profile. On Linux hosts, Ollama may need to listen on the Docker host gateway instead of only loopback, for example by starting Ollama with OLLAMA_HOST=0.0.0.0:11434; only do this on a trusted local network.

3. Open the application

With the default .env values:

Streamlit app: http://localhost:8501
Chroma API: http://localhost:8000
Ollama API: http://localhost:11434
Redis: localhost:6379
Zipkin tracing UI: http://localhost:9411
Ollama Exporter metrics: http://localhost:8001/metrics
Prometheus: http://localhost:9090
Grafana: http://localhost:3000
cAdvisor: http://localhost:8081

Important:

The first Ollama startup may take time because required models are downloaded automatically.
The app service waits for Ollama and Redis health checks before starting.
The minimal profile uses the app's embedded persistent Chroma store and does not start the dedicated Chroma, OpenTelemetry, Zipkin, Prometheus, Grafana, cAdvisor, or Ollama exporter containers.
Chroma traces are exported through OpenTelemetry Collector to Zipkin.
Piper installation happens inside the app image on supported architectures.

4. Stop the stack

docker compose down

To also remove named volumes:

docker compose down -v

Docker services

`app`

Builds from Dockerfile
Runs streamlit run app.py
Connects to Chroma over HTTP
Contains optional Piper text-to-speech support

`chroma`

Uses the official chromadb/chroma image
Stores Chroma data in the chroma_data volume
Exports OpenTelemetry traces to the collector

`otel-collector`

Uses otel/opentelemetry-collector-contrib
Receives OTLP traces from Chroma
Exports traces to Zipkin and logs them through the debug exporter

`zipkin`

Uses openzipkin/zipkin
Provides the tracing UI for Chroma observability

`ollama-exporter`

Uses the unofficial community exporter lucabecker42/ollama-exporter
Scrapes Ollama and exposes Prometheus metrics at /metrics
Provides Ollama-specific metrics such as version info, model inventory, running models, VRAM usage, and scrape stats

`prometheus`

Scrapes the Ollama exporter
Scrapes cAdvisor for container CPU, RAM, and accelerator metrics
Stores time-series metrics locally
Provides the query backend for dashboards and alerting

`cadvisor`

Uses the Docker Hub mirror litetex/ghcr.google.cadvisor
Exposes container-level CPU and memory metrics for Docker services
Can expose accelerator metrics when GPU support is available on the host

`grafana`

Connects to Prometheus as the default data source
Auto-provisions a starter Ollama Overview dashboard
Provides the UI for Ollama monitoring

`ollama`

Uses the official ollama/ollama image
Runs scripts/ollama-entrypoint.sh
Pulls the required chat and embedding models automatically
Persists model data in a named Docker volume

`redis`

Uses redis:7-alpine
Persists Redis data in a named Docker volume
Present for future caching and memory features

Persistence

Data persisted by the stack:

app_data: embedded Chroma data used by the minimal Docker Compose profile
chroma_data: Chroma database contents
ollama_data: downloaded Ollama models
redis_data: Redis state
prometheus_data: Prometheus time-series data
grafana_data: Grafana state

Chroma observability

The Docker stack includes the Chroma observability pattern described in the Chroma Docker guide:

Chroma emits OpenTelemetry traces
Chroma can also emit OTLP metrics through the collector
OpenTelemetry Collector receives them over OTLP
Zipkin stores and visualizes the resulting traces
Prometheus scrapes the collector's Prometheus metrics endpoint

Once the stack is running, open:

http://localhost:9411

Zipkin will start empty until requests hit Chroma. To generate a quick sample trace, call:

curl http://localhost:8000/api/v2/heartbeat

Then use the Zipkin UI and click Run Query.

If you see an error like:

unknown service opentelemetry.proto.collector.metrics.v1.MetricsService

it means Chroma is trying to export OTLP metrics but the OpenTelemetry Collector was configured only for traces. The included collector config in this repository now defines both:

a traces pipeline for Zipkin
a metrics pipeline with a Prometheus exporter on otel-collector:8889

Ollama monitoring stack

The Docker stack now includes a full Ollama monitoring path:

community Ollama exporter
Prometheus
Grafana

Endpoints:

Ollama exporter metrics: http://localhost:8001/metrics
Prometheus UI: http://localhost:9090
Grafana UI: http://localhost:3000

Grafana credentials come from .env:

GRAFANA_ADMIN_USER
GRAFANA_ADMIN_PASSWORD

The included Grafana provisioning automatically:

creates Prometheus as the default data source
loads a starter dashboard named Ollama Overview

The dashboard now includes gauge panels for:

Ollama CPU usage
Ollama RAM usage
Ollama GPU memory usage

GPU note:

the GPU gauge depends on accelerator metrics being available from cAdvisor
on systems without exposed GPU metrics, that gauge will stay at 0

Text-to-speech

RAGU supports optional text-to-speech with Piper.

Current behavior:

if Piper is installed and the configured voice model exists, the app renders a browser audio player after generating an answer
if Piper is unavailable, the app shows a visible informational message instead of failing silently
if Piper synthesis fails, the text answer still works and the user sees a warning

Docker behavior:

the app image attempts to install Piper for amd64
the app image also attempts an arm64 install path using the aarch64 Piper release asset
if Piper cannot be installed for the current architecture, the app remains text-only

Important:

the app no longer uses hardcoded local Windows paths for Piper
TTS playback happens in the browser via st.audio, not on the server/container speakers

The Ollama exporter used here is unofficial and community-maintained.

Examples

Example PDFs are available in examples/ for quick manual testing.

Suggested manual flow:

Start the application.
Upload one or more PDFs from examples/.
Click Process.
Ask a structural biology question.
Inspect the retrieved chunks and reranked context in the UI expanders.

Current limitations

The application is still implemented as a single-file Streamlit app.
Redis is provisioned but not yet integrated into the Python runtime.
There is no automated test suite yet.
Error handling for dependency failures can still be improved.

Contributing

See CONTRIBUTING.md for architecture and technical decisions.

If you are an AI agent modifying the repository, read AGENT.md first.

Additional project documentation:

Origin

The project was originally inspired by:

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
examples		examples
grafana		grafana
prometheus		prometheus
public/images		public/images
scripts		scripts
.dockerignore		.dockerignore
.env		.env
.gitignore		.gitignore
AGENT.md		AGENT.md
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
DESIGN.md		DESIGN.md
Dockerfile		Dockerfile
OPERATIONS.md		OPERATIONS.md
README.md		README.md
ROADMAP.md		ROADMAP.md
SECURITY.md		SECURITY.md
USERS.md		USERS.md
app.py		app.py
docker-compose.host-ollama.yml		docker-compose.host-ollama.yml
docker-compose.yml		docker-compose.yml
otel-collector-config.yaml		otel-collector-config.yaml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

RAGU

What RAGU does

Architecture

Service topology

Current feature set

Project structure

Code layout

Technology stack

Requirements

For local development

For Docker Compose

Configuration

Recommended model pairs

1. Safest for low-end laptops

2. Best balanced default for weak hardware

3. More capable while still realistic on modest laptops

4. Highest-quality option I would still consider for some low-end laptops

Optional embedding upgrade

Practical recommendation

Run locally

1. Create and activate a virtual environment

2. Install Python dependencies

3. Start Ollama and pull required models

4. Start the application

5. Open the UI

Run with Docker Compose

1. Review .env

2. Start the stack

3. Open the application

4. Stop the stack

Docker services

app

chroma

otel-collector

zipkin

ollama-exporter

prometheus

cadvisor

grafana

ollama

redis

Persistence

Chroma observability

Ollama monitoring stack

Text-to-speech

Examples

Current limitations

Contributing

Origin

About

Resources

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Review `.env`

`app`

`chroma`

`otel-collector`

`zipkin`

`ollama-exporter`

`prometheus`

`cadvisor`

`grafana`

`ollama`

`redis`

Packages