diff --git a/docs/open_ondemand/index.md b/docs/open_ondemand/index.md index b6fe5821..2384cba8 100644 --- a/docs/open_ondemand/index.md +++ b/docs/open_ondemand/index.md @@ -103,5 +103,6 @@ vs_code-server files_app jobs_app terminal_app +llm_chat_interface ``` diff --git a/docs/open_ondemand/llm_chat_interface.md b/docs/open_ondemand/llm_chat_interface.md new file mode 100644 index 00000000..89f6bc61 --- /dev/null +++ b/docs/open_ondemand/llm_chat_interface.md @@ -0,0 +1,164 @@ +# LLM Chat Interface + +The **LLM Chat Interface** is an interactive Open OnDemand application that provides a browser-based chat experience powered by [Chainlit](https://docs.chainlit.io) and [Ollama](https://ollama.com). When you launch a session, the application starts an Ollama server on a GPU-equipped compute node and connects it to CURC-hosted large language models (LLMs). You can ask questions, draft and debug code, summarize documents, and analyze images (with vision-capable models), all from your web browser. + +```{important} +The LLM Chat Interface is currently offered as a beta service. Functionality, available models, and resource allocations may change as we gather feedback and refine the service. Please report issues or suggestions through the [CURC support form](https://colorado.service-now.com/req_portal?id=ucb_sc_rc_form). +``` + +## Launching the LLM Chat Interface + +1. Log in to [Open OnDemand](https://curc.readthedocs.io/en/latest/open_ondemand/index.html) using your CURC credentials. +2. Navigate to either the **Interactive Apps** drop-down menu or the **My Interactive Sessions** tab and select **LLM Chat Interface**. +3. Review the launch form fields: + + - Ollama model path — Select which model library Ollama should load. The default **CURC LLM Models** uses CURC-hosted models. You may also provide the **absolute path** to your own Ollama model directory, if you have pulled or fine-tuned models there. See [Ollama documentation](../ai-ml/llms.md#ollama) for more details. + - Configuration type — Choose Preset configuration (recommended for most users) or Custom configuration for advanced resource control. For details on these options, see [Configuring Open OnDemand interactive applications](./configuring_apps.md). + +4. If you selected **Preset configuration**, choose **10 cores, 1 GPU, 1 hour**. This submits your job to Alpine with one GPU, which is required to run the LLM backend. + +```{important} +When you launch with **Preset configuration**, your job is submitted to Alpine **testing** hardware (`atesting_a100` partition, `testing` QoS). Testing resources are shared and **limited in capacity**, so sessions may queue during high demand, run for a maximum of **one hour**, and are not intended for sustained or large-scale work. For longer runtimes or heavier workloads, use **Custom configuration** and request appropriate resources. +``` + +5. If you selected **Custom configuration**, you **must** request at least one GPU in the **gres** field (for example, `gpu:1`). See the [Limitations](#limitations) section for guidance on GPU memory (VRAM) and model size. +6. Click **Launch** and wait for your session to start. When the job is ready, click **Connect to LLM Chat Interface** to open the chat in a new browser tab. + + +## Getting started + +Once you click **Connect to LLM Chat Interface**, you land on the welcome screen. + +### Chat history + +Chat history is saved per user under `/projects` directory, so you can resume past conversations from the sidebar. Conversation data is stored at: + +``` +/projects//.chainlit_data +``` + +### Model selection + +Available models are loaded dynamically from the Ollama server when your session starts. To switch models: + +1. Open the **model selector** in the application header (labeled with the current model name). +2. Browse the list of **Chat Profiles**. Each entry shows: + - The model name + - Capability tags such as **Chat** or **Vision** + - Parameter size and quantization level +3. Select a model. **New messages** in the current thread use the newly selected model. When you resume an older thread from the sidebar, the model saved with that thread is used. + +```{tip} +Choose a smaller, chat-focused model for quick questions and code snippets. Switch to a vision model only when you need to analyze images, since vision models typically require more GPU memory and may respond more slowly. +``` + +### Message actions + +After each assistant reply, action buttons may appear: + +| Action | Description | +| --- | --- | +| Regenerate | Ask the model to produce a new reply to your last message. | +| Copy code block | Copy the first fenced code block from the reply to your clipboard. | +| New chat | Clear the current conversation and return to the welcome screen. | +| Switch to vision model | Switch to a vision-capable model (shown when the current model does not support images). | + +### Attaching files, text, and images + +Browser-based file upload is **disabled** in this application. Files must already exist on CURC filesystems. To attach a file, you **must** use the `/file` command followed by the file's **absolute path**. This keeps data on cluster storage and avoids uploading large files through the web interface. + +#### How to attach files + +**Every attachment must start with `/file`.** When your prompt includes a file, structure your message as follows: + +1. **First line:** `/file` followed by one or more absolute file paths (space-separated for multiple files). +2. **Following lines:** Your question or instruction to the model. + +##### Example 1 + +``` +/file /projects/$USER/data/results.csv +What trends do you see in this file? +``` + +##### Example 2: Multiple files upload + +``` +/file /projects/$USER/my_report.pdf /projects/$USER/script.py +Summarize the report and check whether the script implements the methods described. +``` + +If you send `/file` with paths but no follow-up question, the assistant is asked to analyze the attached file(s) by default. + +```{important} +Do **not** paste a bare filesystem path without the `/file` prefix, as the assistant will not attach the file. When files are attached successfully, the chat displays a confirmation as such **File attached from Alpine filesystem**. +``` + +#### Allowed locations + +Attachments must be absolute paths under one of these: + +- `/home/$USER/document.txt` +- `/projects/$USER/myfile.pdf` +- `/scratch/alpine/$USER/output.log` +- `/pl/active//data.csv` + +#### Supported file types + +| Type | Extensions / formats | Notes | +| --- | --- | --- | +| **Text and code** | `.py`, `.js`, `.ts`, `.html`, `.css`, `.json`, `.yaml`, `.md`, `.sh`, `.r`, `.sql`, `.csv`, and many others | Full file contents are injected into the prompt. | +| **Plain text** | `.txt`, `.md`, `.rst` | Same as above. | +| **PDF** | `.pdf` | Text is extracted automatically. Scanned or image-only PDFs may not yield text (see [Limitations](#limitations)). | +| **Images** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tif`, `.tiff` | Requires a **Vision** model. | + +#### Attachment limits + +| Limit | Value | +| --- | --- | +| Maximum files per message | 20 | +| Maximum file size | 500 MB per file | +| Maximum PDF text extracted | 120,000 characters per PDF (longer documents are truncated) | + +## Limitations + +### GPU memory (VRAM) and model size + +Every LLM Chat Interface session runs on **GPU hardware**. The preset configuration requests **one GPU** (`gpu:1`). Custom configurations also **require** at least one GPU in **gres**. + +Large language models load into GPU video memory (VRAM). Important constraints: + +- **Larger models use more VRAM.** A model's parameter size (shown in the model selector) is a rough guide: multi-billion-parameter models need substantially more memory than smaller ones. +- **Only one GPU is allocated by default.** Very large models may fail to load, run slowly, or return errors if they exceed available VRAM on the assigned node. +- **Vision models and long contexts increase memory pressure.** Attaching images, long PDFs, or maintaining a long conversation history all consume context window space and can contribute to out-of-memory failures or empty responses. + +```{important} +This assistant was **not** trained on [CU Research Computing (CURC) documentation](https://curc.readthedocs.io). It does **not** have reliable, up-to-date knowledge of CURC-specific systems, policies, or procedures. The model may produce plausible-sounding but **incorrect** answers for system specific topics. **Always verify** CURC-specific information against official documentation. + +``` + +### Other limitations + +- **Context window.** The backend uses a large but finite context window (32,768 tokens). Extremely long files, many attachments, or very long threads may be truncated or cause degraded responses. +- **Not for sensitive or regulated data.** Do not paste export-controlled, HIPAA, or other restricted data into the chat. Treat prompts and attachments as you would any shared compute resource. +- **Resuming old threads** reloads conversation text but does not re-inject large file bodies from earlier turns; re-attach files if you need the model to see them again. + + +## Use cases + +* Ask general coding questions +``` +Write a short Python function that reads a CSV and computes column means. +``` + +* Summarizing and questioning documents +``` +/file /projects/$USER/papers/methods_supplement.pdf +List the experimental parameters in a table and note anything ambiguous. +``` +* Image and plot reviews +``` +/file /projects/$USER/figures/confusion_matrix.png +Is this figure publication-ready? Suggest axis labels and caption text. +``` +* Deploying custom-trained or fine-tuned models that can answer questions about your lab's protocols, instrumentation, research methods, and internal documentation. \ No newline at end of file