AI-powered code completion and editing for VS Code using local LLM backends
Overview • Features • Installation • Configuration • Models • Usage • Contributing
Collama is a VS Code extension that provides code completions, refactoring suggestions, and documentation generation. It supports multiple backends:
- Ollama - local
- OpenAI compatible — local / cloud
Status: Heavy active development — output may occasionally be unexpected.
Code Completion
- Inline, multiline, and multiblock suggestions
- Uses currently opened tabs as context
Code Edits
- Generate docstrings, extract functions, refactor code
- Simplify complex code, fix syntax errors
- Manual instructions for custom edits
Chat Interface
- Multiple chat sessions with custom titles
- Temporary chat sessions — create unlisted, non-persisted sessions for quick experiments
- Send selected code/files as context with file references
- Search and attach files/folders directly from the chat input
- Send to chat from file tree — right-click files/folders in VSCode explorer
- Real-time context usage bar with automatic trimming
- Edit messages, copy sessions, scroll navigation
- Export chat history to JSON for backup and sharing
- Auto-accept all toggle for edits and creates to streamline workflow
- Summarize individual turns or entire conversations to reduce context usage
- AGENTS.md support — define custom agent rules by placing an
AGENTS.mdfile in your project root
AI Agent with Tool Calling
- Tools for filesystem, git, and code diagnostics (see Available Tools)
- Workspace-boundary and
.gitignoreprotection; optional read-only mode - Confirmation flow with Accept / Accept All / Cancel-with-reason, plus a duration counter
Commit Messages
- AI-generated conventional commits from staged changes
- Accessible via command palette or Source Control view
Current Context Management
- Smart pruning of editor tabs in autocomplete to fit context
- Chat history optimization (tool results removed)
- Token counter visualization in agent-loop / total usage
Prerequisites: VS Code 1.109.0+, an Ollama or OpenAI-compatible endpoint (local or remote), and a supported code model (see Models).
Install the extension from the marketplace or build the vsix yourself, then configure an endpoint in settings. For authentication, set your API key as a bearer token — see Bearer Tokens.
Ollama (local, remote)
See Ollama installation instructions or the Docker image. Point apiEndpointCompletion / apiEndpointInstruct to your Ollama host (default: http://127.0.0.1:11434).
OpenAI / OpenAI-compatible (local, remote, cloud)
Point the endpoint settings to your server (e.g. vLLM, LiteLLM) or to https://api.openai.com for the OpenAI cloud API.
Configure Collama via VS Code Settings (Preferences → Settings, search "collama"):
| Setting | Type | Default | Description |
|---|---|---|---|
collama.apiEndpointCompletion |
string | http://127.0.0.1:11434 |
Endpoint for code auto-completion |
collama.apiEndpointInstruct |
string | http://127.0.0.1:11434 |
Endpoint for code edits/chat |
collama.apiModelCompletion |
string | qwen2.5-coder:3b |
Model for code completions |
collama.apiModelInstruct |
string | gpt-oss:20b |
Model for code edits (use instruct/base variant) |
collama.apiTokenContextLenCompletion |
number | 4096 |
Context window size (tokens) for the completion model |
collama.apiTokenContextLenInstruct |
number | 4096 |
Context window size (tokens) for the instruct/chat model |
collama.apiTokenPredictCompletion |
number | 400 |
Max tokens to generate per completion request |
collama.apiTokenPredictInstruct |
number | 4096 |
Max tokens to generate per instruct/chat request |
collama.autoComplete |
boolean | true |
Enable auto-suggestions |
collama.suggestMode |
string | inline |
Suggestion style: inline, multiline, or multiblock |
collama.suggestDelay |
number | 1500 |
Delay (ms) before requesting completion |
collama.verbosityMode |
string | medium |
Chat response detail: compact, medium, or detailed |
collama.agenticMode |
string | plain |
Agent mode: plain, lite, or default |
collama.enableEditTools |
boolean | true |
Enable edit tools (read-only mode when disabled) |
collama.enableShellTool |
boolean | false |
Enable shell tool usage |
collama.tlsRejectUnauthorized |
boolean | true |
Verify TLS certificates for HTTPS endpoints |
Set token limits to match your model or server configuration. Values are tokens, not characters.
apiTokenContextLen*- available context windowapiTokenPredict*- maximum generated tokens per request
Context length is shared by input tokens and the LLM answer, so a higher predict limit leaves less room for input/context.
Note
Check your model's maximum context window online and keep memory reservation in mind.
If an endpoint needs authentication, store the token via Command Palette:
collama: Set Bearer Token (Completion)collama: Set Bearer Token (Instruct)
Tokens are sent as Authorization: Bearer <token> and stored in VS Code's encrypted credential storage. Run the same command with an empty value to clear one.
Collama is tested primarily with the Qwen Coder for Completion and liteLLM frontier models for Instructions. Small models may struggle with verbosity settings. Test it and keep the setting for the specific model.
- any qwen coder > 3b recommended
- any instruct model with thinking capabilities (e.g. gpt-oss:20b)
- Mid-level MoE models with dynamic quantization (e.g. unsloth's dynamic GGUF quants) gave noticeably better results than dense models of similar size
- Dense mid-size models (e.g. qwen3:14b) can struggle here — prefer a MoE/dynamic-quant variant if results are poor
- Prefer frontier models (e.g. gpt-oss:120b, glm-4.7, minimax2.7) for agentic mode.
- Mid-level MoE models with dynamic quantization also handled agentic editing well (e.g. gpt-oss:20b bf16, qwen3.6:35b, qwen3:30b — all dynamic quants)
- Any chat model can be used for regular chat
- Keep Agentic mode OFF for small models — small models lack the reasoning for reliable tool calling
| Model | Tested Sizes | FIM Support | Status | Notes |
|---|---|---|---|---|
| codeqwen | — | Untested | May work | |
| qwen2.5-coder | 1.5B, 3B, 7B | ✅ | Stable | Recommended |
| qwen3-coder | 30B | ✅ | Stable | Recommended |
| starcoder | — | Untested | May work | |
| starcoder2 | 3B | ✅ | Stable | Like qwen2.5-coder |
Note: tested primarily at q4 quantization (results may vary with others), and ChatML-format models are not supported — only true FIM models will work for autocomplete.
- Completions: trigger automatically after
suggestDelay; pressTabto accept orEscto dismiss. - Code edits: select code and use collama (on selection) for docstrings, refactors, fixes, or manual edits.
- Chat: attach files/folders, manage sessions, summarize context, and toggle auto-accept for edits/creates.
- Commit messages: stage changes, then run collama: Generate Commit Message.
If a chat already contains tool calls, switch to a fresh chat after turning Agentic OFF. Edit Tools can be turned OFF for read-only exploration. Shell Tool is OFF by default and can be enabled from the status bar menu.
- Agentic mode has been tested on vLLM (NVIDIA H200) with:
- gpt-oss:120b
- glm-4.7-fp8
- minimax2.5, minimax2.7
- deepseek-v4-flash
Warning
Small models should use chat with Agentic OFF.
Available Tools:
-
Explore
read- Read a workspace file, optionally by line rangegrep- Search workspace files with a regex patternglob- Find files and folders by glob pattern
-
Git
gitLog- List commits or branches with optional filtersgitDiff- Show working tree, staged, or commit/branch diffs
-
Code Analysis
diagnostics- Language-server analysis. Returns errors/warnings/hints for a file
-
Decision
decision- Ask the user to choose between options when the right next step is ambiguous
-
Edit Tools
edit- Replace an exact string in a workspace file with preview and confirmationcreate- Create a new file or folder with confirmationdelete- Delete a file or folder with confirmationnotebook- Edit Jupyter notebook cells with rich diff preview support
-
Shell Tool
shell- Run shell commands with confirmation whencollama.enableShellToolis enabled; large output is written to a temp file instead of truncating
Contributions are welcome! Here's how you can help:
- Report Issues Open an issue
- Submit PRs: Fork, create a feature branch, and submit a pull request

