mlx-proxy

Dynamic Model Management Proxy for MLX LM Studio Models

mlx_lm.server only supports loading a single fixed model at startup. This proxy adds:

Dynamically load/unload models via API calls
Toggle thinking on/off per load with enable_thinking
Auto-discover all models under your LM_Studio_Models directory
OpenAI-compatible API — drop-in replacement for existing code

Requirements

Apple Silicon Mac (M1/M2/M3/M4)
macOS 14 Ventura or later
Python 3.10+
mlx-lm

Setup

1. Install mlx-lm

uv venv ~/mlx-server
source ~/mlx-server/bin/activate
uv pip install mlx-lm fastapi uvicorn httpx

2. Clone the repository

git clone https://github.com/YOUR_USERNAME/mlx-proxy
cd mlx-proxy

3. Edit configuration

Open mlx_proxy.py and update the settings at the top:

MODELS_ROOT = Path("/path/to/your/LM_Studio_Models")  # ← Required: path to your models
MLX_SERVER_BIN = Path.home() / "mlx-server/bin/mlx_lm.server"  # ← Required: path to mlx_lm.server
MLX_BACKEND_PORT = 18080  # Internal port (no change needed)
PROXY_PORT = 8080         # Port exposed to clients
PROXY_HOST = "0.0.0.0"   # Use 0.0.0.0 for remote access, 127.0.0.1 for local only

MODELS_ROOT and MLX_SERVER_BIN must be set to match your environment.

Examples:

LM Studio default: ~/.lmstudio/models

External SSD: /Volumes/MySSD/LM_Studio_Models

4. Start the proxy

source ~/mlx-server/bin/activate
python mlx_proxy.py

5. Auto-start with launchd (optional)

# Edit the plist to match your username and paths
cp com.yourname.mlx-proxy.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/com.yourname.mlx-proxy.plist

Usage

List available models

curl http://localhost:8080/v1/models

{
  "object": "list",
  "data": [
    {
      "id": "mlx-community/Qwen3.5-9B-MLX-4bit",
      "object": "model",
      "path": "/path/to/LM_Studio_Models/mlx-community/Qwen3.5-9B-MLX-4bit",
      "loaded": false
    }
  ]
}

Load a model

# Load with thinking disabled (recommended default)
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"model_id": "mlx-community/Qwen3.5-9B-MLX-4bit", "enable_thinking": false}'

# Load with thinking enabled
curl -X POST http://localhost:8080/v1/models/load \
  -H "Content-Type: application/json" \
  -d '{"model_id": "mlx-community/Qwen3.5-9B-MLX-4bit", "enable_thinking": true}'

Chat (OpenAI-compatible)

curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "any",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

The model field can be anything — it is automatically replaced with the loaded model.

Using with the OpenAI Python SDK

from openai import OpenAI
import requests

client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")

# Load a model first
requests.post("http://localhost:8080/v1/models/load", json={
    "model_id": "mlx-community/Qwen3.5-9B-MLX-4bit",
    "enable_thinking": False
})

# Chat (non-streaming)
response = client.chat.completions.create(
    model="any",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Streaming directly from the backend

Due to a known limitation (see Notes), streaming via the proxy returns a premature connection close. For streaming, connect directly to the backend port after loading via the proxy:

import json
import requests

# Step 1: load via proxy (thinking control requires the proxy)
requests.post("http://localhost:8080/v1/models/load", json={
    "model_id": "mlx-community/Qwen3.5-9B-MLX-4bit",
    "enable_thinking": False
})

# Step 2: get the full model path from the proxy
health = requests.get("http://localhost:8080/health").json()
model_path = health["model_id"]  # e.g. "/Volumes/.../Qwen3.5-9B-MLX-4bit"

# Step 3: stream directly from the backend (mlx_lm.server, HTTP/1.0)
url = "http://127.0.0.1:18080/v1/chat/completions"
payload = {
    "model": model_path,
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": True,
}
with requests.post(url, json=payload, stream=True) as resp:
    for line in resp.iter_lines():
        if not line:
            continue
        line_str = line.decode() if isinstance(line, bytes) else line
        if not line_str.startswith("data: "):
            continue
        data_str = line_str[6:]
        if data_str.strip() == "[DONE]":
            break
        chunk = json.loads(data_str)
        content = chunk["choices"][0]["delta"].get("content", "")
        if content:
            print(content, end="", flush=True)

Unload the model

curl -X POST http://localhost:8080/v1/models/unload

Health check

curl http://localhost:8080/health

{
  "proxy": "ok",
  "backend": "ok",
  "model_id": "/path/to/Qwen3.5-9B-MLX-4bit",
  "enable_thinking": false
}

API Reference

Method	Path	Description
GET	`/v1/models`	List all available models
GET	`/v1/models/loaded`	Get currently loaded model info
POST	`/v1/models/load`	Load a model
POST	`/v1/models/unload`	Unload the current model
POST	`/v1/chat/completions`	Chat (OpenAI-compatible)
GET	`/health`	Health check

Architecture

Client / External App
        ↓ :8080 (PROXY_PORT)
  mlx_proxy.py (FastAPI)
        ↓ :18080 (MLX_BACKEND_PORT)
  mlx_lm.server (MLX backend)
        ↓
  /path/to/LM_Studio_Models/

Model switching is achieved by restarting the backend process with the new model and chat-template-args.

Notes

Model switching restarts the backend — allow a few seconds for loading
Only one model can be loaded at a time
If using an external SSD, the drive may not be mounted in time on boot. Re-trigger the proxy with: launchctl kickstart gui/$(id -u)/com.yourname.mlx-proxy

Streaming limitation

mlx_lm.server responds with HTTP/1.0 (no chunked transfer encoding — it signals end-of-stream by closing the connection). When the proxy forwards this as a chunked HTTP/1.1 stream to the client, libraries such as requests raise ChunkedEncodingError: Response ended prematurely.

Workaround options:

Use case	Recommendation
Non-streaming (simple)	Use the proxy endpoint normally — works fine
Streaming	Load via proxy, then stream directly from `http://127.0.0.1:18080` (see example above)

Impact of `enable_thinking`

For Thinking Models (e.g. Qwen3.5), enable_thinking has a dramatic effect on latency:

Setting	Typical total response time (9B model)
`enable_thinking: true`	2–5 minutes (reasoning phase dominates)
`enable_thinking: false`	< 1 second

Use enable_thinking: false (the default) for interactive use. Enable thinking only when deep reasoning is needed and latency is not a concern.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
README.md		README.md
README_JA.md		README_JA.md
mlx_proxy.py		mlx_proxy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mlx-proxy

Requirements

Setup

1. Install mlx-lm

2. Clone the repository

3. Edit configuration

4. Start the proxy

5. Auto-start with launchd (optional)

Usage

List available models

Load a model

Chat (OpenAI-compatible)

Using with the OpenAI Python SDK

Streaming directly from the backend

Unload the model

Health check

API Reference

Architecture

Notes

Streaming limitation

Impact of `enable_thinking`

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

mlx-proxy

Requirements

Setup

1. Install mlx-lm

2. Clone the repository

3. Edit configuration

4. Start the proxy

5. Auto-start with launchd (optional)

Usage

List available models

Load a model

Chat (OpenAI-compatible)

Using with the OpenAI Python SDK

Streaming directly from the backend

Unload the model

Health check

API Reference

Architecture

Notes

Streaming limitation

Impact of enable_thinking

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Impact of `enable_thinking`

Packages