This file documents the public API provided by the modelito package.
The package exposes a small, stable set of helpers and a Client-first chat
surface. The primary exports (also visible via from modelito import *) are:
__version__— package version string.count_tokens(text: str) -> int— estimate token count (usestiktokenif available).estimate_remote_timeout(model_name: Optional[str], input_tokens: int = 2048, concurrency: int = 1, with_source: bool = False) -> int | Tuple[int, Dict[str, Any]]— conservative timeout estimator. Whenwith_source=Truethe function returns a(timeout_seconds, details_dict)tuple with diagnostic metadata.estimate_remote_timeout_details(model_name: Optional[str], input_tokens: int = 2048, concurrency: int = 1) -> Tuple[int, Dict[str, Any]]— diagnostic timeout estimator returning both timeout and computation details.OllamaConnector— small conversation history manager and prompt builder. Connectors now prefer typedMessage/Responsedataclasses and provide both sync (complete) and async (acomplete) surfaces in addition to the legacysend_synchelper.Embedder— small embeddings-only runtime wrapper that mirrors the provider-selection behavior ofClientfor callers that only need vector embeddings.Client— primary application entry point withchat(),chat_json(),chat_parsed(),stream(), and provider auto-selection support.Provider,SyncProvider,AsyncProvider,StreamingProvider,EmbeddingProvider,ChatProvider,RawChatProvider— structural provider protocols for legacy, chat-first, and raw OpenAI-compatible code.Message,Response,MessageInput,OpenAIMessageDict— message and response dataclasses / type helpers.ProviderStatus,check_provider_ready(),format_provider_status()— readiness diagnostics helpers for local and hosted providers.OpenAICompatibleHTTPProvider— shared HTTP base class for local OpenAI-compatible runtimes.OllamaProvider— HTTP-aware provider that will call a local Ollama HTTP API when available (via the bundledollama_servicehelpers). If the HTTP API is not reachable it will attempt the Ollama CLI as a best-effort fallback (usingrun_ollama_command) before exposing a safe deterministicsummarize()fallback useful for tests.OpenAIProvider— SDK-backed hosted OpenAI provider; can also target hosted OpenAI-compatible APIs viabase_url.OMLXProvider— thin preset for local oMLX runtimes, built onOpenAICompatibleHTTPProvider.GeminiProvider,GrokProvider,ClaudeProvider— minimal provider shims with the legacylist_models()/summarize()surface.EmbeddingProvider— structural protocol for provider implementations that exposeembed(texts, **kwargs).embed_texts(texts, dim=8) -> List[List[float]]andStubEmbeddingProvider— deterministic test-friendly embedding helpers.normalize_models(raw) -> List[Dict[str, Any]]— normalize provider model listings into dictionaries with anidfield.normalize_metadata(raw) -> Dict[str, Any]— normalize provider metadata into a plain dictionary, wrapping scalar values when needed.load_config(path: str) -> dict— JSON/YAML loader for small config files.load_config_data(*paths) -> dict— merge multiple config files with later paths taking precedence; performs a deep merge of nested dicts and supports JSON/YAML parsing.parse_host_port(host_url: str) -> Tuple[str, int]— parsehost:portor URL into(host, port).LLMProviderError— base exception used by connector/provider helpers.- Ollama helpers:
server_is_up,endpoint_url,ensure_ollama_running,get_ollama_binary,install_ollama,start_ollama,stop_ollama,update_ollama,list_local_models,list_remote_models,download_model,delete_model,serve_model,change_ollama_config,run_ollama_command, etc.
Some functionality is grouped into namespaced submodules to keep the primary
modelito namespace focused and stable. These helpers are part of the public
API and are safe to import directly from their namespace:
- Recording and Replay:
from modelito.recording import RecordingProvider, ReplayProvider, CassetteFormatError, ReplayMissErrorRecordingProvider— wraps any modelito provider and persists request/response pairs to a JSONL cassette file for offline testing.ReplayProvider— reads a cassette file and returns stored responses without touching the network.- Both are zero-dependency and work entirely with stdlib, making them suitable for tests and examples.
Example usage:
from modelito.mock_provider import MockProvider
from modelito.messages import Message
from modelito.recording import RecordingProvider, ReplayProvider
# Record calls to a cassette
provider = RecordingProvider(
wrapped=MockProvider(),
cassette="tests/cassettes/demo.jsonl"
)
response = provider.summarize(messages=[Message(role="user", content="Hello")])
# Replay from cassette
replay = ReplayProvider(cassette="tests/cassettes/demo.jsonl")
cached_response = replay.summarize(messages=[Message(role="user", content="Hello")])count_tokens(text: str) -> int
: Returns an estimated token count. If tiktoken is installed it uses a real
encoding; otherwise a conservative heuristic is used.
estimate_remote_timeout(model_name: Optional[str], input_tokens: int = 2048, concurrency: int = 1, with_source: bool = False) -> int | Tuple[int, Dict[str, Any]]
: Returns an integer number of seconds to use as a conservative request timeout
for remote LLM calls. Reads a small catalog shipped in modelito/data and
applies family/keyword multipliers when present. For a diagnostic breakdown
use estimate_remote_timeout_details(...); a CLI wrapper is available at
modelito.timeout_cli, and a small calibration harness lives at
modelito.timeout_calibrate.
You can request diagnostic details about how a timeout was computed by
calling the function with with_source=True. This returns a tuple
(timeout_seconds, source_dict) where source_dict contains the matched
catalog band, any model overrides, multipliers and other metadata useful
for debugging and calibrating timeouts. Example:
from modelito import estimate_remote_timeout
timeout, info = estimate_remote_timeout("llama-2-70b", input_tokens=1000, concurrency=1, with_source=True)
print(timeout)
print(info)OllamaConnector(provider, shared_history: bool = False, system_message_file: Optional[str] = None, max_history_messages: int = 20, max_history_tokens: Optional[int] = None)
: Lightweight stateful connector that manages per-conversation histories and
prepares messages lists suitable for provider .summarize() calls.
Important OllamaConnector methods
clear_history(conv_id: Optional[str] = None) -> Noneset_system_message(text: Optional[str]) -> Noneadd_to_history(conv_id: Optional[str], role: str, content: str) -> Noneget_history(conv_id: Optional[str]) -> List[Dict[str, str]]build_prompt(conv_id: Optional[str], new_messages: Optional[List[Dict[str, str]]]=None, include_history: bool=True, max_prompt_tokens: Optional[int]=None) -> List[Dict[str,str]]send_sync(conv_id: Optional[str], new_messages: List[Dict[str,str]], settings: Optional[dict]=None) -> str— convenience helper that builds the prompt, callsprovider.summarize(messages, settings=settings)and updates local history (returnsstr).complete(conv_id: Optional[str], new_messages: Optional[Iterable]=None, settings: Optional[dict]=None) -> Response— typed convenience wrapper returning aResponsedataclass.acomplete(conv_id: Optional[str], new_messages: Optional[Iterable]=None, settings: Optional[dict]=None) -> Response— asynchronous variant.
Provider adapters implement the small provider surfaces used by the connectors. Implementations may choose to support the sync summarize() surface, the async acomplete() surface, streaming, chat(), raw OpenAI-compatible passthrough, and/or embeddings. The core convenience methods are:
list_models() -> List[str]— best-effort model enumeration (may be an empty list in offline mode).summarize(messages, settings: Optional[dict] = None) -> str— synchronous completion surface.acomplete(messages, settings: Optional[dict] = None) -> str— asynchronous completion surface (optional).stream(messages, settings: Optional[dict] = None) -> Iterable[str]— streaming generator (optional).chat(messages, settings: Optional[dict] = None) -> Response— structured response surface with metadata when supported.raw_complete(payload: dict[str, Any]) -> dict[str, Any]— OpenAI-compatible completion passthrough that preserves tool calls and arbitrary fields when supported.raw_stream(payload: dict[str, Any]) -> Iterable[dict[str, Any]]— OpenAI-compatible streaming passthrough that yields parsed JSON chunks when supported.embed(texts: Iterable[str], **kwargs) -> List[List[float]]— embeddings surface (optional).
Embedder(provider: str | EmbeddingProvider = "openai", model: Optional[str] = None, **kwargs)
: Small embeddings-only runtime selector. It resolves a named embedder from
modelito.provider_registry and exposes the narrow embed() surface.
Important Embedder methods and attributes:
embed(texts: Iterable[str], **kwargs) -> List[List[float]]provider_name -> stravailable_embedders() -> List[str]
Registry helpers:
from modelito.provider_registry import get_embedder, list_embedders
Example:
from modelito import Embedder
embedder = Embedder(provider="mock")
vectors = embedder.embed(["one", "two"])
print(vectors)
print(Embedder.available_embedders())Providers may stream outputs at different granularities; modelito normalizes
these into a simple incremental stream() generator that yields str pieces.
Typical provider streaming shapes:
- Token-level: SDKs may provide token deltas. Modelito yields these as short text fragments suitable for concatenation.
- Chunk-level: Providers that emit logical chunks or JSON events are parsed and the textual payload is yielded as chunks.
- Line-delimited / SSE: HTTP services (e.g., Ollama
/api/generate) may send newline-delimited JSON/SSE frames; modelito reads and normalizes these to textual chunks.
The stream(messages, settings=None) generator returns an iterable of
str fragments which, when concatenated, form the final response. Offline
fallbacks emit a single full-text chunk.
Client.chat_json(messages, schema=None, settings=None, strict_schema=False) -> dict
: Request structured JSON output and return a parsed dict; optionally apply
key-presence schema checks and stricter runtime validation when
strict_schema=True.
Client.chat_parsed(messages, schema, settings=None, strict_schema=True) -> Any
: Request structured JSON output and return a parsed schema object when
supported (dataclass or Pydantic-style model hooks).
The raw_complete() and raw_stream() methods enable direct passthrough of
OpenAI-compatible request payloads to supported providers. These methods preserve
all request fields (including tools, tool_choice, response_format, etc.)
and return raw OpenAI-compatible response dicts or streams without transformation.
Availability: OpenAIProvider, OMLXProvider, OpenAICompatibleHTTPProvider,
and OllamaProvider support raw passthrough.
raw_complete(payload: dict[str, Any]) -> dict[str, Any]
: Send a raw OpenAI-compatible request payload and return the complete response
dict. All standard OpenAI request fields are preserved: model, messages,
tools, tool_choice, temperature, max_tokens, etc.
Raises ModelitoBadResponseError or ModelitoConnectionError in strict mode;
non-strict mode returns a deterministic fallback response dict.
raw_stream(payload: dict[str, Any]) -> Iterable[dict[str, Any]]
: Send a raw OpenAI-compatible request with stream=True and yield parsed JSON
chunks from the Server-Sent Events (SSE) stream. The generator yields dict
objects with keys like choices, delta, etc., stopping at the [DONE] marker.
Raises on malformed events in strict mode; non-strict mode yields fallback chunks.
Tool preservation: These methods are designed to preserve tool definitions and
function calling metadata. When using OllamaProvider with modelito-serve, you can
forward OpenAI-compatible tool-calling requests directly to the underlying Ollama
instance via /v1/chat/completions, enabling tool-calling workflows with local models.
Example with OllamaProvider:
from modelito import OllamaProvider
provider = OllamaProvider(model="llama3.2")
# Raw passthrough preserves tools and tool_choice
payload = {
"model": "llama3.2",
"messages": [
{"role": "user", "content": "What is the weather?"}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}
],
"tool_choice": "auto"
}
response = provider.raw_complete(payload)
# response["choices"][0]["message"]["tool_calls"] contains function calls if generatedThe ollama_service module contains a number of small helpers to interact with
the Ollama CLI and HTTP API. The most commonly used helpers are:
endpoint_url(host: str, port: int, path: str = "/api/generate") -> strserver_is_up(host: str, port: int) -> boolensure_ollama_running(host: str = "http://127.0.0.1", port: int = 11434, auto_start: bool = False, start_args: Optional[list] = None, timeout: float = 10.0) -> boolget_ollama_binary() -> Optional[str]list_local_models() -> List[str],list_remote_models() -> List[str], andlist_remote_model_catalog(query: Optional[str] = None) -> List[RemoteModelCatalogEntry]download_model(model_name: str) -> bool,download_model_progress(model_name: str) -> Iterable[ModelLifecycleState], anddelete_model(model_name: str) -> boolserve_model(model_name: Optional[str] = None, start_args: Optional[list] = None, timeout: float = 10.0) -> boolensure_model_available(model_name: str, allow_download: bool = False, timeout: float = 600.0) -> bool— convenience helper to ensure a model is present locally, optionally downloading it.ensure_model_ready(model_name: str, host: str = "http://127.0.0.1", port: int = 11434, auto_start: bool = False, allow_download: bool = False, timeout: float = 120.0) -> bool— ensure a specific model is downloaded, warmed, and responsive.ensure_model_ready_detailed(model_name: str, host: str = "http://127.0.0.1", port: int = 11434, auto_start: bool = False, allow_download: bool = False, timeout: float = 120.0) -> ReadinessResult— likeensure_model_ready()but returns a structuredReadinessResultobject with success status, lifecycle phase, message, source, elapsed_seconds, and error details for cleaner UI integration.get_model_lifecycle_state(model_name: str) -> Optional[ModelLifecycleState],list_model_lifecycle_states() -> Dict[str, ModelLifecycleState], andclear_model_lifecycle_state(model_name: str) -> bool— inspect or reset the in-memory per-model lifecycle tracker.- Async wrappers:
async_preload_model,async_list_local_models,async_list_remote_models,async_download_model,async_delete_model,async_serve_model,async_ensure_model_available,async_ensure_model_ready,async_ensure_model_ready_detailed— simple asyncio-friendly wrappers that run the synchronous helpers in an executor. change_ollama_config(config: dict, config_path: Optional[str] = None) -> bool
The module exposes a few additional convenience helpers and CLI entrypoints useful for diagnostics and local workflows:
detect_install_method(platform_name: Optional[str] = None) -> str— pick the preferred install backend (brew,apt,choco, or script-based fallback) for the current platform.pull_model(model_name: str, timeout: float = 600.0) -> bool— convenience wrapper fordownload_model.preload_model(url: str, port: int, model: str, timeout: float = 120.0) -> None— warm a model via the HTTP API.load_remote_timeout_catalog(path: Optional[Path] = None) -> dict— load the timeout catalog (falls back to the bundled catalog).common_model_timeout(model_name: str) -> Optional[float]— returns a conservative timeout in seconds for a given model.
The detect_install_method() and install_ollama() helpers implement a
platform-aware preference order to ensure consistent behavior across environments:
- macOS:
brew(if available) → script-based fallback - Linux:
apt(if available) → script-based fallback - Windows:
choco(if available) → PowerShell-based fallback
This policy ensures the most commonly available package manager is preferred on
each platform. To override and use a specific method, pass the method parameter
explicitly to install_ollama(method="script") or detect_install_method() will
return the best-effort choice for your platform.
For higher-level tooling, ollama_service now exposes two small dataclasses:
RemoteModelCatalogEntry— structured remote catalog item withname,family,tag,installed, andrawfields.ModelLifecycleState— in-memory state snapshot withphase,message,progress,error, andupdated_atfields.
modelito exposes two small module-level CLIs useful during development:
python -m modelito doctor— diagnose provider readiness and report setup hints.modelito-serve— optional OpenAI-compatible server (/v1/models,/v1/chat/completions,/v1/embeddings) requiringpip install "modelito[serve]".python -m modelito.ollama_service— minimal Ollama lifecycle CLI (start,stop,install,inspect,pull,list-local,list-remote,version).python -m modelito.timeout_cli— print estimated timeouts and diagnostic details for a model.python -m modelito.timeout_calibrate— write calibration prompts and (optionally) exercise a local Ollama server to collect timing samples.
Use the OllamaConnector together with a provider shim for local tests:
from modelito import OllamaProvider, OllamaConnector
from modelito.messages import Message
provider = OllamaProvider()
conn = OllamaConnector(provider=provider)
resp = conn.send_sync(conv_id="example", new_messages=[Message(role="user", content="Summarize: Hello world")])
print(resp)RecordingProvider(wrapped, cassette: str | Path)
: Wraps any modelito provider and persists each call to a JSONL cassette file.
Returns the wrapped provider's result unchanged, so it is a pure passthrough
with side-effect persistence only. Normalises str, dict, and Message
inputs — and exhausts generators exactly once — before delegating to the
wrapped provider.
ReplayProvider(cassette: str | Path, *, strict: bool = True, model: str | None = None, strict_cassette: bool = True)
: Reads a JSONL cassette written by RecordingProvider and returns stored
responses without touching the network or any model runtime. By default
replay is model-agnostic (model=None): records are matched by
kind + messages + settings, ignoring the recorded provider's model name.
Pass model="..." for exact model-aware lookup.
CassetteFormatError(path, line_number, line)
: Raised when a cassette file contains malformed JSON. Attributes: path,
line_number, line.
ReplayMissError(kind, request_hash)
: Raised by ReplayProvider (in strict mode) when no cassette record matches
the request. Attributes: kind, request_hash.
V1 scope: supports list_models(), summarize(), and chat() only.
stream() and embed() raise NotImplementedError. All inputs are
stdlib-only; no additional package is required.
Example::
from modelito import Message
from modelito.mock_provider import MockProvider
from modelito.recording import RecordingProvider, ReplayProvider
# Record
p = RecordingProvider(wrapped=MockProvider(), cassette="/tmp/demo.jsonl")
p.summarize([Message(role="user", content="hello")])
# Replay offline
r = ReplayProvider(cassette="/tmp/demo.jsonl")
print(r.summarize([Message(role="user", content="hello")]))
- The package intentionally keeps provider shims minimal; they are primarily intended for tests and simple local workflows.
- For production usage you should replace provider shims with real SDK-backed
implementations that implement the same
list_models()/summarize()surface. - Static model metadata in
modelito.model_metadatais best-effort fallback data. Prefer provider-reported model information when available. - Unknown metadata fields are intentionally represented as
None, and static metadata should not be treated as authoritative for safety-critical routing.
Unified Provider Abstraction:
- All providers (OpenAI, Anthropic, Google, Ollama, etc.) accessed via a consistent interface.
- Runtime provider/model switching:
from modelito.provider_registry import get_provider, list_providers. - Runtime embedder selection:
from modelito.provider_registry import get_embedder, list_embedders.
Local Model Management:
- Auto-discovery and health checks for local models (Ollama, etc.):
LocalModelManager. - Dynamic model selection without restart.
API Key Management:
- Secure, user-friendly API key management:
APIKeyManager. - Supports environment variable overrides and config files.
- Validation and error reporting.
Streaming & Partial Results:
- All streaming-capable providers expose a
stream()method for incremental results. - See
StreamingProviderprotocol.
Error Handling & Diagnostics:
- Standardized error messages and diagnostics: see
modelito.errors. - Structured error objects for troubleshooting.
Model Capabilities Discovery:
- Expose model metadata (context window, function/tool support, etc.):
get_model_metadata().
Testing & Mocking:
- Built-in mock mode for testing/CI/offline:
MockProvider.
Performance & Caching:
- Optional in-memory response caching:
ResponseCache. - Batching utilities for embeddings and batchable operations:
batch_iterable.
See the tests/ directory for usage examples and coverage for all features.