Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
7b6aa72
improvement(resolver): use context variables for block outputs in fun…
octo-patch May 6, 2026
1989a12
improvement(func-exec): normalize inputs to match schema (#4473)
icecrasher321 May 6, 2026
a945399
feat(models): add grok-4.3 (#4472)
waleedlatif1 May 6, 2026
5d38222
fix(function): validate custom tool param keys before code interpolat…
waleedlatif1 May 6, 2026
ad88859
chore(skills): add /add-model and /validate-model commands (#4475)
waleedlatif1 May 6, 2026
4833145
chore(deps): upgrade next.js to 16.2.4 (#4460)
waleedlatif1 May 6, 2026
6d4ffff
fix(agiloft): correct response parsing, add EWGetChoiceLineId tool (#…
waleedlatif1 May 6, 2026
bfd0f46
improvement(next): bundle and CI cache config (#4478)
waleedlatif1 May 6, 2026
79ffccc
feat(emailbison): block, tools, sharepoint v2 block with cleaner code…
icecrasher321 May 6, 2026
369f9b6
fix(office-excel): support Office.js add-in embed and surface Graph e…
waleedlatif1 May 6, 2026
690b7ab
improvement(seo): restore explicit AI/search bot allow-list and add l…
waleedlatif1 May 6, 2026
00c424b
improvement(executor): reserved keyword errors (#4482)
icecrasher321 May 6, 2026
7953c56
fix(security): xlsx CVE bump and bundled security hardening (#4481)
waleedlatif1 May 6, 2026
a251e45
feat(sap): add SAP Concur integration block and SAP S/4HANA validatio…
waleedlatif1 May 7, 2026
28f1912
feat(files): zoom controls for inline mermaid and images in markdown …
waleedlatif1 May 7, 2026
28e60bf
fix(docker): drop scripts/ from workspaces array (#4484)
waleedlatif1 May 7, 2026
76d602f
fix(workday): correct SOAP service routing and reference types (#4485)
waleedlatif1 May 7, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
159 changes: 159 additions & 0 deletions .claude/commands/add-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
---
description: Add a new LLM model to apps/sim/providers/models.ts with specs verified against the provider's live API docs (no hallucination)
argument-hint: <provider> <model-id> [docs-url]
---

# Add Model Skill

You add a new model entry to `apps/sim/providers/models.ts`. **Every numeric and capability claim MUST be derived from a live web fetch of the provider's official docs in this session.** Marketing emails, training data, and your prior knowledge are not sources of truth — they routinely hallucinate pricing, context windows, and capability lists.

## Hard rules (do not skip)

1. **Live-fetch or refuse.** Before writing the entry, you must successfully WebFetch the provider's official models/pricing page in this session. If you cannot reach an authoritative source for any field, **mark the field as UNVERIFIED in your report and ask the user before guessing**. Never fill in pricing or capabilities from memory.
2. **Two-source rule for pricing.** Cross-check input/output/cached pricing against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice, mem0, intuitionlabs). If sources disagree, the provider's own docs win — but flag the disagreement.
3. **Read the code before setting capability flags.** Capability flags are dead unless the provider's implementation under `apps/sim/providers/{provider}/` actually consumes them (see Consumption Matrix below). Setting a flag the provider ignores is a silent bug.
4. **Cite every fact.** Your final report must list the URL each value came from. No URL → not verified.

## Your Task

1. Identify provider and model id from user args
2. Live-fetch official docs + pricing page + capability/parameter pages + at least one secondary source
3. Apply the Consumption Matrix to know which capability flags are real
4. Read 2-3 sibling entries in `models.ts` and match their pattern exactly
5. Insert the entry, run `bun run lint`, print the verification report

## Step 1: Live source-of-truth lookup

In priority order — fetch all that exist for the provider:

| Provider | Models index | Pricing | Reasoning/parameter caveats |
|---|---|---|---|
| OpenAI | platform.openai.com/docs/models | openai.com/api/pricing | platform.openai.com/docs/guides/reasoning |
| Anthropic | docs.anthropic.com/en/docs/about-claude/models | anthropic.com/pricing | docs.anthropic.com/en/docs/build-with-claude/extended-thinking |
| Google (Gemini) | ai.google.dev/gemini-api/docs/models | ai.google.dev/pricing | ai.google.dev/gemini-api/docs/thinking |
| xAI | docs.x.ai/developers/models | docs.x.ai/developers/models (per-model detail page) | docs.x.ai/developers/model-capabilities/text/reasoning |
| Mistral | docs.mistral.ai/getting-started/models/models_overview | mistral.ai/pricing | n/a |
| DeepSeek | api-docs.deepseek.com/quick_start/pricing | same | api-docs.deepseek.com/guides/reasoning_model |
| Groq | console.groq.com/docs/models | groq.com/pricing | n/a |
| Cerebras | inference-docs.cerebras.ai/models | cerebras.ai/pricing | n/a |

Secondary verification (use at least one): `openrouter.ai/<provider>/<model>`, `artificialanalysis.ai/models/<model>`, `cloudprice.net/models/<provider>-<model>`.

Use a precise WebFetch prompt: *"Extract for {model_id}: exact model id string, context window in tokens, input price per 1M, cached input price per 1M, output price per 1M, max output tokens, supported reasoning effort levels, accepted parameters (temperature, top_p), release date. Do not fill in fields you cannot find."*

## Step 2: Consumption Matrix (which provider honors which capability)

| Capability | Honored by | Effect if set elsewhere |
|---|---|---|
| `temperature` | All providers (passed through if set) | Safe but inert on always-reasoning models that reject it |
| `toolUsageControl` | All providers (provider-level, not per-model) | n/a — set on `ProviderDefinition`, not models |
| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped to thinking), `gemini/core.ts` | **Dead on xai, deepseek, mistral, groq, cerebras, openrouter, fireworks, bedrock, vertex** unless their core consumes it — re-grep before assuming |
| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` only | Dead elsewhere |
| `thinking` | `anthropic/core.ts`, `gemini/core.ts` | Dead elsewhere |
| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` | Dead on openai, xai, google, vertex, bedrock, azure-openai, deepseek, mistral, groq, cerebras |
| `maxOutputTokens` | Read by UI + executor for token estimation | Always meaningful — set if provider documents a cap |
| `computerUse` | `anthropic/core.ts` | Dead elsewhere |
| `deepResearch` | UI flag for routing to deep-research SKUs | Set only on actual deep-research model IDs |
| `memory: false` | Conversation persistence opt-out | Set only when model genuinely cannot maintain history (e.g., deep-research) |

**Always re-grep before relying on this table** — the codebase moves:

```bash
rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/
rg "verbosity" apps/sim/providers/<provider>/
rg "request\.thinking|thinking:" apps/sim/providers/<provider>/
rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/
```

## Step 3: Match the provider's existing entry pattern

Open `apps/sim/providers/models.ts`, find `PROVIDER_DEFINITIONS[<provider>].models`, read 2-3 sibling entries. Match field order exactly:

```ts
{
id: '<exact-api-id>',
pricing: {
input: <number>,
cachedInput: <number>, // omit if provider doesn't offer caching
output: <number>,
updatedAt: '<today YYYY-MM-DD>',
},
capabilities: {
// only flags the provider actually consumes — see matrix
},
contextWindow: <tokens>,
releaseDate: '<YYYY-MM-DD>',
recommended: true, // only if new flagship; ask user before swapping
speedOptimized: true, // only on smallest/fastest tier
deprecated: true, // only on retired models
}
```

### Reseller providers (azure-openai, azure-anthropic, vertex, bedrock, openrouter)

Model id MUST be prefixed: `azure/`, `azure-anthropic/`, `vertex/`, `bedrock/`, `openrouter/`. Pricing usually mirrors the upstream provider but verify on the reseller's own pricing page.

### Insertion order

Within a family, newest first (matches existing convention: GPT-5.5 above GPT-5.4 above GPT-5.2). Across families, biggest/flagship at top of list.

### `recommended` / `speedOptimized`

- At most one or two `recommended: true` per provider — the current flagship(s).
- If you're adding a new flagship, ask the user before removing `recommended` from the previous flagship. Never silently flip it.
- `speedOptimized: true` only on the smallest/fastest tier (nano, flash-lite, haiku class).

## Step 4: Write, lint

```bash
bun run lint
```

Lint must pass before reporting done. **If lint fails:** read the error, fix the syntax/typing issue in the entry you just wrote (do not delete the entry — it's the work product), re-run lint, and note the fix in a "Lint adjustments" line in the verification report. Never report done with lint failing.

## Step 5: Verification report (mandatory format)

End with this exact structure:

```markdown
### Verification — <model-id>

| Field | Value | Source URL | Status |
|---|---|---|---|
| `id` | `grok-4.3` | https://docs.x.ai/... | ✓ verified |
| `contextWindow` | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified (2 sources agree) |
| `input` | $1.25/M | https://docs.x.ai/... | ✓ verified |
| `cachedInput` | $0.20/M | https://cloudprice.net/... | ⚠️ single source |
| `output` | $2.50/M | https://docs.x.ai/... + https://openrouter.ai/... | ✓ verified |
| `capabilities.temperature` | `{ min: 0, max: 1 }` | matches sibling entries | — pattern-match only |
| `capabilities.reasoningEffort` | NOT SET | provider docs say API rejects it for this model | ✓ correctly omitted |
| `releaseDate` | 2026-04-30 | https://docs.x.ai/... announcement | ✓ verified |

**Disagreements**
- _none_ OR _OpenRouter says X, provider docs say Y — used Y per provider rule_

**Unverified fields**
- _none_ OR _<field>: could not find authoritative source — left as <X> based on sibling pattern; please confirm_
```

If any row is ⚠️ single-source or "unverified," **state it plainly to the user and ask whether to proceed**. Do not silently merge.

## What to do if you cannot find a source

Omitting a field is **not the same as verifying it**. Any field you cannot confirm from a live fetch must be **both** omitted from the entry **and** listed as ❓ UNVERIFIED in the report's "Unverified fields" section, with the URLs you attempted. Then ask the user to confirm before merging.

- Pricing missing → do NOT guess. Omit `cachedInput`. Mark ❓ UNVERIFIED. Ask the user for the price or the docs URL.
- Context window missing → do NOT guess. Ask the user; mark ❓ UNVERIFIED.
- Release date missing → omit the field; mark ❓ UNVERIFIED in the report.
- Capability uncertain → omit the flag (safer than setting a dead/wrong one); mark ❓ UNVERIFIED so the user knows you didn't confirm it either way.

## Anti-patterns this skill exists to prevent

- ❌ Trusting a marketing email (xAI's grok-4.3 email claimed "3 reasoning efforts" but the API rejects `reasoning_effort` — verified by official docs only)
- ❌ Setting `nativeStructuredOutputs: true` on xai/openai/google (dead — only anthropic/fireworks/openrouter consume it)
- ❌ Setting `thinking` on non-Anthropic/non-Gemini providers
- ❌ Setting `verbosity` on anything other than OpenAI gpt-5.x
- ❌ Copying `pricing.updatedAt` from a sibling instead of using today's date
- ❌ Inventing a `cachedInput` price by dividing input by 4 (varies by provider — find an explicit number)
- ❌ Stamping `recommended: true` on the new model without removing it from the previous flagship
- ❌ Reporting "done" with any UNVERIFIED row in the table
166 changes: 166 additions & 0 deletions .claude/commands/validate-model.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
description: Validate a model entry (or every model in a provider) in apps/sim/providers/models.ts against the provider's live API docs (no hallucination — reports what cannot be verified)
argument-hint: <provider> [model-id]
---

# Validate Model Skill

You audit one or more model entries in `apps/sim/providers/models.ts` against the provider's official live API docs. **Hallucinated pricing and capabilities are the #1 failure mode in this file.** Every numeric and capability claim must be re-derived from a live web fetch in this session — not from memory, not from training data, not from the user's marketing email.

## Hard rules (do not skip)

1. **Live-fetch or report unverified.** Each field must be backed by a live WebFetch in this session. If you cannot reach an authoritative URL for a field, mark it **UNVERIFIED** in the report — do not silently confirm it from memory.
2. **Cite every fact.** Every value in the report must show the source URL it was checked against. No URL → mark UNVERIFIED.
3. **Two-source rule for pricing.** Cross-check input/output/cached against at least one secondary source (OpenRouter, Artificial Analysis, CloudPrice). If sources disagree, the provider's own docs win — flag the disagreement.
4. **Inspect provider implementation before flagging capability mismatches.** A capability flag in `models.ts` is dead unless the provider's code under `apps/sim/providers/{provider}/` consumes it (see Consumption Matrix below). Setting a flag the provider ignores is a warning, not a critical.
5. **Never auto-fix without printing the diff.** Show the user the proposed diff before applying. Get confirmation.

## Your Task

When invoked as `/validate-model <provider> [model-id]`:

1. Read the target entries from `models.ts`
2. Live-fetch the provider's official models, pricing, and capability/reasoning pages + at least one secondary source for pricing
3. Inspect the provider implementation to know which flags are actually consumed
4. Run the checklist below per model
5. Report findings (critical / warning / suggestion / unverified) with every cell linked to its source URL
6. Offer to fix; on confirm, edit `models.ts` in a single pass and re-lint

If `model-id` is omitted, validate every model in the provider.

## Step 1: Read entries from `models.ts`

Capture per model: `id`, full `pricing`, full `capabilities`, `contextWindow`, `releaseDate`, `recommended`, `speedOptimized`, `deprecated`.

## Step 2: Live-fetch authoritative sources

Use the canonical provider URL table in `add-model.md` (Step 1) as the single source of truth — fetch the models index, pricing, and reasoning/parameter caveats pages listed there for the target provider. If you update one table, update the other in the same change.

Secondary cross-check (use at least one): OpenRouter, Artificial Analysis, CloudPrice.

If a fetch fails (404, timeout, paywall), record the URL attempted and mark dependent fields UNVERIFIED.

## Step 3: Build the consumption map for this provider

Re-grep before trusting the snapshot below:

```bash
rg "reasoningEffort|reasoning_effort" apps/sim/providers/<provider>/
rg "verbosity" apps/sim/providers/<provider>/
rg "request\.thinking|thinking:" apps/sim/providers/<provider>/
rg "supportsNativeStructuredOutputs|nativeStructuredOutputs" apps/sim/providers/<provider>/
```

Snapshot (verify before relying):

| Capability | Consumed by |
|---|---|
| `reasoningEffort` | `openai/core.ts`, `azure-openai`, `anthropic/core.ts` (mapped via thinking), `gemini/core.ts` |
| `verbosity` | `openai/core.ts`, `azure-openai/index.ts` |
| `thinking` | `anthropic/core.ts`, `gemini/core.ts` |
| `nativeStructuredOutputs` | `anthropic/core.ts`, `fireworks/index.ts`, `openrouter/index.ts` |
| `computerUse` | `anthropic/core.ts` |
| `temperature` | All providers (passthrough) |

A flag set in `models.ts` but not in the consumption list for this provider = **warning: dead flag**.

## Step 4: Run the checklist

For each model, evaluate every row. Statuses: ✓ matches docs, ✗ disagrees, ⚠️ single-source, ❓ UNVERIFIED (could not fetch).

### Identity
- [ ] `id` exactly matches provider's API model identifier (case, dots, dashes, prefix for resellers)
- [ ] `releaseDate` matches launch announcement
- [ ] `deprecated: true` set if provider has announced retirement (or removed from active list)

### Pricing (per 1M tokens, USD)
- [ ] `pricing.input` matches provider pricing page
- [ ] `pricing.output` matches provider pricing page
- [ ] `pricing.cachedInput` matches provider's documented cached/prompt-cache rate (or is correctly omitted if no caching offered)
- [ ] `pricing.updatedAt` is recent — warn if older than 60 days

### Context & output limits
- [ ] `contextWindow` matches docs (in tokens)
- [ ] `capabilities.maxOutputTokens` matches documented output cap (or is correctly omitted if "no output limit")

### Capabilities (each must be DOCUMENTED-AS-SUPPORTED **and** CONSUMED-BY-PROVIDER-CODE)
- [ ] `temperature` — provider accepts it for this model (reasoning-always-on models often reject)
- [ ] `reasoningEffort.values` — list matches docs; **omitted** for always-reasoning models that reject the parameter (e.g., grok-4.3, where xAI docs explicitly state `reasoning_effort` is not supported). Verify per model — some always-reasoning models (e.g., OpenAI's o-series) DO accept `reasoning_effort` and should keep the flag.
- [ ] `verbosity.values` — only on OpenAI gpt-5.x family; values match docs
- [ ] `thinking.levels` + `thinking.default` — only on Anthropic/Gemini; values match docs
- [ ] `nativeStructuredOutputs` — only on anthropic/fireworks/openrouter; provider must document Structured Outputs / JSON-mode for this model
- [ ] `toolUsageControl` — provider supports `tool_choice` semantics
- [ ] `computerUse` — provider implements computer-use loop AND model is a computer-use SKU
- [ ] `deepResearch` — only on actual deep-research SKUs
- [ ] `memory: false` — only when the model genuinely cannot maintain conversation history

### Flags
- [ ] `recommended: true` — at most one or two per provider; should be current flagship
- [ ] `speedOptimized: true` — only on smallest/fastest tier (nano / flash-lite / haiku class)

## Step 5: Report (mandatory format)

For each model, emit a table with one row per checklist item. Every row that claims ✓ must have a URL.

```markdown
### Validation — <model-id>

| Field | Repo | Live docs | Source URL | Status |
|---|---|---|---|---|
| `input` | $1.25/M | $1.25/M | https://docs.x.ai/... | ✓ |
| `cachedInput` | $0.50/M | $0.20/M | https://cloudprice.net/... | ✗ stale (price cut not picked up) |
| `reasoningEffort` | low/medium/high | rejected by API | https://docs.x.ai/.../reasoning | ✗ inert — selecting silently no-ops |
| `contextWindow` | 1,000,000 | 1,000,000 | https://docs.x.ai/... + https://openrouter.ai/... | ✓ (2 sources) |
| `releaseDate` | 2026-04-30 | not found in scraped pages | _attempted: docs.x.ai, x.ai/news_ | ❓ UNVERIFIED |

**Findings**
- 🔴 critical — `cachedInput` is wrong: docs say $0.20/M, repo has $0.50/M
- 🟡 warning — `reasoningEffort` is set but provider rejects it for this model (xAI docs explicitly: "reasoning_effort is not supported by grok-4.3")
- 🔵 suggestion — `pricing.updatedAt` is 90 days old; refresh
- ❓ unverified — `releaseDate` could not be confirmed from any fetched page; ask user

**Disagreements between sources**
- _none_ OR _OpenRouter says $X, provider docs say $Y — went with provider docs_
```

End each multi-model run with a summary count: `N models checked · X critical · Y warnings · Z suggestions · W unverified`.

## Step 6: Offer to fix

After reporting, ask: *"Want me to fix the critical and warning items? I'll print the diff first."* On yes:

1. Print the proposed diff (do not apply yet)
2. Get user confirmation
3. Edit `models.ts` in a single pass
4. Run `bun run lint`
5. Re-run only the failed rows of the checklist on the new state

## Severity definitions

- 🔴 **critical** — wrong number or wrong identifier that misleads users about cost or breaks API calls. Examples: incorrect pricing, wrong model id, wrong context window, capability the API rejects.
- 🟡 **warning** — dead code or internal inconsistency. Examples: capability flag the provider ignores, multiple `recommended: true` per provider, `pricing.updatedAt` >60 days old, missing `deprecated: true` on retired model.
- 🔵 **suggestion** — style/consistency. Examples: field order, missing `speedOptimized` on a clearly smallest-tier model.
- ❓ **unverified** — could not fetch an authoritative source for this field. Surface it; never silently confirm.

## Common bugs this skill catches

- Pricing drift after a provider price cut (very common — providers cut quarterly)
- `reasoningEffort` set on always-reasoning models that reject the parameter (grok-4.3, o3-pro pattern)
- `nativeStructuredOutputs` set on providers that don't consume the flag (dead)
- `thinking` set on non-Anthropic/non-Gemini providers
- `verbosity` set on non-gpt-5.x models
- Wrong context window (e.g., 128k claimed vs 200k actual)
- Stale `pricing.updatedAt`
- Multiple `recommended: true` per provider after a flagship swap
- Missing `deprecated: true` on retired models (e.g., the xAI batch retiring May 15, 2026)

## What "I cannot verify this" looks like

If, after fetching the documented sources, a field cannot be confirmed:

- Mark the row ❓ UNVERIFIED with the URL(s) attempted
- Surface it in the **Findings** section with severity ❓
- Do NOT mark the validation as passed
- Ask the user for a docs URL or guidance before changing anything

The skill is allowed to say *"I could not verify the cached input price for grok-4.3 from the official xAI docs in this session — I attempted [URLs] without finding the value. Third-party sources [URL1, URL2] both report $0.20/M. Confirm before I update."* That is correct behavior. Hallucinating a number is not.
Loading
Loading