Skip to content

ci: run evals in CI with Gemini, add nightly schedule and failure notifications#70

Draft
WilliamBergamin wants to merge 10 commits into
mainfrom
gemini-key
Draft

ci: run evals in CI with Gemini, add nightly schedule and failure notifications#70
WilliamBergamin wants to merge 10 commits into
mainfrom
gemini-key

Conversation

@WilliamBergamin

@WilliamBergamin WilliamBergamin commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Migrates the LLM-judged eval suite from a local Ollama judge to Google's Gemini free-tier API , and makes the CI run the evals

The eval job was added but had no secrets wired up, so TestToolSelection — gated on SLACK_MCP_TOKEN, the only test in tests/eval/ — skipped entirely and the job went green having evaluated nothing. This PR:

  • Wires GEMINI_API_KEY + SLACK_MCP_TOKEN into the eval job's env:
  • Adds a nightly schedule so regressions surface independent of PR traffic.
  • Adds a notifications job that posts to Slack when lint/test/eval fail on main.

Preview

N/A — CI/tooling change, no user-facing UI.

Testing

  • make lint
  • make test-unit
  • ruff format --check tests/support/mcp.py
  • make test-eval

NOTE: get your own key with https://ai.google.dev/gemini-api/docs/generate-content/get-started

Notes

  • No changeset: every change here is dev/test/CI infra, not user-facing plugin behavior.

Requirements

WilliamBergamin and others added 9 commits June 29, 2026 17:02
…ifications

Wire the eval job's GEMINI_API_KEY and SLACK_MCP_TOKEN so the LLM-judged
tool-selection suite actually runs in CI instead of skipping silently. Fix
the mislabeled step name, align the job's action SHAs with lint/test, and
raise its timeout to 10m for the rate-limited scenarios.

Add a nightly schedule and a regression-notifications job that posts to
Slack when lint/test/eval fail on main, mirroring the python-slack-sdk
workflow.

Also clean up leftovers from the Ollama->Gemini migration: drop the stale
"not run in CI" note in AGENTS.md, remove the dangling $(OLLAMA_DIR) from
the Makefile clean target, and revert an unrelated reformat in mcp.py.

Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
@WilliamBergamin WilliamBergamin self-assigned this Jul 2, 2026
@WilliamBergamin WilliamBergamin added the test Improve of update the tests of this project label Jul 2, 2026
{
"id": "list-members-platform-team",
"prompt": "Who are the members of the #platform-team channel?",
"prompt": "Who are the members of the CA1B2C3F5 channel?",

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Why are we switching the prompt from using a human readable channel name to a channel ID?

If we want to test a real-world prompt, the majority of people (including myself) will ask the MCP to list channels members in #channel-name and not C0123.

@mwbrooks mwbrooks left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧪 I'm having trouble getting the tests to run. Here are my steps:

# Clean things up
$ make clean

# Delete the old .env
$ rm .env

# Create a new .env
$ cp .env.example .env

# Add a Gemini API Key and Slack MCP Server API Key
$ vim .env

# Install the dependencies
$ make install

# Run the tests
$ make test-eval

I receive the following errors:

ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[send-message-hello-team] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[read-channel-engineering] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[search-deployment-incident] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[search-channels-mobile] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[read-profile-user] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[list-members-platform-team] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[send-message-release-shipped] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[search-api-migration] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[search-channels-design-system] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-cli-socket-mode] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-block-kit-modal] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-create-app-template] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-post-message-deploy] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-list-members-platform] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-pull-history-engineering] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-user-info-profile] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-add-reaction-releases] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-reply-in-thread] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-read-thread-replies] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-lookup-user-by-email] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-schedule-message-standup] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-scopes] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-which-method-topic] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-pagination] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-missing-scope] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-docs-url] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-rate-limit] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-call-with-curl] - urllib.error.HTTPError: HTTP Error 401: Unauthorized

For sanity, I also exported the API keys into my session in case the .env is not being loaded.

id: str
prompt: str
expected_tool: str
acceptable_tools: NotRequired[list[str]]

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I understand that acceptable_tools are not required, but is expected_tool still required for the test to pass?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test Improve of update the tests of this project

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants