ci: run evals in CI with Gemini, add nightly schedule and failure notifications#70
Draft
WilliamBergamin wants to merge 10 commits into
Draft
ci: run evals in CI with Gemini, add nightly schedule and failure notifications#70WilliamBergamin wants to merge 10 commits into
WilliamBergamin wants to merge 10 commits into
Conversation
…ifications Wire the eval job's GEMINI_API_KEY and SLACK_MCP_TOKEN so the LLM-judged tool-selection suite actually runs in CI instead of skipping silently. Fix the mislabeled step name, align the job's action SHAs with lint/test, and raise its timeout to 10m for the rate-limited scenarios. Add a nightly schedule and a regression-notifications job that posts to Slack when lint/test/eval fail on main, mirroring the python-slack-sdk workflow. Also clean up leftovers from the Ollama->Gemini migration: drop the stale "not run in CI" note in AGENTS.md, remove the dangling $(OLLAMA_DIR) from the Makefile clean target, and revert an unrelated reformat in mcp.py. Co-Authored-By: Claude <svc-devxp-claude@slack-corp.com>
mwbrooks
reviewed
Jul 2, 2026
| { | ||
| "id": "list-members-platform-team", | ||
| "prompt": "Who are the members of the #platform-team channel?", | ||
| "prompt": "Who are the members of the CA1B2C3F5 channel?", |
Member
There was a problem hiding this comment.
question: Why are we switching the prompt from using a human readable channel name to a channel ID?
If we want to test a real-world prompt, the majority of people (including myself) will ask the MCP to list channels members in #channel-name and not C0123.
mwbrooks
reviewed
Jul 2, 2026
mwbrooks
left a comment
Member
There was a problem hiding this comment.
🧪 I'm having trouble getting the tests to run. Here are my steps:
# Clean things up
$ make clean
# Delete the old .env
$ rm .env
# Create a new .env
$ cp .env.example .env
# Add a Gemini API Key and Slack MCP Server API Key
$ vim .env
# Install the dependencies
$ make install
# Run the tests
$ make test-evalI receive the following errors:
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[send-message-hello-team] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[read-channel-engineering] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[search-deployment-incident] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[search-channels-mobile] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[read-profile-user] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[list-members-platform-team] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[send-message-release-shipped] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[search-api-migration] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[search-channels-design-system] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-cli-socket-mode] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-block-kit-modal] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-create-app-template] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-post-message-deploy] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-list-members-platform] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-pull-history-engineering] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-user-info-profile] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-add-reaction-releases] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-reply-in-thread] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-read-thread-replies] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-lookup-user-by-email] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[ambiguous-schedule-message-standup] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-scopes] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-which-method-topic] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-pagination] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-missing-scope] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-docs-url] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-rate-limit] - urllib.error.HTTPError: HTTP Error 401: Unauthorized
ERROR tests/eval/test_tool_selection.py::TestToolSelection::test_tool_selection[skill-slack-api-call-with-curl] - urllib.error.HTTPError: HTTP Error 401: UnauthorizedFor sanity, I also exported the API keys into my session in case the .env is not being loaded.
| id: str | ||
| prompt: str | ||
| expected_tool: str | ||
| acceptable_tools: NotRequired[list[str]] |
Member
There was a problem hiding this comment.
question: I understand that acceptable_tools are not required, but is expected_tool still required for the test to pass?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Migrates the LLM-judged eval suite from a local Ollama judge to Google's Gemini free-tier API , and makes the CI run the
evalsThe
evaljob was added but had no secrets wired up, soTestToolSelection— gated onSLACK_MCP_TOKEN, the only test intests/eval/— skipped entirely and the job went green having evaluated nothing. This PR:GEMINI_API_KEY+SLACK_MCP_TOKENinto the eval job'senv:notificationsjob that posts to Slack whenlint/test/evalfail onmain.Preview
N/A — CI/tooling change, no user-facing UI.
Testing
make lintmake test-unitruff format --check tests/support/mcp.pymake test-evalNOTE: get your own key with https://ai.google.dev/gemini-api/docs/generate-content/get-started
Notes
Requirements
make testand the tests pass.