JaneBench

JaneBench is an Agentic LLM benchmark built using Jane Street Puzzles and mini-swe-agent.

This repo contains all the evaluation code, skeletons of the task data and other code to perform a GePA optimization on the task provided to mini-swe-agent through DSPy.

JS (Jane Street) puzzles (and puzzles in general) serve as good Agent tasks due to their complexity, general ease of verification and heavy reliance on reasoning. I was interested in seeing how well an Agent could perform on this task, and whether we could boost its performance using prompt optimization.

Mini-swe-agent was chosen for ease of use, popularity and ability to be deployed into a Docker container.

Data Collection

All puzzles from the Jane Street archive were collected from January 2020 to April 2026 and processed into agent tasks in the form of directories with the following structure under puzzles/.

yyyy-mm-example-name
    |__assets (if there's an image)
        |__img.jpg
    |__meta.yaml
    |__PUZZLE.md (transcribed puzzle)
    |__TASK.md (submission instructions)

Here's an example truncated PUZZLE.md for 2025-12-robot-javelin. I avoided publishing these PUZZLE.mds to the repo as JS owns them. Additionally, any mention of other puzzles, submissions, or JS was removed from the puzzles during transcription to help avoid bias + memorization on the model's part.

# Robot Javelin

## Puzzle

It's head-to-head. Each of two robots makes their first throw...

... Give your answer in exact form or as a decimal to 10 decimal places.

I labeled the puzzles as suitable/non-suitable for use in the benchmark based on the following criteria:

ease of verification (open-ended/multiple possible solutions were skipped)
image heavy puzzles dependent on small details in the images or heavy visual reasoning

As my criteria for visual reasoning may seem arbitrary, I've included my annotations under puzzles/full_suitability.json for when I deemed a task unsuitable for the benchmark.

During evaluation (run_eval/run_all.py), all files from the above structure except the meta.yaml are loaded into a Docker container w/ some relevant libraries (run_eval/Dockerfile) and the mini-swe-agent is tasked to solve the puzzle (run_eval/prompt_mini_swe.py). The container has no network access to curb cheating.

All in all, this resulted in 74 transcribed puzzles, 54 of which were suitable. 21 of the suitable 54 had no related puzzle images.

Results

Benchmark

I only ran DeepSeek V4 Flash on the 21 image-less puzzles (no vision encoder) due to cost constraints. Each task was ran 5 times to account for output variance. The average time per task was 20 minutes and the average cost was 20 cents.

13 of the 21 tasks had at least 1 correct generation in 5 indepent attempts.

The most common failure mode was the Agent guessing at a premature solution (not taking into account all parts of the problem), although the task that had 3/5 successes would have been 5/5 if the Agent followed the output format correctly although that failure mode is very rare.

The Agent tends to do better on tasks that involve simple computations and is weak at word problems. It's also important to note that the Agent here consists of the framework, model, and task prompt. Any modification to either of those 3 parameters may provide different results.

Prompt Optimization

I believe that the correct methodology to solve some of the puzzles can be described through a prompt, but generating such prompts has traditionally been a human labor intensive process that mostly consists of trial and error.

As such, I added some code under dspy_exp that can run a prompt optimization on the task prompt given to the mini-swe-agent to evaluate how learnable the results are.

I made two runs, both using Deepseek V4 Pro as the reflection model, and using Deepseek V4 Flash and Mercury 2 as the mini-swe-agent drivers.

Neither run was able to identify a prompt that could get a single pass on any of the 8 unsolved problems. I attribute this to a couple of factors: the problem being extremely difficult, the reflection model failing to identify enough useful feedback to update the task prompt and the reward for correctness being sparse (no points given for partial correctness because how do you define that here?).

I still think this technique is usable here, but more work needs to be put into surrogate metrics to help "push" the Agent onto the right track.

How To Run

Benchmark

Setup the environment by going to run_eval and running uv sync to set up a .venv with all the necessary libraries.
Create an Openrouter key, load some credits (add money) and export it to your env using:

export OPENROUTER_API_KEY=sk-...

Modify puzzles/llm_suitability.json with all the puzzles you want to run. The run script will run all puzzles in the json that have "suitable": true, you can copy over what puzzles you want to run from puzzles/full_suitability.json or use the default 8 I've identified DeepSeek V4 Flash to have failed on.
Transcribe the puzzles in the llm_suitability.json into each puzzle dir in puzzles/ as PUZZLE.md. I didn't add those over for fear of copyright so you'll have to bring those over.
Run run_eval/run_all.py. If it's too slow, you can up the --parallel flag but I'd recommend being careful to leave each container at least 2-4 threads for the more computationally intensive puzzles. The default timeout is also 30 minutes.
An output will be created under run_eval/runs/ of the following structure:

yyyymmdd-ts
    |__2020-02-single-cross (example puzzle)
        |__run-1
            |__grade
                |__meta.yaml (copied over)
                |__SOLUTION.md (made by the Agent)
            |__workspace (mnt from Docker, Agent's scratch space)
            grade.json
            trajectory.json (Agent convo)
        |__run-2
            ...
    summary.json (agg metrics + run meta)

Prompt Optimization

Run the prompt optimization code using GePA through DSPy by calling dspy_exp/gepa_optimize.py. See the doc-string for examples.

For the tasks that you put in the train set (which is all of the puzzles you provide by default in puzzles/llm_suitability.json), I highly recommend going to the JS website, and transcribing the solution explanation into an explanation field in meta.yaml. I didn't include this for all the puzzles as it's time consuming to create and validate.

You can see an example in puzzles/2026-03-planetary-parade-meta.yaml.

GePA is a reflective prompt optimizer, meaning the prompt mutation step uses run traces provided as natural language to help shape the next prompt. Without the solution steps, the reflection LM will struggle to tell whether the Agent trajectory is on track and the quality of the proposed instruction will degrade.

I wasn't able to get any benchmark improvement with the solutions added, but I believe this step is necessary to get the best performance.

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
data		data
dspy_exp		dspy_exp
puzzles		puzzles
results		results
run_eval		run_eval
scrape		scrape
verifier		verifier
.gitignore		.gitignore
Notes.md		Notes.md
README.md		README.md
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JaneBench

Data Collection

Results

Benchmark

Prompt Optimization

How To Run

Benchmark

Prompt Optimization

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

JaneBench

Data Collection

Results

Benchmark

Prompt Optimization

How To Run

Benchmark

Prompt Optimization

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages