Skip to content

NickCheng0921/JaneBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

JaneBench

JaneBench is an Agentic LLM benchmark built using Jane Street Puzzles and mini-swe-agent.

This repo contains all the evaluation code, skeletons of the task data and other code to perform a GePA optimization on the task provided to mini-swe-agent through DSPy.

JS (Jane Street) puzzles (and puzzles in general) serve as good Agent tasks due to their complexity, general ease of verification and heavy reliance on reasoning. I was interested in seeing how well an Agent could perform on this task, and whether we could boost its performance using prompt optimization.

Mini-swe-agent was chosen for ease of use, popularity and ability to be deployed into a Docker container.

Data Collection

All puzzles from the Jane Street archive were collected from January 2020 to April 2026 and processed into agent tasks in the form of directories with the following structure under puzzles/.

yyyy-mm-example-name
    |__assets (if there's an image)
        |__img.jpg
    |__meta.yaml
    |__PUZZLE.md (transcribed puzzle)
    |__TASK.md (submission instructions)

Here's an example truncated PUZZLE.md for 2025-12-robot-javelin. I avoided publishing these PUZZLE.mds to the repo as JS owns them. Additionally, any mention of other puzzles, submissions, or JS was removed from the puzzles during transcription to help avoid bias + memorization on the model's part.

# Robot Javelin

## Puzzle

It's head-to-head. Each of two robots makes their first throw...

... Give your answer in exact form or as a decimal to 10 decimal places.

I labeled the puzzles as suitable/non-suitable for use in the benchmark based on the following criteria:

  • ease of verification (open-ended/multiple possible solutions were skipped)
  • image heavy puzzles dependent on small details in the images or heavy visual reasoning

As my criteria for visual reasoning may seem arbitrary, I've included my annotations under puzzles/full_suitability.json for when I deemed a task unsuitable for the benchmark.

During evaluation (run_eval/run_all.py), all files from the above structure except the meta.yaml are loaded into a Docker container w/ some relevant libraries (run_eval/Dockerfile) and the mini-swe-agent is tasked to solve the puzzle (run_eval/prompt_mini_swe.py). The container has no network access to curb cheating.

All in all, this resulted in 74 transcribed puzzles, 54 of which were suitable. 21 of the suitable 54 had no related puzzle images.

Results

Benchmark

I only ran DeepSeek V4 Flash on the 21 image-less puzzles (no vision encoder) due to cost constraints. Each task was ran 5 times to account for output variance. The average time per task was 20 minutes and the average cost was 20 cents.

13 of the 21 tasks had at least 1 correct generation in 5 indepent attempts.

The most common failure mode was the Agent guessing at a premature solution (not taking into account all parts of the problem), although the task that had 3/5 successes would have been 5/5 if the Agent followed the output format correctly although that failure mode is very rare.

alt text

The Agent tends to do better on tasks that involve simple computations and is weak at word problems. It's also important to note that the Agent here consists of the framework, model, and task prompt. Any modification to either of those 3 parameters may provide different results.

Prompt Optimization

I believe that the correct methodology to solve some of the puzzles can be described through a prompt, but generating such prompts has traditionally been a human labor intensive process that mostly consists of trial and error.

As such, I added some code under dspy_exp that can run a prompt optimization on the task prompt given to the mini-swe-agent to evaluate how learnable the results are.

I made two runs, both using Deepseek V4 Pro as the reflection model, and using Deepseek V4 Flash and Mercury 2 as the mini-swe-agent drivers.

Neither run was able to identify a prompt that could get a single pass on any of the 8 unsolved problems. I attribute this to a couple of factors: the problem being extremely difficult, the reflection model failing to identify enough useful feedback to update the task prompt and the reward for correctness being sparse (no points given for partial correctness because how do you define that here?).

I still think this technique is usable here, but more work needs to be put into surrogate metrics to help "push" the Agent onto the right track.

How To Run

Benchmark

  1. Setup the environment by going to run_eval and running uv sync to set up a .venv with all the necessary libraries.

  2. Create an Openrouter key, load some credits (add money) and export it to your env using:

export OPENROUTER_API_KEY=sk-...

  1. Modify puzzles/llm_suitability.json with all the puzzles you want to run. The run script will run all puzzles in the json that have "suitable": true, you can copy over what puzzles you want to run from puzzles/full_suitability.json or use the default 8 I've identified DeepSeek V4 Flash to have failed on.

  2. Transcribe the puzzles in the llm_suitability.json into each puzzle dir in puzzles/ as PUZZLE.md. I didn't add those over for fear of copyright so you'll have to bring those over.

  3. Run run_eval/run_all.py. If it's too slow, you can up the --parallel flag but I'd recommend being careful to leave each container at least 2-4 threads for the more computationally intensive puzzles. The default timeout is also 30 minutes.

  4. An output will be created under run_eval/runs/ of the following structure:

yyyymmdd-ts
    |__2020-02-single-cross (example puzzle)
        |__run-1
            |__grade
                |__meta.yaml (copied over)
                |__SOLUTION.md (made by the Agent)
            |__workspace (mnt from Docker, Agent's scratch space)
            grade.json
            trajectory.json (Agent convo)
        |__run-2
            ...
    summary.json (agg metrics + run meta)

Prompt Optimization

Run the prompt optimization code using GePA through DSPy by calling dspy_exp/gepa_optimize.py. See the doc-string for examples.

For the tasks that you put in the train set (which is all of the puzzles you provide by default in puzzles/llm_suitability.json), I highly recommend going to the JS website, and transcribing the solution explanation into an explanation field in meta.yaml. I didn't include this for all the puzzles as it's time consuming to create and validate.

You can see an example in puzzles/2026-03-planetary-parade-meta.yaml.

GePA is a reflective prompt optimizer, meaning the prompt mutation step uses run traces provided as natural language to help shape the next prompt. Without the solution steps, the reflection LM will struggle to tell whether the Agent trajectory is on track and the quality of the proposed instruction will degrade.

I wasn't able to get any benchmark improvement with the solutions added, but I believe this step is necessary to get the best performance.

About

Agentic Benchmark on Jane Street Puzzles

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors