Skip to content

UT-Austin-RPL/metamon

Repository files navigation

Metamon Text Logo
Metamon Logo

Paper Website Discord


Metamon enables plug-and-play reinforcement learning (RL) research on Pokémon Showdown by providing:

  1. Datasets of 5M+ trajectories "reconstructed" from real human battles and 20M+ generated by self-play between agents.
  2. Starting points for training (or finetuning) your own imitation learning (IL) and RL policies.
  3. Standardized sets of competitive teams for diverse training and evaluation.
  4. 40+ baseline policies ranging from beginner to expert-level human play.

Metamon began as a conference paper, “Human-Level Competitive Pokémon via Scalable Offline RL and Transformers”, at RLC 2025. It later served as both a starter kit and winning baseline for the NeurIPS 2025 PokéAgent Challenge, and now provides RL baselines for the PokéAgent Leaderboard. Although Metamon is primarily intended to make Pokémon an accessible, data-rich research domain, our agents have consistently been among the strongest Pokémon singles bots, with ratings against human players now reaching the 90-99th percentile depending on the ruleset, and may be useful to competitive players as a practice opponent or analysis tool.


Figure 1

Supported Rulesets

Pokémon Showdown hosts many different rulesets spanning nine generations of the video game franchise. Metamon initially focused on the most popular singles ruleset ("OverUsed") for Generations 1, 2, 3, and 4 but has recently expanded to include Generation 9 OverUsed (OU).


Table of Contents

  1. Installation
  2. Quick Start
  3. Pretrained Models
  4. Battle Datasets
  5. Team Sets
  6. Baselines
  7. Observation Spaces, Action Spaces, & Reward Functions
  8. Training and Evaluation
  9. Other Datasets
  10. Battle Backends
  11. Experimental Features
  12. Acknowledgements
  13. Citation



Installation

Metamon is written and tested for linux and python 3.10+. We recommend creating a fresh virtual environment or conda environment:

conda create -n metamon python==3.10
conda activate metamon

Then, install with:

git clone --recursive git@github.com:UT-Austin-RPL/metamon.git
cd metamon
pip install -e .

To install Pokémon Showdown, we'll need a modern version of npm / Node.js (instructions here). Note that Showdown undergoes constant updates... breaking changes are rare, but do happen. The version that downloads with this repo (metamon/server) is always supported.

cd server/pokemon-showdown
npm install

We will need to have the Showdown server running in the background while using Metamon:

# in the background (`screen`, etc.)
node pokemon-showdown start --no-security
# no-security removes battle speed throttling and password requirements on your local server

If necessary, we can customize the server settings (config/config.js) or the rules for each game mode.

Verify that installation has gone smoothly with:

# run a few test battles on the local server
python -m metamon.env

Metamon provides large datasets of Pokémon team files, human battles, and other statistics that will automatically download when requested. Specify a path with:

# add to ~/.bashrc
export METAMON_CACHE_DIR=/path/to/plenty/of/disk/space



Quick Start

Metamon makes it easy to turn Pokémon into an RL research problem. Pick a set of Pokémon teams to play with, an observation space, an action space, and a reward function:

from metamon.env import get_metamon_teams
from metamon.interface import DefaultObservationSpace, DefaultShapedReward, DefaultActionSpace

team_set = get_metamon_teams("gen1ou", "competitive")
obs_space = DefaultObservationSpace()
reward_fn = DefaultShapedReward()
action_space = DefaultActionSpace()

Then, battle against built-in baselines (or any poke_env.Player):

from metamon.env import BattleAgainstBaseline
from metamon.baselines import get_baseline

env = BattleAgainstBaseline(
    battle_format="gen1ou",
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
    team_set=team_set,
    opponent_type=get_baseline("Gen1BossAI"),
)

# standard `gymnasium` environment
obs, info = env.reset()
next_obs, reward, terminated, truncated, info = env.step(env.action_space.sample())

The more flexible option is to request battles on our local Showdown server and battle anyone else who is online (humans, pretrained agents, or other Pokémon AI projects). If it plays Showdown, we can battle against it!

from metamon.env import QueueOnLocalLadder

env = QueueOnLocalLadder(
    battle_format="gen1ou",
    player_username="my_scary_username",
    num_battles=10,
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
    player_team_set=team_set,
)

Metamon's main feature is that it creates a dataset of "reconstructed" human demonstrations for these environments:

from metamon.data import ParsedReplayDataset

human_dset = ParsedReplayDataset(
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
    formats=["gen1ou"],
)
obs_seq, action_seq, reward_seq, done_seq = human_dset[0]

We can also load a starting dataset of self-play trajectories generated by the metamon project:

from metamon.data import SelfPlayDataset

selfplay_dset = SelfPlayDataset(
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
    formats=["gen1ou"],
    subset="pac-base",  # or "pac-exploratory" or "pac-tauros"
)

We can save our own agents' experience in the same format:

from metamon.data import MetamonDataset

env = QueueOnLocalLadder(
    .., # rest of args
    save_trajectories_to="my_data_path",
)
online_dset = MetamonDataset(
    dset_root="my_data_path",
    formats=["gen9ou"],  # match your env format
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_fn,
)
terminated = False
while not terminated:
    *_, terminated, _, _ = env.step(env.action_space.sample())

# find completed battles before loading examples
online_dset.refresh_files()

You are free to use this data to train an agent however you'd like, but we provide starting points for smaller-scale IL (python -m metamon.il.train) and RL (python -m metamon.rl.train), and a large set of pretrained models from our paper.




Pretrained Models

We have made every checkpoint of 40+ models available on huggingface at jakegrigsby/metamon. You will need to install amago, which is an RL codebase by the same authors. Follow instructions here.

Figure 1

Load and run pretrained models with metamon.rl.evaluate. See the full Evaluation README for all eval types (heuristic baselines, local ladder, head-to-head challenges, parameter sweeps, and more). Quick example:

python -m metamon.rl.evaluate --eval_type heuristic --agent Kakuna --gens 1 --formats ou --total_battles 100

Featured Policies

Most Metamon policies were stepping stones to later (better) versions, and are now mainly useful as baselines or extra opponents in self-play data collection. Some notable exceptions worth knowing about are:

ModelSizeDateDescription Ladder Ratings with Sample Teams (GXE)
G1G2G3G4G9

SyntheticRLV2
200M Sep 2024 Original paper's best policy. Remains the basis of several successful third-party efforts to specialize in Gen1. Most previous models have complete human ratings (see Paper Policies below), but we have become a lot more cautious about laddering. 77%68%64%66%

Abra
57M Jul 2025 The best gen9ou agent that was open-sourced during the PokéAgent Challenge, and therefore the basis of many of the best third-party metamon extensions. 50%

Kadabra3
57M Sep 2025 The best policy trained in time to participate in the PokéAgent Challenge (as an organizer baseline). #1 in the Gen1OU qualifier and #2 in Gen9OU behind foul-play. 80%64%

Kakuna
142M Dec 2025 The best public metamon model – leading by nearly every metric. Trained on diverse teams to serve as a strong foundation for further research in any gen. Appears on all 5 OU leaderboards and is consistently 1500+ Elo in Gen1OU. 82%70%63%64%71%

TaurosV0
62M May 2026 Gen1OU specialist. TaurosEnsemble has held #1 on the human Showdown ladder (following KakunaEnsemble, the first Metamon agent to do so). 83%

Models can be loosely divided into three eras of active development:

  1. RLC Paper: Trained on Gen 1-4 with old versions of the replay dataset and team sets.
  2. NeurIPS PokéAgent Challenge: Developed new baselines by reducing model sizes, reward shaping, and the paper's emphasis on long-term memory while improving generalization over diverse team choices and prioritizing support for gen9ou. However, it took several iterations to recover the paper's Gen 1-4 performance.
  3. Post-PokéAgent: Metamon continues to improve, though as a hobby project for the original team with sporadic development and maintenance. Our focus has shifted away from serving as a "starter kit" and toward chasing expert-level human performance.

Paper Policies

Paper policies play Gens 1-4 and are discussed in detail in the RLC 2025 paper. Some model sizes have several variants testing different RL objectives. See metamon/rl/pretrained.py for a complete list.

Model Name (--agent) Description
SmallIL (2 variants)15M imitation learning model trained on 1M human battles
SmallRL (5 variants)15M actor-critic model trained on 1M human battles
MediumIL50M imitation learning model trained on 1M human battles
MediumRL (3 variants)50M actor-critic model trained on 1M human battles
LargeIL200M imitation learning model trained on 1M human battles
LargeRL200M actor-critic model trained on 1M human battles
SyntheticRLV0200M actor-critic model trained on 1M human + 1M diverse self-play battles
SyntheticRLV1200M actor-critic model trained on 1M human + 2M diverse self-play battles
SyntheticRLV1_SelfPlaySyntheticRLV1 fine-tuned on 2M extra battles against itself
SyntheticRLV1_PlusPlusSyntheticRLV1 finetuned on 2M extra battles against diverse opponents
SyntheticRLV2Final 200M actor-critic model with value classification trained on 1M human + 4M diverse self-play battles.

Here is a reference of human evals for key models according to our paper:

Figure 1

PokéAgent Challenge Policies

Policies trained during the PokéAgent Challenge play Gens 1-4 and 9, but have a clear bias towards Gen 1 OU and Gen 9 OU. Their docstrings in metamon/rl/pretrained.py have some extra discussion and eval metrics.

Model Name (--agent) Description
SmallRLGen9BetaPrototype 15M actor-critic model trained after the dataset was expanded to include Gen9OU
Abra57M actor-critic trained on parsed-replays v3 and a small set of synthetic battles. First of a new series of Gen9OU-compatible policies trained in a similar style to the paper's "Synthetic" agents.
Kadabra, Kadabra2, Kadabra3, Kadabra4Are further extensions of Abra to larger datasets of self-play battles (> 11M) trained and deployed as organizer baselines throughout the PokéAgent Challenge practice ladder.
AlakazamConsidered the final edition of the main PokéAgent Challenge effort. Patches a bug that impacted tera type visibility. Actually slightly worse than Kadabra3/4 with competitive teams, but is more robust to diverse team choices thanks to a larger dataset.
Minikazam4.7M RNN trained on parsed-replays v4 and a large dataset of self-play battles. Tries to compensate for low parameter count by training on Alakazam's dataset. Creates a decent starting point for finetuning on any GPU. Evals here.
SuperkazamAn attempt to revisit Alakazam's (11M self-play + 4M human replay) dataset at a model size closer to the original paper (142M). Evals here.
KakunaThe best public metamon agent. Superkazam finetuned on 7M additional self-play battles collected at higher sampling temperature for improved exploration and value estimation. Reduced sampling weight of human replays to prioritize high-Elo self-play data. Compensates for our inattention to Gens2-4 during the PokéAgent Challenge. Evals here.

Post-PokéAgent Policies

Policies trained after the conclusion of the PokéAgent Challenge. Metamon releases now chase expert-human performance; models may be specialists in a particular ruleset or trained on unreleased datasets.

Model Name (--agent) Description
V2A*Many small-scale (~12–15M param) RL hyperparameter ablations (V2A, V2ASeed2, V2ABeta01, V2AGroupedV2ISFilter, etc.) on Gen1OU. Performance is broadly similar—between SyntheticRLV2 and Kakuna—and they are mainly useful for boosting self-play diversity rather than as standalone ladder agents.
TaurosV0Scales up the V2A findings on a fresh Gen1OU dataset at 50M params. The best standalone Gen1OU policy in metamon to date. TaurosEnsemble also reached #1 on the human Showdown ladder, building on the earlier KakunaEnsemble milestone.

Internal Leaderboards

Human ratings above are the best way to anchor performance to an external metric, but we primarily rely on self comparisons across generations and team sets to guide new research. We can get a general sense of the relative strength of metamon over time by turning policies loose on a locally hosted Showdown ladder and sampling from the same TeamSet. The PokéAgent Server also serves as a live leaderboard, though team sets are inconsistent.

Gold = PokéAgent Challenge policy, Pink = Paper policy.

Tip

These GXE values are a measure of performance relative to the listed models and have no connection to ratings on the public ladder. TaurosV0 is the best Gen1OU policy, but does not play other rulesets.

Early Gen OU Local GXE
Model Competitive TeamSet Modern Replays TeamSet Avg Rank
G1 G2 G3 G4 G1 G2 G3 G4
Kakuna 75% 66% 63% 60% 68% 71% 67% 69% 1.0
Superkazam 67% 63% 59% 58% 64% 61% 62% 61% 3.0
Kadabra4 66% 60% 58% 58% 68% 60% 66% 63% 3.5
Kadabra3 68% 61% 57% 57% 67% 60% 60% 60% 4.0
Kadabra2 67% 60% 58% 57% 64% 62% 59% 60% 4.4
Alakazam 66% 59% 56% 57% 64% 58% 61% 58% 5.5
SynRLV2 50% 59% 55% 55% 54% 61% 55% 56% 6.9
Kadabra 56% 50% 47% 47% 55% 53% 50% 54% 7.9
SynRLV1++ 43% 47% 41% 45% 47% 49% 48% 48% 10.0
SynRLV1 43% 39% 42% 46% 46% 45% 44% 49% 10.2
SynRLV0 41% 38% 48% 40% 45% 41% 49% 45% 11.1
Abra 39% 44% 44% 45% 40% 45% 48% 48% 11.2
SmallRLGen9Beta 44% 42% 45% 48% 12.0
LargeRL 25% 35% 39% 39% 30% 39% 41% 44% 13.9
Minikazam 39% 34% 34% 34% 41% 36% 36% 39% 14.6
SmallILFA 24% 36% 39% 35% 28% 35% 38% 41% 14.8

Tip

Paper Policies are (predictably) weak in Gen9OU because they were never trained to play the format and use observation spaces that assume Team Preview is not available.

Gen9OU Local GXE
Model Competitive TeamSet Modern Replays TeamSet Avg Rank
Kakuna 76% 74% 1.0
Superkazam 75% 73% 2.5
Kadabra4 75% 73% 2.5
Kadabra3 73% 71% 4.5
Kadabra2 73% 69% 5.0
Alakazam 73% 71% 5.5
Abra 61% 57% 7.0
SmallRLGen9Beta 56% 57% 8.5
Kadabra 58% 55% 8.5
Minikazam 50% 50% 10.0
SynRLV0 32% 36% 11.5
SynRLV2 32% 38% 11.5
SynRLV1++ 32% 33% 13.5
LargeRL 29% 34% 14.0
SynRLV1 31% 32% 14.5
SmallILFA 23% 27% 16.0



Battle Datasets

Metamon provides two types of offline RL datasets in a flexible format that lets you customize observations, rewards, and actions on-the-fly.

Human Replay Datasets

Showdown creates "replays" of battles that players can choose to upload to the website before they expire. We gathered all surviving historical replays for Gen 1-4 OU/NU/UU/Ubers and Gen 9 OU, and continuously save new battles to grow the dataset.

Dataset Overview

Datasets are stored on huggingface in two formats:

Name Size Description
metamon-raw-replays 2.7M Battles Our curated set of Pokémon Showdown replay .json files... to save the Showdown API some download requests and to maintain an official reference of our training data. Will be regularly updated as new battles are played and collected.
metamon-parsed-replays 5.3M Trajectories The RL-compatible version of the dataset as reconstructed by the replay parser. This dataset has been significantly expanded and improved since the original paper.

Parsed replays will download automatically when requested by the ParsedReplayDataset, but these datasets are large. Download in advance with:

python -m metamon.data.download parsed-replays
from metamon.data import ParsedReplayDataset

replay_dset = ParsedReplayDataset(
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_func,
    formats=["gen1ou", "gen9ou"],
)
obs_seq, action_seq, reward_seq, done_seq = replay_dset[0]

Server/Replay Sim2Sim Gap

In Showdown RL, we have to embrace a mismatch between the trajectories we observe in our own battles and those we gather from other player's replays. In short, replays are saved from the point-of-view of a spectator rather than the point-of-view of a player. The server sends info to the players that it does not save to its replay, and we need to try and simulate that missing info. Metamon goes to great lengths to handle this, and is always improving (more info here), but there is no way to be perfect.

Therefore, replay data is perhaps best viewed as pretraining data for an offline-to-online finetuning problem. Self-collected data from the online env fixes inaccuracies and can help concentrate on teams we'll be using on the ladder. We have open-sourced large self-play sets (below).


Self-Play Datasets

Almost all improvement in metamon's performance is driven by large and diverse datasets of agent vs. agent battles. Public self-play datasets are stored on huggingface at jakegrigsby/metamon-parsed-pile. Trajectories were generated by the ladder self-play launcher with various team sets and model pools.

There are currently three subsets:

Name Size Description
pac-base 11M Trajectories Partially comprised of battles played by organizer baselines on the PokéAgent Challenge practice ladder, but the vast majority are battles collected locally for the purposes of training the Abra, Kadabra, and Alakazam line of policies. The version uploaded here trained Alakazam, and previous models were trained on subsets of this dataset.
pac-exploratory 7M Trajectories Self-play revisited after the NeurIPS challenge with higher sampling temperature (to improve value estimates of sub-optimal actions). Kakuna was trained on metamon-parsed-replays, pac-base, and pac-exploratory.
pac-tauros 4M Trajectories Self-play dataset specifically focused on high-ladder Gen1OU. Contains decisions of base policies in the ~70–83% GXE range on the human ladder, and was used to train policies that can top the public leaderboard.

Self-play data will download automatically when requested by the SelfPlayDataset, but these datasets are large. Download in advance with:

python -m metamon.data.download self-play

This downloads all subsets for their available formats (pac-base and pac-exploratory: gen1ou–gen4ou and gen9ou; pac-tauros: gen1ou only). You can also specify formats explicitly: --formats gen1ou gen9ou.

from metamon.data import SelfPlayDataset

self_play_dset = SelfPlayDataset(
    observation_space=obs_space,
    action_space=action_space,
    reward_function=reward_func,
    subset="pac-base",  # or "pac-exploratory" or "pac-tauros"
    formats=["gen1ou", "gen9ou"],
)
obs_seq, action_seq, reward_seq, done_seq = self_play_dset[0]



Team Sets

Team sets are dirs of Showdown team files that are randomly sampled between episodes. They are stored on Hugging Face at jakegrigsby/metamon-teams and can be downloaded in advance with python -m metamon.data.download teams.

metamon.env.get_metamon_teams(battle_format: str, set_name: str)
set_name Gen1 Gen2 Gen3 Gen4 Gen9 Description
"competitive" < 30 < 30 < 30 < 30 < 30 Human-made teams scraped from forum threads, usually official “sample teams” designed by experts for beginners. This is the set used for human ladder evaluations in the paper. The name is now misleading: these are probably not the most “competitive” teams. However, because they were once the main evaluation and deployment set, Metamon has overfit to them, and they remain strong defaults.
"gl_05_26" 29k 10k 107k 48k 139k General Ladder May '26: A successor to the modern_replays idea; fills replays from recent months with teams predicted from time- and rating-appropriate usage stats.
"hl_05_26" 9k 2k 22k 6k 43k High Ladder May '26: Subset of gl_05_26 that restricts to higher-rated games and tournament-server replays, with a preference for replays that reveal more of the ground-truth team and a bias towards high-rating usage stats.

Legacy team sets from the paper and PokéAgent Challenge (paper_variety, paper_replays, modern_replays, modern_replays_v2, etc.) remain available and are described in more detail in the metamon-teams README.

You can also use your own directory of team files with, for example:

from metamon.env import TeamSet

team_set = TeamSet("/path/to/your/team/dir", battle_format: str)  # e.g. gen3ou

Files need the extension ".{battle_format}_team" (e.g., .gen3nu_team).




Baselines

baselines/ contains baseline opponents that we can battle against via BattleAgainstBaseline. baselines/heuristics provides more than a dozen heuristic opponents and starter code for developing new ones (or mixing ground-truth Pokémon knowledge into ML agents). baselines/model_based ties the simple il model checkpoints to poke-env (with CPU inference).

Here is an overview of the opponents mentioned in the paper:

from metamon.baselines import get_baseline, get_all_baseline_names
opponent = get_baseline(name)  # Get specific baseline
available = get_all_baseline_names()  # List all available baselines
name Description
BugCatcher An actively bad trainer that always picks the least damaging move. When forced to switch, picks the pokemon in its party with the worst type matchup vs the player.
RandomBaseline Selects a legal move (or switch) uniformly at random and measures the most basic level of learning early in training runs.
Gen1BossAI Emulates opponents in the original Pokémon Generation 1 games. Usually chooses random moves. However, it prefers using stat-boosting moves on the second turn and “super effective” moves when available.
Grunt A maximally offensive player that selects the move that will deal the greatest damage against the current opposing Pokémon using Pokémon’s damage equation and a type chart and selects the best matchup by type when forced to switch.
GymLeader Improves upon Grunt by additionally taking into account factors such as health. It prioritizes using stat boosts when the current Pokémon is very healthy, and heal moves when unhealthy.
PokeEnvHeuristic The SimpleHeuristicsPlayer baseline provided by poke-env with configurable difficulty (shortcuts like EasyPokeEnvHeuristic).
EmeraldKaizo An adaptation of the AI in a Pokémon Emerald ROM hack intended to be as difficult as possible. It selects actions by scoring the available options against a rule set that includes handwritten conditional statements for a large portion of the moves in the game.
BaseRNN A simple RNN IL policy trained on an early version of our parsed replay dataset. Runs inference on CPU.

Compare baselines with:

python -m metamon.baselines.compete --battle_format gen2ou --player GymLeader --opponent RandomBaseline --battles 10

Here is a reference for the relative strength of some heuristic baselines from the paper:

Figure 1




Observation Spaces, Action Spaces, & Reward Functions

Metamon tries to separate the RL from Pokémon. All we need to do is pick an ObservationSpace, ActionSpace, and RewardFunction:

  1. The environment outputs a UniversalState
  2. Our ObservationSpace maps the UniversalState to the input of our agent.
  3. Our agent outputs an action however we'd like.
  4. Our ActionSpace converts the agent's choice to a UniversalAction.
  5. The environment takes the current (UniversalState, UniversalAction) and outputs the next UniversalState. Our RewardFunction gives the agent a scalar reward.
  6. Repeat until victory.

Observations

UniversalState defines all the features we have access to at each timestep.

The ObservationSpace packs those features into a policy input.
We could create a custom version with more/less features by inheriting from metamon.interface.ObservationSpace.

Observation Space Description
DefaultObservationSpace The text/numerical observation space used in our paper.
ExpandedObservationSpace A slight improvement based on lessons learned from the paper. It also adds tera types for Gen 9.
TeamPreviewObservationSpace Further extends ExpandedObservationSpace with a preview of the opponent's team (for Gen 9).
OpponentMoveObservationSpace Modifies TeamPreviewObservationSpace to include the opponent Pokémon's revealed moves. Continues our trend of deemphasizing long-term memory.
GroupedObservationSpace Restructures similar features into per-Pokémon groups (your active, each switch slot, opponent active) plus a misc block for format, weather, teampreview, and revealed opponent species. Designed for a shared Pokémon encoder rather than one flat text/numbers vector. Used by Post-PokéAgent policies such as TaurosV0.
Tokenization

Text features have inconsistent length, but we can translate to int IDs from a list of known vocab words. The built-in observation spaces are designed such that the "tokenized" version will have fixed length.

from metamon.interface import TokenizedObservationSpace, DefaultObservationSpace
from metamon.tokenizer import get_tokenizer

base_obs = DefaultObservationSpace()
tokenized_space = TokenizedObservationSpace(
    base_obs_space=base_obs,
    tokenizer=get_tokenizer("DefaultObservationSpace-v0"),
)

The vocabs are in metamon/tokenizer; they are generated by tracking unique words across the entire replay dataset, with an unknown token for rare cases we may have missed.

Tokenizer Name Description
allreplays-v3 Legacy version for pre-release models.
DefaultObservationSpace-v0 Updated post-release vocabulary as of metamon-parsed-replays dataset v2.
DefaultObservationSpace-v1 Updated vocabulary as of metamon-parsed-replays dataset v3-beta (adds ~1k words for Gen 9).

Actions

Metamon uses a fixed UniversalAction space of 13 discrete choices:

  • {0, 1, 2, 3} use the active Pokémon's moves in alphabetical order.
  • {4, 5, 6, 7, 8} switch to the other Pokémon in the party in alphabetical order.
  • {9, 10, 11, 12} are wildcards for generation-specific gimmicks. Currently, they only apply to Gen 9, where they pick moves (in alphabetical order) with terastallization.

That might not be how we want to set up our agent. The ActionSpace converts between whatever the output of the policy might be and the UniversalAction.

Action Space Description
DefaultActionSpace Standard discrete space of 13 and supports Gen 9.
MinimalActionSpace The original space of 9 choices (4 moves + 5 switches) --- which is all we need for Gen 1-4.

Any new action spaces would be added to metamon.interface.ALL_ACTION_SPACES. A text action space (for LLM-Agents) is on the short-term roadmap.

Rewards

Reward functions assign a scalar reward based on consecutive states (R(s, s')).

Reward Function Description
DefaultShapedReward Shaped reward used by the paper. +/- 100 for win/loss, light shaping for damage dealt, health recovered, status received/inflicted.
BinaryReward Removes the smaller shaping terms and simply provides +/- 100 for win/loss.
AggressiveShapedReward Edits DefaultShapedReward's sparse reward to +200 for winning +0 for losing.

Any new reward functions would be added to metamon.interface.ALL_REWARD_FUNCTIONS, and we can implement a new one by inheriting from metamon.interface.RewardFunction.



Training and Evaluation

Metamon & Amago Diagram

We trained all of our main RL & IL models with amago. Everything you need to train your own model on metamon data and evaluate against Pokémon baselines is provided in metamon/rl/.

Configure wandb logging (optional):

cd metamon/rl/
export METAMON_WANDB_PROJECT="my_wandb_project_name"
export METAMON_WANDB_ENTITY="my_wandb_username"

Train From Scratch

See python -m metamon.rl.train --help for options. The script trains offline RL agents from scratch: pick model and training gin configs, observation/action/reward interfaces, and a dataset mix (via --dataset_config YAML). Scan metamon/rl/pretrained.py to see the configs used by each public model.

We might retrain the "SmallIL" model like this:

python -m metamon.rl.train \
  --run_name AnyNameHere \
  --model_gin_config small_agent.gin \
  --train_gin_config il.gin \
  --dataset_config self_play_dset.yaml \
  --save_dir ~/my_checkpoint_path/ \
  --log

"SmallRL" would use --train_gin_config exp_rl.gin instead. Larger runs take days and can use multiple GPUs. A smaller-model example is metamon/rl/configs/models/minikazam.gin.


Finetune from HuggingFace

See python -m metamon.rl.finetune --help to start from a public checkpoint and adapt it. Finetuning inherits architecture, observation space, and tokenizer from --base_model, but you can change the training objective (--train_gin_config), reward function (--reward_function), eval setup, and dataset mix (--dataset_config).

First iteration from HuggingFace:

python -m metamon.rl.finetune \
  --run_name minikazam_custom \
  --save_dir ~/metamon_finetunes/ \
  --base_model Minikazam \
  --train_gin_config finetune.gin \
  --dataset_config self_play_dset.yaml \
  --epochs 10 \
  --log \
  --eval_gens 9

Continue from a local run with --prev_run_dir, --prev_run_name, and --prev_checkpoint. Use --base_checkpoint to pick a non-default HuggingFace epoch on the first iteration. For iterative self-play loops (new data piles, annealed mixing, chaining prev_dataset), see the walkthrough in metamon/rl/finetune.py and examples in metamon/rl/configs/datasets/.


Customize

Most changes go through gin configs and CLI flags rather than code edits:

  • Architecture — new metamon/rl/configs/models/*.gin files
  • Training objective / hparams — new metamon/rl/configs/training/*.gin files
  • Observations, actions, rewards--obs_space, --action_space, and --reward_function (see sections above).
  • Data mix — dataset YAMLs in metamon/rl/configs/datasets/ when you need to change replay/self-play weighting or add custom datasets.

amago's configuration and customization docs cover more features.


Evaluate a Custom Model

See the Evaluation README for full details on all evaluation modes, including automated head-to-head, parameter sweeps, and self-play launchers.

To eval a custom agent trained from scratch (rl.train) we'd create a LocalPretrainedModel. LocalFinetunedModel provides quick setup for models finetuned with rl.finetune. examples/evaluate_custom_models.py shows an example for each.

Standalone Toy il (Deprecated)

Details

il/ is old toy code that does basic behavior cloning with RNNs. We used it to train early learning-based baselines (BaseRNN, WinsOnlyRNN, and MiniRNN) that you can play against with the BattleAgainstBaseline env. We may add more of these as the dataset grows/improves and more architectures are tried. Playing around with this code might be an easier way to get started, but note that the main rl/train script can also be configured to do RNN BC... but faster and on multiple GPUs.

Get started with something like:

cd metamon/il/
python train.py --run_name any_name_will_do --model_config configs/transformer_embedding.gin  --gpu 0


Other Datasets

To support the main raw-replays, parsed-replays, and teams datasets, metamon creates a few resources that may be useful for other purposes:

Usage Stats

Showdown records the frequency of team choices (items, moves, abilities, etc.) brought to battles in a given month. The community mainly uses this data to consider rule changes, but we use it to help predict missing details of partially revealed teams. We load data for an arbitrary window of history around the date a battle was played, and fall back to all-time stats for rare Pokémon where data is limited:

from metamon.backend.team_prediction.usage_stats import get_usage_stats
from datetime import date
usage_stats = get_usage_stats("gen1ou",
    start_date=date(2017, 12, 1),
    end_date=date(2018, 3, 30),
    rank=1500,  # falls back to nearest lower (glicko) rank where data is available
)
alakazam_info: dict = usage_stats["Alakazam"] # non alphanum chars and case are flexible

Download usage stats in advance with:

python -m metamon.data.download usage-stats

The data is stored on huggingface at jakegrigsby/metamon-usage-stats.

Revealed Teams

One of the main problems the replay parser has to solve is predicting a player's full team based on the "partially revealed" team at the end of the battle. As part of this, we record the revealed team in the standard Showdown team builder format, but with some magic keywords for missing elements. For example:

Tyranitar @ Custap Berry
Ability: Sand Stream
EVs: $missing_ev$ HP / $missing_ev$ Atk / $missing_ev$ Def / $missing_ev$ SpA / $missing_ev$ SpD / $missing_ev$ Spe
$missing_nature$ Nature
IVs: 31 HP / 31 Atk / 31 Def / 31 SpA / 31 SpD / 31 Spe
- Stealth Rock
- Stone Edge
- Pursuit
- $missing_move$

Given the size of our replay dataset, this creates a massive set of real (but incomplete) human team choices. The files are stored alongside the parsed-replay dataset and downloaded with:

python -m metamon.data.download revealed-teams

See Team Prediction for a work-in-progress learned approach to filling in the missing details (heuristic predictors still power the current pipeline).



Battle Backends

Converting Showdown messages to RL observations is very hard, and there will always be bugs. Minor fixes to edge cases or rare Pokémon mechanics are fairly common and don't have a real impact on overall performance. However, a fix that directly impacts observation features (agent inputs) usually decreases performance of policies trained on older battles. We extend the lifespan of pretrained model weights by versioning the "battle backend" so that we can evaluate the agent in an environment that matches the dataset it was trained on.

battle_backend : str is an arg for all the RL environment wrappers (see Quick Start).

There are currently three versions:

battle_backend Description Known Bugs When To Use
"poke-env" Original paper verison. Uses poke-env to process online battles. - Creates a sim2sim gap with the replay parser that generates training data from replays.
- PP counting and tera types are broken.
When evaluating the original paper policies.
"pokeagent" Replaces poke-env's message parsing with metamon's replay parser. Maintains the version used by all the new baselines and datasets created for the PokéAgent Challenge. - Gen9 was in Beta; tera types are reported as missing. When evaluating a policy trained during the competition (see Pretrained Models).
"metamon" Always the latest version. When collecting new self-play data and training new policies from scratch.

A PretrainedAgent saves the backend it "should" be evaluated with (if you're using them as a baseline). If you are collecting lots of new self-play data and actively working on new training runs: use "metamon". Thanks to a few hacks, it is still reasonable to use any PretrainedAgent to collect new training data in the current metamon backend.


Experimental Features

Work-in-progress ML features that are not well tested. Contributions welcome!

Team Prediction Models

Showdown replays only reveal part of each player's team. Metamon has to infer the rest before parsed trajectories are usable. Today, heuristic predictors in metamon/backend/team_prediction/predictor.py (NaiveUsagePredictor, ReplayPredictor) sample from Smogon usage stats and replay-derived candidates; these power the current replay parser, team sets, and related pipelines.

We have also been experimenting with a learned TeamPredictionModel: a transformer over a structured team sequence (Team2Seq) that fills masked species, moves, items, and abilities. At inference time it uses MaskGIT-style parallel decoding — over several iterations, it predicts all still-masked tokens in parallel and commits the highest-confidence fills, gradually unmasking the team (see IterativeTeamDecoder in iterative_decoder.py).

Training: python -m metamon.backend.team_prediction.train_prediction_model (see that module for configs and eval).

Evaluation / use: the model API lives in prediction_model.py; wiring into the replay parser and team-set builders is still in progress. See also Revealed Teams and Usage Stats.

Team Preview Models

In Generation 9, battles begin with a "team preview" phase where both players see each other's full team and choose which Pokémon to lead with. Metamon includes a separate model for this decision.

Training: Team preview models are trained via metamon/backend/team_preview/ using supervised learning on human replay data.

Evaluation: Pass a checkpoint to the evaluation script. An example checkpoint for gen9ou is included:

python -m metamon.rl.evaluate --team_preview_checkpoint metamon/backend/team_preview/gen9ou_high_elo_v4/best_model.pt --team_preview_use_argmax ...

The --team_preview_use_argmax flag selects the highest-probability lead deterministically; without it, the model samples from its predicted distribution.

Test-Time Ensembling

We have also been experimenting with inference-time ensembles that combine multiple pretrained policies rather than training a single monolithic agent. The implementation lives in metamon/rl/experimental/ensemble/ and uses a heuristic proposer/judge router (HeuristicRouterEnsemblePolicy): member models propose candidate actions, a subset judges them, and the router picks a final move using GXE priors, per-turn uncertainty, and disagreement between experts. The router itself is not trained — it is a hand-tuned inference wrapper over fixed checkpoints.

Public eval agents are PastaMittens and Exeggcute (nicknames in metamon/rl/experimental/ensemble/agents.yaml, registered via register.py). Each points at a member pool in ensemble_presets.jsonPastaMittens uses the kakuna router family (first Metamon agent to reach #1 on the human Showdown ladder); Exeggcute uses the tauros family with a Gen1OU-focused pool built around TaurosV0. The underlying base classes KakunaEnsemble and TaurosEnsemble in pretrained.py are still available but use older hardcoded member specs. Use the nicknames like any other pretrained agent:

python -m metamon.rl.evaluate --eval_type challenge --agent PastaMittens --gens 1 --formats ou ...

Member pools can be overridden at runtime with METAMON_ENSEMBLE_MEMBER_SPECS or by name via METAMON_ENSEMBLE_PRESET (see presets in metamon/rl/experimental/ensemble/ensemble_presets.json). Ensembling is not part of the main offline RL training loop — it is an experimental deployment strategy for squeezing more ladder performance out of existing checkpoints.


Acknowledgements

This project owes a huge debt to the amazing poke-env, as well Pokémon resources like Bulbapedia, Smogon, and of course Pokémon Showdown.



Citation

@misc{grigsby2025metamon,
      title={Human-Level Competitive Pok\'emon via Scalable Offline Reinforcement Learning with Transformers}, 
      author={Jake Grigsby and Yuqi Xie and Justin Sasek and Steven Zheng and Yuke Zhu},
      year={2025},
      eprint={2504.04395},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.04395}, 
}

About

Pokémon Showdown RL Agents and Datasets

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages