A linear probing study of residual vector quantization (RVQ) layers in two neural audio codecs — EnCodec and SpeechTokenizer — to quantify how phoneme identity, speaker identity, and pitch (F0) distribute across codec depth.
This project extends Sadok et al. 2025 (Bringing Interpretability to Neural Audio Codecs, Interspeech 2025), which used mutual information and t-SNE. We apply linear probing as a complementary and more directly quantifiable methodology, and introduce a comparative codec analysis — identical probing pipelines applied to both codecs side-by-side.
How are phoneme identity, speaker identity, and pitch distributed across the 8 RVQ layers of EnCodec vs SpeechTokenizer? Does explicitly designing RVQ-1 for semantic alignment (SpeechTokenizer) produce stronger linear decodability than emergent compression structure (EnCodec)?
Hypothesis: Phoneme information will be more strongly linearly decodable from earlier RVQ layers in SpeechTokenizer (by design — RVQ-1 is aligned to HuBERT semantic tokens), while speaker identity will concentrate in middle-to-deeper layers for both codecs, and pitch (F0) will be weakly decodable across all layers.
| EnCodec | SpeechTokenizer | |
|---|---|---|
| Paper | Défossez et al. 2022 | Zhang et al. 2023 |
| Sample rate | 24 kHz | 16 kHz |
| RVQ layers | 8 | 8 |
| Token rate | 75 tokens/sec | 50 tokens/sec |
| Codebook size | 1024 | 1024 |
| Embedding dim | 128 | varies |
| Design intent | General audio compression | RVQ-1 aligned to HuBERT semantic tokens |
| Task | Probe type | Labels | Primary metric |
|---|---|---|---|
| Phoneme identity | Logistic regression | MFA forced alignment TextGrids | Macro-F1 |
| Speaker identity | Logistic regression | LibriSpeech speaker metadata | Macro-F1 |
| Pitch (F0) | Linear regression | YIN algorithm (librosa) | R² |
48 total probes: 2 codecs × 8 layers × 3 tasks. Codec parameters are never updated — all probes train on frozen embeddings.
.
├── main.py # Entry point — orchestrates the full pipeline
│
├── data/
│ ├── LibriSpeech/ # Audio data (gitignored — download separately)
│ │ ├── train-clean-100/ # 251 speakers, 28 539 utterances (~100h)
│ │ ├── train-clean-360/ # 921 speakers (~360h) — optional ablation
│ │ ├── dev-clean/ # Reserved
│ │ └── test-clean/ # Reserved
│ ├── alignments/ # TextGrid files (gitignored — download separately)
│ │ └── train-clean-100/ # speaker/chapter/utterance.TextGrid layout
│ ├── split.py # Utterance-level stratified train/eval split
│ ├── load_librispeech.py # FLAC loader → Utterance dataclass
│ ├── load_alignments.py # TextGrid parser → phoneme intervals
│ └── extract_pitch.py # YIN F0 extraction per token
│
├── encode/
│ ├── encode_encodec.py # EnCodec inference → 8 layer embeddings
│ ├── encode_speechtokenizer.py # SpeechTokenizer inference → 8 layer embeddings
│ └── collect.py # collect_bundle(): encode + align + cache
│
├── probe/
│ ├── train_probes.py # fit_label_encoders() + train_probes()
│ └── evaluate_probes.py # evaluate_probes() → layer-wise metrics
│
├── visualize/
│ └── plot_curves.py # plot_all() → 3 comparison figures
│
├── results/ # Generated outputs (gitignored)
│ ├── split.json # Saved train/eval utterance IDs
│ ├── cache/ # Per-utterance NPZ embedding cache
│ ├── probes/ # Saved probe .pkl files + label encoders
│ ├── figures/ # phoneme_probing.png, speaker_probing.png, pitch_probing.png
│ └── results.pkl # Raw metrics dict
│
├── proposal/ # Project proposal (LaTeX + PDF)
├── scripts/
│ ├── setup_env_a100.sh # A100 environment setup (uv + CUDA 12.4 PyTorch)
│ ├── run_a100_probes.sh # A100 pipeline launcher
│ └── run_a100_probes.slurm # Slurm wrapper for A100 launcher
├── pyproject.toml # Dependencies (uv-compatible)
└── .gitignore
git clone <repo-url>
cd Neural-Audio-Codec-Interpretabilitycurl -LsSf https://astral.sh/uv/install.sh | shuv venv --python 3.12
source .venv/bin/activatePyTorch must be installed before the rest of the dependencies because the wheel differs by platform.
macOS (Apple Silicon — MPS):
uv pip install torch torchaudioLinux / NVIDIA A100 (CUDA 12.4):
uv pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124uv pip install -e .For the project A100 environment, run:
bash scripts/setup_env_a100.shThe script derives PROJECT_ROOT from its own location, creates .venv, installs CUDA 12.4 PyTorch, installs dependencies from pyproject.toml, and redirects uv/PyTorch/Hugging Face caches out of $HOME using the hardcoded project storage path.
If reproducing this work on a different A100 cluster/location, edit the STORAGE value in scripts/setup_env_a100.sh before running.
Download train-clean-100 from OpenSLR. Extract so the directory structure is:
data/LibriSpeech/train-clean-100/<speaker_id>/<chapter_id>/<utterance_id>.flac
dev-clean, test-clean, and train-clean-360 can optionally be placed under the same data/LibriSpeech/ root.
Download pre-aligned TextGrid files for train-clean-100 from the CorentinJ/librispeech-alignments GitHub Releases page. Extract so the structure is:
data/alignments/train-clean-100/<speaker_id>/<chapter_id>/<utterance_id>.TextGrid
These files map each millisecond of audio to a phoneme label (e.g., /K/ 0.05–0.12s), which is used to assign phoneme labels to codec token positions.
Download the pretrained SpeechTokenizer weights from the ZhangXInFD/SpeechTokenizer repository. You need two files:
speechtokenizer.pt— model weightsconfig.json— model configuration
python main.py \
--librispeech_root data/LibriSpeech \
--alignments_root data/alignments/train-clean-100 \
--st_ckpt /path/to/speechtokenizer.pt \
--st_config /path/to/config.json \
--output_dir results \
--max_utterances 50python main.py \
--librispeech_root data/LibriSpeech \
--alignments_root data/alignments/train-clean-100 \
--st_ckpt /path/to/speechtokenizer.pt \
--st_config /path/to/config.json \
--output_dir results \
--max_utterances 0python main.py \
--librispeech_root data/LibriSpeech \
--alignments_root data/alignments/train-clean-100 \
--st_ckpt /path/to/speechtokenizer.pt \
--st_config /path/to/config.json \
--output_dir results \
--max_utterances 0 \
--probe_exec_profile local \
--probe_workers 24--probe_workers 0 uses profile defaults.
python main.py \
--librispeech_root data/LibriSpeech \
--alignments_root data/LibriSpeech-TextGrids/LibriSpeech/train-clean-100 \
--st_ckpt data/SpeechTokenizer/speechtokenizer_hubert_avg/SpeechTokenizer.pt \
--st_config data/SpeechTokenizer/speechtokenizer_hubert_avg/config.json \
--output_dir results \
--split train-clean-100 \
--max_utterances 200 \
--probe_exec_profile local-fastlocal-fast preset behavior:
- Uses aggressive local probe worker defaults for high-core machines.
- Sets probe BLAS threads to 1 when not explicitly provided.
- Sets probe classification max iterations to 300 when left at default.
- Skips eval/plotting by default for faster cache and probe warm-up.
Add --run_eval to force full evaluation while keeping other local-fast presets.
The repository includes scripts/run_a100_probes.sh, which forwards to main.py and keeps the same CLI behavior.
export LIBRISPEECH_ROOT=data/LibriSpeech
export ALIGNMENTS_ROOT=data/alignments/train-clean-100
export ST_CKPT=/path/to/speechtokenizer.pt
export ST_CONFIG=/path/to/config.json
export OUTPUT_DIR=results_a100
export PROBE_EXEC_PROFILE=a100
export PROBE_WORKERS=0
bash scripts/run_a100_probes.shSet PROBE_WORKERS explicitly to pin concurrency for a specific node.
For Slurm clusters, scripts/run_a100_probes.slurm wraps the same launcher:
sbatch scripts/run_a100_probes.slurmpython main.py \
--split train-clean-360 \
--alignments_root data/alignments/train-clean-360 \
--librispeech_root data/LibriSpeech \
--st_ckpt /path/to/speechtokenizer.pt \
--st_config /path/to/config.json \
--output_dir results_360 \
--max_utterances 0| Argument | Default | Description |
|---|---|---|
--librispeech_root |
(required) | Path to data/LibriSpeech/ |
--alignments_root |
(required) | Path to TextGrid alignment root for the chosen split |
--st_ckpt |
(required) | Path to speechtokenizer.pt |
--st_config |
(required) | Path to SpeechTokenizer config.json |
--output_dir |
results |
Directory for all outputs |
--split |
train-clean-100 |
LibriSpeech split to use |
--eval_frac |
0.1 |
Fraction of utterances held out for evaluation |
--max_utterances |
500 |
Cap on utterances (0 = no cap) |
--device |
auto |
auto | cpu | cuda | mps |
--force_resplit |
off | Ignore cached split.json and recompute |
--probe_exec_profile |
local |
Probe training profile: sequential | local | local-fast | a100 |
--probe_workers |
0 |
Max concurrent probe jobs (0 = profile default) |
--probe_blas_threads |
0 |
BLAS/OpenMP threads per probe worker (0 = auto) |
--probe_max_iter |
1000 |
Max iterations for phoneme/speaker logistic probes |
--skip_eval |
off | Stop after probe training (skip eval + plotting) |
--run_eval |
off | Force evaluation even when profile presets skip it |
Scan train-clean-100 utterance paths
│
▼
Utterance-level 90/10 split (stratified by speaker)
→ saved to results/split.json for reproducibility
│
▼
For each training utterance:
├─ Load forced alignment TextGrid → phoneme labels per token
├─ Encode with EnCodec → 8 × (N_enc_tokens, 128) embeddings [cached to .npz]
├─ Encode with ST → 8 × (N_st_tokens, D) embeddings [cached to .npz]
└─ Extract pitch (YIN) → F0 in Hz per token (NaN if unvoiced)
│
▼
Fit label encoders on training data (shared across both codecs)
→ label_encoder_phoneme.pkl, label_encoder_speaker.pkl
│
▼
Train 48 probes via a shared dispatcher (2 codecs × 8 layers × 3 tasks):
├─ Phoneme → LogisticRegression (class_weight="balanced")
├─ Speaker → LogisticRegression (class_weight="balanced")
└─ Pitch → LinearRegression (voiced frames only)
│
▼
Evaluate on held-out utterances → metrics per layer per codec
│
▼
Plot 3 figures with chance-level reference lines
→ results/figures/{phoneme,speaker,pitch}_probing.png
Embedding cache: After a codec encodes an utterance for the first time, the 8-layer embeddings are saved to results/cache/{codec}/{utterance_id}.npz. Subsequent runs load from disk, skipping the encoding step entirely. This makes re-running with different probe hyperparameters fast.
Reproducibility: The train/eval split is saved to results/split.json on first run and reloaded on subsequent runs. The random seed is fixed at 42.
| File | Description |
|---|---|
results/split.json |
Train/eval utterance IDs — guarantees reproducible splits |
results/cache/encodec/*.npz |
Cached EnCodec embeddings per utterance |
results/cache/speechtokenizer/*.npz |
Cached SpeechTokenizer embeddings per utterance |
results/probes/probe_{codec}_layer{N}_{task}.pkl |
48 trained probe models |
results/probes/label_encoder_{task}.pkl |
Fitted label encoders |
results/figures/phoneme_probing.png |
Phoneme accuracy + macro-F1 by layer |
results/figures/speaker_probing.png |
Speaker accuracy + macro-F1 by layer |
results/figures/pitch_probing.png |
Pitch MAE (Hz) + R² by layer |
results/results.pkl |
Raw metrics dict for downstream analysis |
Why utterance-level split? Splitting at the token level leaks: tokens from the same utterance appear in both train and eval. Utterance-level splitting ensures no utterance is split across sets, which is standard practice in NLP probing studies.
Why not dev-clean for evaluation? LibriSpeech is designed so speakers never overlap between splits. Dev-clean's 40 speakers are entirely distinct from train-clean-100's 251 speakers. A speaker probe trained to classify 251 speakers cannot evaluate on unseen speakers — LabelEncoder.transform() raises ValueError. Using a single utterance-level hold-out keeps the methodology consistent across all three tasks.
Why macro-F1? English phoneme and speaker distributions are skewed. Macro-F1 weights all classes equally regardless of frequency, making it the appropriate primary metric for an imbalanced multi-class problem. class_weight="balanced" in the probe training likewise compensates for class imbalance.
What is F0 / pitch? The YIN algorithm extracts the fundamental frequency (F0) — the single "base pitch" of a voiced speech frame in Hz (e.g., 120 Hz for a low typical male voice, 220 Hz for a higher typical female voice). Unvoiced frames (consonants, silence) have no fundamental frequency and are excluded from regression. A high R² means the codec embedding linearly predicts the speaker's fundamental frequency.
| Paper | Role |
|---|---|
| Sadok et al. 2025 — arXiv:2506.04492 | Primary reference; this project extends their methodology |
| Défossez et al. 2022 — arXiv:2210.13438 | EnCodec architecture |
| Zhang et al. 2023 — arXiv:2308.16692 | SpeechTokenizer architecture |
| Park et al. 2025 — arXiv:2509.01390 | Statistical analysis of codec token structure |
| Belinkov 2022 — arXiv:2102.12452 | Probing classifier methodology and best practices |
Based on codec design and the prior literature:
| Task | EnCodec (expected) | SpeechTokenizer (expected) |
|---|---|---|
| Phoneme | Gradual rise in early layers, then plateau | Sharp rise at layer 1 (HuBERT alignment by design) |
| Speaker | Weak early, stronger in mid/deep layers | Similar pattern but possibly different depth |
| Pitch | Low R² across all layers | Low R² across all layers (replicates Sadok et al.) |
If you use this code or build on this analysis:
Riley Denn and Akshay Aralikatti. Neural Audio Codec Interpretability:
A Linear Probing Study of RVQ Layers in EnCodec and SpeechTokenizer.
CSCI 682 Final Project, California State University, Chico, Spring 2026.