Skip to content

riley-1995/Neural-Audio-Codec-Interpretability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Neural Audio Codec Interpretability

A linear probing study of residual vector quantization (RVQ) layers in two neural audio codecs — EnCodec and SpeechTokenizer — to quantify how phoneme identity, speaker identity, and pitch (F0) distribute across codec depth.

This project extends Sadok et al. 2025 (Bringing Interpretability to Neural Audio Codecs, Interspeech 2025), which used mutual information and t-SNE. We apply linear probing as a complementary and more directly quantifiable methodology, and introduce a comparative codec analysis — identical probing pipelines applied to both codecs side-by-side.


Research Question

How are phoneme identity, speaker identity, and pitch distributed across the 8 RVQ layers of EnCodec vs SpeechTokenizer? Does explicitly designing RVQ-1 for semantic alignment (SpeechTokenizer) produce stronger linear decodability than emergent compression structure (EnCodec)?

Hypothesis: Phoneme information will be more strongly linearly decodable from earlier RVQ layers in SpeechTokenizer (by design — RVQ-1 is aligned to HuBERT semantic tokens), while speaker identity will concentrate in middle-to-deeper layers for both codecs, and pitch (F0) will be weakly decodable across all layers.


Codecs Compared

EnCodec SpeechTokenizer
Paper Défossez et al. 2022 Zhang et al. 2023
Sample rate 24 kHz 16 kHz
RVQ layers 8 8
Token rate 75 tokens/sec 50 tokens/sec
Codebook size 1024 1024
Embedding dim 128 varies
Design intent General audio compression RVQ-1 aligned to HuBERT semantic tokens

Probing Tasks

Task Probe type Labels Primary metric
Phoneme identity Logistic regression MFA forced alignment TextGrids Macro-F1
Speaker identity Logistic regression LibriSpeech speaker metadata Macro-F1
Pitch (F0) Linear regression YIN algorithm (librosa)

48 total probes: 2 codecs × 8 layers × 3 tasks. Codec parameters are never updated — all probes train on frozen embeddings.


Project Structure

.
├── main.py                        # Entry point — orchestrates the full pipeline
│
├── data/
│   ├── LibriSpeech/               # Audio data (gitignored — download separately)
│   │   ├── train-clean-100/       # 251 speakers, 28 539 utterances (~100h)
│   │   ├── train-clean-360/       # 921 speakers (~360h) — optional ablation
│   │   ├── dev-clean/             # Reserved
│   │   └── test-clean/            # Reserved
│   ├── alignments/                # TextGrid files (gitignored — download separately)
│   │   └── train-clean-100/       # speaker/chapter/utterance.TextGrid layout
│   ├── split.py                   # Utterance-level stratified train/eval split
│   ├── load_librispeech.py        # FLAC loader → Utterance dataclass
│   ├── load_alignments.py         # TextGrid parser → phoneme intervals
│   └── extract_pitch.py           # YIN F0 extraction per token
│
├── encode/
│   ├── encode_encodec.py          # EnCodec inference → 8 layer embeddings
│   ├── encode_speechtokenizer.py  # SpeechTokenizer inference → 8 layer embeddings
│   └── collect.py                 # collect_bundle(): encode + align + cache
│
├── probe/
│   ├── train_probes.py            # fit_label_encoders() + train_probes()
│   └── evaluate_probes.py         # evaluate_probes() → layer-wise metrics
│
├── visualize/
│   └── plot_curves.py             # plot_all() → 3 comparison figures
│
├── results/                       # Generated outputs (gitignored)
│   ├── split.json                 # Saved train/eval utterance IDs
│   ├── cache/                     # Per-utterance NPZ embedding cache
│   ├── probes/                    # Saved probe .pkl files + label encoders
│   ├── figures/                   # phoneme_probing.png, speaker_probing.png, pitch_probing.png
│   └── results.pkl                # Raw metrics dict
│
├── proposal/                      # Project proposal (LaTeX + PDF)
├── scripts/
│   ├── setup_env_a100.sh          # A100 environment setup (uv + CUDA 12.4 PyTorch)
│   ├── run_a100_probes.sh         # A100 pipeline launcher
│   └── run_a100_probes.slurm      # Slurm wrapper for A100 launcher
├── pyproject.toml                 # Dependencies (uv-compatible)
└── .gitignore

Setup

1. Clone the repository

git clone <repo-url>
cd Neural-Audio-Codec-Interpretability

2. Install uv

curl -LsSf https://astral.sh/uv/install.sh | sh

3. Create a virtual environment

uv venv --python 3.12
source .venv/bin/activate

4. Install PyTorch (platform-specific)

PyTorch must be installed before the rest of the dependencies because the wheel differs by platform.

macOS (Apple Silicon — MPS):

uv pip install torch torchaudio

Linux / NVIDIA A100 (CUDA 12.4):

uv pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124

5. Install project dependencies

uv pip install -e .

6. A100 shortcut setup script (optional)

For the project A100 environment, run:

bash scripts/setup_env_a100.sh

The script derives PROJECT_ROOT from its own location, creates .venv, installs CUDA 12.4 PyTorch, installs dependencies from pyproject.toml, and redirects uv/PyTorch/Hugging Face caches out of $HOME using the hardcoded project storage path.

If reproducing this work on a different A100 cluster/location, edit the STORAGE value in scripts/setup_env_a100.sh before running.


Data Setup

LibriSpeech audio

Download train-clean-100 from OpenSLR. Extract so the directory structure is:

data/LibriSpeech/train-clean-100/<speaker_id>/<chapter_id>/<utterance_id>.flac

dev-clean, test-clean, and train-clean-360 can optionally be placed under the same data/LibriSpeech/ root.

Forced alignment TextGrids

Download pre-aligned TextGrid files for train-clean-100 from the CorentinJ/librispeech-alignments GitHub Releases page. Extract so the structure is:

data/alignments/train-clean-100/<speaker_id>/<chapter_id>/<utterance_id>.TextGrid

These files map each millisecond of audio to a phoneme label (e.g., /K/ 0.05–0.12s), which is used to assign phoneme labels to codec token positions.

SpeechTokenizer checkpoint

Download the pretrained SpeechTokenizer weights from the ZhangXInFD/SpeechTokenizer repository. You need two files:

  • speechtokenizer.pt — model weights
  • config.json — model configuration

Running the Pipeline

Sanity check (50 utterances — runs in a few minutes)

python main.py \
    --librispeech_root data/LibriSpeech \
    --alignments_root  data/alignments/train-clean-100 \
    --st_ckpt          /path/to/speechtokenizer.pt \
    --st_config        /path/to/config.json \
    --output_dir       results \
    --max_utterances   50

Full run (embeddings cached after first pass)

python main.py \
    --librispeech_root data/LibriSpeech \
    --alignments_root  data/alignments/train-clean-100 \
    --st_ckpt          /path/to/speechtokenizer.pt \
    --st_config        /path/to/config.json \
    --output_dir       results \
    --max_utterances   0

Local parallel probe run (recommended default profile)

python main.py \
  --librispeech_root data/LibriSpeech \
  --alignments_root  data/alignments/train-clean-100 \
  --st_ckpt          /path/to/speechtokenizer.pt \
  --st_config        /path/to/config.json \
  --output_dir       results \
  --max_utterances   0 \
  --probe_exec_profile local \
  --probe_workers    24

--probe_workers 0 uses profile defaults.

Fast cache/probe warm-up on train-clean-100 (one flag)

python main.py \
  --librispeech_root data/LibriSpeech \
  --alignments_root  data/LibriSpeech-TextGrids/LibriSpeech/train-clean-100 \
  --st_ckpt          data/SpeechTokenizer/speechtokenizer_hubert_avg/SpeechTokenizer.pt \
  --st_config        data/SpeechTokenizer/speechtokenizer_hubert_avg/config.json \
  --output_dir       results \
  --split            train-clean-100 \
  --max_utterances   200 \
  --probe_exec_profile local-fast

local-fast preset behavior:

  • Uses aggressive local probe worker defaults for high-core machines.
  • Sets probe BLAS threads to 1 when not explicitly provided.
  • Sets probe classification max iterations to 300 when left at default.
  • Skips eval/plotting by default for faster cache and probe warm-up.

Add --run_eval to force full evaluation while keeping other local-fast presets.

A100 workflow (single-node launcher)

The repository includes scripts/run_a100_probes.sh, which forwards to main.py and keeps the same CLI behavior.

export LIBRISPEECH_ROOT=data/LibriSpeech
export ALIGNMENTS_ROOT=data/alignments/train-clean-100
export ST_CKPT=/path/to/speechtokenizer.pt
export ST_CONFIG=/path/to/config.json
export OUTPUT_DIR=results_a100
export PROBE_EXEC_PROFILE=a100
export PROBE_WORKERS=0

bash scripts/run_a100_probes.sh

Set PROBE_WORKERS explicitly to pin concurrency for a specific node.

For Slurm clusters, scripts/run_a100_probes.slurm wraps the same launcher:

sbatch scripts/run_a100_probes.slurm

Optional: train-clean-360 ablation

python main.py \
    --split            train-clean-360 \
    --alignments_root  data/alignments/train-clean-360 \
    --librispeech_root data/LibriSpeech \
    --st_ckpt          /path/to/speechtokenizer.pt \
    --st_config        /path/to/config.json \
    --output_dir       results_360 \
    --max_utterances   0

All CLI arguments

Argument Default Description
--librispeech_root (required) Path to data/LibriSpeech/
--alignments_root (required) Path to TextGrid alignment root for the chosen split
--st_ckpt (required) Path to speechtokenizer.pt
--st_config (required) Path to SpeechTokenizer config.json
--output_dir results Directory for all outputs
--split train-clean-100 LibriSpeech split to use
--eval_frac 0.1 Fraction of utterances held out for evaluation
--max_utterances 500 Cap on utterances (0 = no cap)
--device auto auto | cpu | cuda | mps
--force_resplit off Ignore cached split.json and recompute
--probe_exec_profile local Probe training profile: sequential | local | local-fast | a100
--probe_workers 0 Max concurrent probe jobs (0 = profile default)
--probe_blas_threads 0 BLAS/OpenMP threads per probe worker (0 = auto)
--probe_max_iter 1000 Max iterations for phoneme/speaker logistic probes
--skip_eval off Stop after probe training (skip eval + plotting)
--run_eval off Force evaluation even when profile presets skip it

Pipeline Details

Scan train-clean-100 utterance paths
         │
         ▼
Utterance-level 90/10 split (stratified by speaker)
  → saved to results/split.json for reproducibility
         │
         ▼
For each training utterance:
  ├─ Load forced alignment TextGrid → phoneme labels per token
  ├─ Encode with EnCodec  → 8 × (N_enc_tokens, 128) embeddings   [cached to .npz]
  ├─ Encode with ST       → 8 × (N_st_tokens, D)   embeddings    [cached to .npz]
  └─ Extract pitch (YIN)  → F0 in Hz per token (NaN if unvoiced)
         │
         ▼
Fit label encoders on training data (shared across both codecs)
  → label_encoder_phoneme.pkl, label_encoder_speaker.pkl
         │
         ▼
Train 48 probes via a shared dispatcher (2 codecs × 8 layers × 3 tasks):
  ├─ Phoneme → LogisticRegression (class_weight="balanced")
  ├─ Speaker → LogisticRegression (class_weight="balanced")
  └─ Pitch   → LinearRegression  (voiced frames only)
         │
         ▼
Evaluate on held-out utterances → metrics per layer per codec
         │
         ▼
Plot 3 figures with chance-level reference lines
  → results/figures/{phoneme,speaker,pitch}_probing.png

Embedding cache: After a codec encodes an utterance for the first time, the 8-layer embeddings are saved to results/cache/{codec}/{utterance_id}.npz. Subsequent runs load from disk, skipping the encoding step entirely. This makes re-running with different probe hyperparameters fast.

Reproducibility: The train/eval split is saved to results/split.json on first run and reloaded on subsequent runs. The random seed is fixed at 42.


Outputs

File Description
results/split.json Train/eval utterance IDs — guarantees reproducible splits
results/cache/encodec/*.npz Cached EnCodec embeddings per utterance
results/cache/speechtokenizer/*.npz Cached SpeechTokenizer embeddings per utterance
results/probes/probe_{codec}_layer{N}_{task}.pkl 48 trained probe models
results/probes/label_encoder_{task}.pkl Fitted label encoders
results/figures/phoneme_probing.png Phoneme accuracy + macro-F1 by layer
results/figures/speaker_probing.png Speaker accuracy + macro-F1 by layer
results/figures/pitch_probing.png Pitch MAE (Hz) + R² by layer
results/results.pkl Raw metrics dict for downstream analysis

Methodology Notes

Why utterance-level split? Splitting at the token level leaks: tokens from the same utterance appear in both train and eval. Utterance-level splitting ensures no utterance is split across sets, which is standard practice in NLP probing studies.

Why not dev-clean for evaluation? LibriSpeech is designed so speakers never overlap between splits. Dev-clean's 40 speakers are entirely distinct from train-clean-100's 251 speakers. A speaker probe trained to classify 251 speakers cannot evaluate on unseen speakers — LabelEncoder.transform() raises ValueError. Using a single utterance-level hold-out keeps the methodology consistent across all three tasks.

Why macro-F1? English phoneme and speaker distributions are skewed. Macro-F1 weights all classes equally regardless of frequency, making it the appropriate primary metric for an imbalanced multi-class problem. class_weight="balanced" in the probe training likewise compensates for class imbalance.

What is F0 / pitch? The YIN algorithm extracts the fundamental frequency (F0) — the single "base pitch" of a voiced speech frame in Hz (e.g., 120 Hz for a low typical male voice, 220 Hz for a higher typical female voice). Unvoiced frames (consonants, silence) have no fundamental frequency and are excluded from regression. A high R² means the codec embedding linearly predicts the speaker's fundamental frequency.


Background Literature

Paper Role
Sadok et al. 2025 — arXiv:2506.04492 Primary reference; this project extends their methodology
Défossez et al. 2022 — arXiv:2210.13438 EnCodec architecture
Zhang et al. 2023 — arXiv:2308.16692 SpeechTokenizer architecture
Park et al. 2025 — arXiv:2509.01390 Statistical analysis of codec token structure
Belinkov 2022 — arXiv:2102.12452 Probing classifier methodology and best practices

Expected Results

Based on codec design and the prior literature:

Task EnCodec (expected) SpeechTokenizer (expected)
Phoneme Gradual rise in early layers, then plateau Sharp rise at layer 1 (HuBERT alignment by design)
Speaker Weak early, stronger in mid/deep layers Similar pattern but possibly different depth
Pitch Low R² across all layers Low R² across all layers (replicates Sadok et al.)

Citation

If you use this code or build on this analysis:

Riley Denn and Akshay Aralikatti. Neural Audio Codec Interpretability: 
A Linear Probing Study of RVQ Layers in EnCodec and SpeechTokenizer. 
CSCI 682 Final Project, California State University, Chico, Spring 2026.

About

Neural Audio Codec Interpretability using linear probing

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors