FALCON (Forced Alignment through Contrastive Optimization Networks) is an end-to-end, fully differentiable neural system for phoneme-level and word-level (and also multilingual) forced alignment — given a speech waveform and a known phoneme/words transcript sequence, it predicts precise phoneme boundary timestamps.
Rotem Rousso, Eyal Cohen, Joseph Keshet
"Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming"
Preprint, 2026 — arXiv:2606.25460
If you use FALCON in your research, please cite:
@article{rousso2026falcon,
title = {Fully Differentiable Neural Forced Alignment via Soft Dynamic Programming},
author = {Rousso, Rotem and Cohen, Eyal and Keshet, Joseph},
journal = {arXiv preprint arXiv:2606.25460},
year = {2026}
}- 🎯 Phoneme-level precision — operates at ~10 ms frame resolution, the finest granularity among neural forced aligners.
- 🔁 Fully differentiable — a Soft Dynamic Programming decoder enables gradient flow through the entire alignment pipeline.
- 🌍 Zero-shot multilingual — trained on English, generalizes to unseen languages at inference (tested on Dutch, German, and Hebrew, among others) without any additional training.
- 📝 Word-level generalization — word boundaries derived from phoneme predictions at test time, competitive with word-level neural aligners.
The system has three jointly trained components:
Raw waveform (16 kHz)
│
▼
┌──────────────────┐
│ CNN Encoder f_θ │ learns boundary-sensitive representations
│ (5 strided conv) │ via MNCE contrastive loss
└───────┬──────────┘
│ Z (frame-level latent features)
├──────────────────────────────────────┐
▼ │ Z
┌────────────────────┐ │
│ BiLSTM Context g_ψ │ frame-wise phoneme │
│ (5 layers, 512 dim)│ posterior probs U │
└───────┬────────────┘ │
│ U │
▼ ▼
┌───────────────────────────────────────────────────┐
│ Soft-DP Decoder h_W │
│ φ₁(Z): acoustic boundary sharpness │
│ φ₂(U,p): phoneme posterior consistency │
│ LogSumExp forward + soft-argmax backtrack │
└───────────────────────────────────────────────────┘
│
▼
Predicted boundary timestamps ŷ = (ŷ₁, …, ŷₙ)
FALCON aligns the example fasw0sa2.wav — a TIMIT sentence, "Don't ask me to carry an oily rag like that." — using the included English checkpoint. The same utterance is provided in every supported input format so you can try them in the web demo: .phn (phonemes), .wrd (words), and .txt (plain transcript).
Waveform with predicted phoneme boundaries:
Under the hood — every representation on one shared time axis. Waveform, log-mel spectrogram, the phoneme posteriors, the Soft-DP cost matrix with its backtracked alignment path, and the contrastive boundary score with its derivative. The same predicted boundaries (crimson dashed) and ground-truth boundaries (charcoal dotted) line up across all panels, with the input phonemes written under each axis, so every component's contribution is visible at a glance:
Output (first rows of the phones tier; the full alignment is written to a Praat .TextGrid):
| Phoneme | Start (s) | End (s) |
|---|---|---|
| h# | 0.000 | 0.121 |
| d | 0.121 | 0.141 |
| ow | 0.141 | 0.313 |
| n | 0.313 | 0.333 |
| q | 0.333 | 0.434 |
| ae | 0.434 | 0.575 |
| s | 0.575 | 0.645 |
| epi | 0.645 | 0.726 |
| m | 0.726 | 0.776 |
| iy | 0.776 | 0.897 |
| … |
Reproduce:
# alignment + TextGrid
python generate_textgrids.py --wav assets/fasw0sa2.wav --mode phoneme --lang english --annotation phn
# the multi-panel "under the hood" figure
python falcon_viz.py▶ Try it live — no install: huggingface.co/spaces/MLSpeech/FALCON (free CPU Space; the first alignment downloads a model).
FALCON includes an interactive Gradio web interface for zero-shot forced alignment. Upload audio of any sample rate along with a plain-text transcript (or standard annotation file), and instantly visualize predicted boundaries, the Soft-DP path, and download a TextGrid.
conda activate falcon
python app.pyThis launches a local server (default http://localhost:7860). Inputs (audio, transcript, options) are on the left; the Alignment Data and Visualizations tabs are on the right.
How to use it — upload audio (any sample rate; resampled to 16 kHz internally) and a transcript, set the options, and click Run Alignment:
| Option | What it does |
|---|---|
| Mode | phoneme — align the phonemes; word — align words (boundaries derived from the predicted phonemes). |
| Language | english — transcript is already TIMIT-39 phonemes (no G2P); multilingual — any other language (runs the G2P path). Auto-selects the matching checkpoint. |
| Pretrained checkpoint | Read English (TIMIT), Spontaneous English (Buckeye), or Multilingual (joint). Follows Language by default; override it, or upload your own .pt. |
| Word G2P (word mode) | Auto (recommended — best backend for the language), espeak, MFA-like, or none (romanization — only for languages with no G2P model). If the chosen backend lacks the language it falls back automatically. |
| Input language (word mode) | Optional but recommended — lets Auto pick the right G2P and set the voice/dictionary behind the scenes. |
| Annotation | .phn (phoneme timestamps), .wrd (word timestamps), or .txt (plain token sequence, no timestamps). |
Outputs: the Alignment Data tab gives the boundary tables and a downloadable Praat .TextGrid; the Visualizations tab shows the time-aligned panels (waveform · spectrogram · phoneme posteriors · Soft-DP path · contrastive score).
Note: the official public release will be hosted at github.com/MLSpeech/FALCON (as referenced in the paper). This repository (
github.com/RotemRousso/FDNFA) is the active development version.
git clone https://github.com/RotemRousso/FDNFA.git
cd FDNFA- Python 3.8+
- CUDA-capable GPU (recommended)
conda create -n falcon python=3.8
conda activate falcon
pip install -r requirements.txtpip install -r requirements.txtNote:
torchandtorchaudioare pinned to 2.4.1. For a different CUDA version, install them separately first following PyTorch's official instructions before runningpip install -r requirements.txt.
FALCON expects data in TIMIT-style format:
<dataset_root>/
├── train/
│ ├── speaker1_utt1.wav
│ ├── speaker1_utt1.phn ← required
│ └── ...
├── val/ ← only for Buckeye-style splits
└── test/
├── speaker2_utt1.wav
├── speaker2_utt1.phn
└── ...
Each .phn file contains one phoneme segment per line:
<start_sample> <end_sample> <phoneme_label>
0 3050 h#
3050 4559 sh
4559 5723 ix
...
Audio must be 16 kHz mono WAV. Phoneme labels should use the TIMIT 61-phoneme set (automatically mapped to Lee-Hon 39 internally).
Supported datasets:
- TIMIT —
data: timitin config (standard train/test split, 10% of train used for validation) - Buckeye —
data: buckeyein config (explicittrain/,val/,test/dirs)
Edit conf/config.yaml to set your dataset paths and output directory:
# Data
data: timit # or 'buckeye'
timit_path: /path/to/your/timit/
buckeye_path: /path/to/your/buckeye/
# Output run directory (Hydra timestamped)
hydra:
run:
dir: /path/to/your/runs/my_run/${now:%Y-%m-%d_%H-%M-%S}-${exp_name}Then run:
conda activate falcon
python main.pyCheckpoints are saved every epoch as {epoch}_best_model.pt in the Hydra run directory.
| Parameter | Default | Description |
|---|---|---|
epochs |
200 | Total training epochs |
batch_size |
8 | Per-GPU batch size |
lr |
3e-4 | Learning rate (AdamW) |
devices |
[7] |
GPU index(es) to train on |
num_classes |
39 | Phoneme set size (Lee-Hon 39) |
z_dim |
256 | CNN encoder output dimension |
z_proj |
64 | Projection head output dimension |
Three checkpoints are used by FALCON (under pretrained_models/):
| File | Trained on | Best for |
|---|---|---|
falcon_timit_english.pt |
TIMIT (read English) | English phoneme alignment |
falcon_buckeye_english.pt |
Buckeye (spontaneous English) | Spontaneous / conversational English |
falcon_joint_multilingual.pt |
Joint TIMIT+Buckeye | Best for cross-lingual / multilingual zero-shot alignment (Dutch, German, Hebrew, ...) at both phoneme and word level — the joint model generalizes better to unseen languages than either single-corpus model. |
The CLI tools (predict.py, generate_textgrids.py, test_results.py) auto-pick a
checkpoint from the --lang flag — --lang english → TIMIT, --lang multilingual
→ joint — or pass --ckpt /path/to/your.pt to use any other (e.g. the Buckeye
model). The web demo (app.py) exposes all three as selectable options.
The .pt files are not committed to git; place them under pretrained_models/, or
set HF_MODEL_REPO (and HF_TOKEN if private) so app.py / HuggingFace Spaces
fetch them from that model repo on first use.
The main entry point for generating alignments is generate_textgrids.py. This script processes your audio and annotations, and outputs standard Praat .TextGrid files ready for linguistic analysis (similar to tools like Montreal Forced Aligner).
--ckpt is optional — if omitted, the bundled TIMIT checkpoint is used for
--lang english and the joint TIMIT+Buckeye checkpoint for --lang multilingual.
conda activate falcon
python generate_textgrids.py \
--wav /path/to/audio.wav \
--mode phoneme \
--lang english \
--annotation phnconda activate falcon
python generate_textgrids.py \
--wav_dir /path/to/dataset/test/ \
--mode word \
--lang english \
--annotation wrdFlags:
| Flag | Choices | Default | Description |
|---|---|---|---|
--wav |
file path | — | Path to a single input .wav file (16 kHz mono) |
--wav_dir |
directory path | — | Path to a directory containing .wav files to process |
--ckpt |
file path | bundled | Path to a trained checkpoint (.pt). Defaults to pretrained_models/falcon_timit_english.pt (or falcon_joint_multilingual.pt when --lang multilingual). |
--mode |
phoneme, word |
phoneme |
Alignment granularity. phoneme = phoneme-level (default). word = word-level alignment (zero-shot). |
--lang |
english, multilingual |
english |
Language setting. english = trained English phoneme alignment. multilingual = any non-English language (zero-shot cross-lingual). |
--annotation |
any string | phn |
Annotation file extension to look for. Use txt for plain text transcripts, or standard extensions (e.g. phn, wrd). |
The .wav file must have a paired annotation file in the same directory, providing the phoneme or word sequence to align to.
Examples:
# English phoneme-level (default — uses TIMIT checkpoint)
python generate_textgrids.py --wav_dir my_dataset/
# English word-level (zero-shot)
python generate_textgrids.py --wav_dir my_dataset/ --mode word --annotation wrd
# Multilingual phoneme-level (zero-shot — uses joint TIMIT+Buckeye checkpoint)
python generate_textgrids.py --wav_dir my_dataset/ --lang multilingual --annotation phn
# Plain-text transcript (no timestamps needed)
python generate_textgrids.py --wav_dir my_dataset/ --mode word --annotation txtOutput:
- A standard
.TextGridfile saved next to each input.wavfile. - If
--mode wordis used, the.TextGridwill contain two tiers:wordsandphones. - Visualizations saved next to the input WAV:
<basename>_probs.png,<basename>_logits.png,<basename>_boundaries.png
Evaluate a checkpoint over a directory of .wav files and report precision at multiple time thresholds:
conda activate falcon
python test_results.py \
--wav /path/to/test/wavs/ \
--mode phoneme \
--lang english \
--annotation phnFlags:
| Flag | Choices | Default | Description |
|---|---|---|---|
--wav |
directory path | — | Directory containing .wav files to evaluate |
--ckpt |
path | bundled | A single .pt file or a directory (sweeps {idx}_best_model.pt). Defaults to the bundled TIMIT/Buckeye checkpoint matching --lang. |
--mode |
phoneme, word |
phoneme |
Alignment granularity — same meaning as in predict.py |
--lang |
english, multilingual |
english |
Language setting — same meaning as in predict.py |
--annotation |
any string | phn |
Annotation file extension — same meaning as in predict.py |
--w-phi |
float | 0.5 |
Acoustic ↔ linguistic feature weight |
--out-plot |
file path | — | If set (and sweeping a directory), save a precision-vs-checkpoint plot |
--no-plots |
flag | off | Skip per-file diagnostic plots — much faster on large datasets |
Reports precision at 10, 15, 20, 25, 50, and 100 ms tolerances.
FALCON can align speech in unseen languages (Dutch, German, Hebrew, and others) without any additional training. Phonemes from the target language are automatically mapped to the Lee-Hon 39 set used during English training via articulatory feature distance (PanPhon).
Just pass --lang multilingual — this both selects the joint TIMIT+Buckeye
checkpoint (which generalizes better cross-lingually) and routes the input
through the G2P/articulatory-mapping pipeline:
python generate_textgrids.py \
--wav_dir /path/to/dutch_audio_dir/ \
--lang multilingual \
--annotation phn.phn / .wrd files can use any standard phoneme/word notation — dutch_preprocess.aligner_pipeline() handles IFA → IPA → LH39 conversion.
Visualize how the CNN encoder learns phoneme-boundary-sensitive representations:
conda activate falcon
python visualize_latent_representation.py \
--wav /path/to/audio.wav \
--run-dir /path/to/training/run/directory/Saves four heatmap plots to <run_dir>/latent_representations/:
latent_epoch0_untrained.png— CNN features before traininglatent_best_trained.png— CNN features at the best epochlatent_comparison_epoch0_vs_best.png— side-by-side comparisonlatent_difference_best_minus_epoch0.png— difference map
FALCON/
├── app.py # Gradio web demo
├── falcon_viz.py # Time-aligned alignment visualizations (web demo + README)
├── generate_textgrids.py # MFA-style CLI for TextGrid output
├── main.py # Training entry point
├── predict.py # Low-level single-file inference
├── test_results.py # Batch evaluation
├── solver.py # Train/val/test loop
├── next_frame_classifier.py # Model + loss functions
├── dataloader.py # Dataset classes
├── utils.py # Soft-DP, metrics, phoneme maps
├── dutch_preprocess.py # Cross-lingual phoneme mapping (IPA → LH39)
├── word_g2p.py # Word-level G2P front-end (espeak / char)
├── mfa_g2p.py # Word-level G2P front-end (MFA-style)
├── visualize_latent_representation.py
├── pretrained_models/ # Checkpoints (gitignored; via HF Hub at runtime)
│ ├── falcon_timit_english.pt
│ ├── falcon_buckeye_english.pt
│ └── falcon_joint_multilingual.pt
├── conf/
│ └── config.yaml # All hyperparameters (Hydra)
├── scripts/ # Data preparation & utilities
├── assets/ # README screenshots & example figures
├── requirements.txt
Each panel below is the demo's own output — waveform · log-mel spectrogram ·
phoneme posteriors · Soft-DP cost matrix with the optimal path · contrastive
boundary score — on a real test utterance. The bundled inputs live in
assets/examples/ (English is assets/fasw0sa2.*), so you
can drop them straight into the web demo or the CLI.
English — TIMIT (read speech, phoneme-level)
"Don't ask me to carry an oily rag like that." — 91% of boundaries within 25 ms on this utterance.
Dutch — IFA corpus (zero-shot cross-lingual, phoneme-level)
The English-trained model aligns Dutch phonemes zero-shot via articulatory-feature mapping — 60% @25 ms, 93% @100 ms on this utterance.
German — PHONDAT (zero-shot cross-lingual, phoneme-level)
Zero-shot German alignment. Figure only — PHONDAT is licensed by BAS and is not redistributed here; bring your own copy to reproduce.
Hebrew (zero-shot, word-level, no G2P model)
A romanized Hebrew transcript mapped directly to phonemes by articulatory distance — no grapheme-to-phoneme model required.
Example audio is bundled for demonstration only and remains subject to each corpus's original license — see
assets/examples/NOTICE. Full-test-set benchmark numbers are in the paper.
This project is released under the MIT License.







