BSTabDiff is a block-subunit generative framework for High-Dimensional Low-Sample-Size (HDLSS) tabular data synthesis. Rather than learning dependence directly in the original high-dimensional feature space, it partitions the feature space into M latent blocks, where M ≪ m, models global structure through a compact diffusion/flow prior over block latents, and decodes back to the full table using copula-based dependence, flexible feature-wise marginals, and explicit missingness modeling. This design makes BSTabDiff especially well suited for omics-style and other HDLSS settings, where direct high-dimensional density learning is often unstable. Across multiple HDLSS benchmarks, BSTabDiff generates more realistic and stable synthetic data than several widely used tabular generators, while often approaching downstream performance obtained from real data.
Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Kumar Gyawali, Gianfranco Doretto, and Donald A. Adjeroh.
“BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation.”
In ICLR 2026 2nd Workshop on Deep Generative Models in Machine Learning: Theory, Principle and Efficacy (DeLTa), 2026.
BibTeX:
@inproceedings{habib2026bstabdiff,
title = {BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation},
author = {Habib, Al Zadid Sultan Bin and Ahamed, Md Younus and Gyawali, Prashnna Kumar and Doretto, Gianfranco and Adjeroh, Donald A.},
booktitle = {ICLR 2026 2nd Workshop on Deep Generative Models in Machine Learning: Theory, Principle and Efficacy (DeLTa)},
year = {2026}
}- ICLR Page: https://iclr.cc/virtual/2026/10017199
- OpenReview: https://openreview.net/forum?id=RKNDy0KhGT
This folder contains the core BSTabDiff implementation:
__init__.py- Package initializer and high-level API exports.block_subunit_gen.py- Main BSTabDiff implementation, including feature schema, empirical marginals, block-subunit emissions, diffusion/flow priors, training, and synthetic sampling utilities.
Since May 30, 2026, all Jupyter notebook previews are failing with "An error occurred" message. This affects both my own notebooks and others' repositories. Using nbformat v5.10.4 and nbconvert v7.17.1. Notebooks are valid and working locally. This appears to be a GitHub-side rendering issue.
-
Dummy Example Usage.ipynb
Contains simple toy examples showing how to install/import thebstabdiffpackage, fit BSTabDiff on a dummy HDLSS dataset, and sample synthetic data. -
BSTabDiff_Colon.ipynb
Contains the Colon dataset experiments from the paper. The downstream classifiers include Logistic Regression, TabPFN-2.5 (currently applicable only when the number of features is within its supported range, so Colon is eligible), TANDEM (NeurIPS 2025), and CatBoost. This notebook also includes the paper’s ablation studies and related fidelity analysis. -
BSTabDiff_GLI.ipynb
Contains the GLI-85 experiments using Logistic Regression as the downstream classifier, along with selected fidelity analysis. -
BSTabDiff_Lung.ipynb
Contains the Lung dataset experiments using Logistic Regression as the downstream classifier, along with selected fidelity analysis. -
BSTabDiff_PIP_Install_Check.ipynbDemonstration of BSTabDiff in a Google Colab notebook using pip installation with some toy examples.
requirements.txt- Python dependencies required to run the BSTabDiff package and notebooks.BSTabDiffArchi.png- High-level architecture diagram of the BSTabDiff framework.LICENSE- MIT license for this repository.README.md- Project overview, installation, usage instructions, and citation information..gitignore- Standard Git ignore rules for Python and Jupyter projects.pyproject.toml- Build system and packaging metadata for installation.setup.cfg- Package configuration and installation metadata.
- Python 3.10.13
- torch 2.9.1+cu128
- numpy 2.2.6
- pandas 2.3.3
- scikit-learn 1.7.2
- catboost 1.2.8
- tabpfn 6.3.1
You can install BSTabDiff in several ways depending on your workflow.
git clone https://github.com/zadid6pretam/BSTabDiff.git
cd BSTabDiff
pip install -r requirements.txt
pip install -e .pip install "git+https://github.com/zadid6pretam/BSTabDiff.git"python -m venv bstabdiff-env
source bstabdiff-env/bin/activate # On Windows: bstabdiff-env\Scripts\activate
git clone https://github.com/zadid6pretam/BSTabDiff.git
cd BSTabDiff
pip install -r requirements.txt
pip install -e .git clone https://github.com/zadid6pretam/BSTabDiff.git
cd BSTabDiff
pip install -r requirements.txt
pip install .pip install bstabdiffBelow is a minimal example showing how to fit BSTabDiff on a dummy HDLSS dataset and generate synthetic samples.
import numpy as np
from bstabdiff import FeatureSpec, fit_block_subunit_generator
# Dummy HDLSS data
np.random.seed(42)
n, m = 80, 2000
X = np.random.randn(n, m).astype(np.float32)
y = np.random.randint(0, 2, size=n)
X[np.random.rand(n, m) < 0.1] = np.nan
# Feature schema
feature_specs = [FeatureSpec(name=f"f{j}", kind="continuous") for j in range(m)]
# Fit BSTabDiff
gen, train_info = fit_block_subunit_generator(
X=X,
feature_specs=feature_specs,
y=y,
M=20,
blocks=None,
permute_features=False,
prior_type="diffusion",
device="cpu",
seed=42,
prior_epochs=300,
prior_batch=64,
prior_lr=1e-3,
verbose_every=100,
save_dir=None,
save_name="bstabdiff_demo",
save_best=True,
use_ema=True,
ema_decay=0.999,
return_train_info=True,
)
# Sample synthetic data
X_syn, R_syn, y_syn = gen.sample(n=50)
print("X_syn shape:", X_syn.shape)
print("R_syn shape:", R_syn.shape)
print("y_syn shape:", y_syn.shape if y_syn is not None else None)
print("Best training info:", train_info)For fuller experiments, ablations, and fidelity studies, see:
- Dummy Example Usage.ipynb
- BSTabDiff_Colon.ipynb
- BSTabDiff_GLI.ipynb
- BSTabDiff_Lung.ipynb
BSTabDiff is part of our broader line of work on tabular deep learning and high-dimensional tabular modeling.
Our recent ICML 2026 Regular main conference paper on feature ordering and compression for tabular foundation models for high-dimensional low-sample-size tabular data:
- GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data
- GitHub: https://github.com/zadid6pretam/GOTabPFN
- ICML Page: https://icml.cc/virtual/2026/poster/62523
@inproceedings{habib2026gotabpfn,
title = {GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data},
author = {Habib, Al Zadid Sultan Bin and Ahamed, Md Younus and Gyawali, Prashnna Kumar and Doretto, Gianfranco and Adjeroh, Donald A.},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026}
}Our generative modeling framework for high-dimensional low-sample-size tabular data:
- BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation
- GitHub: https://github.com/zadid6pretam/BSTabDiff
- ICLR Page: https://iclr.cc/virtual/2026/10017199
@inproceedings{habib2026bstabdiff,
title = {BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation},
author = {Habib, Al Zadid Sultan Bin and Ahamed, Md Younus and Gyawali, Prashnna Kumar and Doretto, Gianfranco and Adjeroh, Donald A.},
booktitle = {ICLR 2026 2nd Workshop on Deep Generative Models in Machine Learning: Theory, Principle and Efficacy (DeLTa)},
year = {2026}
}- If you are interested in high-dimensional tabular synthesis, block-subunit generation, and diffusion/flow priors for HDLSS tabular data, please also refer to the BSTabDiff repository and paper.
Our structured feature sequencing framework for multimodal learning with image and tabular data. This work is part of my PhD research on feature sequencing or ordering for multimodal image-tabular representation learning.
- iStructTab: Structured Feature Sequencing for Multimodal Learning of Image and Tabular Data
- GitHub: https://github.com/zadid6pretam/iStructTab
@inproceedings{habib2026istructtab,
title = {iStructTab: Structured Feature Sequencing for Multimodal Learning of Image and Tabular Data},
author = {Habib, Al Zadid Sultan Bin and Ahamed, Md Younus and Gyawali, Prashnna and Doretto, Gianfranco and Adjeroh, Donald A.},
booktitle = {Proceedings of the 28th International Conference on Pattern Recognition},
year = {2026},
address = {Lyon, France}
}- If you are interested in structured feature sequencing, multimodal fusion of image and tabular data (the integration problem), and feature order-aware tabular representation learning, please also refer to the iStructTab repository and paper.
Our more recent work on learned feature ordering for high-dimensional tabular data:
- DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data
- GitHub: https://github.com/zadid6pretam/DynaTab
@InProceedings{dynatab,
title = {{DynaTab: Dynamic Feature Ordering as Neural Rewiring for High-Dimensional Tabular Data}},
author = {Habib, Al Zadid Sultan Bin and Doretto, Gianfranco and Adjeroh, Donald A.},
booktitle = {{Proceedings of the First Workshop on NeuroAI Multimodal Intelligence @ AAAI 2026}},
pages = {27--57},
year = {2026},
volume = {308},
series = {{Proceedings of Machine Learning Research}},
publisher = {PMLR},
url = {https://proceedings.mlr.press/v308/habib26a.html}
}- If you are interested in learned feature ordering, neural rewiring for high-dimensional tabular data, and sequential backbone design for HDLSS settings, please also refer to the DynaTab repository and paper.
- Paper Link: https://proceedings.mlr.press/v308/habib26a.html
- arXiv: https://arxiv.org/abs/2605.03430
Our earlier work on sequential modeling for tabular data:
- TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering
- GitHub: https://github.com/zadid6pretam/TabSeq
- Springer ICPR 2024 proceedings: https://link.springer.com/chapter/10.1007/978-3-031-78128-5_27
- arXiv: https://arxiv.org/abs/2410.13203
@inproceedings{habib2024tabseq,
title={TabSeq: A Framework for Deep Learning on Tabular Data via Sequential Ordering},
author={Habib, Al Zadid Sultan Bin and Wang, Kesheng and Hartley, Mary-Anne and Doretto, Gianfranco and A. Adjeroh, Donald},
booktitle={International Conference on Pattern Recognition},
pages={418--434},
year={2024},
organization={Springer}
}- If you are interested in sequential ordering for tabular data, deep sequential backbones, and early feature-ordering-based tabular modeling, please also refer to the TabSeq repository and paper.
This repository corresponds to our separate collaborative work on tabular remote sensing and environmental data:
- ZAYAN: Disentangled Contrastive Transformer for Tabular Remote Sensing Data
- GitHub: https://github.com/zadid6pretam/ZAYAN
- arXiv: https://arxiv.org/abs/2604.27606
@inproceedings{habib2026zayan,
title = {ZAYAN: Disentangled Contrastive Transformer for Tabular Remote Sensing Data},
author = {Habib, Al Zadid Sultan Bin and Tasnim, Tanpia and Islam, Md. Ekramul and Tabasum, Muntasir},
booktitle = {Proceedings of the 28th International Conference on Pattern Recognition},
year = {2026},
address = {Lyon, France}
}- ZAYAN focuses on feature-level contrastive learning and Transformer-based classification for tabular remote sensing and environmental datasets.
- Unlike my PhD dissertation projects on high-dimensional tabular learning and HDLSS modeling, ZAYAN was developed as a separate collaboration.
For any questions, issues, or suggestions related to this repository, please feel free to contact us or open an issue on GitHub.
