A modular, production-ready template for building Hugging Face Transformers-compatible pre-trained models.
Tip
This is currently an early version, intended for internal use by the FlowVotrex team.
├── notebooks/
│ └── demo.ipynb # Interactive demonstration notebook
├── scripts/
│ └── pre-training.sh # Shell script for launching pre-training jobs
├── src/
│ ├── yourmodel/ # Main project package
│ │ ├── __init__.py # Package initialization and versioning
│ │ ├── config.py # Model and training configuration classes
│ │ ├── dataset.py # Custom dataset implementations
│ │ ├── model.py # Core model architecture
│ │ ├── output.py # Typed model output classes
│ │ ├── pipeline.py # High-level inference pipelines
│ │ ├── trainer.py # Training logic and trainer classes
│ │ └── utils.py # Reusable utility functions
│ └── main.py # Entry point for training and fine-tuning
├── .gitignore
├── LICENSE
└── README.md
- Defines the package version (
__version__ = "0.0.1") - Exposes public API through
__all__list - Imports key classes (config, model, pipeline, trainer) for easy access
- First step in implementing a Hugging Face model
- Contains two primary configuration classes:
YourModelConfig: Inherits fromPretrainedConfig, defines model architecture parametersYourModelTrainingConfig: Defines training-specific hyperparameters
- Fully compatible with Hugging Face Transformers configuration system
- Implements
DatasetandIterableDatasetclasses for training, fine-tuning, and evaluation - Handles data parsing, tokenization, batching, and augmentation
- Follows Hugging Face Datasets API standards
- Compatible with Hugging Face Trainer API and standard PyTorch DataLoader
- Implements the core model architecture
- Inherits from
PreTrainedModelfor full Hugging Face compatibility - Includes weight initialization (
_init_weights) - Provides input validation (
_validate_input) - Computes loss internally during forward pass (
_compute_loss) - Returns typed outputs defined in
output.py - For complex models, layer implementations should be moved to a separate
layers.pyfile
- Defines strongly-typed output classes inheriting from
ModelOutput - Standardizes outputs across different model components:
AttentionOutput: Output of individual attention layersEncoderBlockOutput: Output of individual encoder blocksEncoderOutput: Output of the entire encoder backboneYourModelOutput: Final model output including loss and hidden states
- Improves code readability and type safety
- Provides high-level, user-friendly inference interfaces
- Encapsulates the complete model usage workflow:
- Model loading from Hugging Face Hub or local files
- Configuration setup
- Input preprocessing
- Inference execution
- Result post-processing
- Serves as the primary entry point for end-users
- Maintains compatibility with Hugging Face pipeline conventions
- Contains all training logic
- Supports multiple training paradigms:
- Supervised training
- Large-scale pre-training
- Unsupervised/self-supervised training
- Base
BaseTrainerclass implements common functionality - Supports three implementation approaches:
- Accelerate-based (full transparency and control)
- PyTorch Lightning-based (high-level interface)
- Hugging Face Trainer-based (best ecosystem integration)
- Reusable helper functions used across the codebase
- Includes:
- Visualization tools
- Data processing helpers
- Logging utilities
- Common mathematical operations
- Designed to be generic and architecture-agnostic
- Entry point script for model pre-training and fine-tuning
- Called by shell scripts in the
scripts/directory - Parses command-line arguments
- Initializes config, model, dataset, and trainer
- Executes the training loop
- Create a new configuration class in
config.py - Implement the model architecture in
model.py(or create a new file) - Define appropriate output classes in
output.py - Create a corresponding pipeline in
pipeline.py - Update
__init__.pyto expose the new classes
- Implement a new
DatasetorIterableDatasetclass indataset.py - Add data loading and processing logic
- Update the training script in
main.pyto support the new dataset
- Create a new trainer class in
trainer.pyinheriting fromBaseTrainer - Implement the training loop and any custom logic
- Add command-line arguments in
main.pyto select the new trainer
Note: Replace all instances of YourModel, yourmodel, and Your with your actual project name throughout the codebase and documentation.
