An industry-sponsored Deep Learning project by NVIDIA focused on large-scale multi-class video classification using spatial and temporal learning techniques.
The system was developed end-to-end as part of an academic research project and is capable of classifying videos into four categories:
- Animation
- Gaming
- Natural Content
- Flat Content
- Deep Learning-based video classification pipeline
- Frame-level spatial feature extraction
- Temporal sequence understanding
- Attention-based video representation
- Real-time inference support
- Robust preprocessing and augmentation pipeline
- Out-of-Distribution (OOD) trust scoring
- GPU-accelerated training and inference
The pipeline combines multiple deep learning components for accurate video understanding:
- ResNet18 / EfficientNet for spatial feature extraction
- Bi-LSTM for temporal modeling
- Multi-Head Self Attention for informative frame selection
- Model Ensembling for improved robustness
- Test-Time Augmentation (TTA) for better generalization
The system includes:
- Adaptive frame extraction
- Black-frame filtering
- Blur detection using Laplacian variance
- Image sharpening
- GPU preprocessing
- ImageNet normalization
- Video-level prediction aggregation
- Achieved approximately 91–93% classification accuracy
- Evaluated on a diverse dataset of 3,500+ to 4,000 videos
- Dataset curated from YouTube-8M
- Training performed on NVIDIA GPU servers (A100 GPUs)
- Python
- PyTorch
- OpenCV
- NumPy
- CUDA
- Flask
- NVIDIA GPU Infrastructure
- Flask-based inference server
- Video upload + URL-based classification
- Attention visualization
- OOD confidence scoring
- Real-time prediction pipeline
The complete system including:
- Model architecture
- Training pipeline
- Data preprocessing
- Experimentation
- Optimization
- Inference APIs
- Evaluation framework
was developed from scratch as part of an NVIDIA-sponsored academic project.
Developed at:
- Vishwakarma Institute of Information Technology (VIIT Pune)
To build a scalable and robust deep learning system capable of understanding complex video content using both spatial and temporal information while maintaining high accuracy and real-world deployment capability.