- Long-range correlations: Many characterization signals (diffraction, micrograph stacks) have dependencies that exceed practical CNN receptive fields.
- Self-attention: Lets every token attend to every other token directly, capturing global structure in one layer.
- Scaled dot-product attention: The core operation and its O(L²) cost.
- Vision Transformer (ViT): Patchify → embed → encode → classify; transformers applied to image-like data.
- Flash Attention: A fused kernel that makes long sequences tractable without materialising the L×L matrix.
- ViT on 4D-STEM: Diffraction patches become a token sequence for a ViT encoder.
- Cross-attention across LPBF layers: Long-stack micrograph context for additive-manufacturing monitoring.
- Mamba / structured state-space models (SSMs): O(L) compute, constant memory; competitive on long sequences. Cross-reference the Week 7 time-series deck.
- Recap of Week 9 (characterization signals) and why we now need attention.
- Long-range correlations exceed CNN receptive fields.
- Scaled dot-product attention: the formula and the cost.
- ViT in five lines: patchify, embed, encode, classify.
- Flash Attention: the kernel that makes long sequences tractable.
- ViT on 4D-STEM diffraction.
- Cross-attention across LPBF layer stacks.
nn.MultiheadAttentionvsF.scaled_dot_product_attention.- Scaling alternatives (Mamba / SSMs) — mention only.
- Anti-patterns: what not to do.
- Exercise preview.
Summary for ML-PC Week 10:
- Motivates self-attention for long-range structure in characterization data.
- Covers scaled dot-product attention, the Vision Transformer, and Flash Attention.
- Applies transformers to 4D-STEM diffraction and LPBF layer-stack context.
- Notes Mamba / state-space models as scaling alternatives, and when not to reach for a transformer.