Discriminative Region Problem

When a standard classifier ends up only locating the most distinguishing features in order to output a confident classification, omitting other important traits.

Dynamic Masking Augmentation Techniques

A limitation of the Park et al. formalization is that dynamic masking techniques, such as PuzzleMix [1] and AutoMix [2] cannot be formally explained by their framework, as their theorem relies on the mask being stochastic and independent of pixel values. When we add pixel-level awareness, we can no longer approximate loss function as a straightforward input gradient.

alt text
From Jang-Hyun Kim et al (PuzzleMix). We can see how crops can be optimized for maximum saliency learning, whereas standard CutMix is meant to be stochastic and unaware.
alt text
From Zicheng Liu et al (AutoMix). Further comparisons of mixed-sample data augmentation techniques against dynamic masking techniques.

Empirically, though they achieve state-of-the-art results against MixUp and CutMix in selected setups, the overall impact on head class prediction is small. These techniques also take signficantly more compute per batch than stochastic techniques.

[1]
J.-H. Kim, W. Choo, and H. O. Song, “Puzzle Mix: Exploiting Saliency and Local Statistics for Optimal Mixup.” 2020. [Online]. Available: https://arxiv.org/abs/2009.06962
[2]
Z. Liu et al., “AutoMix: Unveiling the Power of Mixup for Stronger Classifiers.” 2022. [Online]. Available: https://arxiv.org/abs/2103.13027

F1 score

Harmonic mean of Precision and Recall. We use it when we want a single metric that balances both, especially when we have an uneven class distribution.

F1=2PrecisionRecallPrecision+RecallF1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} F1=2TP2TP+FP+FNF1 = \frac{2TP}{2TP + FP + FN}

False Positive Rate (FPR)

FPR=FPFP+TN\text{FPR} = \frac{FP}{FP + TN}

Generative Image Augmentation

alt text
An example of using Nano Banana 2 from Google to combine two ImageNet subjects

Earlier work has shown that using diffusion models to synthesize and augment data improves classification results on ImageNet [1], [2]. Given the latest developments in image generation models and abilities, these have moved to more sophisticated enhancements, such as background diversification, and saliency-aware generation pipelines [3], [4], [5]. Generally, these techniques isolate the ground-truth subject via segmentation or feature-extraction, and use the generative model as as an augmenter build the surrounding context and handle the blending.

alt text
From Zhao, Tianchen et al. The results demonstate how saliency-aware generative techniques outperformed other augmentation baselines.
alt text
From Fazle Rahat et al. The table demonstrates how on classes with an abundance of training images, data augmentation techniques make little difference. However, on tail ends where classes have fewer samples, generated augmentation signficantly improved results.
[1]
S. Azizi, S. Kornblith, C. Saharia, M. Norouzi, and D. J. Fleet, “Synthetic Data from Diffusion Models Improves ImageNet Classification.” 2023. [Online]. Available: https://arxiv.org/abs/2304.08466
[2]
B. Trabucco, K. Doherty, M. Gurinas, and R. Salakhutdinov, “Effective Data Augmentation With Diffusion Models.” 2025. [Online]. Available: https://arxiv.org/abs/2302.07944
[3]
F. Rahat, M. S. Hossain, M. R. Ahmed, S. K. Jha, and R. Ewetz, “Data Augmentation for Image Classification using Generative AI.” 2024. [Online]. Available: https://arxiv.org/abs/2409.00547
[4]
B. Chen et al., “XVerse: Consistent Multi-Subject Control of Identity and Semantic Attributes via DiT Modulation.” 2025. [Online]. Available: https://arxiv.org/abs/2506.21416
[5]
T. Zhao et al., “Salient Concept-Aware Generative Data Augmentation,” Advances in Neural Information Processing Systems 38 (NeurIPS 2025). 2025.

ImageNet

ImageNet [1] is a large-scale visual dataset designed for object recognition research. It became foundational to modern computer vision, especially after the breakthrough of University of Toronto’s team in the 2012 ImageNet Large Scale Visual Recognition Challenge using AlexNet [2]. The original dataset includes 14 million images and 21,841 synsets (classes), though teams often evaluate on the ILSVRC Subset benchmark [3] consisting of 1.2 million training images and 1,000 classes.

The original ImageNet images were not exhaustively labeled. An image labeled “dog” might also contain trees, cars, or people. It is single-label per image. ILSVRC was essential for adding stricter quality control, as well as bounding box and object localization for more nuanced identification.

[1]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009, doi: 10.1109/CVPR.2009.5206848.
[2]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with deep convolutional neural networks,” Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, pp. 1097–1105, 2012.
[3]
O. Russakovsky et al., “ImageNet Large Scale Visual Recognition Challenge,” International Journal of Computer Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015, doi: 10.1007/s11263-015-0816-y.

Jetson AGX Orin

Hardware from Nvidia.

L1 Regularization (Lasso)

A method of penalizing weights proportional to their L1 norm, the sum of their absolute values (Manhattan Distance). Pushes weights to zero, encourages sparsity.

Losstotal=Lossdata+λiwi\text{Loss}_{\text{total}} = \text{Loss}_{\text{data}} + \lambda \sum_i |w_i|

Where:

L2 Regularization (Ridge)

A method of penalizing weights proportional to their L1 norm, the sum of their absolute values (Manhattan Distance). Pushes weights to zero, encourages sparsity.

Losstotal=Lossdata+λiwi\text{Loss}_{\text{total}} = \text{Loss}_{\text{data}} + \lambda \sum_i |w_i|

Where:

Linear Attention

Faithful transformers implement softmax on attention layers. This introduces quadratic complexity, and there are modern techniques to approximate it and reduce complexity to linear time.

[1]

[1]
M. Xu, X. Lin, X. Guo, W. Xu, and W. Cui, “Softmax Linear Attention: Reclaiming Global Competition.” 2026. [Online]. Available: https://arxiv.org/abs/2602.01744

Mixed-Sample Data Augmentation

alt text
Sample implementations of CutMix

In 2022, Park et al. provided a unified formal treatment of mixed-sample data augmentation methods (MSDA) [1], demonstrating that such techniques were equivalent to designing a spatial decay kernel for gradient regularization, including CutMix [2] and MixUp [3]. Each technique posseses their own strengths and limitations. For example, CutMix may introduce multi-label noise, erasing a class that only existed in the cutout area, or introduces a new class that isn’t actually present in the pasted patch. There are some further static enhancements on this techniques, such as Fourier Mix (FMix)[4], which produces smooth and continuous patches instead of sharp ones, and ResizeMix [5], which resizes the patching image instead of cropping.

alt text
Comparison diagram from the original CutMix paper, comparing MixUp, CutOut, and CutMix. Notice how CutMix is shown to positively impact saliency in Class Activation Mapping (CAM), but regional dropout and MixUp cannot take advantage.

The Park et al. formalization demonstrates how MSDA reshapes the loss landscape, mathematically forcing it to learn smoother functions by penalizing erratic changes in its input gradients, weighted by how close the data points are (spatial decay). The core of the theorem relies on the stochastic nature of applicable image augmentation techniques. They used their theorems to formalize Hybrid Mix (HMix) and Gaussian Mix (GMix), intermediate spatial regularization techniques to sit between CutMix and MixUp distributions.

[1]
C. Park, S. Yun, and S. Chun, “A Unified Analysis of Mixed Sample Data Augmentation: A Loss Function Perspective.” 2022. [Online]. Available: https://arxiv.org/abs/2208.09913
[2]
S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features.” 2019. [Online]. Available: https://arxiv.org/abs/1905.04899
[3]
H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” International Conference on Learning Representations (ICLR), 2018.
[4]
E. Harris, A. Marcu, M. Painter, M. Niranjan, A. Prügel-Bennett, and J. Hare, “FMix: Enhancing Mixed Sample Data Augmentation.” 2021. [Online]. Available: https://arxiv.org/abs/2002.12047
[5]
J. Qin, J. Fang, Q. Zhang, W. Liu, X. Wang, and X. Wang, “ResizeMix: Mixing Data with Preserved Object Information and True Labels.” 2020. [Online]. Available: https://arxiv.org/abs/2012.11101

MSDA Regularization

Mixup blends images globally across all pixels, creating semi-transparent ghost images, hampering structural awareness. MixUp-CAM [1] is an introduced technique to use formalize uncertainty regularization constraints:

Lall=Lcls(I,Y)+λemLem(I)+λconLcon(M)\mathcal{L}_{all} = \mathcal{L}_{cls}(I^{\prime}, Y^{\prime}) + \lambda_{em}\mathcal{L}_{em}(I^{\prime}) + \lambda_{con}\mathcal{L}_{con}(M)

This formula combines a classification loss, class-wise entropy regularization, and spatial concentration loss to optimize the MixUp procedure, and prevent the response from being too divergent.

CutMix slready enforces spatial formalization through its geometry, such as feature exploration and spatial integrity. There is some research into techniques like semantic proportioning, making the augmentation differentiable, and using dynamic view-scales crops [2], [3], [4] to incrementally improve augmentations. These gains are incremental and selected, whereas the original CutMix usually suffices and is the common benchmark.

[1]
Y.-T. Chang, Q. Wang, W.-C. Hung, R. Piramuthu, Y.-H. Tsai, and M.-H. Yang, “Mixup-CAM: Weakly-supervised Semantic Segmentation via Uncertainty Regularization.” 2020. [Online]. Available: https://arxiv.org/abs/2008.01201
[2]
S. Huang, X. Wang, and D. Tao, “SnapMix: Semantically Proportional Mixing for Augmenting Fine-grained Data.” 2020. [Online]. Available: https://arxiv.org/abs/2012.04846
[3]
B. Li et al., “DDAug: Differentiable Data Augmentation for Weakly Supervised Semantic Segmentation,” Trans. Multi., vol. 26, pp. 4764–4775, Jan. 2024, doi: 10.1109/TMM.2023.3326300.
[4]
H. Kim, D. Kim, P. Ahn, S. Suh, H. Cho, and J. Kim, “ContextMix: A context-aware data augmentation method for industrial visual inspection systems,” Engineering Applications of Artificial Intelligence, vol. 131, p. 107842, May 2024, doi: 10.1016/j.engappai.2023.107842.

Poisson Distribution

A discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant mean rate and independently of the time since the last event.

The probability of observing exactly kk events is given by the formula:

P(X=k)=eλλkk!P(X=k) = \frac{e^{-\lambda} \lambda^k}{k!}

Where:

precision (metric)

You’re fishing with a net. You use a wide net, and catch 80 of 100 total fish in a lake. That’s 80% recall. But you also get 80 rocks in your net. That means 50% precision, half of the net’s contents is junk. You could use a smaller net and target one pocket of the lake where there are lots of fish and no rocks, but you might only get 20 of the fish in order to get 0 rocks. That is 20% recall and 100% precision.

Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}

recall (metric)

You’re fishing with a net. You use a wide net, and catch 80 of 100 total fish in a lake. That’s 80% recall. But you also get 80 rocks in your net. That means 50% precision, half of the net’s contents is junk. You could use a smaller net and target one pocket of the lake where there are lots of fish and no rocks, but you might only get 20 of the fish in order to get 0 rocks. That is 20% recall and 100% precision.

Also just the True Positive Rate (TPR).

Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}

Regularization

alt text
A sample of structural problems with smaller datasets and gradient descent

300 samples per class is small, making Stochastic Gradient Descent (SGD) inherently difficult to land reliably. The sample complexity is just too low for SGD to reliably find flat, generalizable minima. We have to find a regularization strategy that helps us mitigate our limitations and prevent a sharp minima (overfitting). This forces us into a situation in which we must squeeze out every state-of-the-art regularization technique we can do to make the data as meaningful as possible.

SEAM

alt text
From Yude Wang et al. Comparisons across the original data, ground truth, baseline CAM, and SEAM-produced CAM. SEAM demonstrates a signficant increase in saliency integrity.

It has been demonstrated that standard CAMs fail to cover entire objects as they are translation and transformation invariant[1]. The localization and segmentation we seek require equivariance, where masks must alongside augmentation. Yude Wang et al. introduced Self-supervised Equivariant Attention Mechanism (SEAM) and Equivariant Cross Regularization (ECR) loss, taking the original image, generating its CAM, and then apply a spatial transformation TT. TT is then applied on the input, we yield the second CAM, and enforce on the loss that these two pathways yield the exact same spatial tensor:

LECR=M(T(x))T(M(x))22(1)L_{ECR} = || M(T(x)) - T(M(x)) ||_2^2 \tag{1}

By adding this to the standard cross-entropy loss, we force the network to produce stable object localizations rather than peaked activations that shift wildly when the image is augmented.

In the original implementation by Yude Wang et al., rotation and translation failed to produce sufficient supervision, while rescaling added major results (47.43%47.43\% to 55.41%55.41\%).

[1]
Y. Wang, J. Zhang, M. Kan, S. Shan, and X. Chen, “Self-supervised Equivariant Attention Mechanism for Weakly Supervised Semantic Segmentation.” 2020. [Online]. Available: https://arxiv.org/abs/2004.04581

Sparse mixture-of-expert models

Sparsely activated MoE Models.

WSOL

Weakly Supervised Object Localization (WSOL) is a task where we try to locate objects in an image using only image-level classification labels rather than detailed bounding boxes or masks.

WSSS

Weakly-Supervised Semantic Segmentation (WSSS) is a task where the data lacks pixel-level labels and masks, and the modeling must learn an alternative. Both of these domains apply to our problems, and there is significant literature available.