ConvNN-Attention
Unified Convolution-Attention operation for Vision Transformers
TL;DR
Convolutional Nearest Neighbor Attention (ConvNN-Attention) applies the Convolutional Nearest Neighbors framework to Vision Transformer self-attention mechanisms, demonstrating that attention can be effectively implemented as a sparse k-NN operation that selects neighbors based on feature similarity rather than spatial proximity. This approach reduces computational complexity from $O(n^2)$ to $O(nk \log(k))$ while maintaining or improving performance on vision tasks, providing a unified perspective on attention and convolution.
Paper
Attention Via Convolutional Nearest Neighbors
Mingi Kang, Jeova Farias Sales Rocha Neto
Bowdoin College
Available on arXiv: 2511.14137
Key Contributions
- Achieves 0.25 - 0.66% accuracy improvements over standard self-attention on ImageNet-1K image classification with ViT-Base architecture.
- Reduces computational GFLOPS by ~2% while improving performance
- Reveals that attending to all positions may be unnecessary for learning global dependencies
- Shows ConvNN is a true framework unifying convolution and attention, with both as special cases
The Attention-as-k-NN Perspective
Standard self-attention computes attention over all $n$ tokens:
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^{\top}}{\sqrt{d_k}}\right) V\]Issues:
- $O(n^2)$ complexity due to full pairwise similarity computation
- Assumes all positions should influence output
- May attention to uninformative or distant tokens
ConvNN-Attention Solution: Restrict attention to $k$ most similar keys, selected from candidate set of size $r$:
\[\text{ConvNN-Attention}(Q, K, V) = \text{softmax}(S_{\text{top-k}}) V\]where $S_{\text{top-k}}$ are the k largest similarity scores.
ConvNN Core Steps
ConvNN operates in three core steps:
1. Similarity Computation \(S = QK^{\top}, \quad Q = f_Q(X), \quad K = f_K(X)\) where $f_Q$ and $f_K$ are learnable projections and similarity is computed via cosine similarity after ℓ2 normalization.
2. Neighbor Selection and Modulation \(s_i = \text{k-max}_k(S)[i,:], \quad I_i = \text{k-argmax}_k(S)[i,:]\) \(X^{nn,i} = S_i \cdot V[I_i,:] \in \mathbb{R}^{k \times c}\) where $S_i = \text{diag}(\rho(s_i))$ applies a weighting scheme (identity or softmax).
3. Weighted Aggregation Apply Conv1D (standard or depthwise) to concatenated neighbor matrices with stride $k$.
As Self-Attention:
- Set $k = n$ (all features as candidates)
- Set $\rho = \text{softmax}$
- Apply depthwise convolution with unit weights
- Result: recovers exact self-attention computation $N$
Neighbor Selection Strategies
1. All-Feature Selection (standard attention):
- Compute all $n$ tokens as candidates
- Full similarity computation $O(n^2)$
2. Random Selection:
- Randomly sample $r < n$ tokens as candidates
- Reduces complexity to $O(nr \log(r))$
- Encourages learning of long-range, non-local relationships
3. Spatial Selection:
- Subsample spatially: select every $\sqrt{r}$-th patch
- Preserves some locality structure
ViT-Base ImageNet-1K Results
| Model | Test Loss | Top-1 Accuracy | GFLOPS | Parameters |
|---|---|---|---|---|
| Standard Self-Attention | 0.899 | 80.94% | 35.131 | 86.379M |
| ConvNN-Attention All ($k=9$) | 0.855 | 81.19% | 34.432 | 86.386M |
| ConvNN-Attention Random ($k=9, r=32$) | 1.064 | 76.04% | 33.833 | 86.386M |
| ConvNN-Attention Spatial ($k=9, r=128$) | 0.931 | 79.31% | 34.181 | 86.391M |
| ConvNN-Attention All ($k=16$) | 0.838 | 81.60% | 34.445 | 86.391M |
Best model (ConvNN-Attention All with $k=16$) achieves 0.66% improvement over standard attention while reducing GFLOPS by ~2%.
Architecture Details
Convolutional Nearest Neighbor famework applied to attention layers in ViT:
- Replace standard self-attention with ConvNN
- Use depthwise 1D convolution for weighted aggregation
- Initialize depthwise weights to 1.0 (equivalent to summation)
- Same number of heads, embedding dimensions as baseline ViT
Citation
@article{kang2025attention,
title={Attention Via Convolutional Nearest Neighbors},
author={Kang, Mingi and Neto, Jeov{\'a} Farias Sales Rocha},
journal={arXiv preprint arXiv:2511.14137},
year={2025}
}