TL;DR

Convolutional Nearest Neighbors (ConvNN) is a unified framework that dissolves the apparent distinction between convolution and attention by viewing both as k-nearest neighbor aggregation operations. Convolution selects neighbors by spatial proximity, while attention selects by feature similarity—ConvNN formalizes this spectrum. The framework enables systematic exploration of hybrid configurations that combine spatial and feature-based neighbor selection, achieving consistent improvements across vision architectures.

Paper

Attention Via Convolutional Nearest Neighbors
Mingi Kang, Jeova Farias Sales Rocha Neto
Bowdoin College

Available on arXiv: 2511.14137

Key Contributions

Unified convolution and attention within a single k-NN aggregation framework
Proves both operations are special cases of neighbor selection: convolution selects by spatial proximity, attention by feature similarity
Introduces hybrid branching layer that balances local (spatial) and global (feature) processing
Demonstrates ConvNN can be exactly configured to recover standard convolution or attention
Shows consistent accuracy improvements on CIFAR-10/100 across CNN (VGG) and ViT architectures
Provides efficient sparse neighbor search strategies (random and spatial) that reduce complexity from $O(n^2)$ to $O(nr \log(r))$
Achieves 0.56% improvement over standard convolution on ResNet-50 ImageNet-1K classification

Motivation: The Convolution-Attention Spectrum

Despite their apparent differences, convolution and self-attention share a fundamental principle: neighbor aggregation.

Convolution: Aggregates features from spatially adjacent neighbors

Fixed spatial neighborhoods (by kernel size)
Local feature extraction
Explicit spatial inductive bias

Self-Attention: Aggregates features from all positions based on learned similarity

Global receptive field
Feature-based selection
High computational cost $O(n^2)$

Key Insight: These differences arise from neighbor selection strategy, not the aggregation principle itself.