IGLU | Mingi Kang

TL;DR

IGLU: Integrated Gaussian Linear Unit is a parametric activation function derived as a scale mixture of GELU gates under a half-normal distribution. This derivation yields a closed-form expression whose gating component is exactly the Cauchy CDF, providing a principled one-parameter family that continuously interpolates between identity-like and ReLU-like behavior via a single sharpness parameter $\sigma$. Unlike GELU’s Gaussian gate, IGLU’s heavy-tailed Cauchy gate decays polynomially in the negative tail, guaranteeing non-zero gradients for all finite inputs and offering greater robustness to vanishing gradients.

Paper

IGLU: The Integrated Gaussian Linear Unit Activation Function
Mingi Kang, Zai Yang, Jeova Farias Sales Rocha Neto
Bowdoin College

Proceedings of the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2026. arXiv:2603.06861

Key Contributions

Derives IGLU as a continuous scale mixture of GELU gates under half-normal distribution
Gating component equals Cauchy CDF—guarantees non-zero gradients for all finite inputs
IGLU-Approx: efficient rational approximation using only ReLU operations (no transcendental functions)
Demonstrates strong performance on balanced and especially imbalanced datasets
Provides theoretical unification of ReLU and GELU via single parameter $\sigma$

Integrated Gaussian Linear Unit

First recall the definition of GELU:

\[\text{GELU}(x) = x \cdot \Phi(x), \Phi(x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi}} e^{-\frac{t^2}{2}} dt\]

Since GELU follow the gated-linear form $x \cdot g(x)$, we can introduce a configurable parameter $a$ that scales or sharpens the gate, producing parameterized variants:

\[\text{GELU}_a(x;a) = x \cdot \Phi(a x)\]

As $a$ approaches infinity, the function converge to ReLU, and as $a$ approaches zero, the function converge to a scaled identity function.

Rather than a single parameter level $a$, we can introduce it as a latent scale variable and average over a continuum of gating strengths:

\[\text{IGLU}(x; \sigma) = \int_{0}^{\infty} \text{GELU}_a(x; a) f(a;\sigma)da,\]

where $f(a;\sigma)$ is a non-negative weighting function parameterized by $\sigma > 0$. This formulation allows the activation function to adaptively integrate information across a range of gating strengths, potentially enhancing its expressiveness and performance in neural networks.

Now, when we choose $f(a;\sigma)$ to be a half-normal distribution with mean zero and standard deviation $\sigma$, the integral becomes:

\[Z(x; \sigma) = \int_{0}^{\infty} \Phi(a x) \cdot \frac{\sqrt{2}}{\sigma \sqrt{\pi}} e^{-\frac{a^2}{2\sigma^2}} da,\]

where $\text{IGLU}(x; \sigma) = x \cdot Z(x; \sigma)$.

To solve this integral, we expand $\Phi(ax)$ and substitute $t = ax$:

\[Z(x;\sigma) = \frac{2x}{\sqrt{2\pi}\,\sigma} \int_0^\infty e^{-a^2/2\sigma^2} \int_{-\infty}^{x} \frac{a}{\sqrt{2\pi}}\, e^{-a^2 s^2/2}\, ds\, da.\]

We then swap the order of integration and solve the inner integral:

\[= \int_0^\infty a\, e^{-a^2\left(\frac{1}{\sigma^{2}} + s^2\right)/2} da = \frac{1}{\sigma^{-2} + s^2} = \frac{\sigma^2}{1 + \sigma^2 s^2},\]

which leads to the closed-form solution for $Z(x; \sigma)$:

\[Z(x;\sigma) = \frac{\sigma x}{\pi} \int_{-\infty}^{x} \frac{ds}{1 + \sigma^2 s^2} = \frac{1}{2} + \frac{\arctan(\sigma x)}{\pi},\]

and thus the final closed-form expression for IGLU is:

\[\text{IGLU}(x; \sigma) = x \cdot Z(x; \sigma) = x \left( \frac{1}{2} + \frac{\arctan(\sigma x)}{\pi} \right)\]

Comparison of ReLU, GELU, and IGLU activation functions

IGLU with varying $\sigma$ values: smaller $\sigma$ yields heavier-tailed gate

IGLU vs IGLU-Approx: negligible difference (<0.025 max error)

Integrated Gaussian Linear Unit Approximation

The approximation for the arctangent function:

\[\arctan(x) \approx \frac{\pi}{2} \cdot \frac{\sigma x}{1 + |\sigma x|},\]

which is continuous, odd, and saturates correctly as $x \to \pm \infty$. Substituting into the original gating function yields:

\[Z_{\text{approx}}(x; \sigma) = \frac{1}{2} \frac{1 + 2\max(0, \sigma x)}{1 + |\sigma x|} = \frac{1}{2} \frac{1 + 2\text{ReLU}(\sigma x)}{1 + \text{ReLU}(\sigma x) + \text{ReLU}(-\sigma x)},\]

where we used the identities $x + \lvert x\rvert = 2\max(0, x)$ and $\lvert x\rvert = \max(0, x) + \max(0, -x)$.

This approximation maintains the same asymptotic behavior as the original function while being computationally more efficient to evaluate.

Therefore, the final approximation for IGLU is:

\[\text{IGLU}_{\text{approx}}(x; \sigma) = x \cdot Z_{\text{approx}}(x; \sigma) = \frac{x}{2} \cdot \frac{1 + 2\text{ReLU}(\sigma x)}{1 + \text{ReLU}(\sigma x) + \text{ReLU}(-\sigma x)}\]

Computational Efficiency

IGLU-Approx benchmarks (relative to Identity function):

CPU Forward: 10.17× (vs GELU: 16.40×)
GPU Forward: 15.05× (vs GELU: 15.36×)
Competitive with ReLU while preserving smooth gradient flow
No transcendental function evaluation required

Citation

@article{kang2026iglu,
    title={IGLU: The Integrated Gaussian Linear Unit Activation Function},
    author={Kang, Mingi and Yang, Zai and Neto, Jeova Farias Sales Rocha},
    journal={arXiv preprint arXiv:2511.XXXXX},
    year={2026}
}