求0截断多元正态分布的均值方差及ReLU输出分布的矩

阿华AIGC实验室

2026-5-19

Mean and Covariance of Element-wise ReLU Applied to a Multivariate Normal Distribution

Great question—this is a core calculation for analyzing ReLU activations in deep learning, especially when using the Central Limit Theorem (CLT) to approximate layer inputs as normally distributed. Let's break this down into univariate case first (since element-wise ReLU acts independently on each dimension for the mean, but we need to account for cross-dimension correlations for covariance) and then extend to the multivariate setting.

1. Univariate Case (Single Dimension)

Let $X \sim \mathcal{N}(\mu, \sigma^2)$, and $X_+ = \max(0, X)$.

Mean $\mathbb{E}[X_+]$

This is the mean of a right-truncated normal distribution (truncated at 0). The closed-form formula is:
$$
\mathbb{E}[X_+] = \sigma \phi\left(\frac{-\mu}{\sigma}\right) + \mu \Phi\left(\frac{\mu}{\sigma}\right)
$$
where:

$\phi(z)$ is the probability density function (PDF) of the standard normal distribution: $\phi(z) = \frac{1}{\sqrt{2\pi}}e^{-z2/2}$
$\Phi(z)$ is the cumulative distribution function (CDF) of the standard normal distribution: $\Phi(z) = \int_{-\infty}^z \phi(t)dt$

Variance $\text{Var}(X_+)$

Variance is calculated as $\text{Var}(X_+) = \mathbb{E}[X_+^2] - (\mathbb{E}[X_+])^2$. To find $\mathbb{E}[X_+^2]$, we use the integral of the truncated normal's second moment, which simplifies to:
$$
\mathbb{E}[X_+^2] = \mu \sigma \phi\left(\frac{\mu}{\sigma}\right) + (\mu^2 + \sigma^2)\Phi\left(\frac{\mu}{\sigma}\right)
$$
Substitute this into the variance formula to get the final result.

Simplification for Zero-Mean Input

A common scenario in deep learning (e.g., after batch normalization) is $\mu = 0$. Here, the formulas simplify drastically:

$\mathbb{E}[X_+] = \sigma \phi(0) = \frac{\sigma}{\sqrt{2\pi}} \approx 0.3989\sigma$
$\mathbb{E}[X_+^2] = \sigma^2 \Phi(0) = \frac{\sigma^2}{2}$
$\text{Var}(X_+) = \frac{\sigma^2}{2} - \left(\frac{\sigma}{\sqrt{2\pi}}\right)^2 = \sigma^2\left(\frac{1}{2} - \frac{1}{2\pi}\right) \approx 0.3431\sigma^2$

2. Multivariate Case ($d$-Dimensional)

Let $Z \sim \mathcal{N}(\mu, \Sigma)$ where $Z \in \mathbb{R}^d$, and $Z_+ = (\max(0,Z_1), \max(0,Z_2), ..., \max(0,Z_d))^T$.

Mean Vector $\mathbb{E}[Z_+]$

Since ReLU acts element-wise, the mean vector is just the element-wise application of the univariate mean formula:
$$
\left(\mathbb{E}[Z_+]\right)i = \sigma{ii}^{1/2} \phi\left(\frac{-\mu_i}{\sigma_{ii}^{1/2}}\right) + \mu_i \Phi\left(\frac{\mu_i}{\sigma_{ii}^{1/2}}\right)
$$
where $\mu_i$ is the $i$-th element of $\mu$, and $\sigma_{ii}$ is the $i$-th diagonal entry of $\Sigma$ (the variance of $Z_i$).

Covariance Matrix $\text{Cov}(Z_+)$

The $(i,j)$-th entry of the covariance matrix is:
$$
\text{Cov}(Z_{+,i}, Z_{+,j}) = \mathbb{E}[Z_{+,i}Z_{+,j}] - \mathbb{E}[Z_{+,i}]\mathbb{E}[Z_{+,j}]
$$

Diagonal Entries ($i=j$)

This is just the univariate variance of $Z_{+,i}$, which we already covered in Section 1.

Off-Diagonal Entries ($i \neq j$)

We need to compute $\mathbb{E}[\max(0,Z_i)\max(0,Z_j)]$, which involves integrating over the upper-right quadrant of the bivariate normal distribution of $(Z_i, Z_j)$.

For general non-zero means $\mu_i, \mu_j$, there is no simple closed-form expression, and you’ll typically use numerical integration or approximation methods. However, for the common zero-mean case ($\mu = 0$), we can derive a closed-form result:

Let $\rho = \frac{\sigma_{ij}}{\sqrt{\sigma_{ii}\sigma_{jj}}}$ (the correlation coefficient between $Z_i$ and $Z_j$). Then:
$$
\mathbb{E}[Z_{+,i}Z_{+,j}] = \frac{1}{2\pi}\left(\sigma_{ij}\arcsin\rho + \sqrt{\sigma_{ii}\sigma_{jj} - \sigma_{ij}^2}\right)
$$
Subtract the product of the means ($\mathbb{E}[Z_{+,i}]\mathbb{E}[Z_{+,j}] = \frac{\sigma_{ii}\sigma_{jj}}{2\pi}$) to get the covariance:
$$
\text{Cov}(Z_{+,i}, Z_{+,j}) = \frac{1}{2\pi}\left(\sigma_{ij}\arcsin\rho + \sqrt{\sigma_{ii}\sigma_{jj} - \sigma_{ij}^2} - \sigma_{ii}\sigma_{jj}\right)
$$

Key Context for Deep Learning

This calculation is critical for analyzing how signals propagate through ReLU layers. When the CLT applies to layer inputs (e.g., in deep networks with many independent inputs), this approximation lets you model the output distribution of ReLU activations, which helps with tasks like analyzing gradient flow, designing normalization layers, or even approximating network behavior.

内容的提问来源于stack exchange，提问作者Danica