如何在PyTorch中初始化权重？如何通过He/Xavier初始化网络权重与偏置？

阿华AIGC实验室

2026-5-20

Great question! Initializing weights properly is crucial for training deep neural networks efficiently—bad initialization can lead to vanishing/exploding gradients or slow convergence. Let's walk through all the common methods in PyTorch, including He and Xavier initialization, with practical code examples you can use right away.

Weight Initialization in PyTorch

1. Basic Manual Initialization

If you want full control, you can directly modify the weight and bias tensors of your layers. PyTorch provides handy in-place functions for common distributions:

For a single linear layer:

import torch
import torch.nn as nn

# Create a linear layer (input dim 10, output dim 20)
linear_layer = nn.Linear(10, 20)

# Initialize weights with a normal distribution (mean=0, std=0.01)
linear_layer.weight.data.normal_(mean=0.0, std=0.01)
# Initialize biases to zero
linear_layer.bias.data.zero_()

For an entire model (iterating over all layers):

class SimpleNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 10)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

model = SimpleNN()

# Loop through all parameters and initialize
for name, param in model.named_parameters():
    if 'weight' in name:
        # Normal distribution for weights
        nn.init.normal_(param, mean=0, std=0.01)
    elif 'bias' in name:
        # Zero out biases
        nn.init.constant_(param, 0)

2. Xavier Initialization (Glorot Initialization)

Xavier initialization is designed to keep the variance of activations roughly the same across layers, which works well with sigmoid or tanh activations. PyTorch has two built-in functions for this: xavier_normal_ (normal distribution) and xavier_uniform_ (uniform distribution).

Key Details:

It uses the average of input and output feature counts to scale the distribution.
Use nn.init.calculate_gain() to adjust for the activation function's gain (e.g., tanh has a gain of ~5/3).

Code Example:

# Initialize a single layer with Xavier normal
nn.init.xavier_normal_(linear_layer.weight, gain=nn.init.calculate_gain('tanh'))

# Or use uniform variant
nn.init.xavier_uniform_(linear_layer.weight, gain=nn.init.calculate_gain('sigmoid'))

Applying to a Model:

You can initialize layers directly in the model's __init__ method:

class XavierNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 10)
        self.tanh = nn.Tanh()

        # Xavier init for tanh-activated layers
        nn.init.xavier_normal_(self.fc1.weight, gain=nn.init.calculate_gain('tanh'))
        nn.init.xavier_normal_(self.fc2.weight, gain=nn.init.calculate_gain('tanh'))
        # Biases stay at zero (standard practice)
        nn.init.constant_(self.fc1.bias, 0)
        nn.init.constant_(self.fc2.bias, 0)

    def forward(self, x):
        x = self.tanh(self.fc1(x))
        return self.fc2(x)

3. He Initialization (Kaiming Initialization)

He initialization is optimized for ReLU and its variants (like LeakyReLU). It accounts for the fact that ReLU zeros out half the activations, so it scales the distribution using only the input feature count (by default).

Key Details:

Use mode='fan_in' (default) to preserve variance of inputs, or mode='fan_out' for outputs.
Specify the nonlinearity parameter to match your activation function (e.g., 'leaky_relu' requires setting the negative slope a).

Code Example:

# He normal init for ReLU
nn.init.kaiming_normal_(linear_layer.weight, mode='fan_in', nonlinearity='relu')

# For LeakyReLU, specify the negative slope (a=0.01 is common)
nn.init.kaiming_uniform_(linear_layer.weight, mode='fan_in', nonlinearity='leaky_relu', a=0.01)

Applying to a Model:

class HeNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(64, 32)
        self.fc2 = nn.Linear(32, 10)
        self.relu = nn.ReLU()

        # He init for ReLU layers
        nn.init.kaiming_normal_(self.fc1.weight, nonlinearity='relu')
        nn.init.kaiming_normal_(self.fc2.weight, nonlinearity='relu')
        # Biases initialized to zero
        nn.init.constant_(self.fc1.bias, 0)
        nn.init.constant_(self.fc2.bias, 0)

    def forward(self, x):
        x = self.relu(self.fc1(x))
        return self.fc2(x)

4. Custom Initialization Functions

For unique use cases, you can define a custom initialization function and apply it to your entire model with model.apply():

def custom_weight_init(m):
    # Apply only to linear layers
    if isinstance(m, nn.Linear):
        # Custom normal init scaled by input features
        nn.init.normal_(m.weight, mean=0, std=torch.sqrt(1/m.in_features))
        # Small constant for biases (instead of zero)
        m.bias.data.fill_(0.01)

# Apply to model
model = SimpleNN()
model.apply(custom_weight_init)

Quick Notes on Bias Initialization

In most cases, initializing biases to 0 is safe and effective—biases have a smaller impact on gradient flow compared to weights.
For some edge cases (e.g., output layers for imbalanced classification), you might initialize biases to a small non-zero value, but this is rare.

内容的提问来源于stack exchange，提问作者Fábio Perez