如何在PyTorch中初始化权重?如何通过He/Xavier初始化网络权重与偏置?
Great question! Initializing weights properly is crucial for training deep neural networks efficiently—bad initialization can lead to vanishing/exploding gradients or slow convergence. Let's walk through all the common methods in PyTorch, including He and Xavier initialization, with practical code examples you can use right away.
1. Basic Manual Initialization
If you want full control, you can directly modify the weight and bias tensors of your layers. PyTorch provides handy in-place functions for common distributions:
- For a single linear layer:
import torch import torch.nn as nn # Create a linear layer (input dim 10, output dim 20) linear_layer = nn.Linear(10, 20) # Initialize weights with a normal distribution (mean=0, std=0.01) linear_layer.weight.data.normal_(mean=0.0, std=0.01) # Initialize biases to zero linear_layer.bias.data.zero_()
- For an entire model (iterating over all layers):
class SimpleNN(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(64, 32) self.fc2 = nn.Linear(32, 10) self.relu = nn.ReLU() def forward(self, x): x = self.relu(self.fc1(x)) return self.fc2(x) model = SimpleNN() # Loop through all parameters and initialize for name, param in model.named_parameters(): if 'weight' in name: # Normal distribution for weights nn.init.normal_(param, mean=0, std=0.01) elif 'bias' in name: # Zero out biases nn.init.constant_(param, 0)
2. Xavier Initialization (Glorot Initialization)
Xavier initialization is designed to keep the variance of activations roughly the same across layers, which works well with sigmoid or tanh activations. PyTorch has two built-in functions for this: xavier_normal_ (normal distribution) and xavier_uniform_ (uniform distribution).
Key Details:
- It uses the average of input and output feature counts to scale the distribution.
- Use
nn.init.calculate_gain()to adjust for the activation function's gain (e.g., tanh has a gain of ~5/3).
Code Example:
# Initialize a single layer with Xavier normal nn.init.xavier_normal_(linear_layer.weight, gain=nn.init.calculate_gain('tanh')) # Or use uniform variant nn.init.xavier_uniform_(linear_layer.weight, gain=nn.init.calculate_gain('sigmoid'))
Applying to a Model:
You can initialize layers directly in the model's __init__ method:
class XavierNN(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(64, 32) self.fc2 = nn.Linear(32, 10) self.tanh = nn.Tanh() # Xavier init for tanh-activated layers nn.init.xavier_normal_(self.fc1.weight, gain=nn.init.calculate_gain('tanh')) nn.init.xavier_normal_(self.fc2.weight, gain=nn.init.calculate_gain('tanh')) # Biases stay at zero (standard practice) nn.init.constant_(self.fc1.bias, 0) nn.init.constant_(self.fc2.bias, 0) def forward(self, x): x = self.tanh(self.fc1(x)) return self.fc2(x)
3. He Initialization (Kaiming Initialization)
He initialization is optimized for ReLU and its variants (like LeakyReLU). It accounts for the fact that ReLU zeros out half the activations, so it scales the distribution using only the input feature count (by default).
Key Details:
- Use
mode='fan_in'(default) to preserve variance of inputs, ormode='fan_out'for outputs. - Specify the
nonlinearityparameter to match your activation function (e.g.,'leaky_relu'requires setting the negative slopea).
Code Example:
# He normal init for ReLU nn.init.kaiming_normal_(linear_layer.weight, mode='fan_in', nonlinearity='relu') # For LeakyReLU, specify the negative slope (a=0.01 is common) nn.init.kaiming_uniform_(linear_layer.weight, mode='fan_in', nonlinearity='leaky_relu', a=0.01)
Applying to a Model:
class HeNN(nn.Module): def __init__(self): super().__init__() self.fc1 = nn.Linear(64, 32) self.fc2 = nn.Linear(32, 10) self.relu = nn.ReLU() # He init for ReLU layers nn.init.kaiming_normal_(self.fc1.weight, nonlinearity='relu') nn.init.kaiming_normal_(self.fc2.weight, nonlinearity='relu') # Biases initialized to zero nn.init.constant_(self.fc1.bias, 0) nn.init.constant_(self.fc2.bias, 0) def forward(self, x): x = self.relu(self.fc1(x)) return self.fc2(x)
4. Custom Initialization Functions
For unique use cases, you can define a custom initialization function and apply it to your entire model with model.apply():
def custom_weight_init(m): # Apply only to linear layers if isinstance(m, nn.Linear): # Custom normal init scaled by input features nn.init.normal_(m.weight, mean=0, std=torch.sqrt(1/m.in_features)) # Small constant for biases (instead of zero) m.bias.data.fill_(0.01) # Apply to model model = SimpleNN() model.apply(custom_weight_init)
Quick Notes on Bias Initialization
- In most cases, initializing biases to 0 is safe and effective—biases have a smaller impact on gradient flow compared to weights.
- For some edge cases (e.g., output layers for imbalanced classification), you might initialize biases to a small non-zero value, but this is rare.
内容的提问来源于stack exchange,提问作者Fábio Perez




