PyTorch训练指定架构MNIST模型时高学习率导致Loss为NaN的问题排查及任务执行咨询
问题:固定学习率集合下,高学习率(1、10)导致训练损失NaN的解决办法
我现在需要完成一个实验:使用学习率集合{0.01, 0.1, 1, 10}分别训练一个固定架构的PyTorch模型10个epoch,对比损失曲线。但当学习率设为1时,训练过程中损失直接变成了NaN,学习率10的情况估计更糟。模型架构不能修改,必须用指定的学习率,请问该怎么解决这个问题?
我的实验配置
- 模型架构:ModifiedNet(两层线性层,无激活函数)
- 数据集:MNIST(输入维度784,输出类别10)
- 损失函数:CrossEntropyLoss
- 优化器:SGD
模型代码
import torch import torch.nn as nn class ModifiedNet(nn.Module): def __init__(self, num_inputs, num_outputs): super(ModifiedNet, self).__init__() self.linear = nn.Linear(num_inputs, 1000) self.linear2 = nn.Linear(in_features=1000, out_features=num_outputs) def forward(self, input): input = input.view(-1, num_inputs) # reshape input to batch x num_inputs output = self.linear(input) output = self.linear2(output) return output
训练与测试函数
import numpy as np import torch.nn.functional as F from torch.autograd import Variable from torch.utils.data import DataLoader from torchvision.datasets import MNIST from torchvision.transforms import ToTensor # 假设train_loader和test_loader已提前定义 # train_loader = DataLoader(MNIST('./data', train=True, download=True, transform=ToTensor()), batch_size=64, shuffle=True) # test_loader = DataLoader(MNIST('./data', train=False, download=True, transform=ToTensor()), batch_size=1000, shuffle=False) def train(epoch, network, optimizer=None): losses = list() network.train() for batch_idx, (data, target) in enumerate(train_loader): data, target = Variable(data), Variable(target) if optimizer is not None: optimizer.zero_grad() output = network(data) loss = F.cross_entropy(output, target).to(torch.float64) losses.append(loss.item()) loss.backward() if optimizer is not None: optimizer.step() if batch_idx % 100 == 0: print('Train Epoch: {} [{}/{} ({:.0f}%)] Loss: {:.6f}'.format( epoch, batch_idx * len(data), len(train_loader.dataset), 100. * batch_idx / len(train_loader), loss.item())) return np.mean(np.array(losses)) def test(network): network.eval() test_loss = 0 correct = 0 for data, target in test_loader: output = network(data) test_loss += F.cross_entropy(output, target, reduction='sum').to(torch.double).item() # sum up batch loss pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability correct += pred.eq(target.data.view_as(pred)).cpu().sum() test_loss /= len(test_loader.dataset) print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format( test_loss, correct, len(test_loader.dataset), 100. * correct / len(test_loader.dataset))) return test_loss
学习率循环代码
import torch.optim as optim num_inputs = 784 num_outputs = 10 learning_rates = [0.01, 0.1, 1, 10] # 假设plot_graph函数已定义,用于绘制损失曲线 # def plot_graph(x, y, x_label, y_label, title): # import matplotlib.pyplot as plt # plt.plot(x, y) # plt.xlabel(x_label) # plt.ylabel(y_label) # plt.title(f"Training Loss - LR {title}") # plt.show() for learning_rate in learning_rates: net = ModifiedNet(num_inputs, num_outputs) optimizer = optim.SGD(net.parameters(), lr=learning_rate) train_losses = dict() for epoch_idx in range(10): train_losses[epoch_idx] = train(epoch_idx, net, optimizer) plot_graph(list(train_losses.keys()), list(train_losses.values()), "epoch", "train loss", str(learning_rate))
问题原因分析
出现NaN的核心原因是高学习率导致参数更新幅度过大,进而引发数值不稳定:
- 你的模型没有任何激活函数,两层线性层直接输出logits,数值很容易变得极大;
- CrossEntropyLoss内部会对logits做softmax,当logits数值过大时,
exp(超大值)会超出浮点数范围变成inf,最终导致损失计算为NaN; - SGD用大学习率更新参数时,参数会被推到数值极端的区域,进一步加剧这个问题。
可行的解决办法(无需修改模型架构)
1. 缩小参数初始化尺度
默认的Linear层初始化会让初始输出的数值范围偏大,手动缩小参数的初始尺度,能让模型在大学习率下多撑几个训练步骤:
class ModifiedNet(nn.Module): def __init__(self, num_inputs, num_outputs): super(ModifiedNet, self).__init__() self.linear = nn.Linear(num_inputs, 1000) self.linear2 = nn.Linear(in_features=1000, out_features=num_outputs) # 手动初始化参数,缩小标准差 torch.nn.init.normal_(self.linear.weight, mean=0, std=1e-3) torch.nn.init.zeros_(self.linear.bias) torch.nn.init.normal_(self.linear2.weight, mean=0, std=1e-3) torch.nn.init.zeros_(self.linear2.bias) def forward(self, input): input = input.view(-1, num_inputs) output = self.linear(input) output = self.linear2(output) return output
2. 添加梯度裁剪
在反向传播后限制梯度的最大范数,避免参数被更新得过于极端:
修改train函数,在loss.backward()之后添加梯度裁剪逻辑,同时增加NaN检测跳过无效更新:
def train(epoch, network, optimizer=None): losses = list() network.train() for batch_idx, (data, target) in enumerate(train_loader): data, target = Variable(data), Variable(target) if optimizer is not None: optimizer.zero_grad() output = network(data) loss = F.cross_entropy(output, target).to(torch.float64) # 提前检测NaN,避免无效更新 if torch.isnan(loss): print(f"NaN loss encountered at epoch {epoch}, batch {batch_idx}, skipping update") losses.append(np.nan) continue losses.append(loss.item()) loss.backward() # 梯度裁剪,限制梯度最大范数为1.0 torch.nn.utils.clip_grad_norm_(network.parameters(), max_norm=1.0) if optimizer is not None: optimizer.step() if batch_idx % 100 == 0: print('Train Epoch: {} [{}/{} ({:.0f}%)] Loss: {:.6f}'.format( epoch, batch_idx * len(data), len(train_loader.dataset), 100. * batch_idx / len(train_loader), loss.item())) # 计算平均损失时忽略NaN值 valid_losses = [l for l in losses if not np.isnan(l)] return np.mean(valid_losses) if valid_losses else np.nan
3. 学习率预热
对于学习率1和10的情况,先在前1个epoch用极小的学习率预热,让参数先进入相对稳定的区域,再切换到目标大学习率训练(不算违反任务要求,最终还是用指定学习率完成训练):
for learning_rate in learning_rates: net = ModifiedNet(num_inputs, num_outputs) optimizer = optim.SGD(net.parameters(), lr=learning_rate) train_losses = dict() for epoch_idx in range(10): # 第1个epoch用1%的目标学习率预热 if epoch_idx == 0: for param_group in optimizer.param_groups: param_group['lr'] = learning_rate * 0.01 else: # 后续epoch恢复目标学习率 for param_group in optimizer.param_groups: param_group['lr'] = learning_rate train_losses[epoch_idx] = train(epoch_idx, net, optimizer) plot_graph(list(train_losses.keys()), list(train_losses.values()), "epoch", "train loss", str(learning_rate))
结果预期
- 学习率1的训练应该能避免NaN,损失曲线会呈现明显震荡(因为学习率仍偏大),但能完成10个epoch的训练;
- 学习率10的情况可能还是会出现NaN,但至少能多训练几个步骤,获得部分可用的损失数据用于对比;
- 实验结论可以总结:只有合适的学习率(0.01、0.1)能让模型有效收敛,大学习率会导致训练不稳定甚至数值溢出。
内容的提问来源于stack exchange,提问作者woofwoof




