You need to enable JavaScript to run this app.
最新活动
大模型
产品
解决方案
定价
生态与合作
支持与服务
开发者
了解我们

PyTorch训练指定架构MNIST模型时高学习率导致Loss为NaN的问题排查及任务执行咨询

问题:固定学习率集合下,高学习率(1、10)导致训练损失NaN的解决办法

我现在需要完成一个实验:使用学习率集合{0.01, 0.1, 1, 10}分别训练一个固定架构的PyTorch模型10个epoch,对比损失曲线。但当学习率设为1时,训练过程中损失直接变成了NaN,学习率10的情况估计更糟。模型架构不能修改,必须用指定的学习率,请问该怎么解决这个问题?


我的实验配置

  • 模型架构:ModifiedNet(两层线性层,无激活函数)
  • 数据集:MNIST(输入维度784,输出类别10)
  • 损失函数:CrossEntropyLoss
  • 优化器:SGD

模型代码

import torch
import torch.nn as nn
class ModifiedNet(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(ModifiedNet, self).__init__()
        self.linear = nn.Linear(num_inputs, 1000)
        self.linear2 = nn.Linear(in_features=1000, out_features=num_outputs)
    def forward(self, input):
        input = input.view(-1, num_inputs) # reshape input to batch x num_inputs
        output = self.linear(input)
        output = self.linear2(output)
        return output

训练与测试函数

import numpy as np
import torch.nn.functional as F
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torchvision.datasets import MNIST
from torchvision.transforms import ToTensor

# 假设train_loader和test_loader已提前定义
# train_loader = DataLoader(MNIST('./data', train=True, download=True, transform=ToTensor()), batch_size=64, shuffle=True)
# test_loader = DataLoader(MNIST('./data', train=False, download=True, transform=ToTensor()), batch_size=1000, shuffle=False)

def train(epoch, network, optimizer=None):
    losses = list()
    network.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = Variable(data), Variable(target)
        if optimizer is not None:
            optimizer.zero_grad()
        output = network(data)
        loss = F.cross_entropy(output, target).to(torch.float64)
        losses.append(loss.item())
        loss.backward()
        if optimizer is not None:
            optimizer.step()
        if batch_idx % 100 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]	Loss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
    return np.mean(np.array(losses))

def test(network):
    network.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        output = network(data)
        test_loss += F.cross_entropy(output, target, reduction='sum').to(torch.double).item() # sum up batch loss
        pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.data.view_as(pred)).cpu().sum()
    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))
    return test_loss

学习率循环代码

import torch.optim as optim

num_inputs = 784
num_outputs = 10
learning_rates = [0.01, 0.1, 1, 10]

# 假设plot_graph函数已定义,用于绘制损失曲线
# def plot_graph(x, y, x_label, y_label, title):
#     import matplotlib.pyplot as plt
#     plt.plot(x, y)
#     plt.xlabel(x_label)
#     plt.ylabel(y_label)
#     plt.title(f"Training Loss - LR {title}")
#     plt.show()

for learning_rate in learning_rates:
    net = ModifiedNet(num_inputs, num_outputs)
    optimizer = optim.SGD(net.parameters(), lr=learning_rate)
    train_losses = dict()
    for epoch_idx in range(10):
        train_losses[epoch_idx] = train(epoch_idx, net, optimizer)
    plot_graph(list(train_losses.keys()), list(train_losses.values()), "epoch", "train loss", str(learning_rate))

问题原因分析

出现NaN的核心原因是高学习率导致参数更新幅度过大,进而引发数值不稳定

  1. 你的模型没有任何激活函数,两层线性层直接输出logits,数值很容易变得极大;
  2. CrossEntropyLoss内部会对logits做softmax,当logits数值过大时,exp(超大值)会超出浮点数范围变成inf,最终导致损失计算为NaN;
  3. SGD用大学习率更新参数时,参数会被推到数值极端的区域,进一步加剧这个问题。

可行的解决办法(无需修改模型架构)

1. 缩小参数初始化尺度

默认的Linear层初始化会让初始输出的数值范围偏大,手动缩小参数的初始尺度,能让模型在大学习率下多撑几个训练步骤:

class ModifiedNet(nn.Module):
    def __init__(self, num_inputs, num_outputs):
        super(ModifiedNet, self).__init__()
        self.linear = nn.Linear(num_inputs, 1000)
        self.linear2 = nn.Linear(in_features=1000, out_features=num_outputs)
        # 手动初始化参数,缩小标准差
        torch.nn.init.normal_(self.linear.weight, mean=0, std=1e-3)
        torch.nn.init.zeros_(self.linear.bias)
        torch.nn.init.normal_(self.linear2.weight, mean=0, std=1e-3)
        torch.nn.init.zeros_(self.linear2.bias)
    def forward(self, input):
        input = input.view(-1, num_inputs)
        output = self.linear(input)
        output = self.linear2(output)
        return output

2. 添加梯度裁剪

在反向传播后限制梯度的最大范数,避免参数被更新得过于极端:
修改train函数,在loss.backward()之后添加梯度裁剪逻辑,同时增加NaN检测跳过无效更新:

def train(epoch, network, optimizer=None):
    losses = list()
    network.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = Variable(data), Variable(target)
        if optimizer is not None:
            optimizer.zero_grad()
        output = network(data)
        loss = F.cross_entropy(output, target).to(torch.float64)
        
        # 提前检测NaN,避免无效更新
        if torch.isnan(loss):
            print(f"NaN loss encountered at epoch {epoch}, batch {batch_idx}, skipping update")
            losses.append(np.nan)
            continue
            
        losses.append(loss.item())
        loss.backward()
        # 梯度裁剪,限制梯度最大范数为1.0
        torch.nn.utils.clip_grad_norm_(network.parameters(), max_norm=1.0)
        
        if optimizer is not None:
            optimizer.step()
        if batch_idx % 100 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]	Loss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item()))
    
    # 计算平均损失时忽略NaN值
    valid_losses = [l for l in losses if not np.isnan(l)]
    return np.mean(valid_losses) if valid_losses else np.nan

3. 学习率预热

对于学习率1和10的情况,先在前1个epoch用极小的学习率预热,让参数先进入相对稳定的区域,再切换到目标大学习率训练(不算违反任务要求,最终还是用指定学习率完成训练):

for learning_rate in learning_rates:
    net = ModifiedNet(num_inputs, num_outputs)
    optimizer = optim.SGD(net.parameters(), lr=learning_rate)
    train_losses = dict()
    for epoch_idx in range(10):
        # 第1个epoch用1%的目标学习率预热
        if epoch_idx == 0:
            for param_group in optimizer.param_groups:
                param_group['lr'] = learning_rate * 0.01
        else:
            # 后续epoch恢复目标学习率
            for param_group in optimizer.param_groups:
                param_group['lr'] = learning_rate
        train_losses[epoch_idx] = train(epoch_idx, net, optimizer)
    plot_graph(list(train_losses.keys()), list(train_losses.values()), "epoch", "train loss", str(learning_rate))

结果预期

  • 学习率1的训练应该能避免NaN,损失曲线会呈现明显震荡(因为学习率仍偏大),但能完成10个epoch的训练;
  • 学习率10的情况可能还是会出现NaN,但至少能多训练几个步骤,获得部分可用的损失数据用于对比;
  • 实验结论可以总结:只有合适的学习率(0.01、0.1)能让模型有效收敛,大学习率会导致训练不稳定甚至数值溢出。

内容的提问来源于stack exchange,提问作者woofwoof

火山引擎 最新活动