TensorFlow优化器对象未识别问题：PPO算法实现session.run调用报错

阿华AIGC实验室

2026-5-13

解决PPO LSTM网络中optimizer未定义的错误及相关问题

首先，你遇到的NameError: name 'optimizer' is not defined是典型的Python作用域问题：你在__init__方法里定义的optimizer是局部变量，而update方法属于另一个作用域，无法直接访问这个变量。要解决这个问题，你需要把optimizer（以及其他需要在update中用到的图组件）变成类的实例属性，也就是用self.前缀来声明它们。

接下来，我会一步步帮你修正代码里的问题，同时解决其他潜在的逻辑错误：

1. 修正变量作用域与计算图逻辑

在__init__方法中，把所有需要在update里访问的TensorFlow组件（包括placeholder、loss、optimizer）都改为实例属性，同时把actor_loss、熵的计算逻辑整合到计算图中（而非用外部placeholder传入）：

def __init__(self, input_size, output_size, session):
    """ input_size：输入环境维度 - OpenAI CartPole环境为4
         output_size：动作空间维度 - OpenAI CartPole环境为2 """
    # LSTM要求输入为3D张量
    self.env = tf.placeholder(dtype=tf.float32, shape=[1, None, input_size])
    
    self.lstm1 = tf.keras.layers.LSTM(8, return_sequences=False)(self.env)
    # Softmax返回动作概率分布及神经网络预测的最优动作
    self.actor = tf.keras.layers.Dense(output_size, activation="softmax")(self.lstm1)
    self.critic = tf.keras.layers.Dense(1, activation=None)(self.lstm1)
    
    # 改为实例属性，方便update方法访问
    self.return_ = tf.placeholder(dtype=tf.float32, shape=[None, 1])
    self.advantage = tf.placeholder(dtype=tf.float32, shape=[None, 1])
    self.old_log_prob = tf.placeholder(dtype=tf.float32, shape=[None, 1])
    self.action_mask = tf.placeholder(dtype=tf.float32, shape=[None, output_size])  # 用于选择对应动作的概率
    
    # 在计算图中定义actor_loss逻辑
    current_log_prob = tf.reduce_sum(tf.log(self.actor) * self.action_mask, axis=1, keepdims=True)
    ratio = tf.exp(current_log_prob - self.old_log_prob)
    clipped_ratio = tf.clip_by_value(ratio, 1 - 0.2, 1 + 0.2)
    actor_loss = -tf.reduce_mean(tf.minimum(ratio * self.advantage, clipped_ratio * self.advantage))
    
    # 在计算图中计算动作分布的熵（避免外部计算后传入）
    entropy = tf.reduce_mean(-tf.reduce_sum(self.actor * tf.log(self.actor + 1e-8), axis=1))
    
    critic_loss = tf.reduce_mean(tf.square(self.return_ - self.critic))
    # 总loss
    self.loss = 0.5 * critic_loss + actor_loss - 0.001 * entropy
    
    # 改为实例属性，让update方法可访问
    self.optimizer = tf.train.AdamOptimizer(0.001).minimize(self.loss)
    
    init = tf.global_variables_initializer()
    initlocal = tf.local_variables_initializer()
    session.run([init, initlocal])

2. 修正`update`方法中的逻辑错误

原来的update方法存在feed_dict键错误、熵计算逻辑错位等问题，修正后的版本如下：

def update(self, session, epochs, batch_size, states, actions, log_probs, returns, advantages):
    """ 使用经验缓冲区更新神经网络
    ##############################################
    epochs：神经网络训练轮数
    batch_size：训练数据批次大小
    states：环境状态数组
    actions：执行的动作数组
    log_probs：动作的对数概率
    returns：给定时间点的估计回报数组
    advantages：优势数组，由预测回报与估计值的差值计算得到 """
    # 把actions转换成one-hot掩码，匹配计算图中的action_mask格式
    action_masks = np.zeros((len(actions), self.actor.shape[1]))
    action_masks[np.arange(len(actions)), actions] = 1.0
    
    for e in range(epochs):
        # 调整make_batches的输入，传入转换后的action_masks
        for state, action_mask, old_log_prob, return_, advantage in self.make_batches(states, action_masks, log_probs, returns, advantages):
            # 确保state符合self.env的shape要求：[1, None, input_size]
            state = np.expand_dims(state, axis=0)
            # 构建正确的feed_dict
            feed_dict = {
                self.env: state,
                self.action_mask: action_mask,
                self.old_log_prob: old_log_prob.reshape(-1, 1),
                self.return_: return_.reshape(-1, 1),
                self.advantage: advantage.reshape(-1, 1)
            }
            # 运行优化器
            session.run(self.optimizer, feed_dict=feed_dict)

3. 关键问题解释

作用域问题：通过把optimizer、placeholder等改为self.xxx，让这些变量成为类实例的属性，这样update方法就能正常访问它们了。
计算图范式修正：原来的代码把actor_loss、熵作为外部数值传入placeholder，这不符合TensorFlow的计算图设计逻辑——我们应该把这些计算逻辑整合到图中，提升效率并避免数值不兼容问题。
动作掩码的使用：用one-hot掩码来选择当前动作对应的概率，既符合批量计算逻辑，也能避免手动索引导致的维度错误。