为何Karpathy的Pong策略梯度代码用均匀分布采样动作而非阈值判断？

阿华AIGC实验室

2026-5-8

Why We Use Random Sampling Instead of Fixed Threshold in Pong Policy Gradient Code

Great question! Let’s unpack this from both practical and mathematical angles—this is a key point in understanding how policy gradient methods work.

Balancing Exploration and Exploitation
The core challenge in reinforcement learning is making sure your agent doesn’t get stuck in a suboptimal loop. If we used a fixed threshold like action = 2 if 0.5 < aprob else 3, the agent would act deterministically: every time aprob is above 0.5, it always picks action 2, and never tries action 3 even if it might lead to a better reward later.
By sampling from the probability distribution (np.random.uniform() < aprob), we let the agent exploit the action it thinks is better (higher aprob means it’s chosen more often) while still exploring the other action occasionally. This exploration is critical for discovering better strategies over time—without it, the agent might never learn that a seemingly worse action could lead to a winning streak in certain scenarios.
Mathematical Correctness for Policy Gradients
Karpathy’s code uses the REINFORCE algorithm, which relies on estimating the gradient of the expected reward with respect to the policy parameters. For this gradient estimate to be valid, we need to sample actions directly from the policy’s probability distribution.
If we switched to a fixed threshold, we’d effectively turn the stochastic policy into a deterministic one. Deterministic policies require different gradient estimation techniques (like those used in DDPG), but REINFORCE is designed for stochastic policies. Sampling ensures that the gradient updates push the policy toward increasing the probability of actions that lead to higher rewards—something a fixed threshold can’t properly capture, since it doesn’t account for the full probability distribution of actions.
Intuitive Example
Suppose the current aprob (probability of choosing action 2) is 0.6. A fixed threshold would make the agent choose action 2 100% of the time here. But if we sample, 60% of the time it picks 2, 40% picks 3. If one of those 40% cases results in a huge reward (like winning a round), the agent can update its policy to adjust aprob appropriately (either increasing it if action 2 still proves better overall, or decreasing it if action 3 turned out to be a better choice in that context). Without sampling, it would never get that crucial feedback.