《Mastering the game of Go without human knowledge》中AlphaGo Zero规则融入与落子合规性问询
Great question—let’s break this down using the exact details from that landmark Nature paper. The short version is: the CNN itself doesn’t have Go rules hard-coded into its layers, but rules are baked into the system via supporting components that shape what the network learns and how it makes decisions.
1. Weaving Go Rules into the CNN: Indirectly, via MCTS and Reinforcement Learning
The CNN’s job is to predict move probabilities and game outcomes, but rule enforcement happens through two critical parts of the AlphaGo Zero pipeline:
Pre-filtering illegal moves in Monte Carlo Tree Search (MCTS)
Every time AlphaGo Zero runs an MCTS to evaluate possible plays, a lightweight rule-checking module first strips out all invalid moves—think playing on a ko point too soon, filling your own eye, or placing a stone that gets immediately captured with no way to recapture. This means the MCTS never even considers illegal options. All the candidate moves it explores, and the final policy distribution it outputs to train the CNN, only include legal moves. Over thousands of self-play games, the CNN learns to prioritize these legal moves because that’s all it sees in its training data.Reinforcement learning rewards enforce valid play
During self-play, if the CNN outputs an illegal move, the rule engine rejects it outright, and the system samples a valid move instead. Only complete, legal games generate a reward (+1 for winning, -1 for losing). This creates a strong incentive: moves that lead to valid, winning games get reinforced, while any tendency toward illegal play is penalized by being excluded from the reward signal. The CNN gradually learns to align its outputs with legal, high-value moves.
2. Does the CNN’s Generated Play Always Follow Go Rules?
This depends on whether you’re looking at the full system or the CNN in isolation:
Full AlphaGo Zero system (CNN + MCTS): Yes, always compliant
Since MCTS filters out illegal moves before they’re even considered, the final move selected by the system is guaranteed to follow all Go rules. The CNN guides the MCTS’s search, but the MCTS acts as a safety guard to ensure only legal moves make it to the board.CNN in isolation: Not necessarily
If you took the trained CNN and directly picked the highest-probability move from its output, you might occasionally get an illegal move. The CNN learns to associate legal moves with winning outcomes, but it doesn’t have explicit rule checks built into its weights. That said, this scenario never happens in the actual AlphaGo Zero system—MCTS is always part of the decision pipeline to enforce compliance.
From the paper: The authors explicitly note that rule enforcement is handled by the MCTS’s move generation step, not the neural network itself. This separation keeps the CNN focused on learning high-level game strategy, while the rule-checking module handles low-level legal constraints.
内容的提问来源于stack exchange,提问作者Optimus




