The AlphaZero program developed by Google and DeepMind took four hours of playing against itself to create chess knowledge beyond any human or other computer program. It could beat any person and beat the best World Computer Champion Stockfish 28 wins to 0 in a 100-game match.
Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved within 24 hours a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and convincingly defeated a world-champion program in each case.
Self-play games are generated by using the latest parameters for this neural network, omitting
the evaluation step and the selection of best player.
AlphaGo Zero tuned the hyper-parameter of its search by Bayesian optimization. In AlphaZero they reuse the same hyper-parameters for all games without game-specific tuning. The sole exception is the noise that is added to the prior policy to ensure exploration; this is scaled in proportion to the typical number of legal moves for that game type.
Like AlphaGo Zero, the board state is encoded by spatial planes based only on the basic
rules for each game. The actions are encoded by either spatial planes or a flat vector, again
based only on the basic rules for each game.
They applied the AlphaZero algorithm to chess, shogi, and also Go. Unless otherwise specified, the same algorithm settings, network architecture, and hyper-parameters were used for all three games. They trained a separate instance of AlphaZero for each game. Training proceeded for 700,000 steps (mini-batches of size 4,096) starting from randomly initialized parameters,
using 5,000 first-generation TPUs to generate self-play games and 64 second-generation
TPUs to train the neural networks.
In chess, AlphaZero outperformed Stockfish after just 4 hours (300k steps); in shogi, AlphaZero outperformed Elmo after less than 2 hours (110k steps); and in Go, AlphaZero outperformed AlphaGo Lee (29) after 8 hours (165k steps).
The game of chess represented the pinnacle of AI research over several decades. State-ofthe-art
programs are based on powerful engines that search many millions of positions, leveraging handcrafted domain expertise and sophisticated domain adaptations. AlphaZero is a generic
reinforcement learning algorithm – originally devised for the game of Go – that achieved superior results within a few hours, searching a thousand times fewer positions, given no domain