Alpha Go Zero becomes best at Go in 40 days by only playing itself without any human input

AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play.

DeepMind created an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, the new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

AlphaGo Zero surpassed its predecessor’s abilities by not referring to any human games. It started playing at random and improved solely by repeatedly playing against itself. Three days and 4.9 million such games later, the result is the world’s best Go-playing AI. It took 40 days to beat the better version of AlphaGo that beat the world champion.

Nature – Mastering the game of Go without human knowledge