Alpha Go Zero becomes best at Go in 40 days by only playing itself without any human input

AlphaGo became the first program to defeat a world champion in the game of Go. The tree search in AlphaGo evaluated positions and selected moves using deep neural networks. These neural networks were trained by supervised learning from human expert moves, and by reinforcement learning from self-play.

DeepMind created an algorithm based solely on reinforcement learning, without human data, guidance or domain knowledge beyond game rules. AlphaGo becomes its own teacher: a neural network is trained to predict AlphaGo’s own move selections and also the winner of AlphaGo’s games. This neural network improves the strength of the tree search, resulting in higher quality move selection and stronger self-play in the next iteration. Starting tabula rasa, the new program AlphaGo Zero achieved superhuman performance, winning 100–0 against the previously published, champion-defeating AlphaGo.

AlphaGo Zero surpassed its predecessor’s abilities by not referring to any human games. It started playing at random and improved solely by repeatedly playing against itself. Three days and 4.9 million such games later, the result is the world’s best Go-playing AI. It took 40 days to beat the better version of AlphaGo that beat the world champion.

Nature – Mastering the game of Go without human knowledge

10 thoughts on “Alpha Go Zero becomes best at Go in 40 days by only playing itself without any human input”

  1. btw – off topic but what’s up with this commenting system? I really, really disliked solid opinion, mostly due to people abusing its points system,and I like the cleanliness of this system but it still seems a bit lacking.

    Not being able to edit your posts and not being able to have nested conversations seems to be two big strikes against it. what was the issue with disqus?

  2. I read this elsewhere…but I think its pretty impressive, part of that AI begets better AI feedback loop. Now if only a good AI can be built to answer search type questions and we simply rated the result to train it.

  3. 100-0, that is obliteration. And this is a far faster advance than was achieved by chess engines. They were even with humans for 3 or 4 years. I think Magnus Carlsen (chess World Champion and world high Elo record holder) could get maybe 5 draws out of a hundred games vs Stockfish, Houdini, or Komodo (the three fairly close best engines by quite a bit). Which is inline with the 700 Elo or so difference in strength. A 100/100 games would be about 1200 Elo advantage or more. And that is understated because the Go program was not playing the best human but a program that was already clearly better than the best humans. So maybe 1400 Elo past the humans.

  4. No human input apart from a human defining success for the AI.

    Not really sure how this is news. Anyone who has made an Alpha-Beta pruning algorithm for a game with a simple search space (e.g. tic-tac-toe or connect four) has made the game play itself. Back in the day I used to run heuristic variants of such algorithms against each other to see who would win based on how the algorithm would interpret what the board looked like six or seven moves in to a game.

    • you’re kidding right?

      this is huge, because it:

      1. indicates that us mere humans didn’t even come up with the optimum after thousands of years of searches via our own heuristics

      2. for limited AI and with the right algorithm, you can go from zero to superhuman in an order of days, from scratch – even with massively complicated search spaces like go.

      If – and I admit it this is a big if – general AI is tractable with the correct algorithm in the same way that Go is, we are likely to experience a ‘fast takeoff’ in general AI – and if this general AI doesn’t have the same goals as us we are likely to be screwed.

      I’ve always thought that this would be somewhat gradual – where there were several AIs that were developed in parallel and AI’s evolution would be in steps – but I’m not so sure now. This is one datapoint that strongly indicates that it is likely to be sudden.

      And If the general AI takeoff happens in days it is likely to be a very random event, depending on the starting conditions behind that AI, and AI safety ought to be at the top of our priority list.

      • On the plus side, one of the most dangerous skills an AI might have is manipulating human beings. In order to learn this skill by accelerated learning comparable to this Go example, the AI would need to already have a good emulation of human behavior and reasoning! And one capable of being run much, much faster than real time.

        So in order to take off, it really has to already be significantly superhuman.

        • brett,

          I disagree. gigantic tailor-made, curated, training sets already exist for any potential AI to learn manipulation – youtube and wikipedia for example, chat networks for another.

          facebook and google could be mined for ‘likes’ analysis, determining algorithmically which posts and memes can manipulate the most people.

          and finally there is the final, ultimate manipulation agent for humans, namely money. Any superintelligent entity is likely to be able to make loads of this, and then use payment as a way to manipulate people – given the anonymity of the internet, the people being manipulated need not know they are being manipulated by an AI.

          so no, I don’t think these barriers provide us much protection.

    • @Combinatorics,

      If by “defining success” you mean specifying when the game ends and who wins based upon the rules, then sure, but this is pretty minimal information.

      Also, this is much more of an achievement than what a tree search does. For traditional game engines (e.g. chess) the source of most of the difference in playing strength among the top ranks is not in the search algorithm, but in the evaluation function. That is, the thing that takes in a board state and spits out who it thinks is winning and by how much. Note that in all traditional applications this function is *hand-coded* even today (see the inexorable stream of patches for Stockfish, arguably the strongest chess engine:

      This new version of Alpha Go started with an evaluation function that is essentially random and gradually improved it to a superhuman level over many games. If you look at the published paper, they benchmarked a version of Alpha Go with *no tree search what-so-ever* (that is, no look-ahead, just the current state of the board), which is obviously much weakened, but is still able to play at a Go professional’s level. I don’t know about you, but I think that’s pretty impressive.

Comments are closed.