Google Deepmind Intros Generalist AI Which May Lead to AGI

Arxiv – Deepmind introduces GATO, a generalist AI agent, which could be a path to AGI, Artificial General Intelligence.

Inspired by progress in large-scale language modeling, Deepmind apply a similar approach towards building a single generalist agent beyond the realm of text outputs. The agent, which Deepmind refer to as Gato, works as a multi-modal, multi-task, multi-embodiment generalist policy. The same network with the same weights can play Atari, caption images, chat, stack blocks with a real robot arm and much more, deciding based on its context whether to output text, joint torques, button presses, or other tokens. In this report Deepmind describe the model and the data, and document the current capabilities of Gato.

A generalist agent. Gato can sense and act with different embodiments across a wide range of environments using a single neural network with the same set of weights. Gato was trained on 604 distinct tasks with varying modalities, observations and action specifications.

Transformer sequence models are effective as multi-task multi-embodiment policies, including for real-world text, vision and robotics tasks. They show promise as well in few-shot out-of-distribution task learning. In the future, such models could be used as a default starting point via prompting or fine-tuning to learn new behaviors, rather than training from scratch.

Given scaling law trends, the performance across all tasks including dialogue will increase with scale in parameters, data and compute. Better hardware and network architectures will allow training bigger models while maintaining real-time robot control capability. By scaling up and iterating on this same basic approach, Deepmind can build a useful general-purpose agent.

GATO Robotics – RGB Stacking Benchmark (real and sim)

As a testbed for taking physical actions in the real world, they chose the robotic block stacking environment introduced by Lee et al. (2021). The environment consists of a Sawyer robot arm with 3-DoF cartesian velocity control, an additional DoF for velocity, and a discrete gripper action. The robot’s workspace contains three plastic blocks colored red, green and blue with varying shapes. The available observations include two 128 × 128 camera images, robot arm and gripper joint angles as well as the robot’s end-effector pose. Notably, ground truth state information for the three objects in the basket is not observed by the agent. Episodes have a fixed length of 400 timesteps at 20 Hz for a total of 20 seconds, and at the end of an episode block positions are randomly re-positioned within the workspace. The robot in action is shown in Figure 4. There are two challenges in this benchmark:
Skill Mastery (where the agent is provided data from the 5 test object triplets it is later tested on) and
Skill Generalization (where data can only be obtained from a set of training objects that excludes the 5 test sets).

They used several sources of training data for these tasks. In Skill Generalization, for both simulation and real, they use data collected by the best generalist sim2real agent from Lee et al. (2021). They collected data only when interacting with the designated RGB-stacking training objects (this amounts to a total of 387k successful trajectories in simulation and 15k trajectories in real). For Skill Mastery Deepmind used data from the best per group experts from Lee et al. (2021) in simulation and from the best sim2real policy on the real robot (amounting to 219k trajectories in total). Note that this data is only included for specific Skill Mastery experiments.

AI Progress Via Deep Learning and Other Recent AI Developments

Geoff Hinton joins Pieter in a two-part season finale for a wide-ranging discussion inspired by insights gleaned from Hinton’s journey from academia to Google Brain. The episode covers how existing neural networks and backpropagation models operate differently than how the brain actually works; the purpose of sleep; and why it’s better to grow our computers than manufacture them.

What’s in this episode:

00:00:00 – Introduction
00:02:48 – Understanding how the brain works
00:06:59 – Why we need unsupervised local objective functions
00:09:39 – Masked auto-encoders
00:10:55 – Current methods in end to end learning
00:18:36 – Spiking neural networks
00:23:00 – Leveraging spike times
00:29:55 – The story behind AlexNet
00:36:15 – Transition from pure academia to Google
00:40:23 – The secret auction of Hinton’s company at NuerIPS
00:44:18 – Hinton’s start in psychology and carpentry
00:54:34 – Why computers should be grown rather than manufactured
01:06:57 – The function of sleep and Boltzmann Machines
01:11:49 – Need for negative data
01:19:35 – Visualizing data using t-SNE

Complex Reasoning With Large Language Models

Arxiv – Least-to-Most Prompting Enables Complex Reasoning in Large Language Models

We propose a novel prompting strategy, least-to-most prompting, that enables large language models to better perform multi-step reasoning tasks. Least-to-most prompting first reduces a complex problem into a list of subproblems, and then sequentially solves the subproblems, whereby solving a given subproblem is facilitated by the model’s answers to previously solved subproblems. Experiments on symbolic manipulation, compositional generalization and numerical reasoning demonstrate that least-to-most prompting can generalize to examples that are harder than those seen in the prompt context, outperforming other prompting based approaches by a large margin. A notable empirical result is that the GPT-3 code-davinci-002 model with least-to-most-prompting can solve the SCAN
benchmark with an accuracy of 99.7% using 14 examples. As a comparison, the neural-symbolic models in the literature specialized for solving SCAN are trained with the full training set of more than 15,000 examples.

META AI LeCun Proposes Six Module CommonSense AI

META LeCun proposed an architecture of six separate, differential modules that can easily compute gradient estimates of the objective function with respect to input and propagate the information to upstream modules. This common-sense architecture can help AI systems to achieve autonomous intelligence. The six modules are configurator, perception, world model, short-term memory, actor, and cost.

The configurator module is for executive control, like executing a given task. It’s also responsible for pre-configuring the perception, world model, cost, and the actor module by modulating the parameters of those modules.

The perception module receives signals from sensors and estimates the current state of the world, but only a small subset of the perceived state of the world is relevant and valuable for a given task.

The world model module has two roles, and it’s the most complex piece of architecture. The first role is to estimate missing information about the state of the world that is not provided by perception to predict the natural evolutions of the world. The second role is to predict plausible future states of the world. The world model module acts as a simulator to the task at hand. It helps represent multiple possible predictions.

The cost module predicts the level of discomfort of the agent and has two submodules: the intrinsic cost and the critic. The former submodule is immutable and computes discomforts like damage to the agent, violation of hard-coded behavioral constraints, etc.). The latter submodule is a trainable module that predicts future values of the intrinsic cost.

The actor module computes proposals for action sequences. “The actor can find an optimal action sequence that minimizes the estimated future cost and output the first action in the optimal sequence, in a fashion similar to classical optimal control,” LeCun says.

The short-term memory module keeps track of the current and predicted world state and associated costs.