While we’ve recently seen great strides in robotic capability, the gap between human and robot motor skills remains vast. Machines still have a very long way to go to match human proficiency even at basic sensorimotor skills like grasping. However, by linking learning with continuous feedback and control, we might begin to bridge that gap, and in so doing make it possible for robots to intelligently and reliably handle the complexities of the real world.
The video below is from Korea (KAIST) and won last year’s (2015) DARPA robotics challenge. The remarkably precise and deliberate motions are deeply impressive. But they are also quite… robotic. Why is that? What makes robot behavior so distinctly robotic compared to human behavior? At a high level, current robots typically follow a sense-plan-act paradigm, where the robot observes the world around it, formulates an internal model, constructs a plan of action, and then executes this plan. This approach is modular and often effective, but tends to break down in the kinds of cluttered natural environments that are typical of the real world. Here, perception is imprecise, all models are wrong in some way, and no plan survives first contact with reality.
Humans and animals move quickly, reflexively, and often with remarkably little advance planning, by relying on highly developed and intelligent feedback mechanisms that use sensory cues to correct mistakes and compensate for perturbations. For example, when serving a tennis ball, the player continually observes the ball and the racket, adjusting the motion of his hand so that they meet in the air. This kind of feedback is fast, efficient, and, crucially, can correct for mistakes or unexpected perturbations. Can we train robots to reliably handle complex real-world situations by using similar feedback mechanisms to handle perturbations and correct mistakes?
While servoing and feedback control have been studied extensively in robotics, the question of how to define the right sensory cue remains exceptionally challenging, especially for rich modalities such as vision. So instead of choosing the cues by hand, we can program a robot to acquire them on its own from scratch, by learning from extensive experience in the real world. In our first experiments with real physical robots, we decided to tackle robotic grasping in clutter.
A human child is able to reliably grasp objects after one year, and takes around four years to acquire more sophisticated precision grasps. However, networked robots can instantaneously share their experience with one another, so if we dedicate 14 separate robots to the job of learning grasping in parallel, we can acquire the necessary experience much faster. Below is a video of Google robots practicing grasping a range of common office and household objects:
While initially the grasps are executed at random and succeed only rarely, each day the latest experiences are used to train a deep convolutional neural network (CNN) to learn to predict the outcome of a grasp, given a camera image and a potential motor command. This CNN is then deployed on the robots the following day, in the inner loop of a servoing mechanism that continually adjusts the robot’s motion to maximize the predicted chance of a successful grasp. In essence, the robot is constantly predicting, by observing the motion of its own hand, which kind of subsequent motion will maximize its chances of success. The result is continuous feedback: what we might call hand-eye coordination. Observing the behavior of the robot after over 800,000 grasp attempts, which is equivalent to about 3000 robot-hours of practice, we can see the beginnings of intelligent reactive behaviors. The robot observes its own gripper and corrects its motions in real time. It also exhibits interesting pre-grasp behaviors, like isolating a single object from a group. All of these behaviors emerged naturally from learning, rather than being programmed into the system.
Incorporating continuous feedback into the system reduces the failures by nearly half, down to 18% from 34%, and produces interesting corrections and adjustments.
One of the most exciting aspects of the proposed grasping method is the ability of the learning algorithm to discover unconventional and non-obvious grasping strategies. We ob- served, for example, that the system tended to adopt a different approach for grasping soft objects, as opposed to hard ones. For hard objects, the fingers must be placed on either side of the object for a successful grasp. However, soft objects can be grasped simply by pinching into the object, which is most easily accomplished by placing one finger into the middle, and the other to the side. We observed this strategy for objects such as paper tissues and sponges. In future work, we plan to further explore the relationship between our self-supervised continuous grasping approach and reinforcement learning, in order to allow the methods to learn a wider variety of grasp strategies from large datasets of robotic experience.
At a more general level, our work explores the implications of large-scale data collection across multiple robotic plat- forms, demonstrating the value of this type of automatic large dataset construction for real-world robotic tasks. Although all of the robots in our experiments were located in a controlled laboratory environment, in the long term, this class of methods is particularly compelling for robotic systems that are deployed in the real world, and therefore are naturally exposed to a wide variety of environments, objects, lighting conditions, and wear and tear. For self-supervised tasks such as grasping, data collected and shared by robots in the real world would be the most representative of test-time inputs, and would therefore be the best possible training data for improving the real-world performance of the system. So a particularly exciting avenue for future work is to explore how our method would need to change to apply it to large-scale data collection across a large number of deployed robots engaged in real world tasks, including grasping and other manipulation skills
Neural networks have made great strides in allowing us to build computer programs that can process images, speech, text, and even draw pictures. However, introducing actions and control adds considerable new challenges, since every decision the network makes will affect what it sees next. Overcoming these challenges will bring us closer to building systems that understand the effects of their actions in the world. If we can bring the power of large-scale machine learning to robotic control, perhaps we will come one step closer to solving fundamental problems in robotics and automation. The research on robotic hand-eye coordination and grasping was conducted by Sergey Levine, Peter Pastor, Alex Krizhevsky, and Deirdre Quillen, with special thanks to colleagues at Google Research and X who’ve contributed their expertise and time to this research.
We describe a learning-based approach to hand- eye coordination for robotic grasping from monocular images. To learn hand-eye coordination for grasping, we trained a large convolutional neural network to predict the probability that task-space motion of the gripper will result in successful grasps, using only monocular camera images and independently of camera calibration or the current robot pose. This requires the network to observe the spatial relationship between the gripper and objects in the scene, thus learning hand-eye coordination. We then use this network to servo the gripper in real time to achieve successful grasps. To train our network, we collected over 800,000 grasp attempts over the course of two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and hardware. Our experimental evaluation demonstrates that our method achieves effective real-time control, can successfully grasp novel objects, and corrects mistakes by continuous servoing.
SOURCES – Youtube, Google Research, Arxiv