Graphcore AI chips are 100X times faster

AI chip Startup Graphcore IPU systems are designed to lower the cost of accelerating AI applications>A in cloud and enterprise datacenters to increase the performance of both training and inference by up to 100x compared to the fastest systems today.

Graphcore systems excel at both training and inference. The highly parallel computational resources together with graph software tools and libraries, allows researchers to explore machine intelligence across a much broader front than current solutions. This technology lets recent success in deep learning evolve rapidly towards useful, general artificial intelligence.

Graphcore’s IPU (Intelligence Processing Unit) is a new AI accelerator bringing an unprecedented level of performance to both current and future machine learning workloads. Its unique combination of massively parallel multi-tasking compute, synchronized execution within an IPU or across multiple IPUs, innovative data exchange fabric and large amounts of on-chip SRAM give unheard of capabilities for both training and inference across a large range of machine learning algorithms.

Graphcore has been focused on bringing up a full software stack early to ensure that the IPU can be used for real applications from the outset.

Their Poplar® graph programming framework and application libraries provide these capabilities. They have developed a port of TensorFlow to target the Poplar libraries with support for other machine learning frameworks underway. With these software tools, they can run a wide variety of real applications, through both cycle accurate chip simulations and real hardware.

Graphcore recently had $50 million in funding from Sequoia Capital.

CNN model training (even at low batch sizes)

Convolutional neural networks (CNNs) are used widely in image processing. A CNN model will typically contain several layers performing multiple convolution operations. The convolution operations have parameters which must be learnt via a training algorithm. Training is usually performed by stochastic gradient descent which involves repeatedly running the model on image data, calculating the gradients of the model and then updating the parameters of the model.

The best performance reported on a 300W GPU accelerator (the same power budget as a C2 accelerator) is approximately 580 images per second but the Graphcoore can train 16000 images per second.