IBM in memory computing with 1 million phase change memory devices is 200 times faster than regular computing

October 25, 2017 by Brian Wang

“In-memory computing” or “computational memory” is an emerging concept that uses the physical properties of memory devices for both storing and processing information. This is counter to current von Neumann systems and devices, such as standard desktop computers, laptops and even cellphones, which shuttle data back and forth between memory and the computing unit, thus making them slower and less energy efficient.

Today, IBM Research is announcing that its scientists have demonstrated that an unsupervised machine-learning algorithm, running on one million phase change memory (PCM) devices, successfully found temporal correlations in unknown data streams. When compared to state-of-the-art classical computers, this prototype technology is expected to yield 200x improvements in both speed and energy efficiency, making it highly suitable for enabling ultra-dense, low-power, and massively-parallel computing systems for applications in AI.

Above – The researchers used PCM devices made from a germanium antimony telluride alloy, which is stacked and sandwiched between two electrodes.

The researchers used PCM devices made from a germanium antimony telluride alloy, which is stacked and sandwiched between two electrodes. When the scientists apply a tiny electric current to the material, they heat it, which alters its state from amorphous (with a disordered atomic arrangement) to crystalline (with an ordered atomic configuration). The IBM researchers have used the crystallization dynamics to perform computation in place.

“This is an important step forward in our research of the physics of AI, which explores new hardware materials, devices and architectures,” says Dr. Evangelos Eleftheriou, an IBM Fellow and co-author of the paper. “As the CMOS scaling laws break down because of technological limits, a radical departure from the processor-memory dichotomy is needed to circumvent the limitations of today’s computers. Given the simplicity, high speed and low energy of our in-memory computing approach, it’s remarkable that our results are so similar to our benchmark classical approach run on a von Neumann computer.”

To demonstrate the technology, the authors chose two time-based examples and compared their results with traditional machine-learning methods such as k-means clustering:

( Simulated Data: one million binary (0 or 1) random processes organized on a 2D grid based on a 1000 x 1000 pixel, black and white, profile drawing of famed British mathematician Alan Turing. The IBM scientists then made the pixels blink on and off with the same rate, but the black pixels turned on and off in a weakly correlated manner. This means that when a black pixel blinks, there is a slightly higher probability that another black pixel will also blink. The random processes were assigned to a million PCM devices, and a simple learning algorithm was implemented. With each blink, the PCM array learned, and the PCM devices corresponding to the correlated processes went to a high conductance state. In this way, the conductance map of the PCM devices recreates the drawing of Alan Turing.

* Real-World Data: actual rainfall data, collected over a period of six months from 270 weather stations across the USA in one hour intervals. If rained within the hour, it was labelled “1” and if it didn’t “0”. Classical k-means clustering and the in-memory computing approach agreed on the classification of 245 out of the 270 weather stations. In-memory computing classified 12 stations as uncorrelated that had been marked correlated by the k-means clustering approach. Similarly, the in-memory computing approach classified 13 stations as correlated that had been marked uncorrelated by k-means clustering.

“Memory has so far been viewed as a place where we merely store information. But in this work, we conclusively show how we can exploit the physics of these memory devices to also perform a rather high-level computational primitive. The result of the computation is also stored in the memory devices, and in this sense the concept is loosely inspired by how the brain computes.” said Dr. Abu Sebastian, exploratory memory and cognitive technologies scientist, IBM Research and lead author of the paper.

Nature Communications – Temporal correlation detection using computational phase-change memory

Abstract

Conventional computers based on the von Neumann architecture perform computation by repeatedly transferring data between their physically separated processing and memory units. As computation becomes increasingly data centric and the scalability limits in terms of performance and power are being reached, alternative computing paradigms with collocated computation and storage are actively being sought. A fascinating such approach is that of computational memory where the physics of nanoscale memory devices are used to perform certain computational tasks within the memory unit in a non-von Neumann manner. We present an experimental demonstration using one million phase change memory devices organized to perform a high-level computational primitive by exploiting the crystallization dynamics. Its result is imprinted in the conductance states of the memory devices. The results of using such a computational memory for processing real-world data sets show that this co-existence of computation and storage at the nanometer scale could enable ultra-dense, low-power, and massively-parallel computing systems.

This system has two POWER8 CPUs (each comprising 10 cores) and 4 Nvidia Tesla P100 graphical processing units (attached using the NVLink interface). A detailed profiling of the GPU implementation reveals two key insights. Firstly, we find that the fraction of time computing the momentum M(k) is around 2%2% of the total execution time. Secondly, we observe that the performance is ultimately limited by the memory bandwidth of the GPU device. We then proceed to estimate the time that would be needed to perform the same task using a computational memory module: we determine the time required to compute the momentum on the memory controller, as well as the additional time required to perform the in-memory part of the computation. We conclude that by using such a computational memory module, one could accelerate the task of correlation detection by a factor of 200 relative to an implementation that uses 4 state-of-the-art GPU devices. We have also performed power profiling of the GPU implementation, and conclude that the computational memory module would provide a significant improvement in energy consumption of two orders of magnitude.

An alternative approach to using PCM devices will be to design an application-specific chip where the accumulative behavior of PCM is emulated using complementary metal-oxide semiconductor (CMOS) technology using adders and registers.

However, even at a relatively large 90 nm technology node, the areal footprint of the computational phase change memory is much smaller than that of CMOS-only approaches, even though the dynamic power consumption is comparable. By scaling the devices to smaller dimensions and by using shorter write pulses, these gains are expected to increase several fold35,36. The ultra-fast crystallization dynamics and non-volatility ensure a multi-time scale operating window ranging from a few tens of nanoseconds to years. These attributes are particularly attractive for slow processes, where the leakage of CMOS would dominate the dynamic power because of the low utilization rate.

They performed a large-scale experimental demonstration of this concept using a million PCM devices, and could successfully detect weakly correlated processes in artificially generated stochastic input data. This experiment demonstrates the efficacy of this concept even in the presence of device variability and other non-ideal behavior. They also successfully processed real-world data sets from weather stations in the United States and obtained classification results similar to the k-means clustering algorithm. A detailed comparative study with respect to state-of-the-art von Neumann computing systems showed that computational memory could lead to orders of magnitude improvements in time/energy-to-solution compared to conventional computing systems.

Brian Wang

Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.

Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.

A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.

10 thoughts on “IBM in memory computing with 1 million phase change memory devices is 200 times faster than regular computing”

goatguy

October 25, 2017 at 7:50 pm

An additional comment-fragment that I accidentally deleted …
________

One thing that is NOT recognized tho’ about this parallelized-on-chip search tho is this: yes, the upper bound of search time (for this example) is 3.2 milliseconds. However in that time, one can potentially find ALL ‘hits’ on a pattern, whether 1 of them match or millions. The actual hit-bandwidth is huge. Morever, the specialized chips would have abundant extra memory (“computational memory”) that would hold result-sets for boolean or even ‘fuzzy’ logic association to other result-sets. Again, keeping the results OFF the bus, but in-memory for further selection and rejection based on the AI criteria imposed.

As I consider this, and my life of professionally working on all nature of computational projects, I know that I now am quite a bit too old to invent a novel AI/pattern language and computational processor ‘primitives’ code set for this kind of invention. Oh, I could – with a team of really bright spuds – but no longer am I so cheeky to think that this would be a “one man project” over a few years. But that realization is itself liberating – it doesn’t mean that its so sophisticated a problem as to defy authorship. Just that it is multidisciplinary, multi-person, drawing from dozens of highly unusual realms of study. Therefore, with high likelihood: it can be done. It has that feel.

Hope someone takes it up! I’m VOLUNTEERING to join the team! Spent over 30 years thinking on this problem.

GoatGuy
- doctorpat
  
  October 26, 2017 at 1:59 am
  
  So…. Elon Musk, I assume you follow NBF?
  
  Call Goat.
goatguy

October 25, 2017 at 7:38 pm

Sigh… for the old-timers this’ll be a rehash.

Some time ago (about 25 years now back) I began to think that the solution to the “Von Neumann” bottleneck¹ was going to inescapably require “distributed computing”. This was, of course, somewhat before the now-ubiquitous development and roll-out of [ 2, 4, 6, 8, 12, 14 ] core [ 1, 2, 4? ] thread multicore, multi-threaded CPUs. And this was before the 100-to–4,000 ‘core’ GPU nanoprocessors that can all be enslaved simultaneously at computing significant – albeit small- code in parallel.

My vision was and remains: the cointegration of a very special (and necessarily rapidly evolving) type of memory chip that is itself a processor. Not any-ol’ processor, but one that is integrated into the planar (future: cubic) physical layout of the device.

As to demonstrate – the most fundamental present-day memory chip architecture consists conceptually of a big array of cells, rimmed by a whole bunch of logic and analog stuff that can write, read, refresh and queue the bits contained in the array. Physically, this is very often implemented as “an array of arrays of arrays of rows-and-columns”. But no matter. Its still an array.

The array in ‘planar space’ is 2 dimensional. Rows and columns of bits. The thing is that when a chip’s electronics “activates a row”, then entire span of columns is read out simultaneously. (Well, this for an array-of-arrays-of-arrays does make a difference from a power consumption point of view.)

Say our (primitive) memory is organized as 65,536 rows (16 bit address) and each row energizes 4,096 columns of data (12 bit address). 16 + 12 = 28 bits. 2²⁸ = 65536 × 4096 = 268,435,456 bits of addressible memory. One quarter gigabit. This isn’t even a very ancient (primitive!!!) chip. One that one could find today in 5 year old computers.

Anyway, every time the onboard electronics activates a row, all 4096 bits in that row potentially (if not subdivided into further arrays) would be present to the readout circuitry ‘around the edge’. Modern chips have no problem getting from so-called row-strobe to column-data-readout in 20 ns or faster. Memory is seriously fast stuff. Seriously fast.

In theory the chip might call for another completely different row in as little as 40 ns. Time for things to settle donwn. Well if you invert that ( 1 ÷ 40×10⁻⁹ sec = 25,000,000 per sec ), and multiply by 4,096 you get 102,400,000,000 bits per second of per chip bandwidth (on chip).

REMEMBER this is an example. Modern chips do WAY better than this, way faster. POint though is that 102,400 Mbit/s = 12.8 GByte/s. The point is, in merely 2.6 milliseconds, the entire chip’s worth of memory can be read. Edge-to-edge. All 32 megabytes of this chip. 12 GB/sec. ¹/₄₀₀ second. 400 full-chip scans a second.

COMPUTE what this means. Say we have 1 Terabyte of memory, full of patterned stuff. Information. Data. Knowledge. We only have Goat’s ¼ Gbit chips. ¹/₃₂ GB/chip. 1 TB = 1024 GB = 1,048,567 MB = 1,073,741,824 kB = 1,099,511,627,776 bytes. Lots of bytes. We take 1,048,576 MB and divide by 32 MB/chip and get 32,768 chips. While we know that that is a lot of chips, consider modern FLASH packaging: very often as many as 16 chips per side of the PCBoard are stacked up. 32 of them in a dime-sized PCB. Ignoring the heat-problem for now, 32,000 chips becomes 1,000 packages. One can easily fit 24 of those per long DIMM socket (double sided). Its still “a stick”. 1,000 ÷ 24 = 44 or so sticks. This is not a huge amount of space. You know? I have a small ‘workgroup’ server that has 32 sticks of DRAM. Just saying.

Anyway, all of that memory can be searched in 3.2 milliseconds if the computational “searching” elements are on each memory chip. Think about that… 400 end-to-end, full-memory searches for arbitrary patterns every second.

In A.I. (somewhat) and Big Data Analysis (especially) there are a LOT of data lookup situations that cannot be indexed meaningfully.² Searching for REGEX patterns³ over huge amounts of search-space is particularly problematic. Genetic pattern searching is like this, too.

GoatGuy
____

¹ von neumann bottleneck – Concept: processor reads memory for instructions, data. Does stuff. Puts results back into memory. Repeats. Memory bus is bandwidth limiting BOTH for instructions and data (and results).

² indexed – The quite-refined computational Art of organizing “values” in an “ordered arrangement” that can later be “looked up” thousands-to-billions of times faster “by value” than by linearly searching the data store end-to-end. For some kinds of data (e.g. “customer number”, “first, last name”, “address” or “DNA snippet”), indexing is the core of ‘the magic’ that allows HUGE databases to be searched in milliseconds to find all nature of stuff. It is even good at “searching all indexed values” for arbitrary patterns (which can’t be organized beforehand) from the index. The index which often is mere ‘thousandths’ the size of the database. With A.I. relationships and inference patterns tho’, the “index” is often as large or substantially larger than the ‘database’ itself! Cheaper (in time) to just search the database.

³ REGEX patterns – e.g. /bob/ finds the string ‘bob’ in a string (like a record, a sentence, a paragraph, this article, today’s NYTimes), if its in there. It finds ‘bob’ in ‘bob nelson’, ‘thingamabob‘, ‘bobbin’ and ‘discombobulation’. More sophisticated are such like /bob.*son/ which will find a bob followed by anyting and a ‘son’. So this matches only ‘bob nelson‘ in this little example. You might imagine how that is more computationally complex – and non-indexable if chosen as a random search.
- Jerry Glenn
  
  October 26, 2017 at 9:12 am
  
  Seems a variant of “Plug and Chug”. To whatever extent AI permits standard problem statement, the solution would be dramatically faster then a “Blank sheet” start with basic proofs and reformulations. Perhaps somewhat like your standard spreadsheet starting points.
mzso

October 25, 2017 at 5:12 pm

Just like HP’s memristor computing technology, it’s so wondrous that it only exists on paper.
- James Haswell
  
  October 25, 2017 at 7:55 pm
  
  Yeah I was so excited over a decade ago when they first were being talked about. The thing about memristors is that my guess would be, to really take advantage of them you’d have to redesign the whole idea of how a home computer worked. That’s expensive and time consuming. I’d love to see it happen. I’d be curious to see what might be possible if you started from scratch and included this and memristors and a gaggle of other new potential game changers.
- James Haswell
  
  October 25, 2017 at 7:56 pm
  
  Also, what’s up with the new commenty thing? Is my old account gone? I can’t comment on my own comment? Who moved my cheese?
  - mzso
    
    October 28, 2017 at 10:08 am
    
    You old account is not gone, but you can’t use it here anymore…
    Even if you have a wordpress (new commenting system) account you might not be able to log in, as it is the case for a few of us. 🙂
    Even so I have reply buttons to all comments, even my own. But I can’t edit, nor do I get notifications.
- James Haswell
  
  October 25, 2017 at 7:57 pm
  
  Ah, comment threads are only 1 level deep and you have to reply to the OP.
Jan Jansson

October 25, 2017 at 2:35 pm

The header is misleading. The current simulation was not 200 times faster than a conventional computer, the authors only calculate that it may be in future. Let us withhold our cheers until the scientist actually demonstrate a high throughput…

Comments are closed.