Roadmap to human cortex scale neuromorphic hardware systems using analog technology

It should be possible to build a silicon version of the human cerebral cortex with the transistor technology that was in production in 2013. The resulting machine would take up less than a cubic meter of space and consume less than 100 watts, not too far from the human brain. This article is summarizing the work of Jennifer Hasler and Bo Marr writing in Frontiers of Neuroscience – Finding a roadmap to achieve large neuromorphic hardware systems.

Computational power efficiency for biological systems is 8–9 orders of magnitude higher (better) than the power efficiency wall for digital computation. Analog techniques at a 10 nm node can potentially reach this same level of biological computational efficiency. Figure 1 show huge potential for neuromorphic systems, showing the community has a lot of room left for improvement, as well as potential directions on how to achieve these approaches with technology already being developed; new technologies only improve the probability of this potential being reached.

Figure 1. A spectrum showing the computational efficiency of various technologies, including digital technologies, analog Signal Processing (SP), as well as best estimate of biological neuron computation. Three orders of magnitude has produced amazing improvements in digital technology from speak-and-spell devices (Frantz and Wiggins, 1982) to current day smart phones. Three orders of magnitude in analog SP approaches has the promise of similar advancements as it becomes a stable capability. Biological neurons show a potential of five more orders of magnitude of improvement, opening further opportunity for efficient computational devices. Further, this observation defines one definition for efficient neuromorphic systems as those physically implemented algorithms that improve power efficiency beyond the analog SP metrics.

Synapses and Soma: The floating-gate transistor [top left], which can store differing amounts of charge, can be used to build a “crossbar” array of artificial synapses [bottom left]. Electronic versions of other neuron components, such as the soma region [right], can be made from standard transistors and other circuit components.

Large-Scale Neuromorphic Systems
Although the eventual goal would be the complexity of human brain, it remains beneficial to consider intermediate steps as well, such as a limited region of cortex, or potentially smaller nervous systems like a mouse. Estimates of the number of neurons in the human brain are between 100 billion and one trillion (Williams and Herrup, 1988; Mead, 1990; Azevedo et al., 2009), although most recent data leans toward 100 billion (Azevedo et al., 2009). Estimates on the number of neurons in a mouse is roughly 100 million neurons (Williams, 2000). Size of the cortex structure would be somewhat proportional to the sensor size of the incoming signals (Allman, 2000); size of the cortex tends to be correlated to the body size in mammals (Allman, 2000). Further, building a cortex or cortex in a handheld device imposes additional significant constraints in area and power consumed.

Dendritic computation results in computational efficiency improvements over analog SP techniques. The first approach was a compiled FPAA design, showing an order of magnitude increase, with the second, more optimized configurable approach potentially enabling three orders of magnitude over analog SP techniques (Ramakrishnan et al., 2012). The second approach was based on a local, configurable architecture (FPGA/FPAA) for routing neurons with a high percentage of local connectivity.

A digital system using 8 bit MAC arithmetic is a 30 million times higher energy usage than the biological computation numbers. Analog signal processing techniques have been shown to have a factor of 1000 improvement, on average, on computational efficiency for many algorithms. If we implement the biological approach as a sequence of VMM computations and similar approaches, efficiencies of roughly 10 MMAC/μW or 10 TMAC/W would be achieved; analog VMM and similar approaches are in the 1–10 TMAC/W range. Understanding neural computation offers opportunities of significant improvement in computational efficiency (500,000 times).

A typical value for a VMM compiled in an FPAA would be at 10 MMAC/μW (=10 TMAC/W) power level. By utilizing the computation efficiency in dendritic structures for wordspotting approaches, a basic compiled structure with large node capacitances (i.e., ≈ 1 pF) shows an improvement in power efficiency of a factor of 10, a more dedicated approach would show an improvement of 450 over the VMM structure. Decreasing the resulting power supply to biological levels (Vdd = 180 mV), shows another factor of 10 improvement in power efficiency (45 PMAC/W). All of these factors, with typical node capacitances results in structures within two orders of magnitude of the power efficiency of biological systems; the Si internode capacitance could be further decreased as nodes scale down. These neuromorphic techniques show promise to approach the computational efficiency and raw computational power as mammalian nervous systems.

Plot of computational efficiency versus capacitance level for VMM (analog) and Dendrite computation (neuromorphic, wordspotting) physical algorithms for Vdd = 2.5 V. For both algorithms, the efficiency improves linearly with decrease in Vdd, since power scales linearly with Vdd here. We also show the computational efficiency for the dendrite computation for Vdd = 180 mV, typical of neurobiological systems (Siwy et al., 2003). We also include a table of effective SNR, computed from thermal noise at the node over signal size (≈UT), as a function of capacitance.

Commercial Considerations to Drive these Systems

Although one can discuss how to build a cortical computer on the size of mammals and humans, the question is how will the technology developed for these large systems impact commercial development. The cost for ICs alone for cortex would be approximately $20 M in current prices, which although possible for large users, would not be common to be found in individual households. Throughout the digital processor approach, commercial market opportunities have driven the progress in the field. Getting neuromorphic technology integrated into commercial environment allows us to ride this powerful economic “engine” rather than pull.

A range of different ICs and systems will be built, all at different targets in the market. There are options for even larger networks, or integrating these systems with other processing elements on a chip/board. When moving to larger systems, particularly ones with 10–300 chips (300 million to a billion neurons) or more, one can see utilization of stacking of dies, both decreasing the communication capacitance as well as board complexity. Stacking dies should roughly increase the final chip cost by the number of dies stacked.

Neuromorphic systems are gaining increasing importance in an era where CMOS digital computing techniques are reaching physical limits.

Potential of a Neuromorphic Processor IC

In another case, we will consider a large die of 400 mm2, the size of an entire reticle, typical of the microprocessor ICs, graphics ICs, and other higher end commercial ICs. We might expect a chip cost of $100 range, resulting from a die cost under $50 per die, given current pricing models. These chips would probably exist in handheld or other electronic devices that sell above a $350 range, which enables a wide range of commercial applications. In 40 mm2 area, we could imagine a network of 30,000,000 cortical neurons, resulting in 500 TMAC equivalent computation in 50 mW of power. We assume roughly 10,000 neurons project outside of the IC per second, and with addressing bits would require roughly 256 kb/s, resulting in 8 mW of average output communication power.

By comparison, these numbers show effectively a hand held device having the computational power rivaling the largest of today’s supercomputers in the power consumed by less than most handheld devices, and at a price point that could be put into higher end commercial devices, such as tablets or laptops. Potential applications would include the speech recognition examples for the smaller chip, as well as (or in addition to) image processing emulation, particularly on 1 M pixel images, including receptive field processing, image/scene classification, and pre-attention mechanisms.

Writing/Reading Synapse Values from a Cortical Model
If the synapse strengths/weights are learned, this alleviates the need for loading a large number of parameter values into a system. Assuming we are loading a cortex of 1000 trillion synapses, this requires significant communication time and overall system power. The computations use 10 bit accuracy for the device values, 300 pF system load capacitance, and Vdd at 2.5 V. We expect to have many parallel input data streams to load the entire array for a sustained rate of 11.3 Tbit/s, probably coming from multiple memory sources to hold the 1000 TByte golden memory target.

Loading a single IC with 109 synapses (say 106 neurons) in a second would require 10 Gbit/s data link into the IC requiring 1.6 W for communication for a 50 pF load (minimum level for IC test with zero-insertion force socket). The challenge of parallel programming these number of synapses on chip is managable, and the resulting power requirements are significantly less than the data communication. These numbers directly impact the final cost of such a system; IC testing can be a significant cost in manufacturing of a final product; loading values in 1 s prevents one such product limitation. For the 1000 trillion synapse data loading the power consumption and performance will be limited by the system communication, not the IC complexity.

For a 20 W system, loading the weights frequently is not possible; this point further illustrates the untenable case of storing synapse weights in one place and using them somewhere else, even in a multiplexed system. Once a memory is programmed, adapted, and/or learned, reloading the memory is costly; therefore, non-volatile memory is critical to minimize the cost of loading a system. On the other hand, occasionally loading an entire cortex of 1000 trillion synapses, say on the order of once a day, is a feasible proposition, as well as having programmed code at the initial condition or reset condition for a commercial machine.

Building a supercomputer like structure to perform computations in human cortex is within our technical capability, although more a question of funding (research and development) and manpower. Figure 25 shows a representative cortical system architecture of silicon neuron structures. The heavy emphasis on local interconnectivity dramatically reduces the communication complexity. We show these capabilities are possible in purely CMOS approaches, not necessarily relying on novel nanotechnology devices.

Figure 25. A potential view of how one could build a brain/cortical structure; the approaches follow constraints outlined throughout this discussion. The approach could be integrated as a set of boards with a large number of neural ICs, where at each level of complexity, local communication is emphasized for power efficient computation as well as low integration complexity. Most of the on-chip communication would be local, most of the chip-to-chip communication would be between neighboring ICs in an extended FPGA like fabric. The system would Interface to typical biological sensors, like retina (vision), microphones for audition, and chemical sensors, as well as non-biological (i.e., communication spectrum) inputs. A particular neuron array could be integrated with additional FPAA structures enabling integration of analog SP for the front-end processing (i.e., acoustic front-end processing).

Figure 26 shows the potential computational energy efficiency in terms of computation for digital systems, analog signal processing, and potential neuromorphic hardware-based algorithms. Computational power efficiency for biological systems is 8–9 orders of magnitude lower than the power efficiency wall for digital computation; analog techniques at a 10 nm node can potentially reach the same level of computational efficiency. The resulting tradeoffs show that a purely digital circuit approach are less likely because of the differences in computational efficiency. These approaches show huge potential for neuromorphic systems, showing we have a lot of room left for improvement (Feynman, 1960), as well as potential directions on how to achieve these approaches with technology already being developed; new technologies only improve the probability of this potential being reached.

Figure 26. A summary comparison of power efficient computational techniques, including digital, analog Signal Processing (SP) techniques, and the potential for neuromorphic physical algorithms. The potential of 8–9 orders of magnitude of achievable computational efficiency encourages a wide range of neuromorphic research going forward.

Probably the largest hurdle is not about what we can build, but identifying novel, efficient computation in neurobiology and employing these techniques in engineering applications.