7 Nanometer Chips and 64 Core AMD Threadripper

At CES AMD has announced the 64 core AMD Threadripper line, (3990X). It will have super-high GHz clocks, and only $4,000 a package pricing.

Nextbigfuture reader and commenter Goatguy notes that combining the AMD super chip with 512 GB of RAM ($3,000, ECC, DDR4-3200) and 4 TB of NVME ($2,000) and a top-top shelf Nvidia dual graphics-and-compute card ($1,800 for pair), the motherboard to support ($600) and case-and-power-supply ($600) would cost ,.. just shy of $12,000. The system would surpass the world’s reigning supercomputer from 20 years ago. The top supercomputer 20 years ago was the ASCI White at 13 TeraFLOPS.

This would be ONE MONTH’S PAY for a competent software engineer in Silicon Valley.

…, that’s just jaw-dropping.
Absolutely bonkers.
The NEXT big future in computing.

If one was doing AI work, you could fit almost all of the literature of the world … ever written … in memory alone.

AMD Ryzen™ Threadripper™ 3990X
Specifications
# of CPU Cores 64
# of Threads 128
Base Clock 2.9GHz
Max Boost Clock Up to 4.3GHz
Total L1 Cache 4MB
Total L2 Cache 32MB
Total L3 Cache 256MB
TSMC 7nm FinFET

AMD announced many high-performance 7nm CPUs in the desktop space, high-end desktop space, and server space. The flagship Ryzen 4800u will have eight cores/16 threads on only 15W TDP (thermal design power).

The UK Research and Innovation (UKRI) center announced it has contracted with Cray to build a supercomputer based on AMD’s EPYC “Rome” server processors. ARCHER2 will use 5,848 liquid-cooled Shasta Mountain compute nodes, each of which houses a pair of 64-core/128-thread EPYC processors clocked at 2.2GHz. In total, ARCHER2 will crunch through workloads using 748,544 cores and 1.57 petabytes of memory. UKRI estimates peak performance to be around 28 PFLOP/s.

20 thoughts on “7 Nanometer Chips and 64 Core AMD Threadripper”

  1. MasPar had a construct that let the expression be evaluated on each of the 64k processors simultaneously. Sadly, MasPar sleeps with the fishes.

    Try “goroutines” in Google’s GO.

  2. “reasonably common”

    Weasel words. Tell me why any reasonably common personal computer needs more than four CPUs.

    The “canonical” examples of why people would want 32 or 64 core machines are Photoshop, Blender, compiling code, etc.

    There are a large number of users can get away with substandard quad core CPUs and Intel caters well to that market.

  3. Having lots of fast memory together with massively parallel computing is essential. This is one thing GPU cores fundamentally lack.
    A flow simulation over 2-bladed propeller with ~8 million cells would cost me 16 RYZEN cores and 32GB RAM for a period of 72 hours. That, however, is a single-precision minimum resolution study on a relatively small domain.
    Now, parameter-based optimization would be nice (which means re-meshing the entire solution each time). 100 iterations * 72 hours + re-meshing (100*4 hours) + double precision (x2) = 633 days on my workstation,
    or 158 days on alternative threadripper setup,
    or 80 two-socket threadripper blades burning 24kW for 24 hours.

    A progress, but not a game-changer.

  4. Yes, Amdahl’s law sets the limit. 
    http://www.imagebam.com/image/81195e1330418147
    For the vast majority of applications (for home usage) the useful number of cores is around 16-32! So practically we’ve maxed it out!!! 

    In the next two years the next big thing will be that the cpus will be able to make single threaded codes into multithreaded on the hardver level, so they won’t need any special multi-threaded coding, and you won’t need ‘thread dissociation’ operators in languages. 
    That will make it reasonable to buy 16 core cpus for home, and in this way we can almost double the speed in general applications, but it’s a one time thing.
    In highly parallel applications you get quite negligible speed-up after 128 cores.

    So we have maxed out the number of cores. 
    The next thing is 3D stacking.
    https://youtu.be/XW_h4KFr9js?t=248

  5. Yes, one size does not fit all.

    But can you name some reasonably common parallel processing task for a personal computer – not a server – for which 64 CPU cores/128 threads are very beneficial – let alone essential – AND which wouldn’t run better on a decent GPU?

  6. Some fact faults there…

    3990x is Threadripper, not EPYC.
    Same chip but some significant differences like half the amount of memory channels and no ECC memory support.
    Super high clock frequency is not correct. This is rather the same as it has been for years and Intel is still in the lead when it comes to clock frequency and single core performance.

    The EPYC line has been released months ago and the corresponding 64 core CPU is named EPYC 7742

  7. That’s more like OpenMP.

    C++17 (lib C++17) has parallel for each loops that operate on STL containers which is basically 90% of what a competent developer needs to pick the low hanging fruit of parallelization.

    We are seeing a transition from “how do I write this code to be as efficient as possible” to “how do I write this code so that it can work in parallel”. What may be inefficient when running on a single CPU will become a clear winner when run on 4,8,16 CPUs.

    When writing algorithmic pseudocode it is normal for me to state “this particular loop is parallelized” and to design all datastructures around isolating the threads from each other.

  8. Every desktop PC is a heterogeneous computing environment.

    You have a GPU that works with very fast but low per thread memory, thousands of threads and very simple logic.

    You have a CPU that works with somewhat slow but high per thread memory, disk I/O, up to 128 threads and complex logic.

    One size does not fit all.

  9. The GPU boards have had on-board memory for many years, I think precisely due to bandwidth limits. AMD has “APU” SoCs that combine GPU and CPU cores on the same chip. Though as I recall, without on-board memory, other than cache. But cache sizes have been growing too.

    Meanwhile, RAM and long-term storage seem to be on a converging path, and there was some (minor?) talk of in-memory computing. Things like ferroelectric gate transistors can do both computing and memory storage. The ferroelectric also happens to be non-volatile.

    Perhaps eventually, RAM, storage, and processors will be the same component.

  10. Back in 2000, the ‘ASCI White’ supercomputer was a grid of over two-hundred networked cabinets, each about six feet high, that covered a floor area equal to about two basketball courts. It weighed 106 tons and ate 1.2 megawatts of power.

  11. I remember when having an AMD CPU and mobo full of Kingston RAM necessitated placing a fire extinguisher next to your computer.

  12. But how much highly-parallel algorithm processing is done on the CPU these days, beyond handling independent or pipelined operations of complex software systems, or servers handling lots of independent processing? I.e. where lots of threads are running lots of different code…

    The really big uniform-data/highly parallel processing tasks (e.g. deep learning stuff) get shoved over to GPUs.

    Also – in what world is 4.3Ghz ‘Max boost clock’ “super-high Ghz”? I was hoping for at LEAST 10Ghz, from that bit of hyperbole…

  13. It would be hugely helpful if there were ‘thread dissociation’ operators in high-level languages such as C, C++ and so on. Things such as

    | pragma multithread;

    | for( int i = 0; i < 1e6; i+=100 )
    | {
    |   for( int j = 0; j < 100; j++ )
    |   {
    |     int k = i + j;
    |     compute( k, func1(k, z[k], y[k] ), func2(k) ) &;
    |   }
    |   continue &;
    | }

    | endpragma multithread;

    The idea is that the ( &; ) operator spawns the related statement and doesn’t demand completion to continue the for()loop cycle.  

    Likewise, “continue &;” : the loop can still compute()’ing threads, even if they’re not complete, and allows as many to queue up to the compiler-runtime limit.;
    ________________________________________

    That said, I invested a few years of my high-spot intellectual growing into fleshing out my version of a “D” language (no relation to the present D language) that did exactly the above. 

    Key thread-release constructs were 

    &; … statement spawning
    &} … block spawning

    and with quite special execution rules

    &, … func arg spawn

    Further, in the language development I did, it was key that a LOT of implicit code-analysis parallelism on the ‘fiber’ (smaller than thread) level could, would, and SHOULD be auto-magically confined to the compiler’s wit. Not whim. 

    A whole lot of code can be analyzed far above individucal statement, operator and assignment protocols to infer time-and-resource useful fiber execution setup calculus. 

    ⋅-⋅-⋅ Just saying, ⋅-⋅-⋅
    ⋅-=≡ GoatGuy ✓ ≡=-⋅

  14. The programming paradigm is towards many more threads than cores where each thread is doing a modest amount of work. Push parallelism deeper in to the code.

  15. LOL… your first post said ‘minutes’. The email-alerter never lies!

    Yes, Linux supports a modestly effective NUMA non-uniform-memory-access model; the upside is “when it works, its great”, and “when it doesn’t, well … its not great”. 

    When there are hundreds-to-thousands of threads, the NUMA support in the O/S tries to reassign perhaps not the same core, but at least another of the cores-and-threads on the “same side” of the non-uniform memory access scheduling barrier. That way, … in theory …, the revived thread can pick up where its paused-or-suspended prior incarnation was, at exactly the same (hopefully) or almost the same execution speed and efficiency.  

    However, in the real world, the overhead of having to wait for the RIGHT particular handful of core-thread workers to “free up” to revive a suspended thread is very often greater … in total elapsed time … than if the suspended thread was revived on a non-optimal cross-barrier processor.  

    The problem with saying, “well just do it”, is that ALL of the threads being serviced by non-optimal cpu core assignments end up markedly increasing demand on the NUMA cross-channel memory I/O queue, which in turn slows down everything INCLUDING the threads that are on the “right side” of the barrier!

    So… NUMA is hard.
    What can I say.
    ⋅-=≡ GoatGuy ✓ ≡=-⋅

  16. They really are “killing it”. I kind of wish I was employed in deep-parallel computing somewhere … it seems like the perfect specialization as both AMD and Intel march forward with ever-more-cores-per-package, and the price of memory continues to inexorably decrease per gigabyte. 

    I’m kind of wondering where the logical endpoint is, for many-cores-in-one-box, given the no-where-near-as-quickly-evolving limitations of RAM memory bandwidth, I/O-to-specialized-processors bandwidth and so on.  

    Even for general purpose computing, assuming that enough overlapping threads can be reasonably efficiently floated, I’m pretty sure Amdahl’s Limit is coming, above 32 cores and 64 threads. But who knows… perhaps all the way to 256 C, 512 T?  

    That seems to be the direction of the compute-unit, cited above in the article. It is already 128 C, 256 T … and a factor of 2 more is totally conceivable. Eek a bit more out of memory, especially the successor(s) to NVMe or PCIe 4.0, and get some truly vast I/O bandwidth to feed-and-store all those threads and memory.  

    ⋅-=≡ Just saying, ≡=-⋅
    ⋅-=≡ GoatGuy ✓ ≡=-⋅

  17. GoyGuy is my Yiddish ⅜ brother. And the NextBig-U-Future was (I presume) a Mad Magazine reference. … LOL …

  18. AMD is killing it. Good Job!

    For my money’s worth I am actually more impressed by the chip having 288MB of cache. My first computer had something like 23kb of total RAM so a quarter GB of L3 cache is amazing (and is totally necessary to keep 64 cores, 128 threads working).

Comments are closed.