The distilled versions of Deepseek are not as good as the full model. They are vastly inferior and other models out perform them handily. Running the full model, with a 16K or greater context window, is possible for about $2000 at about 4 tokens per second. This uses an
Machine Specs
AMD EPYC 7702
512GB DDR4 2400 in 32GB 2×4 ECC DIMMS
Gigabyte MZ32-AR0 Single Socket Mobo
Typical storage was an 4 TB mirror of NVMe U.2’s but that is pulled right now for the storage redo.
Leaving me the boot mirror 512GB NVMe pair
100GbE Mellanox ConnectX-4
Proxmox 8.3.3
4x 3090 MSI Ventus GPUs (no sli)
Corsair 1500w PSU
Rig Frame
Corsair H170i Elite XT 420mm Water Block Works for SP3 with bracket.
LXC container settings running docker. Docker runs ollama/owui stack.
120 CPUs (threaded cores, recommend backing it off 8 to keep temps down 4c at peak)
496GB RAM
unprivileged container
Upgrading GPUs from 3090 to RTX5080 can improve the speed.
RTX 5090: 21,760 CUDA cores, 32GB GDDR7 memory, 575W TGP ($1999)
RTX 5080: 10,752 CUDA cores, 16GB GDDR7 memory, 360W TGP ($999)
RTX 5070 Ti: 8,960 CUDA cores, 16GB GDDR7 memory, 300W TGP6 ($799)
RTX 5070: 6,144 CUDA cores, 12GB GDDR7 memory, 250W TGP ($549)
A smaller copy of Deepseek R1 uses 1.58 bit dynamic quantization. This reduces the memory needed for decent performance to about 130 Gigabytes. They selectively shrink sections more to maintain good performance for the “important part of the model”.

Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.
Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.
A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.
Memory manufacturers are going to be the biggest winners of next few years.
The power of Mixture of Experts approach is stark – enables high performance from a collection of numerous smaller models trained to be good at specific things, reducing hardware needs for high performance (so long as have sufficient memory available), and all with much lower training demands.
Maybe that will have application to FSD with eventual separate models developed to have greater expertise/reliability in day, night, wet, snow, urban, rural or highway all loadable from disk in a few microseconds.
These specs are a bit off the normal generic user ones, but aren’t anything too fancy for hobbyists. Let’s recall people on the 70s or 80s spent sums like these (or way bigger in 70s dollars) to have a much quainter microcomputer. Now we can have HAL9000 at home.
I find it funny how it seemed client computing needs had plateaued a bit for several years. A 10 year old computer could still satisfy a Pareto of 80% of users’ needs (browsing, office, some light gaming).
Not anymore. These applications need every spare clock cycle and byte of VRAM or RAM you can throw at them.
So the growth of cores, VRAM/RAM sizes and bus speeds will keep apace, and be sure it’s needed.