XAI Grok 5 and AGI

Musk announced that xAI’s Colossus 2 will be the world’s first gigawatt-plus AI training supercomputer and this will be used to start training XAI Grok 5 next month (September 2025).

Elon says Colossus 2 has a non-trivial chance of achieving AGI. He says xAI is close to having all the pieces in place for AGI. A non-trivial chance is probably about 1-5% .

AGI is loosely defined here as the point where debates rage: some argue it’s achieved, others deny it.

What are the Essential Pieces for Achieving AGI

AI Models = xAI’s Grok series, particularly Grok 4, is highlighted as a frontrunner on the LLM leaderboards.
Compute Colossus 2’s gigawatt-scale cluster. xAI plans to scale from 200,000 H100 equivalents now to a 250x increase within 5 years. There could be up to 550,000 Nvidia B200-B300 GPUs.
Musk is aggressively securing power, including shipping entire power plants from overseas.

Evaluation of Grok 4: Strengths, Benchmarks, and Comparisons

Grok 4 and Grok 4 Heavy excels in complex, long-form tasks like coding at length and tackling difficult problems. It’s described as approaching everything as if it’s a hard problem which makes it slower for simple tasks but ideal for intricate projects. Community feedback and the speaker’s testing support this.

Developer Denny Lamemensetta (possibly partnering with Max Herden) uses Grok 4 exclusively for game development, including UI via Grok Imagine, despite neither being coders. They employ “vibe coding” (intuitive, AI-assisted process), producing impressive results. The speaker plans an interview to explore why they prefer Grok over competitors like Gemini 2.5 Pro or GPT-5.

Benchmarks

#1 on Live Code Bench.
– Tops or near-tops in: AIME 2025 (100% with Python heavy), SWE-bench (75% score, edging out 74.9% runner-up), GPQA Diamond, Vending Bench (successful long-term vending machine operation).

– Outperforms on ARGI 2 (more complex than ARGI 1), showing “nonzero fluid intelligence” per Greg Kamradt (president of the ARC AGI Foundation Prize).

– Fluid vs. Crystallized Intelligence: LLMs traditionally rely on crystallized (experience-based) intelligence from training data. Grok 4 demonstrates fluid intelligence—adapting to novel problems without prior examples—peaking like human young adults’ learning ability.

Comparisons to Competitors

– Vs. GPT-5: Developer Theo (who initially praised GPT-5 but later critiqued its inconsistencies) notes Grok 4 as a top model on Open Router. GPT-5 shines at its best but suffers from poor model routing; Grok 4 “200 IQs” everything, overcomplicating simple tasks, making it expensive and slow but superior for benchmarks.

– General: Grok 4 leads in reasoning, software engineering, and complex evals, though not everyone’s default due to lack of polishing for everyday use.

– Community Debate: Some say Grok lags behind; others praise it. The speaker attributes variability to task complexity.

Behind-the-Scenes: Training Paradigms and Scaling Laws

Evolution of Training (drawing from OpenAI’s Sequoia Capital talk and Andrej Karpathy’s analogies):

Pre-Training: Like reading a textbook—absorbing and compressing knowledge (e.g., GPT-4 era).

RLHF (Reinforcement Learning from Human Feedback): Like solved example problems at chapter ends—demonstrating step-by-step solutions.

RL (Reinforcement Learning): Like unsolved problems with answers in the back—trial-and-error to develop strategies. Scaling RL is seen as the “next big wave” of AI progress.

Grok’s Progression

Grok 2 to Grok 3: 10x pre-training compute.

Grok 3 to Grok 4: 10x RL compute on top, essentially “Grok 3 with more RL” (per Reddit post). This involves solving billions of problems, rewarding correct approaches.

Scaling Shifts: Pre-training scaling is hitting walls—exponentially expensive (e.g., $1B to $10B to $100B).

Alternatives:

Test-Time Compute: Giving models more thinking time (e.g., low/medium/high in ARC AGI evals) yields gains but diminishes.

RL Scaling: The new S-curve. Grok 4 hints at this by topping charts with RL-heavy training on an “older” base (Grok 3).

Future Implications: What if we 10x RL further or pair it with a larger base model? References to Absolute Zero Reasoner and DeepSeek R10 suggest RL can scale dramatically, inspired by AlphaGo/AlphaZero’s self-play (teacher-student clones generating/improving problems).

Compute Landscape: Visualized via epoch.ai charts. xAI’s Colossus Phase 1 matches Microsoft/OpenAI and Meta; Phase 2 dwarfs them. Tesla and Google (TPUs/GPUs) are separate but comparable.

Upcoming xAI Developments

Open Sourcing:
Grok 2.5 (last year’s best) is now open source;
Grok 3 in ~6 months. Praised by Sebastian Rashka for being full production models, not lite versions.

Grok 4.2 expected in weeks
Possible Sonic is Grok 4 Coding (fast coding model, rumored via LM Arena leak).
Voice, image/video generation, Grok Finance.

Elon’s Broader Vision for AI’s Future

Edge AI Inference: Devices (e.g., phones) become “edge nodes” for real-time AI generation due to bandwidth limits. No pre-made apps/websites—AI generates custom software/games/videos on-demand.

Examples: Diffusion models like Google DeepMind’s Genie 3 (upload image, explore as a character); real-time video games without coding.

Use LLMs (e.g., Claude, Grok) to generate scripts/tools instead of downloading apps, avoiding trials/subscriptions.

Societal Impacts

Musk predicts AI will counterintuitively *increase* birth rates (and xAI will program it that way).

Human Limbic System – Elon says AI will one-shot it (overwhelm instincts/emotions), but positively via birth rate boost.

Elon’s Speed Advantage: Critics overlooked xAI’s rapid catch-up—from late entry to leading benchmarks.

Plateau vs. Progress- GPT-5 seen as no wall; RL scaling promises “wild” advances. If it fails, temporary plateau until next breakthrough.

4 thoughts on “XAI Grok 5 and AGI”

  1. I’ll do my standard Turing Test on Grok 5, which every LLM has failed spectacularly so far (including ones that claim to be “unleased” or whatever):

    >Tell me a trans joke.

    I’m willing to bet that Grok 5 will fail.

  2. The only realistic way society will solve the fertility rate problem might be by separating the raising or even the creation of children from biological parents.

    However, before societies decide to go down that path, a post work society might lead to couples deciding to have more children. Humanoid robots will also make the raising of children easier. And of course, artificial wombs.

  3. I don’t get the prediction that AI will increase birth rates. It seems to be doing the opposite already, with AI and other compute distractions taking people away from baby-making (i.e. having sex in the real world). The greatest fertility is in countries and local communities with limited access to online activities and computers. Whether that is a desirable outcome is very much open to debate, as is a short-term population boost of ignorant unproductive, and often violent, people being good for humanity.

  4. At least my experience with ChatGPT5 is consistent with the general criticism that LLMs are fundamentally flawed in reasoning about reality and they try to make for it with hallucinations. We need a new paradigm.

Comments are closed.