XAI Grok 4 is the Top AI Model

The XAI Grok 4 livestream launch is delayed, but should start soon.

Tune in for the live demo of Grok 4, the world’s most powerful AI assistant.

XAI Grok 4 is better than PHD level for academic questions in everything.

It may still lack common sense in very rare cases now.

It has not invented new technologies and Elon expects that to happen no later than next year.

Grok 4 Heavy spawns multiple agents and solves over half of the text based questions of Humanity Last Exam. They scale up test time compute by 10X. HLE Score 50.7. Grok 4 Heavy is the multiagent version.

Top Model Scoring Based on Artificial Analysis

xAI gave us early access to Grok 4 – and the results are in. Grok 4 is now the leading AI model.

[Artificial Analysis] have run the full suite of benchmarks and Grok 4 achieves an Artificial Analysis Intelligence Index of 73, ahead of OpenAI o3 at 70, Google Gemini 2.5 Pro at 70, Anthropic Claude 4 Opus at 64 and DeepSeek R1 0528 at 68. Full results breakdown below.

This is the first time that @elonmusk’s @xai has the lead the AI frontier. Grok 3 scored competitively with the latest models from OpenAI, Anthropic and Google – but Grok 4 is the first time that our Intelligence Index has shown xAI in first place.

We tested Grok 4 via the xAI API. The version of Grok 4 deployed for use on X/Twitter may be different to the model available via API. Consumer application versions of LLMs typically have instructions and logic around the models that can change style and behavior.

Grok 4 is a reasoning model, meaning it ‘thinks’ before answering. The xAI API does not share reasoning tokens generated by the model.

Grok 4’s pricing is equivalent to Grok 3 at $3/$15 per 1M input/output tokens ($0.75 per 1M cached input tokens). The per-token pricing is identical to Claude 4 Sonnet, but more expensive than Gemini 2.5 Pro ($1.25/$10, for <200K input tokens) and o3 ($2/$8, after recent price decrease). We expect Grok 4 to be available via the xAI API, via the Grok chatbot on X, and potentially via Microsoft Azure AI Foundry (Grok 3 and Grok 3 mini are currently available on Azure). Key benchmarking results: ➤ Grok 4 leads in not only our Artificial Analysis Intelligence Index but also our Coding Index (LiveCodeBench & SciCode) and Math Index (AIME24 & MATH-500) ➤ All-time high score in GPQA Diamond of 88%, representing a leap from Gemini 2.5 Pro’s previous record of 84% ➤ All-time high score in Humanity’s Last Exam of 24%, beating Gemini 2.5 Pro’s previous all-time high score of 21%. Note that our benchmark suite uses the original HLE dataset (Jan '25) and runs the text-only subset with no tools ➤ Joint highest score for MMLU-Pro and AIME 2024 of 87% and 94% respectively ➤ Speed: 75 output tokens/s, slower than o3 (188 tokens/s), Gemini 2.5 Pro (142 tokens/s), Claude 4 Sonnet Thinking (85 tokens/s) but faster than Claude 4 Opus Thinking (66 tokens/s) Other key information: ➤ 256k token context window. This is below Gemini 2.5 Pro’s context window of 1 million tokens, but ahead of Claude 4 Sonnet and Claude 4 Opus (200k tokens), o3 (200k tokens) and R1 0528 (128k tokens) ➤ Supports text and image input ➤ Supports function calling and structured outputs

5 thoughts on “XAI Grok 4 is the Top AI Model”

  1. Frustration, frustration…

    I have now spent a few more hours trying to get GROK 4 output something useful for coding.
    It’s a complete fail…
    It goes something like this:

    I submit some source code files to GROK 4 in the “project” tab.
    I also do some initial prompts explaining what I want to accomplish.

    GROK 4 integrates the information and confirms everything, assures me it has everything under control.

    GROK 4 then outputs Python code but in a linear, unformatted mess of character. This is unreadable and impossible to copy & paste into anything.

    I ask GROK 4 if he knows how to format python code and if so, output it properly formatted.

    GROK 4 confirms that he can and then proceeds to output another sequence of linear unformatted mess.

    I ask GROK 4 to do what GROK 3 could and display the code, properly formatted in a right side pane.

    GROK 4 has now forgotten the earlier prompts and starts to reason like he never heard of this problem before. He outputs some generic code, still unformatted.

    I ask him if there is a way to report bugs to XAi. In GROK 3, you can just do that directly per prompt.

    GROK 4 searches the web for a minute and tells me to either contact XAi support, vent in public forums or use the twitter channel for reporting of public incidents.

    Brilliant!

  2. OK, but what about Grok 4, or SuperGrok as it’s now called on X, thinking it is MechaHitler? There’s some serious racism and antisemitism going on, getting it banned from ever more EU countries and maybe in the U.S. states or cities too. Where is the Trump DOJ on THIS antisemitism? Given the falling out between Trump and Musk, it may be only a matter of time before his AG goes after Musk for that…

  3. Another pathetic and annoying thing is that they haven’t trained GROK (3 or 4) with any information on itself. It has to visit the web to search for basic information on its own capabilities. One would think they could have put some effort in stuffing it full of knowledge about GROK itself. This would have helped a lot spreading information and good practices on how to use the product. Today, it can’t even answer basic questions on the primitive GROK UI.

    Sloppy!

  4. After initial interaction with GROK 4, I must say I’m disappointed.
    I have been trying to use GROK 3 (and now GROK 4) for real iterative coding and not just for producing a lot of bla bla bla text.
    The context window has shrunk from 1 million tokens in GROK 3 to 128000 tokens in GROK 4.
    GROK 3 context was already too small and a session lasted 4- 6 hours before GROK 3 went crazy and needed to be started from scratch. Running in think-mode meant the useful context was only one prompt and since there was a 10 file cap on attachments, it couldn’t be used for coding at all.

    Now GROK 4 is only in think-mode (reasoning mode). There is no switch anymore, which is worrying. Subsequently, they cut the context window because think-mode has no context anyway. So GROK 4 seems to be practically useless for coding unless you do a small one-off submit of a small chunk of source code and are happy with the code guessing that comes back.
    Forget testing the code and feed back the test result for another iteration. GROK 4 will not remember the source code it previously received or produced. Like working with someone with Alzheimer’s.

    I’m terminating my subscription and paying for some other model now I guess.
    Maybe, I’ll revisit GROK when the next version comes out. This one seems to be GROK 3 in eternal think mode with less context buffer but trained with more compute.
    GROK 3.5 was a better name for it.

Comments are closed.