XAI Grok-1.5 With Vision

Grok-1.5V is XAI’s first-generation multimodal model. In addition to its strong text capabilities, Grok can now process a wide variety of visual information, including documents, diagrams, charts, screenshots, and photographs. Grok-1.5V will be available soon to our early testers and existing Grok users.

Capabilities
Grok-1.5V is competitive with existing frontier multimodal models in a number of domains, ranging from multi-disciplinary reasoning to understanding documents, science diagrams, charts, screenshots, and photographs. Grok has capabilities in understanding our physical world. Grok outperforms its peers in our new RealWorldQA benchmark that measures real-world spatial understanding. For all datasets below, they evaluate Grok in a zero-shot setting without chain-of-thought prompting.

Writing Code from a Handdrawn Flowchart

Calories from Food Labels

Real-World Understanding
In order to develop useful real-world AI assistants, it is crucial to advance a model’s understanding of the physical world. Towards this goal, we are introducing a new benchmark, RealWorldQA. This benchmark is designed to evaluate basic real-world spatial understanding capabilities of multimodal models. While many of the examples in the current benchmark are relatively easy for humans, they often pose a challenge for frontier models.

Into the future
Advancing both XAI multimodal understanding and generation capabilities are important steps in building beneficial AGI that can understand the universe. In the coming months, they anticipate to make significant improvements in both capabilities, across various modalities such as images, audio, and video.

1 thought on “XAI Grok-1.5 With Vision”

  1. I see a lot of driving-relevant examples in the spatial reasoning section… might be related to X poaching Tesla AI engineers. Nothing wrong with Grok having good spatial understanding of course, but I’m not sure that it’s the most relevant skill for an AI chatbot.

Comments are closed.