⬤ xAI just claimed the top spot in speech-to-speech AI with Grok Voice Agent's release. According to benchmark data from Artificial Analysis, the new model hit 92.3% on Big Bench Audio, edging past Google's Gemini 2.5 Flash Native Audio Thinking. This is xAI's first public speech-to-speech API and shows they're ready to go head-to-head with established voice AI platforms.
⬤ Big Bench Audio tests how well speech models can actually reason through complex questions. It uses 1,000 audio questions pulled from Big Bench Hard, a well-known text benchmark for advanced reasoning. Grok Voice Agent sits at the top of the leaderboard, beating models like Gemini 2.5 Flash Native Audio, Nova 2.0 Sonic, GPT Realtime variants, and Qwen Omni models. The results show it's really good at understanding tricky spoken prompts and delivering accurate responses.
⬤ Speed matters too, and Grok Voice Agent delivers with an average time to first token of 0.78 seconds—making it the third fastest on the board behind two Gemini 2.5 Flash variants. Pricing is straightforward at $0.05 per minute ($3/hour) of connected audio. The model comes with built-in tool calling, so you can plug it into web search, RAG workflows, and custom tools using JSON schemas.
⬤ Grok Voice Agent is heating up the already competitive speech AI space. It supports telephony through providers like Twilio and Vonage, handles 100+ languages, and offers multiple voice options. Whether you're building voice assistants, phone agents, or interactive voice apps, this model's got the chops. Its benchmark performance shows that speech-based reasoning is becoming the real differentiator in next-gen AI systems, and xAI is making a serious play in the voice AI market.
Usman Salis
Usman Salis