⬤ Andon Labs released updated Vending-Bench results showing that the Kimi K2 Thinking model now leads all open-source systems in the benchmark's net-worth simulation. The team reran the evaluation using the Kimi API and found that Kimi K2 is now the best open-source model on Vending-Bench. The "Net worth over time" chart compares model performance across roughly a year of simulated activity, showing visible lines for Human, Kimi K2 Thinking, Qwen3, gpt-oss-120b and Grok 4.
⬤ The chart displays two distinct curves for Kimi K2 Thinking—one using a third-party API and another using Moonshot's API. The Moonshot API version rises higher than the human baseline and other open-source models, eventually leveling off above them with improved net-worth outcomes. The earlier third-party API run tracks closer to other open-source systems. The steep Grok 4 (SOTA) curve is also shown for comparison, though it's not classified as open source.
⬤ The benchmark was rerun using Moonshot's own API as it was suggested this would improve performance on tool calling. This adjustment did improve results, and the updated run shows Kimi K2 achieving the highest average net worth among open-source agents on Vending-Bench. The metric being compared is net worth accumulated over time across different AI and human agents.
⬤ The revised performance results show how API selection and tool-calling behavior can materially impact model rankings in agent-based simulations. With Kimi K2 now leading the open-source category on Vending-Bench, the update adds new context to ongoing comparisons of model capability, efficiency, and practical deployment characteristics in financial-style decision benchmarks.
Sergey Diakov
Sergey Diakov