⬤ Recent testing shows Mistral Devstral-Small-2 delivering solid inference performance when running locally on Apple's M3 Ultra through MLX. The model ran via Mistral Vibe CLI with 6-bit quantization, hitting peak speeds around 36 tokens per second. Video demonstrations captured the setup's responsiveness on Apple silicon, with accelerated playback showing the system's capabilities.
⬤ The test used inferencerlabs/Devstral-Small-2-24B-Instruct-2512-MLX-6.5bit with a 0.2 temperature setting. While 4-bit quantization was tested, it showed noticeable errors after several conversation turns that hurt output quality. The 6-bit option struck a better balance, delivering faster speeds while keeping outputs reliable during extended sessions.
⬤ In a separate run on M3 Ultra with 512GB memory using LM Studio, average speeds tracked around 27 tokens per second on the console. The video showed mixed playback speeds—normal speed initially, then 2x afterward—demonstrating consistent performance rather than just brief spikes.
⬤ These results matter because they show how optimized MLX workflows can deliver high-throughput local inference for large language models on accessible hardware. Strong performance from Mistral Devstral-Small-2 on Apple M3 Ultra proves you can run advanced AI models locally without cloud dependency. As MLX tooling, quantization methods, and local inference setups continue improving, configurations like this could reshape how developers deploy and test open-weight models.
Alex Dudov
Alex Dudov