AI Models Gain 3–16% Accuracy Boost With Chain-of-Visual-Thought

A new AI method called Chain-of-Visual-Thought introduces continuous visual tokens to improve reasoning in vision-language models. Recent tests show accuracy gains of 3–16 percent across major benchmarks.

⬤ Vision-language model development accelerates as researchers trial Chain-of-Visual-Thought, a new method that sharpens the way models read detailed images. The method alters the model's handling of visual data - blending ordinary text tokens with continuous visual tokens inside the reasoning steps. The model therefore thinks in small, exact visual pieces that register appearance, structure, depth and edges.

⬤ The framework follows one sequence - the model receives a prompt, produces mixed text and visual tokens while it reasons then outputs a plain text answer. Because continuous visual tokens sit inside this chain, models like Qwen2.5-VL plus LLaVA build a richer inner view of the image. Experiments show that the method raises visual reasoning scores by three to sixteen percent on more than ten benchmarks that cover recognition, spatial interpretation and multi-step visual analysis.

⬤ Those gains signal a broader move in multimodal AI toward clearer, structured reasoning. Earlier vision language models often stumble over small visual details or scenes that contain many objects - Chain-of-Visual-Thought eases this problem - supplying a steadier flow of visual information. The result is a model that analyses shapes, counts objects, understands layout and detects small details with steadier accuracy but also fewer errors.

Alibaba Integrates Qwen Into Revamped Quark AI Browser for 100 Million Users

Alibaba upgraded the Quark browser. The update places the Qwen AI model at the center of the program. More than one hundred million desktop users now receive immediate access to search and browsing tools that run on artificial intelligence.

⬤ The advance achieved with Chain-of-Visual-Thought indicates that stronger visual reasoning may power the next generation of multimodal AI systems. Improved inner processing creates opportunities in robotics, automation and real-world image analysis that demand reliability. As visual reasoning continues to strengthen, this method could turn into a core component for building vision language models that scale yet keep high accuracy.

News Source

#AI News #ai model #Qwen2.5-VL #LLaVA

Usman Salis E-mail

Usman has been in the blockchain space for 9 years and written dozens of articles about crypto in his career. He wants to put crypto on the global map.