● NVIDIA researcher Daniel van Strien recently shared that the company has dropped Nemotron-VLM Dataset v2 on Hugging Face, packing over 8 million samples for image, video, and text reasoning tasks. Released under the CC-BY-4.0 license, it's free to use commercially and stands as one of the biggest public multimodal datasets any major lab has released this year.
● What makes this interesting is NVIDIA's timing. While competitors are locking down their training data, NVIDIA is going all-in on open source. The dataset includes ready-to-use tools for OCR, visual question answering, and video-to-text conversion—basically everything you need to build and test vision-language models at scale. The open approach is great for innovation, though some researchers worry about potential bias issues, especially in the multilingual and video sections.
● The business angle here is clever. Instead of charging for access, NVIDIA is giving it away free, which effectively subsidizes the entire AI developer community. Startups and universities can now train models without massive data costs—a stark contrast to the walled gardens other companies are building.
NVIDIA is one of the few major AI labs releasing datasets As Daniel van Strien pointed out
● This v2 release is three times bigger than v1, which came out just two months ago. It combines image-text alignment, OCR capabilities, and multilingual datasets, showing NVIDIA wants to be more than just the GPU company—they're positioning themselves as a data infrastructure player too.
Usman Salis
Usman Salis