In the latest example of a worrying industry pattern, NVIDIA appears to have collected vast amounts of copyrighted content for ai training. On Monday, 404 Media's Samantha Cole reported that the $2.4 trillion company asked workers to download videos from YouTube, Netflix and other data sets to develop commercial ai projects. The graphics card maker is among the tech companies that appear to have adopted a “move fast and break things” ethos in their race to establish dominance in this often-embarrassing ai gold rush.
The training was reportedly aimed at developing models for products such as its Omniverse 3D world generator, self-driving car systems and “digital humans” projects.
NVIDIA defended its practice in an email to Engadget. A company spokesperson said its research “fully complies with the letter and spirit of copyright law,” while asserting that intellectual property laws protect specific expressions “but not facts, ideas, data, or information.” The company equated the practice to a person's right to “know facts, ideas, data, or information from another source and use them to create their own expression.” Human, computer… what's the difference?
YouTube doesn't seem to agree. Spokesperson Jack Malon pointed us to a… Bloomberg Story In April, YouTube CEO Neal Mohan said that using YouTube to train ai models would be a “clear violation” of its terms. “Our previous comment still stands,” YouTube’s policy communications manager wrote to Engadget.
That quote from Mohan in April was in response to reports that OpenAI trained its text-to-video generator Sora on YouTube videos without permission. Last month, a report showed that startup Runway ai followed suit.
NVIDIA managers reportedly told NVIDIA employees who raised ethical and legal concerns about the practice that it had already been green-lit from the highest levels of the company. “This is an executive decision,” responded Ming-Yu Liu, NVIDIA’s vice president of research. “We have blanket approval for all data.” Others at the company reportedly described its extraction as an “open legal issue” that they would address later.
This all sounds similar to facebook’s old motto (Meta) of “move fast and break things,” which has been admirably successful in breaking quite a few things, including the privacy of millions of people.
In addition to YouTube and Netflix videos, NVIDIA reportedly ordered workers to train on the MovieNet movie trailer database, internal libraries of video game footage, and Github video datasets WebVid (now taken down following a cease-and-desist order) and InternVid-10M. The latter is a dataset containing 10 million YouTube video IDs.
Some of the data NVIDIA allegedly trained on was marked as suitable only for academic (or non-commercial) use. HD-VG-130M, a library of 130 million YouTube videos, includes a usage license specifying that it is intended for academic research only. NVIDIA allegedly brushed off concerns about the academic-only terms and insisted that its batches were suitable for its commercial ai products.
To avoid detection by YouTube, NVIDIA reportedly downloaded content using virtual machines (VMs) with rotating IP addresses to avoid bans. In response to one employee’s suggestion to use a third-party IP address rotation tool, another NVIDIA employee reportedly wrote: “We are on (amazon Web Services)(#) and when restarting a (VM)(#) instance we get a new public IP(.)(#) So that’s not an issue so far.”
404 MediaThe full report on NVIDIA's practices can be found here ai-scraping-foundational-model-cosmos-project/” rel=”nofollow noopener” target=”_blank” data-ylk=”slk:worth a read;elm:context_link;elmt:doNotAffiliate;cpos:8;pos:1;itc:0;sec:content-canvas”>It is worth reading.