Leaked documents obtained by 404 Media reveal NVIDIA was allegedly scraping videos across the internet like movie and game footage for its AI products. As a result, clients using those products and tools are at risk of unintentional copyright infringement.Like other AI toolmakers, Nvidia needs training data for its text, video, and audio generators to "learn" how to create assets. Data scraping generally refers to the practice of feeding existing video, text, and audio into training models without securing the permission of the people who made it.The technique means YouTube and Netflix (and the companies with media on those platforms) have copyrighted material taken without consent.Regulators in the US and EU are still determining if data scraping practices violate copyright rules. 404 Media's report underscores how much tech companies play loose with copyright law when it comes to generative AI, and how other industries like entertainment and games can be affected by these choices.Employees at the company expressed concerns about this behavior in messages reviewed by the outlet. Despite these concerns, NVIDIA told 404 Media its scraping directives are "in full compliance with the letter and spirit of copyright law. […] Fair use protects the ability to use a work for a transformative purpose, such as model training."Game developers and their parent companies are copyright holders, and YouTube is an important platform for the industry. Having their work taken without a say in the matter creates a massive violation of trust with a company who often uses games from big studios to sell its services and products.
Nvidia AI engineers wanted gameplay video to improve their training data
An employee speaking to the outlet claims they and others were told to grab full-length videos that could help train the tech company's AI model, and that game footage in particular was highly coveted by engineers. Acquiring said footage for data sets involved collaborating with NVIDIA's GeForceNow cloud service.In one Slack conversation, senior research analyst Jim Fan noted the service's streaming capabilities for capture and storing video. All of those "high-quality gameplay videos," he said, is "very useful" data to pull from."We'll work closely with [GeForceNow] and related engineering teams to set up live game data capture, scale up the pipeline, and process them for training," he explained.However, employees raising concerns were also allegedly told by project managers that scraping was an "executive decision" to not worry about. The "open legal issue" (such as breaking YouTube's Terms of Service) would apparently be resolved in the future.In 404's story, quotes from internal documents and Slack channels from several AI researchers show NVIDIA'S active effort to avoid bad press. Its research VP Ming-Yu Liu stressed there couldn't be "negative sentiment" if the company didn't publish any research about its download data."What we are doing here will lead to zero publications," wrote Liu. He and other staff also constructed their own YouTube data scrapers and an API account to help with the process.Until regulators define what does and does not violate copyright in the world of generate AI, NVIDIA and other companies are likely to operate in a legal gray zone. As MIT's Robert Mahari told 404, proving data scraping can be "really hard technically.""The best [company] policy in terms of incentives, is to not tell people what you've trained on," he said. "So as long as you don't tell anybody, it's going to be really hard to prove."404 Media's full, extensive report on NVIDIA's data scraping can be read here.