A few days ago, Google seemingly put out a warning to OpenAI, stating that they’re not allowed to train their models on YouTube data. But what does ‘publicly available’ training data mean to AI companies?
Recently, The New York Times reported that OpenAI, Meta, and Google have all ignored certain rules to train their AI models. This raises significant questions about the nature of ‘publicly available’ data.
When AI companies are asked about their training data, they often provide vague responses about using “publicly available data.” For instance, in an interview with The Wall Street Journal, OpenAI’s chief technology officer Mira Murati expressed uncertainty about whether data from YouTube or other social platforms was used to train their models. She stated, “I’m just not gonna go into the details of the data that was used, but it was publicly available data or licensed data.”
If the use of publicly available data is straightforward, why are AI companies so evasive? Back in November 2023, Ed Newton-Rex, who led Stability AI’s audio team, resigned, arguing that training generative AI models on copyrighted works under the “fair use” exemption is not justified. He believes creators’ works suffer from the duplicative content generated by AI models.
Publishers like The New York Times have terms of service explicitly prohibiting AI companies from using their content for training models. However, enforcing these terms without federal AI legislation is challenging. The NYT has taken on this challenge by filing a lawsuit against OpenAI in December, joining other authors and comedians who have sued the AI giant for copyright infringement.
OpenAI maintains that they’ve done nothing wrong, emphasizing their use of publicly available and licensed content. Ed Newton-Rex explained to Axios that the term “publicly available” often confuses people, as it doesn’t necessarily imply the creator’s permission, but rather that the content wasn’t illegally obtained.
According to a recent NYT article, major AI players like OpenAI, Google, and Meta are cutting corners in data collection. In 2016, documents from a class-action lawsuit revealed that Meta’s team discussed intercepting app traffic from Snapchat users and later from YouTube and Amazon users, gaining access to sensitive data such as usernames, passwords, and app activity. More recently, Meta employees considered using copyrighted data despite the risk of lawsuits, to avoid the lengthy process of procuring licenses.
With the internet’s available data dwindling, AI companies are scrambling to secure large data sets, either through large licensing deals or less ethical methods. Creators are fighting back, often through lawsuits, but it’s a challenging path. In February, a federal judge dismissed most copyright infringement claims from authors including Ta-Nehisi Coates and Sarah Silverman, setting a difficult precedent for creatives seeking legal protection for their works.
The coming months and years will likely see landmark cases and legislation that will shape how creators share their work and how AI companies collect their data.