Solutions

For Business

For Enterprise

Products

Understanding ‘Publicly Available’ Training Data in AI

A few days ago, Google seemingly put out a warning to OpenAI, stating that they’re not allowed to train their models on YouTube data. But what does ‘publicly available’ training data mean to AI companies?

Recently, The New York Times reported that OpenAI, Meta, and Google have all ignored certain rules to train their AI models. This raises significant questions about the nature of ‘publicly available’ data.

When AI companies are asked about their training data, they often provide vague responses about using “publicly available data.” For instance, in an interview with The Wall Street Journal, OpenAI’s chief technology officer Mira Murati expressed uncertainty about whether data from YouTube or other social platforms was used to train their models. She stated, “I’m just not gonna go into the details of the data that was used, but it was publicly available data or licensed data.”

If the use of publicly available data is straightforward, why are AI companies so evasive? Back in November 2023, Ed Newton-Rex, who led Stability AI’s audio team, resigned, arguing that training generative AI models on copyrighted works under the “fair use” exemption is not justified. He believes creators’ works suffer from the duplicative content generated by AI models.

Publishers like The New York Times have terms of service explicitly prohibiting AI companies from using their content for training models. However, enforcing these terms without federal AI legislation is challenging. The NYT has taken on this challenge by filing a lawsuit against OpenAI in December, joining other authors and comedians who have sued the AI giant for copyright infringement.

OpenAI maintains that they’ve done nothing wrong, emphasizing their use of publicly available and licensed content. Ed Newton-Rex explained to Axios that the term “publicly available” often confuses people, as it doesn’t necessarily imply the creator’s permission, but rather that the content wasn’t illegally obtained.

According to a recent NYT article, major AI players like OpenAI, Google, and Meta are cutting corners in data collection. In 2016, documents from a class-action lawsuit revealed that Meta’s team discussed intercepting app traffic from Snapchat users and later from YouTube and Amazon users, gaining access to sensitive data such as usernames, passwords, and app activity. More recently, Meta employees considered using copyrighted data despite the risk of lawsuits, to avoid the lengthy process of procuring licenses.

With the internet’s available data dwindling, AI companies are scrambling to secure large data sets, either through large licensing deals or less ethical methods. Creators are fighting back, often through lawsuits, but it’s a challenging path. In February, a federal judge dismissed most copyright infringement claims from authors including Ta-Nehisi Coates and Sarah Silverman, setting a difficult precedent for creatives seeking legal protection for their works.

The coming months and years will likely see landmark cases and legislation that will shape how creators share their work and how AI companies collect their data.

GenSpace.ai is an autonomous AI workspace that integrates with chat platforms like Discord or Slack. It lets you control all your work and productivity apps and browse the web via simple chat commands. Our AI agents automate tasks, manage workflows, and act as your digital assistant, streamlining operations and reducing costs for entrepreneurs and startups.

Share the Post:

The Transformative Impact of AI in Education

In recent years, higher education has increasingly integrated modern technologies to enhance the educational experience. Innovations like learning management systems,

Top 10 AI Tools for Teachers in 2024

Over the past decade, educational institutions have rapidly integrated modern technologies to enhance teaching and learning experiences. The global pandemic

Solutions

For Individuals

Overview

For Business

Overview

Small Business

New business

Startups

For Enterprise

Overview

Frontline Workers

Work Safer

Products

Mail

Calendar

Files

Documents

Presentations

Spreadsheets

Talk

Notes

Deck

Understanding ‘Publicly Available’ Training Data in AI

Related Posts

The Transformative Impact of AI in Education

Top 10 AI Tools for Teachers in 2024

Products

Multi AI Agents

Agentic Workspace

ERPs Sync

Digital Twin

3rd Party Agents

Pricing

Data Security

Join Mailing List

For Business

For Enterprise

Understanding ‘Publicly Available’ Training Data in AI

Related Posts

Products

Multi AI Agents

Agentic Workspace

ERPs Sync

Digital Twin

3rd Party Agents

Join Mailing List

Request Early Access To Your Native AI Workspace