Solutions

Overview

GenSpace Individual

For Business

Overview

GenSpace Business

Small Business

Small business productivity tools

New business

Tools for new businesses

Startups

Startup productivity tools

For Enterprise

Overview

GenSpace Enterprise

Frontline Workers

GenSpace for the frontline

Work Safer

Protect organizations from cyberattacks

Products

Mail

Custom business email

Calendar

Custom business email

Files

Files

Documents

Documents

Presentations

Presentations

Spreadsheets

Spreadsheets

Talk

Talk

Notes

Notes

Deck

Deck

Understanding ‘Publicly Available’ Training Data in AI

A few days ago, Google seemingly put out a warning to OpenAI, stating that they’re not allowed to train their models on YouTube data. But what does ‘publicly available’ training data mean to AI companies?

Recently, The New York Times reported that OpenAI, Meta, and Google have all ignored certain rules to train their AI models. This raises significant questions about the nature of ‘publicly available’ data.

When AI companies are asked about their training data, they often provide vague responses about using “publicly available data.” For instance, in an interview with The Wall Street Journal, OpenAI’s chief technology officer Mira Murati expressed uncertainty about whether data from YouTube or other social platforms was used to train their models. She stated, “I’m just not gonna go into the details of the data that was used, but it was publicly available data or licensed data.”

If the use of publicly available data is straightforward, why are AI companies so evasive? Back in November 2023, Ed Newton-Rex, who led Stability AI’s audio team, resigned, arguing that training generative AI models on copyrighted works under the “fair use” exemption is not justified. He believes creators’ works suffer from the duplicative content generated by AI models.

Publishers like The New York Times have terms of service explicitly prohibiting AI companies from using their content for training models. However, enforcing these terms without federal AI legislation is challenging. The NYT has taken on this challenge by filing a lawsuit against OpenAI in December, joining other authors and comedians who have sued the AI giant for copyright infringement.

OpenAI maintains that they’ve done nothing wrong, emphasizing their use of publicly available and licensed content. Ed Newton-Rex explained to Axios that the term “publicly available” often confuses people, as it doesn’t necessarily imply the creator’s permission, but rather that the content wasn’t illegally obtained.

According to a recent NYT article, major AI players like OpenAI, Google, and Meta are cutting corners in data collection. In 2016, documents from a class-action lawsuit revealed that Meta’s team discussed intercepting app traffic from Snapchat users and later from YouTube and Amazon users, gaining access to sensitive data such as usernames, passwords, and app activity. More recently, Meta employees considered using copyrighted data despite the risk of lawsuits, to avoid the lengthy process of procuring licenses.

With the internet’s available data dwindling, AI companies are scrambling to secure large data sets, either through large licensing deals or less ethical methods. Creators are fighting back, often through lawsuits, but it’s a challenging path. In February, a federal judge dismissed most copyright infringement claims from authors including Ta-Nehisi Coates and Sarah Silverman, setting a difficult precedent for creatives seeking legal protection for their works.

The coming months and years will likely see landmark cases and legislation that will shape how creators share their work and how AI companies collect their data.

 

GenSpace.ai is an autonomous AI workspace that integrates with chat platforms like Discord or Slack. It lets you control all your work and productivity apps and browse the web via simple chat commands. Our AI agents automate tasks, manage workflows, and act as your digital assistant, streamlining operations and reducing costs for entrepreneurs and startups.

Share the Post:

Related Posts

GenSpace Logo

Request Early Access To Your Native AI Workspace