In response to inquiries, OpenAI spokesperson
Lindsay Held explained that the company customizes datasets for each model to
enhance its understanding of the world and maintain competitiveness in global
research. Held noted that OpenAI utilizes a variety of sources, including
publicly available data and partnerships for non-public data, and is exploring
the creation of synthetic data.
The article describes how OpenAI, after
depleting existing data sources in 2021, turned to transcribing YouTube videos,
podcasts, and audiobooks. Prior sources included computer code from GitHub,
chess move databases, and educational content from Quizlet.
Google, another major player in AI, has also
reportedly used YouTube transcripts for its models. A Google spokesperson
emphasized adherence to YouTube's terms of service and policies prohibiting
unauthorized scraping or downloading of content.
Meta (formerly Facebook) encountered similar
challenges in obtaining sufficient training data. The company's AI team
explored options such as licensing books or acquiring a publisher to address
data limitations. Additionally, privacy-related changes made after the
Cambridge Analytica scandal restricted Meta's access to consumer data.
Both Google and
OpenAI, alongside the broader AI community, face a looming scarcity of quality
training data. The effectiveness of potential solutions, such as training
models on synthetic data or employing curriculum learning, remains uncertain.
Meanwhile, concerns persist regarding the legality and ethical implications of
utilizing data without explicit permission, as evidenced by recent lawsuits in
the field.
No comments: