OpenAI translated more than a million hours of YouTube video to train GPT-4.


OpenAI translated more than a million hours of YouTube video to train GPT-4.



OpenAI
 utilized an innovative strategy to access large volumes of training data for its GPT-4 language model, according to a recent report by The New York Times. Facing challenges in acquiring high-quality data, OpenAI developed its Whisper audio transcription model, which transcribed over a million hours of YouTube videos. The company's president, Greg Brockman, was directly involved in gathering these videos, despite the legal ambiguities surrounding such practices.

In response to inquiries, OpenAI spokesperson Lindsay Held explained that the company customizes datasets for each model to enhance its understanding of the world and maintain competitiveness in global research. Held noted that OpenAI utilizes a variety of sources, including publicly available data and partnerships for non-public data, and is exploring the creation of synthetic data.


The article describes how OpenAI, after depleting existing data sources in 2021, turned to transcribing YouTube videos, podcasts, and audiobooks. Prior sources included computer code from GitHub, chess move databases, and educational content from Quizlet.

Google, another major player in AI, has also reportedly used YouTube transcripts for its models. A Google spokesperson emphasized adherence to YouTube's terms of service and policies prohibiting unauthorized scraping or downloading of content.


Meta (formerly Facebook) encountered similar challenges in obtaining sufficient training data. The company's AI team explored options such as licensing books or acquiring a publisher to address data limitations. Additionally, privacy-related changes made after the Cambridge Analytica scandal restricted Meta's access to consumer data.

Both Google and OpenAI, alongside the broader AI community, face a looming scarcity of quality training data. The effectiveness of potential solutions, such as training models on synthetic data or employing curriculum learning, remains uncertain. Meanwhile, concerns persist regarding the legality and ethical implications of utilizing data without explicit permission, as evidenced by recent lawsuits in the field.

 

No comments:

Powered by Blogger.