Investigation Shows Tech Companies Trained AI on YouTube Transcripts

Ai Trained Youtube Video Transcripts Featured

Artificial intelligence isn’t magical – it’s in the name: “artificial.” We know the content is originating from somewhere. An investigation showed that some of the big names in tech, including Apple, trained their AI technology on transcripts from YouTube videos – all without permission.

Investigation Shows YouTube Transcripts Used

Proof News conducted an investigation that included a search tool to look for YouTube in the dataset. The investigation determined that the subtitles from nearly 175,000 YouTube videos from more than 48,000 channels were used by tech companies.

The videos that were used included late-night TV episodes from The Late Show with Stephen Colbert and Jimmy Kimmel Live. Also showing up in the investigation were videos by MrBeast, PewDiePie, and Marques Brownlee.

Ai Trained Yourube Videos How Do Llm Work
Image source: Unsplash

The dataset came from “the Pile.” In 2020, the Pile was described as a mix of 22 datasets from EleutherAI, a nonprofit.

A Google spokesperson said in an email to CNET that the company stands by what it has said previously, going back to a comment from April. CEO Neal Mohan said at that time that he didn’t know whether OpenAI used YouTube videos. But if it did, he recognized that it would be a violation of YouTube’s TOS.

Where Else Does the AI Content Come From?

Nearly every tech company has announced recently that it is developing or has developed an AI system. As stated initially, we know it’s not magical and that the content comes from somewhere. It just wasn’t expected that the AI was coming from YouTube transcripts.

OpenAI, the creators of ChatGPT, has mentioned previously that it was getting more difficult to find datasets to train AI, and that led it to make deals with Reddit and News Corp. for their content. Google has said it has an agreement with content creators that allows it to use YouTube content in its AI training. AI Overview was recently added to Google Search. Learn how to turn AI Overview off if it isn’t your cup of tea.

Ai Trained Yourube Videos Chatgpt
Image source: Unsplash

Yet, an Anthropic spokesperson acknowledged to Proof News that it used the Pile to train Claude, it’s AI assistant. The spokesperson also acknowledged that there are some YouTube subtitles in the Pile.

Whether you use Claude, ChatGPT, or another AI technology, it was trained on a dataset. The question is whether it was trained on willing content providers, like Reddit, or whether the search for providers expanded to content that was used without the creators’ knowledge. It’s definitely something you should be considering the next time you use an AI chatbot.

Image credit: Unsplash

Subscribe to our newsletter!

Our latest tutorials delivered straight to your inbox

Laura Tucker Avatar

Read next

Suzanne Simard sealed paper birch and Douglas fir seedlings inside plastic bags, fed them carbon-14 and carbon-13 dioxide, and nine days later found carbon had crossed between species through fungal threads in the British Columbia soil beneath her boots
A species of jellyfish called Turritopsis dohrnii can revert its adult cells back to a juvenile polyp stage when injured or starving, effectively restarting its life cycle, and biologists have so far failed to identify any natural limit to how many times it can do this.
Octopuses possess roughly 500 million neurons distributed across their body, with two-thirds located in their arms rather than their central brain, meaning each arm can taste, problem-solve, and react to stimuli independently of whatever the octopus is otherwise paying attention to.
The Roman aqueduct at Segovia, built around the first century AD without mortar, still carried water into the 1970s, its 167 granite arches held together by nothing but the precise weight distribution of stones cut to fit each other within fractions of a millimeter.
When the SS Great Eastern laid the first working transatlantic telegraph cable in 1866, a message that had taken ten days by steamship suddenly crossed the ocean in minutes, and the financial markets of London and New York were forced, within a single trading week, to invent the modern concept of synchronised global price.
The Big Ear telescope was scanning at 1420.4056 megahertz on the night of 15 August 1977, the exact frequency at which hydrogen atoms vibrate across the universe, because Giuseppe Cocconi and Philip Morrison had argued years earlier that any species trying to be found would broadcast on that channel — and then, for 72 seconds, something did.
In 2016, archaeologists dated two rings of snapped stalagmites in France’s Bruniquel Cave to 176,500 years ago, evidence that Neanderthals had walked 336 metres into darkness with fire and built architecture deep underground long before modern humans reached Europe
Otto von Bismarck was 74 when Germany adopted the world’s first national old-age social insurance program in 1889, setting the pension age at 70 after years of fighting socialists with bans, laws, and a promise few workers would live long enough to use