Big Tech Race To Acquire Generative AI Training Data

April 6, 2024

0

An interview with Ted Leonard, CEO of the Photobucket, Colorado-based firm, revealed that Leonard is discussing with various tech companies about licensing Photobucket’s 13 billion photographs and movies. The intention is to train generative AI models that can generate new material in response to text instructions.

According to him, the charges might vary greatly depending on the customer and the desired picture but can range from 5 cents to $1 for each photo and even $1 per video.

“We’ve spoken to companies that have said, ‘we need way more,'” Leonard continued, saying that one buyer had stated a desire for more than one billion videos, far more than his platform currently offers. “You scratch your head and say, where do you get that?”

Photobucket was the most popular image hosting site in the world in the early 2000s. It was the media foundation for once-popular services like Friendster and Myspace, and it had 70 million users plus approximately half of the online photo industry in the United States. In the present day, just 2 million people continue to utilize Photobucket, as reported by analytics tracker Similarweb. However, it might find fresh life in the wake of the generative AI revolution.

Citing commercial confidentiality, Photobucket declined to reveal its potential buyers. There has been no prior word on the ongoing negotiations, but they provide insight into a thriving data market that is emerging as a result of the competition to control generative AI technology and raise the possibility that the corporation is sitting on content valued at billions of dollars.

In the beginning, tech giants like Google, Meta, and OpenAI (supported by Microsoft) trained generative AI models like ChatGPT, which can imitate human creativity using mountains of freely scraped internet data. Despite facing litigation from a number of copyright holders, they maintain that their actions are morally and legally correct.

Also, these big tech companies are secretly buying access to stuff that is behind paywalls and login screens, which has led to a hidden trade in anything from chat logs to personal images from old social media applications that people no longer use.

As per Edward Klaris of the legal firm Klaris Law, “there is a rush right now to go for copyright holders that have private collections of stuff that is not available to be scraped.” The firm claims to be advising content owners on deals worth tens of millions of dollars a piece to license archives of photos, movies, and books for AI training.

Reuters, a news publishing agency, surveyed over 30 individuals with a first-hand understanding of AI data deals, including current and former executives, lawyers, and consultants. This report serves as the first comprehensive analysis of this emerging market, covering topics such as the content types purchased, the prices realized, and growing worries about the potential unauthorized access to personal data by AI models.

Microsoft and Google provide supplier codes of conduct that contain data-privacy restrictions; OpenAI, Meta, Meta, Microsoft, Apple, and Amazon all chose not to comment on particular data deals. Google went on to say that in the event of a breach, the company will “take immediate action, up to and including termination” over its supplier agreement.

Since many companies in the opaque AI data market choose not to publish their agreements, some renowned market research firms have stated that they are unable to even begin to calculate the size of this market. Some studies, including Business Research Insights’s, have estimated the market value at about $2.5 billion at the present time, and they predict it might reach up to $30 billion in the next decade.

Race Of AI Generative Data

“Training” big generative artificial intelligence “foundation” models takes a lot of time and a lot of computer power; manufacturers of these models are under growing pressure to explain the vast volumes of data they feed to their systems. This has led to the data land grab.

The tech companies argue that the technique would be too expensive to implement without access to free and extensive databases of scraped web page data, like the ones offered by the non-profit repository Common Crawl, which is described as “publicly available.”

However, their strategy has resulted in a flurry of copyright litigation and regulatory scrutiny, and publishers have responded by including scraping blocking code on their websites. In response, businesses that create AI models have begun to secure data-supply chains and hedge risks through partnerships with owners of content and a growing number of data brokers to meet demand.

A source close to the matter revealed that several tech companies, including Meta, Google, Amazon, and Apple, entered into licensing agreements with Shutterstock to utilize its vast collection of images, videos, and music files for training purposes in the months following ChatGPT’s debut in late 2022.

Shutterstock’s CFO Jarrod Yahes said that the original deals with big tech companies were between $25 million and $50 million in this case but that most of them were eventually enlarged. There has been a new “flurry of activity” in the last two months, he continued, as smaller IT players have followed suit. Regarding specific contracts, Yahes chose not to comment. Up until now, neither the Apple agreement nor the amounts of the other deals have been disclosed to the public.

Reuters was informed by Freepik, a rival of Shutterstock, that the company had licensed the bulk of its 200 million photos from two major internet companies for 2 to 4 cents each. Without revealing who the buyers are, CEO Joaquin Cuenca Abela stated that five additional deals of a similar nature are pending.

At least four news companies, including Axel Springer and The Associated Press, have inked licensing deals with OpenAI. OpenAI was an early Shutterstock user. In a separate statement, Thomson Reuters, which owns Reuters News, said it had made deals to license news material in order to train AI large language models, but it didn’t give any more information.

“Content Sourced Ethically”

There is also a growing sector of AI data firms that are acquiring the “rights to real-world content like podcasts, short videos, and interactions with digital assistants.” These firms are also establishing networks of contract workers to create unique visuals and voice samples, similar to an Uber-style gig economy, but for data instead of ridesharing.

According to Defined.ai CEO Daniela Braga’s interview, the company rents data to several corporations. These companies include Meta, Google, Apple, Amazon, and Microsoft. Rates can range from $100 to $300 per hour for lengthier films, $2 to $4 for short-form videos, and $1 to $2 per image, according to Braga, who noted that rates can vary by customer and content category. She went on to say that on the market, a word is worth $0.001.

She mentioned that images depicting nudity, which demand extra care, are priced between $5 and $7. As Braga mentioned, Defined.ai shares a portion of their earnings with content suppliers. It gets permission from the individuals whose data it utilizes and removes personally identifiable information so that it can market its datasets as “ethically sourced,”

An entrepreneur from Brazil who supplies the company with images, podcasts, and medical data claims to pay the rightful owners between twenty and thirty percent of the grand total. According to the supplier, who requested anonymity due to commercial concerns, the most expensive photographs in his collection are those utilized to train AI systems that filter out explicit content, such as graphic violence, that internet companies have banned.

He relies on police, freelance photojournalists, and medical students to supply him with photographs of crime scenes, war violence, and surgeries, respectively. He frequently finds these sources in South America and Africa, where the distribution of graphic images is more prevalent, in order to satisfy his demands.

A Risky Task

Several experts in the field have voiced concerns about the potential consequences of using the data from shuttered websites like Photobucket to power cutting-edge AI models, citing concerns about user privacy in particular. On the one hand, licensing could address certain ethical and legal concerns.

AI systems have been discovered to reproduce their training material word for word. This includes things like the Getty Images watermark, complete sentences from New York Times stories, and photos of actual individuals. So, without warning or permission, generative AI outputs could include someone’s private photographs or ideas submitted decades ago.

According to Photobucket CEO Leonard, the company has the “unrestricted right to sell any uploaded content for the purpose of training AI systems,” as stated in an October modification to the terms of service. So, Leonard is well within his legal rights. He thinks data licensing is a better option than ad sales. “We need to pay our bills, and this could give us the ability to continue to support free accounts,” according to him.

Braga says she stays away from “platform” organizations like Photobucket when sourcing content for her social media accounts. Instead, she prefers to work with influencers who own the photographs and have a stronger case for licensing the rights.

When asked about platform material, Braga responded, “I would find it very risky.”. “If there’s some AI that generates something that resembles a picture of someone who never approved that, that’s a problem.”

Among platforms, Photobucket isn’t the only one that uses a license. Just last month, Automattic, the parent company of Tumblr, announced that it will be sharing data with “select AI companies.” As our news reported in February, Reddit and Google reached an agreement for Google to use Reddit content to train its artificial intelligence models. Before its March IPO, Reddit revealed that the US Federal Trade Commission is investigating its data-licensing business and may violate new privacy and IP rules.

Big Tech Race To Acquire Generative AI Training Data

Race Of AI Generative Data

“Content Sourced Ethically”

A Risky Task

AI-Driven Predictive Models for Cryptocurrency Price Forecasting

AI In Crypto Signals – How AI Is Used In Crypto Signals Generation

Cryptocurrency 101: All Basic Crypto Trading Terms

LEAVE A REPLY Cancel reply

Most Popular

AI-Driven Predictive Models for Cryptocurrency Price Forecasting

AI In Crypto Signals – How AI Is Used In Crypto Signals Generation

Cryptocurrency 101: All Basic Crypto Trading Terms

Top 3 Gaming Crypto Coins That Will Make Crazy Profits

Recent Comments

Quick Links