Whose Data is That? AI Doesn’t Care

AI companies need data. The usual datasets don’t suffice anymore. There’s a real need for human-generated content that can speak better to the users of AI tools. But people trying to fight this backed by human-oriented copyright laws are naturally losing.

There have been many developments in the space of content scraping by AI companies to train their tools. Everything from news stories and blog posts to creative art and copyrighted music is being used to train language models. These models give users the ability to generate new stuff, which often directly competes with the “inspiration” stuff. It reduces advertising revenue, bypasses paywalls, creates better work, and is way faster, making commercial use more cost-effective. We can’t use our human-oriented copyright laws and “what qualifies as inspiration” arguments for AI tools.

Tumblr’s Data Being Sold

Automattic is the parent company of WordPress, primarily a backend content management system powering 40%+ websites on the internet today (including this one). The company bought Tumblr for $3 million, a price point considered shamefully low for a once-popular social media platform, notably when Yahoo bought the social media site in 2013 for $1.1 billion.

Now, Automattic has plans to sell Tumblr’s data to Midjourney and OpenAI for training purposes. You can opt-out by using an option called the “Prevent third-party sharing” option in your blog’s settings. Tumblr remains to be a loss-making dying pursuit, even for Automattic. 404 Media reported that the data dump that OpenAI and Midjourney already have access to includes private posts, deleted posts, private answers, and premium content, which shouldn’t be the case. Neither of the AI companies has released any statement for this, including to The Verge.

Automattic’s WordPress.org software powers websites but its WordPress.com is a blog network. The company is also reportedly trying to sell and monetize it.

It can be argued that for a company already struggling to censor the output of its AI chatbot, using Tumblr and WordPress.com data might not be the best idea. Unless they want to do a lot of fine-tuning to keep the model from becoming overly horny and depressed.

Please note that the WordPress software powering websites is different from Automattic’s WordPress.com blog publishing platform. WordPress.org, the organization facilitating the software, does not store any data from the sites that use it. So, WordPress-running websites are not going to be used for data scraping. WordPress.com, on the other hand, is a private enterprise that’s more akin to a blogging social network and not really used for anything meaningful, especially against the possibility of creating your own website or publishing on a more popular platform like Medium.

The Hunger of Language Models

Language models need to come up with answers to complex questions, follow reasoning, adopt casual and personal tones of writing, offer opinions, and so much more. For all this, just scraping the public domain books such as H. P. Lovecraft’s vast body of works, Wikipedia, research papers, news, and technical documentation isn’t sufficient. All of this gives AI tools a rather dull voice.

That’s why OpenAI’s original data set for training its earlier models relied on sources such as Twitter and Reddit.

Twitter

As Elon Musk wanted to launch his own AI tool and was against how OpenAI became not-so-open-anymore thanks to the cash injections from Microsoft, the company updated its terms to prohibit any kind of scraping and crawling. This theoretically stopped AI models from getting access to juicy, controversial opinions from the microblogging site well-known for opinions. Soon afterward, Elon Musk’s xAI released its own LLM called Grok:

Witty, sarcastic, and rebellious. I’d let users be the judge of that. Anyway, this underlines the hunger of models to have access to human-generated short-form content which is typically casual, opinionated, and more “real” compared to neutral, academic writing. The entire USP of Grok was that it was trained on such juicy data from Twitter.

Reddit

What companies seem to be even more interested in is Reddit’s data. Earlier, there was news of Reddit inking a deal with an AI company selling its user data. The “front page of the internet” is already experiencing severe backlash after it stopped app developers from getting free access to the API (thus shutting down apps such as Apollo). More recently, there’s another backlash surrounding its bid to go public with an IPO. Users are against the decision for a variety of reasons.

Reddit is a platform with 73 million daily active users in the US alone, making it one of the most popular websites. The users generate all of the content, from tech troubleshooting guides to selfless psychological advice. The moderators work day in and day out to ensure the communities are clean. Now, the company wishes to sell access to this hard work for AI training, which will almost definitely make any LLM much superior, have better reasoning, and have better, more personalized opinions on virtually all types of topics.

The unnamed company that Reddit was selling its data to in a $60 billion a year deal was none other than Google. Google announced the “deeper partnership” by underlining how Reddit will get access to AI features and in turn, Google will get access to Reddit data API, which is pretty much the user-generated content of Reddit.

Just like how training an AI model extensively on Tumblr can make it depressed, training it on Reddit isn’t going to make it super cool either. Without heavy fine-tuning, the model might just become a proponent of conspiracy theories, bad facts, and bad advice. The portion of Reddit that’s actually good is very high-quality, but sadly only a fraction of the complete Reddit.

Cherry-picking a few leading subs with strict moderator oversight might work, but it just means that you’re relying on Reddit mods to do your fine-tuning for you.

When Your Content is created with or submitted to the Services, you grant us a worldwide, royalty-free, perpetual, irrevocable, non-exclusive, transferable, and sublicensable license to use, copy, modify, adapt, prepare derivative works of, distribute, store, perform, and display Your Content and any name, username, voice, or likeness provided in connection with Your Content in all media formats and channels now known or later developed anywhere in the world. This license includes the right for us to make Your Content available for syndication, broadcast, distribution, or publication by other companies, organizations, or individuals who partner with Reddit. You also agree that we may remove metadata associated with Your Content, and you irrevocably waive any claims and assertions of moral rights or attribution with respect to Your Content.

User Agreement, Reddit

Other Platforms

There are a bunch of platforms that also have a good deal of such user-generated content.

  • The Stack Exchange network is the best source of technical Q&A. It has blocked the OpenAI web crawler using the same robots.txt method that Twitter used. It’s noteworthy that with the popularity of ChatGPT, a lot of answers on many Stack Exchange sites were actually AI-generated, so it’s not clear how useful it would’ve been really.
  • Quora’s questions and answers are already being used by its own LLM behind the chatbot Poe released a year ago. Poe’s answers then, in turn, get featured as answers on Quora. Also, Quora is now loaded with AI-generated answers as well.
  • A lot of technical forums such as TechPowerUp, Linux.org, MDN Discourse, and so on remain good sources for LLM to get the latest human scoop on, as well as more understanding of larger problems that have marred the worlds of programming, hardware, software, operating systems, etc. for ages.

News

Anybody not living under a rock is familiar with the Times’ lawsuit against OpenAI and Microsoft on copyright infringement. In simple terms, the reporters and journalists of sites such as The New York Times write their news stories after much research, challenges, and editorial oversight. This is real hard work. Tools like ChatGPT are not only trained on all this, but they can also reproduce news events covered exclusively by the Times. Also, AI tools could bypass paywalls, though that was worked on earlier last year. When the articles written by Times journalists are used by ChatGPT, users use ChatGPT to get their news, making it a competitor to the Times itself. What worsens the situation is that tools can gain direct access to links to summarize entire articles, eating part of the revenue generated by the Times.

In related news, three US outlets also demanded compensation in their lawsuit against OpenAI.

OpenAI defended its training stance against The Times by saying that the internet is a free-for-all. Well, clearly, we need different copyright laws for AI regulation. The ones designed for illegitimate human copyright infringement will simply not work because AI is orders of magnitude faster and more powerful than human brains.

Seeing this as a problem, AI companies will need to buy news now. OpenAI already struck a deal with The Associated Press for this back in July 2023. Apple, not willing to get into the quagmire, to begin with, reportedly explored signing multiyear deals worth at least $50 million with news publishers despite not having a public AI LLM or tool so far.

Data is the backbone of AI and all models rely on patterns and correlations established by vast amounts of training data. Generative AI tools need high quality training data—like copyrighted content from the New York Times and other notable publishers—to provide high-quality and enough quantity of training data also reduces hallucinations, actually making responses relevant.

Felix Van de Maele, CEO of Collibra, in an op-ed on Fast Company

Artistic Data

Data isn’t just text. OpenAI signed a deal with Shutterstock for its images, videos, music, and metadata to train its DALL-E AI image generator. To some limited capacity, it’s already been established that training AI models on copyrighted works is fair use. So, AI companies are not striking deals with creative pools of photographs and art. If not, they might get sued (Getty Images sued Stability AI, for example).

The problem is that the companies hosting art, such as Shutterstock, DeviantArt, and even Getty Images, have more incentive to partner with AI LLM companies than to fight for their users. DeviantArt, for example, experimented with AI art on its own platform where artists were protesting against AI using their work, or AI now being able to replace them in some capacity. ArtStation has taken a pro-user approach. Any project tagged with “NoAI” will tell AI scrapers that the art cannot be used for training. This is up to the scrapers and AI companies to honor, by the way.

A lawsuit was filed by artists against DeviantArt’s AI generator, Stability AI, and Midjourney. The US District Court Judge ruled to dismiss it. The lawsuit had some flaws, for example, the art of these artists was not copyrighted through the US Copyright Office, to begin with, but this does set a dangerous precedent that AI companies can misuse or exploit. Not every artist publishing their art on online platforms is going to go through the trouble of copyrighting all of their work, for example.


Your online accounts are not protected from web scraping. Your creative work certainly isn’t. Even your Dropbox files get shared with OpenAI when you use any of the platform’s AI tools. OpenAI already clarified that it’s just not possible to sustain ChatGPT without access to copyrighted work. Microsoft, Google, and a bunch of other heavyweights such as Amazon, Meta, and Apple aren’t going to leave users alone and just train their models on research papers, documentation, public domain work, and Wikipedia articles.

Does this mean we should just close all of our accounts and take an early retirement? Is there any way to make your voice heard? Well, so far, we can only trust that the regulators, governments, and AI-focused bodies of tech giants will do right by the users of the internet. It begins with offering tools to opt out individually, and then tools to make sure that happens.

G7 leaders announced their agreement on some international guiding principles on AI. The US Senate held a couple of meetings on the need for transparency in AI. The US Biden government also issued an executive order on safety in AI. The EU wrote an entire Act as the first piece of regulation on AI.

Still, we’re yet to see any major news where offenders are being punished, artists are being compensated, any copyright lawsuit is actually winning, and users are being given more control than they had before except for the option to opt-out here and there.

The unsuspecting denizen of the interwebs is not empowered. It’s likely they don’t even know how to stop AI companies from getting their data. And that’s just bad. Is it time we took this as seriously as social media apps misusing private information or fine-tuning their algorithms to display content that violates ethics? Would sure be fun to watch Sam Altman, Sundar Pichai, and Satya Nadella at the Senate hearings like Zuckerberg and the lot. It’s always fun to watch. Zuckerberg will probably still be there. That man is everywhere (just like Meta’s tracking code).

By Abhimanyu

Unwrapping the fast-evolving AI popular culture.