How your data is harvested for AI

22-08-2023 00:00 HKT

When zoom users recently realized that the company had updated its terms of service to allow it to use data collected from video calls to train its artificial intelligence systems, the backlash was swift. Celebrities, politicians and academics threatened to quit the service. Zoom quickly backtracked.

These are tense times. Many are worried, and quite rightfully so, that AI companies are threatening their livelihoods - that AI services like OpenAI, Google's Bard and Midjourney have ingested work that artists, writers, photographers and content creators have put online, and can now emulate and produce it for cheap.

Other anxieties are more diffuse. We're not yet entirely certain what these AI companies are capable of, exactly, or to what ends their products will be used. We worry that AI can be used to mimic our digital profiles, our voices, our identities.

Which is why the outrage against Zoom's policy makes perfect sense - videoconferencing is one of the most intimate, personal and data-rich services we use.

Whenever we Zoom - or FaceTime or Google Meet - we are transmitting detailed information about our faces, homes and voices to our friends, family and colleagues; the notion that data would be mined to train an AI that could be used for any purpose a tech company saw fit is disconcerting, to say the least.

And it raises the question: what kind of info are we comfortable forking over to the AIs, if any?

A good rule of thumb, to begin with: if you are posting pictures or words to a public-facing platform or website, chances are that information is going to be scraped by a system crawling the internet gathering data for AI companies, and very likely used to train an AI model of one kind or another. If it hasn't already.

Websites

If you have a website for your business, a personal blog, or write for a company that publishes stories or copy online, that information is getting hoovered up and put to work training an AI, no doubt about it.

The sort of AI that has made headlines this year - OpenAI's ChatGPT and DALL-e, Google's Bard, Meta's LLaMa - is more technically known as a large language model, or LLM. Simply put, LLMs work by "training" on large data sets of images and words. Very large data sets: Google's Colossal Clean Crawl Corpus, or C4, spans 15 million websites.

Earlier this year, investigative reporters at the Washington Post teamed up with the Paul Allen Institute to analyze the kinds of websites that were scraped up to build that data set, which has played a major role in training many of the AI products you're most familiar with.

Everything from Wikipedia entries to Kickstarter projects to New York Times stories to personal blogs was scanned for to amass the AI data set.

So let's say you don't want OpenAI building ChatGPT-7 with your data. What can you do? Just this week, OpenAI announced its latest web-crawling tool, GPTBot, along with instructions on how to block it. Website owners and admins should add an entry to their site's robots.txt file and tell it to "Disallow: /".

Furthermore, the web trawlers looking for data aren't supposed to penetrate paywalls or any websites requiring passwords for entry, so putting your site under lock and key will keep it from AI adoption.

What about apps? First off, the same principle that goes for the web goes for 99 percent of apps out there - if you are creating something to post publicly, on a digital platform, chances are it's going into an AI crawler. Nothing is sacred here, unless the service offers end-to-end encryption or good privacy settings.

Tiktok

Take TikTok, one of the most-downloaded apps in the world. It has run on AI and machine learning from the start. Its algorithm, which serves users the content it thinks they'll want most, is based on tested AI techniques such as computer vision and machine learning.

Every post submitted is scanned, stored and analyzed by AI, and is training its algorithm to improve its ability to send you content it thinks you'll like.

Beyond that, we don't have much information about what ByteDance, which owns TikTok, might plan to do with all the data it's processed. But they've got a vast trove of it - and a lot is possible.

Instagram

With Instagram, we know that your posts have been fed into an AI training system operated by Meta, the company that owns Instagram and Facebook.

News broke in 2018 that the company had scraped billions of Instagram posts for AI data training purposes. The company said it was using those data to improve object recognition and its computer vision systems, but who knows?

Facebook

Technically, Facebook prohibits scraping, so the biggest crawlers probably haven't scooped up your posts for wider use in products like ChatGPT.

But Meta itself is very much in the AI game, just like all the major tech giants - it has trained its own proprietary system, LLaMa - and it's not clear what the company itself is doing with your posts. We do know that it's been earmarking user posts for AI processing in the past. In 2019, Reuters reported that Facebook contractors were looking at posts, even those set as private, in order to label them for AI training.

Twitter/X

Like Facebook, X/Twitter has technically prohibited scraping of its posts, making it harder for bots to get at them. But owner Elon Musk has said that he's interested in charging the AI scrapers for access, and in using them to train X's own nascent AI efforts.

"We will use the public tweets - obviously not anything private - for training," Musk said in a Twitter Spaces chat in July, "just like everyone else has."

Reddit

The popular forum Reddit has been scraped for data plenty. But recently, its CEO, Steve Huffman, has said that he intends to start charging AI scrapers for access. So, yes, if you post on Reddit, you're feeding the bots.

We could keep going down the line - but this sampling should help make the gist of the matter clear: Almost everything is up for grabs if you're creating content online for public consumption.

So that leaves at least one big question: What about messages, posts and work you make with digital tools for private consumption?

This is where it gets more complicated. It's case by case, and if you really want to be sure about whether the products you're using are harvesting your words or work for AI training, you're going to have to dive into some terms of service yourself - or seek out products built with privacy in mind.

Google and Gmail

It's easy to forget that until a few years ago, Google's AI read your e-mail. In order to serve you better ads, the search giant's automated systems combed your Gmail for data. Google says it doesn't do that anymore, and claims that any of the Work products you might use, such as Docs or Sheets, won't be used to train AI without your consent. Nonetheless, authors are uneasy about the prospect that their drafts will wind up training an AI, and quite reasonably so.

Grammarly

The grammar and spell-checking tool, explicitly states that any text you place in its system can be used to train AI systems in perpetuity.

Its terms of service states: "Customer hereby grants us the right to use, during and after the subscription term, aggregated and anonymized customer data to improve the services, including to train our algorithms internally through machine learning techniques."

In other words, you're handing Grammarly AI training material every time you check your spelling.

Apple Messages

Apple's in the AI game too, though it doesn't publicly flaunt it as much. And it insists that the kind of machine learning it's interested in is what's known as on-device AI - instead of taking your data and adding it to large data sets stored on the cloud, its automated systems live locally on the chips in your device.

Apple harnesses machine learning to do things like improve autocorrect in your text messages, recognize the shape of your face, pick out friends and family members in your camera roll, automatically adjust noise cancellation on your Airpods when it's loud, and ID that plant you just snapped on a hike.

So Apple's machine learning systems are reading your texts and scanning your photos, but only within the confines of your iPhone - it's not sending that information to the cloud, like most of its competitors.

Zoom

And finally, we return to Zoom. Because I have one last point to add to the dust-up that got us started here - . which is, while Zoom may have added one little line to its terms of service indicating that it will not use your on-call data for its AI services, it can still keep your data for just about everything else.

In other words, it can do just about anything they want with our private recorded conversations, except for training AI without our consent.

Therein lies the rub. So much of what the tech industry is doing with AI is not orders of magnitude more invasive or exploitative than what they've been doing all along - they're incremental amplifications.

But we should be grateful that it's a genuinely unnerving one: it gives the opportunity for us to renegotiate what we should consider socially - and economically - acceptable in how our data are taken and used.

The best solution, right now, if you want to keep your words, images and likeness away from AI is to use encrypted apps and services that are good on privacy.

Los Angeles Times (TNS)