privacy risks generative AI

Generative AI has been in the spotlight since OpenAI launched its groundbreaking chatbot, ChatGPT, late last year. Using text prompts, users can direct ChatGPT to write essays, poetry, code, produce images and videos, and answer questions about pretty much everything.

While it can be fun to feed prompts to ChatGPT, bundled with this new technology is a host of privacy concerns. And that’s what we’re going to look at in this article. We’ll provide an overview of generative AI and an in-depth view of its privacy implications.

How does generative AI work?

Generative AI models, like ChatGPT, are built upon large language models. Large language models (LLMs) are fed enormous amounts of data as text. From that data, they learn how to produce natural and authoritative responses to text-based prompts.

LLMs produce mathematical models of the statistical distribution of tokens within a massive repository of human-generated text. In this context, tokens can be words, parts of words, or individual characters – including punctuation. So, for example, if you query the model with, “The first American president was…,” and it responds with “George Washington,” that doesn’t mean it “knows” anything about American history or George Washington. What it means is that it interpreted your question as, “Given the statistical distribution of words within the immense public corpus of English text, which words are most likely to come after the sequence: ‘The first American president was?'” As it turns out, those words are ‘George Washington.’

So LLMs are extremely powerful, and their responses are so natural-sounding and authoritative that you’d be hard-pressed to distinguish the chatbot’s responses from genuine human responses. In fact, Google’s own LLM, LaMDA (language model for dialogue applications), made headlines last year when a Google engineer claimed LaMDA had achieved sentience. This was a seasoned technologist making that claim, not a layman. Google ultimately dismissed the claim and, unfortunately, fired the engineer. But it highlights just how good LLMs are at mimicking human speech.

Generative AI models are thread-based, allowing them to refer to and build upon past statements and conversations. Users can refer to their previous conversations with the bots, and just like humans, the bots will immediately know what you’re talking about and pursue the discussion. So they can go way beyond playing simple questions and answers games. These AI models can answer follow-up questions, challenge incorrect premises, admit their mistakes, and reject inappropriate requests, making them even more convincing.

Your previous statements are continually informing generative AI models’ responses. As the conversation evolves, so does the chatbot – using it is training it. And that leads us to the tech’s privacy implications.

Search engine on steroids

In many ways, generative AI chatbots are like search engines: you type in a query that gets processed, and an output is produced. The output produced by a chatbot can be pretty different from your typical search engine URL list, but the process is similar. And the data collection is similar as well.

As you probably already know, search engines like Google and Bing save your queries, process them algorithmically to extract as much information about you as possible, and add that to the profile they build about you. That’s reason enough to use Duckduckgo in my book, but the point is that AI chatbots record all of your queries and may build dossiers on their users like your typical search engine. Google’s Bard does this. In a blog post, OpenAI (ChatGPT) claimed that it doesn’t build user profiles from the collected data.

But due to the nature of generative AI chatbots and their capabilities, the data that users will feed to a chatbot can also be very different from what they would provide a search engine. For example, I could copy and paste this entire article into a chatbot’s UI and instruct it to rewrite it for me to shorten the word count. And it would do it in a few seconds, with varying degrees of success. This example highlights the differences in the sheer quantity of data that one would feed a chatbot versus a traditional search engine. And given that more data yields more insights, a chatbot has a good chance of ingesting more of your data and learning more about you than a search engine.

You may have unwittingly trained it

Even if you haven’t used generative AI at all, it may still have used you.

As I mentioned earlier, generative AI models are fed colossal amounts of text data during their training – think petabytes. But where does all this data come from? A significant portion of it, possibly even the majority, is scraped from the internet without the knowledge or consent of the third party sites from which it was taken.

Generative AI models can scrape data from public social media profiles, personal websites, public records, and even articles removed from search engine results under the EU’s right to be forgotten. While this information is technically public, AI technology opens up new avenues for privacy violations by making it much easier to access such data either intentionally or accidentally.

Scraping data and adding it to an AI’s corpus violates contextual integrity. In other words, personal information is exposed outside of the original context or purpose for which it was collected. It’s not hard to imagine someone asking ChatGPT or another generative AI, “who lives on Madison Street?” and getting a full list of names and addresses in response. Yes, the information might be public, but it was never meant to be part of an AI’s corpus.


Now consider what happens when an employee decides a chatbot could help them summarize yesterday’s meeting notes. Well, the chatbot will probably be successful in summarizing the meeting notes. But it may well have ingested sensitive trade secrets in the process. That’s another privacy risk.

Chatbots invite you to feed them large amounts of data – again, much larger than what you typically provide a search engine. So it will be much harder to keep track of what we gave away to the chatbot than what we fed the search engine. We’re going to have to rely on people’s ability to realize this and refrain from oversharing. And the odds of that working out are, well, pretty slim.

Why is that? I believe part of the reason, at least, is that we’re increasingly putting extremely complex edge technology in the hands of laypeople who clearly do not understand how it works. This isn’t an insult to anybody’s intelligence – it couldn’t be any other way. It’s simply much too complicated a technology for people to have a comprehensive understanding of it, unlike the tech of yesteryears.

That leads us to the question of what happens to your data once it’s been ingested – and that’s the black box problem.

Sucked into the black box

AI has a “black box” issue for a few different reasons. The problem arises because we don’t fully understand what happens “inside” the AI model. But it also occurs because while we know that these chatbots collect our data, we’re not clear on how the companies behind the tech are using that data. Their privacy policies tend to be written in legalese and are quite vague – using expressions such as “we may,” “sometimes,” and “from time to time.”

The first black box issue (what happens inside the AI?) arises because of how AI models are trained. Deep learning uses massive artificial neural networks with multiple hidden layers and nodes. Each node processes its input and transfers its output to the next layer of nodes. Through that process, the AI model ingests millions upon millions of data points and identifies correlations within those data points to produce an output.

That process (from input to output) happens inside the box and is predominantly self-directed – i.e., the machine pretty much trains itself. So it’s obviously going to be difficult for users to understand what’s going on. Again, even programmers and data scientists have trouble interpreting what’s happening inside the box. We all know something is happening, but we have no visibility into what is happening – hence the term black box.

The second black box issue is the data collection/privacy policy issue. To start, here’s a screenshot of OpenAI’s privacy policy.

OpenAI - Privacy Policy

Reading through it makes it clear that OpenAI is collecting a whole lot of data. But beyond that, it leaves one with more questions than answers. For how long is the data kept? What does “conducting research” entail? What does “enhance your experience” mean?

Further down in the privacy policy, we find the following statement:

“As noted above, we may use Content you provide us to improve our Services, for example, to train the models that power ChatGPT. See here for instructions on how you can opt out of our use of your Content to train our models.”

Every time you use AI, you’re also training it. OpenAI does provide instructions to opt out of your data being used to train the model, but that won’t turn off the collection. So that’s essentially just OpenAI “promising” not to use your data for training. But it’s still collected, so how would you know whether or not your data was used for training? Short answer: you’ll never know. That’s life in the black box.

Building upon the above points, what happens when AI is used within a medical context, for instance? There’s no doubt AI has the potential to produce better medical outcomes for society. It can perform research on various molecules to produce new medicines; it can help in patient diagnoses – potentially detecting many conditions in their very early stages for easier treatment. And the list goes on.

But given the above black box issues, what happens to your medical data once AI has ingested it? Who owns it? Will it be shared? If so, with whom and why? Will it be shared with insurance companies, potentially leading to loss of coverage? Your guess is as good as mine.

Sentiment analysis

Remember, close to ten years ago, when many news stories were written about just how easy it was to uniquely identify individuals online with just a few data points (typically just four)? Here’s one such story from the New York Times. It was rightfully scary and creepy – especially for the privacy-minded individuals out there.

AI just upped the stakes with sentiment analysis.

Sentiment analysis, or opinion mining, refers to AI’s ability to interpret and classify human sentiments as positive, neutral, or negative. This is already prevalent in customer service call centers where the generative AI chatbot will analyze the customer’s speech, classifying their statements as positive, neutral, or negative to inform the next steps that should be taken in the conversation. The ostensible goal is to upsell products and foster brand loyalty. But the point is that corporations will now be building profiles of our sentiments based on AI-driven speech analysis.

Fun, right?

When AI goes phishing

Another privacy threat that comes with the advent of generative AI is its weaponization (it was, of course, going to happen – one can only fein surprise here). With its uncanny ability for deep fakes, voice cloning, and natural speech emulation, phishing scams just got a massive shot in the arm.

Remember the good old trick of reading the suspicious email or text message and looking for spelling and grammar errors? That may not save you moving forward. A generative AI chatbot could craft fake messages that appear to come from one of your close friends. It may not achieve doppelgänger status each and every time. But phishing can be a numbers game, and AI could well tip the balance in the attacker’s favor.

We’ve already heard stories of unsuspecting folks being defrauded by a voice-cloned relative urgently requesting money. With the democratization of generative AI, these kinds of attacks will only become more prevalent and successful. Generative AI lowers the bar for would-be scammers and may allow more knowledgeable hackers to pull off attacks that would have otherwise been impossible—scary stuff.

Of course, this is a cat-and-mouse game, so defenses will eventually catch up. But that won’t end the game; it will just escalate it. The bad guys will figure out how to circumvent the defenses until the defense catches up again. Brave new world, indeed.

Asking the right questions

I’d like to have a section titled ‘How to safeguard your privacy from generative AI chatbots.’ But I’m afraid it would only contain a single line:

Don’t use it.

But, as stated above, even if you don’t use it, AI might still use you. And this time, I mean that even if you don’t directly interact with a generative AI chatbot, you could still be the victim of an AI-driven attack. And, let’s face it, most people are going to use AI. So instead, while I can’t really produce a list of tips under the header ‘How to protect your privacy when using AI,’ here’s a list of questions that should be answered by the tech firms pushing generative AI bots.

If you’re a privacy-minded individual, you should get answers to these questions before interacting with AI, so you understand what you (may) be getting into.

Does the AI model process and store user data (queries, prompts, refinement instructions, and generated output) for training (of the AI model) purposes?

Generative AI companies should clearly disclose whether they process and store user data and, if so, whether that data is used to further train the model.

Can you opt-in/opt out of your data being used to train the AI model?

Be sure to opt out of your data being used to train the vendor’s AI model if you can. If that’s not possible, ensure that the training data you provide will only be used to fine-tune your model/output. If the above cannot be done, I’d recommend not interacting with the chatbot.

If the vendor stores your training and validation data, how long is it kept?

You want assurances that your data is stored securely (i.e., encrypted at rest and in transit) and isolated from your subscription, API credentials, and payment information.

If the vendor stores your training and validation data, can you delete it?

Make sure you have control over the data you share and delete it when it’s no longer needed.

Does the vendor share your data with third parties?

The more your data is shared, the less control you have over it. And while many providers “anonymize” the data prior to sharing, that’s probably not going to be enough (see the New York Times piece above). If your data is shared, I’d think twice before entering a prompt.

Which of the vendor’s employees can access your data?

Make sure only authorized employees can access your data and that they’re few and far between.

Does the provider allow you to opt out of data collection and storage?

While this is unlikely to be offered to individual users, generative AI vendors could allow some organizations to opt out of data collection altogether if their activities involve processing sensitive, confidential, or legally regulated data.

This one could make a difference.

If opting out is possible and approved by the vendor, make sure you get explicit confirmation that your data is not being collected. Beyond that, it’s just going to come down to trust, as you won’t have any visibility into what’s collected or not.

Wrap up

So there you have it. The privacy implications of using generative AI aren’t as clear-cut as using Google’s search engine, for example. For one, Google search has been around for much longer, and we have some insights into how Google works (not that it’s pretty…). Conversely, generative AI has only really been in our lives for a few months, so there are many more unknowns. But the privacy risks are real and perhaps even more severe than what we’ve experienced so far.

Time will tell, I guess, which harms are real and which ones are imagined. But until then, it’s probably wiser to err on the side of caution. I’d recommend staying away from the tech until more light shines on it. But if you must use it, try to be conservative in what you share.

Stay safe.