“We may share aggregated data with our partners.”
“We may share data that is aggregated or de-identified.”
“Our product collects anonymous data for analytics purposes.”
Many organizations argue that they protect privacy through the use of aggregate, de-identified or anonymous data. However, do their users understand what these terms mean? What is aggregate data? Is there a difference between de-identified and anonymous data? For researchers, which data sets have more value: aggregate or anonymous?
This guide explains the differences between the terms and will help you make informed decisions when companies request to use your personal data.
Aggregate data: to combine and summarize
So, what is aggregate data? Aggregation refers to a data mining process popular in statistical analysis. Information is only viewable in groups and as part of a summary, not per the individual.
When data scientists rely on aggregate data, they cannot access the raw information. Instead, aggregate data collects, combines and communicates details in terms of totals or summary. Many popular statistics and database languages allow for aggregate functions, with tutorials available for R, SQL and Python.
Consider the following: a digital marketing company runs a survey to see if people prefer their company’s brand, or their competitors’. When they present the data to management, it is in aggregate form: showing which brand is the most popular. They might include additional information on the groups they talked to, such as voting preference by age or location.
With aggregate information, we can get details on what brands are popular by age or in certain regions, but the exact details on how individuals voted are never revealed.
Can aggregation protect privacy?
As data aggregation only displays information in groups, many consider it a safeguard to protect personal information. After all, you cannot compromise privacy if the data only shows the results for groups of individuals, right?
Sadly, it’s not so easy; with the right analysis, aggregate information can reveal significantly personal details. For example, what if you were to use the aggregate data for a blog to find how many visitors you get from Ireland, who view the blog on a smartphone? And what if you ask for the number of visitors from Ireland, who use a smartphone, in one day? Or visitors from Ireland who use a smartphone, and clicked on an Amazon ad for menswear on a single day?
By applying multiple, specific filters, it might be possible to single out an individual, intentional or not. Indeed, the more statistics you produce from aggregate data, the more likely it is that the underlying data can be reconstructed.
So, while the data produced using an aggregation tool can protect privacy, there is no guarantee that it always does.
For organizations that use data aggregation, Ed Felton with the Federal Trade Commission (FTC) warns that:
“The simple argument that it’s aggregate data, therefore safe to release, is not by itself sufficient.”
De-identification: removing personal details
De-identification is a process that removes personal details from a data set. This approach aims to protect privacy while still providing comprehensive data for analytics.
Some data is better at identifying individuals than others. For example, we are easy to identify when the data includes our name, address, email, birth date or other unique factors. With de-identification, we remove those unique identifiers from the raw data.
A retail store that uses de-identification may track individual purchases, dates and store locations, but remove the names and addresses of customers.
So, a Susan Smith that lives at 75 Clark Drive in Great Falls, Montana, and shops for engineering books, will be recorded in the store’s database records as a “user of the Montana location who buys engineering books.” De-identification takes out Susan’s name and identifiers so that her purchase could come from anyone.
De-identification is a particularly popular privacy safeguard with healthcare providers and other organizations that process health information. The Health Insurance Portability and Accountability Act (HIPAA) in the US addresses de-identification under section 164.514. According to HIPAA, information is de-identifiable when
“there is no reasonable basis the information can be used to identify an individual.”
HIPAA permits some allowances for de-identified data, such as disclosures for research or to public officials.
From de-identified to re-identified: it might not take much
Unfortunately for organizations who might hope to use de-identification as a safeguard, many now see it as poor protection. Detailed data sets mean that people can be identified without needing their names.
For example, if a data subject’s job is ‘Mayor’ and the raw data includes city, it doesn’t take much to figure out who’s who.
A case highlighting the flaws in de-identification came in 2006, when Netflix announced a prize to researchers who could improve its movie-recommendation algorithm.
To aid competitors, Netflix released a data set representing the movies rated by over 480,000 Netflix customers and the date each rating was given. The data was de-identified by replacing customers’ names with unique numbers. However, two researchers from the University of Texas subsequently showed that it was possible to re-identify particular individuals using reviews subscribers had posted on IMDB.
When Netflix announced it was planning to release another data set for another competition – but this time with additional demographic customer details – the FTC stepped in and gently encouraged Netflix to rethink its plans.
De-identification is also flawed because there’s no universal agreement over what information is personally identifiable. Is the data de-identified if IP addresses remain? What about dates of birth? Standards exist, including HIPAA’s Safe Harbour, but are they enough?
According to Privacy Analytics, part of the IQVIA group of companies, Safe Harbour “does not actually ensure that the risk of re-identification is low except in very limited circumstances.” That’s bad news for health organizations that rely on it, since per HIPAA section § 164.514.2.ii, allowances for de-identified data are only acceptable if there’s no evidence the data can be re-identified.
Recent studies over the past ten years, including Risks to Patient Privacy: A Re-identification of Patients in Maine and Vermont Statewide Hospital Data now means new standards are needed.
What about coded data? Tokenization?
Coded data and tokenization are solid ways to protect sensitive data. For coded data, all sensitive information is stripped out and replaced with code words, numbers, or unique identifiers.
The codes map to another database or document that works as a key. Information is re-identified by matching the code with its corresponding sensitive data.
In tokenization, we automate the process, replacing sensitive data with a reference variable. The token maps with a more secure database that holds the sensitive information.
When processing information, the system analyzes tokens against records in the secure database. If it finds the token’s corresponding match, processing continues using the sensitive data.
Coded data and tokens protect information security. They are efficient because they only hide sensitive data. If an analyst wishes to process the data without referencing personal details, they can.
Likewise, data sets that use code identifiers or tokens are safer against theft. If the data is compromised, sensitive data remains concealed. For example, an attacker that steals data on credit card sales cannot see the card numbers if tokens are in use.
Be aware, however, that while tokens, coded data and unique identifiers offer better security they do not make data anonymous. Data that uses tokens or code identifiers are still subject to privacy regulations. Privacy laws are not solely concerned with data breach and access. Privacy legislations work to minimize the potential misuse of personal data. So long as the data can, with authorization, be re-identified, privacy agreements must be in place.
Anonymous data: we can’t tell who you are… or can we?
Anonymous data refers to information when it is impossible to identify individuals. Truly anonymous data sets are a privacy enthusiast’s dream. The ability to collect, store, and analyze data without the capability of recognizing individuals make an ideal safeguard. For organizations that manage to keep their data anonymous, the benefits are huge. Anonymous data is easier to sell, process, analyze and retain, as it requires fewer safeguards for protection.
Fewer rules apply: anonymous data is often exempt from privacy legislations, including the E.U’s General Data Protection Regulation. According to the GDPR, information “which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable” is not subject to privacy requirements.
How do you make data anonymous? Most techniques fall into one of three categories: cryptographic, generalization (also known as recoding), and randomization.
Cryptographic methods encrypt the information in storage, making the data anonymous until decrypted for use. This protects the data but means re-identification can happen when the data is decrypted for processing.
Generalization techniques borrow from data aggregation and de-identification, to deliberately remove identifiers and reduce precise data. Under generalization, for example, an individual’s height or weight becomes a range, instead of the exact number.
Randomization skews the results by adding data and moving elements around so that re-identification results are full of errors. The Finnish Social Science Data Archive’s Data Management Guidelines provide in-depth explanations on techniques for anonymizing qualitative and quantitative data.
Why we may need to give up the idea of anonymous data altogether
Unfortunately, the ability for personal data to be anonymous may no longer be an option. The ingenuity that can be used to re-identify individuals is utterly astounding.
Writing for The Guardian, Olivia Solon lists examples of using paparazzi shots and nameless taxi logs to establish celebrity bad tippers. Cory Doctorow writes for BoingBoing.net that journalist Svea Eckert and data scientist Andreas Dewes identified a German MP’s medication regime through data collected by browser plug-ins.
In July 2019, New York Times journalist Gina Kolata published evidence that scientists can re-identify ‘anonymized’ U.S. Census data. Another New York Times article exposed Donald Trump’s 1985-1984 tax returns by re-identifying anonymised data.
Between advances in data science, machine learning, and an increasing trove of data to fill in the gaps, the concept of anonymous data may become meaningless.
So if none of these techniques fully protect privacy, what do we do?
First, recognize that while aggregate, de-identified and anonymized data sets don’t protect privacy completely, they do still offer some level of protection.
If your data is aggregated, de-identified or anonymized, there’s less chance of it being read by daily processors. Thankfully, pulling personal information from this heavily processed data requires tools and skills which are not — currently — available to every individual.
If you operate a business that uses aggregation, de-identification or anonymization, recognize that these can’t be your only safeguards. You should still have other physical, technical and administrative protection measures in place. A data breach of de-identified data can still cost you, particularly if there’s evidence that personal details can be collected. Use these techniques as a tool, but not the end-all of privacy and security programs.