Understanding aggregate, de-identified and anonymous data

“We may share aggregated data with our partners.”

“We may share data that is aggregated or de-identified.”

“Our product collects anonymous data for analytics purposes.”

Many organizations argue they protect privacy through the use of aggregate, de-identified or anonymous data. However, do their users understand what the terms mean? What is aggregate data? Is there a difference between de-identified and anonymous data? For researchers, which data sets have more value: aggregate or anonymous? 

Users often agree to personal data sharing with de-identification, without grasping the details.

If you’ve ever wondered what’s going on, wonder no more. Here’s your guide to data de-identification, aggregation, and the different levels of anonymity.

Aggregate data: to combine and summarize

So, what is aggregate data? Aggregation refers to a data mining process popular in statistics. Information is only viewable in groups and as part of a summary, not per the individual. When data scientists rely on aggregate data, they cannot access the raw information. Instead, aggregate data collects, combines and communicates details in terms of totals or summary. Many popular statistics and database languages allow for aggregate functions, with tutorials available for  R, SQL and Python.

Consider the following: a marketing company runs a survey to see if people prefer their company’s brand, or their competitors’. When they present the data to management, it is in aggregate form: showing which brand is the most popular. They might include additional information on the groups they talked to, such as voting preference by age or location. With aggregate information, we can get details on what brands are popular by age or in certain regions, but the exact details on how individuals voted are never revealed.

Can aggregation protect privacy?

As data aggregation only displays information in groups, many consider it a safeguard to protect personal information. After all, you cannot compromise privacy if the data only shows the results for groups of individuals, right?

Sadly, it’s not so easy;  with the right analysis, aggregate information can reveal significantly personal details. What if you ask the aggregate blog data: how many visitors you get from Ireland, who view the blog on a smartphone?  What if you ask for the number of visitors from Ireland, who use a smartphone, in one day? Or visitors from Ireland who use a smartphone, and clicked on an Amazon ad for menswear on a single day? By applying multiple, specific filters, it might be possible to single out an individual, intentional or not. Aggregation can protect privacy, but there is no guarantee that it always does.

For organizations that use data aggregation, Ed Felton with the FTC has a warning: aggregate data can be useful, but it doesn’t guarantee privacy.

“The simple argument that it’s aggregate data, therefore safe to release, is not by itself sufficient.”

De-identification: removing personal details

De-identification is a process that removes personal details from a data set. This approach aims to protect privacy while still providing comprehensive data for analytics. Some of the data is better at identifying individuals than others. We are easy to identify when the data includes our name, address, email, birth date or other unique factors. With de-identification, we remove those unique identifiers from the raw data. 

A retail store that uses de-identification may track individual purchases, dates and, store locations, but remove the names and addresses. While “Susan Smith from 75 Clark Drive in Great Falls, Montana shops for engineering books”, the store’s database records her as a “user of the Montana location who buys engineering books”. De-identification takes out Susan’s name and identifiers so that her purchase could come from anyone.

De-identification is a particularly popular privacy safeguard with clinics and organizations that process health information. The Health Insurance Portability and Accountability Act (HIPAA) addresses de-identification under section 164.514. According to HIPAA, information is de-identifiable when

“there is no reasonable basis the information can be used to identify an individual”.

HIPAA permits some allowances for de-identified data, such as disclosures for research or to public officials.

From de-identified to re-identified: it might not take much.

Unfortunately for organizations who might hope to use de-identification as a safeguard, many now see it as poor protection. People can be identifiable by more than names and numbers, thanks to detailed data sets. If a data subject’s job is ‘Mayor’ and the raw data includes city, it doesn’t take much to figure out who’s who.  

An extremely popular case of highlighting the flaw of de-identification came in 2006 with Netflix. Per Robert Lemos with SecurityFocus, in a contest to improve the company’s algorithm, Netflix released a set of 2 million subscribers. The company de-identified the data set by removing user names. Yet to their surprise, researchers from Austin were able to identify users. They did so by using the data available and filling in the blanks from other sources: combining user ratings with a public database of movie scores. Needless to say, according to Epic.org, Netflix cancelled the contest.

De-identification is also flawed because there’s no universal agreement over what information is personally identifiable. Is the data de-identified if IP addresses remain? What about dates of birth? Standards exist, including HIPAA’s Safe Harbour, but are they enough? According to Privacy Analytics, part of the IQVIA group of companies, Safe Harbour “does not actually ensure that the risk of re-identification is low except in very limited circumstances.” That’s bad news for health organizations that rely on it, since per HIPAA section § 164.514.2.ii, allowances for de-identified data are only acceptable if there’s no evidence the data can be re-identified. Recent studies over the past ten years, including Risks to Patient Privacy: A Re-identification of Patients in Maine and Vermont Statewide Hospital Data now means new standards are needed.  

What about coded data?  Tokenization?

Coded data and tokenization are solid ways to protect sensitive data. For coded data, all sensitive information is stripped out and replaced with code words, numbers, or unique identifiers. The codes map to another database or document that works as a key. Information is re-identified by matching the code with its corresponding sensitive data.  

In tokenization, we automate the process, replacing sensitive data with a reference variable. The token maps with a more secure database that holds the sensitive information. When processing information, the system analyzes tokens against records in the secure database. If it finds the token’s corresponding match, processing continues using the sensitive data. 

Coded data and tokens protect information security. They are efficient because they only hide sensitive data. If an analyst wishes to process the data without referencing personal details, they can. Likewise, data sets that use code identifiers or tokens are safer against theft.  If the data is compromised, sensitive data remains concealed. For example, an attacker that steals data on credit card sales cannot see the card numbers if tokens are in use. 

Be aware, however, that while tokens, coded data and unique identifiers offer better security they do not make data anonymous. Data that uses tokens or code identifiers are still subject to privacy regulations. Privacy laws are not solely concerned with data breach and access. Privacy legislations work to minimize the potential misuse of personal data. So long as the data can, with authorization, be re-identified, privacy agreements must be in place. 

Anonymous data: we can’t tell who you are… or can we?

Anonymous data refers to information when it is impossible to identify individuals.  Truly anonymous data sets are a privacy enthusiast’s dream. The ability to collect, store, and analyze data without the capability of recognizing individuals make an ideal safeguard. For organizations that manage to keep their data anonymous, the benefits are huge. Anonymous data is easier to sell, process, analyze and retain, as it requires fewer safeguards for protection. 

Fewer rules apply: anonymous data is often exempt from privacy legislations, including the E.U’s General Data Protection Regulation. According to the GDPR, information “which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable” is not subject to privacy requirements.

How do you make data anonymous? Most techniques fall into one of three categories: cryptographic, generalization (also known as recoding), and randomization. 

Cryptographic methods encrypt the information in storage, making the data anonymous until decrypted for use. This protects the data but means re-identification can happen when the data is decrypted for processing. 

Generalization techniques borrow from data aggregation and de-identification, to deliberately remove identifiers and reduce precise data. Under generalization, for example, an individual’s height or weight becomes a range, instead of the exact number. 

Randomization skews the results by adding data and moving elements around so that re-identification results are full of errors. The Finnish Social Science Data Archive’s Data Management Guidelines provide in-depth explanations on techniques for anonymizing qualitative and quantitative data.

Why we may need to give up the idea of anonymous data altogether

Unfortunately, the ability for personal data to be anonymous may no longer be an option. The ingenuity that can be used to re-identify individuals is utterly astounding. Writing for The Guardian, Olivia Solon lists examples of using paparazzi shots and nameless taxi logs to establish celebrity bad tippers. Cory Doctorow writes for BoingBoing.net that journalist Svea Eckert and data scientist Andreas Dewes identified a German MP’s medication regime through data collected by browser plug-ins. In July 2019, New York Times journalist Gina Kolata published evidence that scientists can re-identify ‘anonymized’ U.S. Census data. Between advances in data science and an increasing trove of data to fill in the gaps, the concept of anonymous data may become meaningless.

So if none of these techniques fully protect privacy, what do we do?

First, recognize that while aggregate, de-identified and anonymized data sets don’t protect privacy completely, they do still offer some level of protection. If your data is aggregated, de-identified or anonymized, there’s less chance of it being read by daily processors. Thankfully, pulling personal information from this heavily processed data requires tools and skills which are not available to every individual.  

Second, be aware if you see these phrases in privacy policies or terms of use that your personal information is still accessible. A service that collects anonymous data can still be gathering personal information. Companies that share aggregate or de-identified information are still sharing personal details: what are your feelings on that? 

If you operate a business that uses aggregation, de-identification or anonymization, recognize that these can’t be your only safeguards. You should still have other physical, technical and administrative protection measures in place. A data breach of de-identified data can still cost you, particularly if there’s evidence that personal details can be collected. Use these techniques as a tool, but not the end-all of privacy and security programs.

See also: Data breach trends