How a digital footprint is used to track you online and how to erase it
The ability to use the internet anonymously is desirable for many people. There’s a school of thought that only people with something to hide seek anonymity and privacy on the internet. However, for the same reasons those people don’t take their phone calls on speakerphone in a crowded bus, there are very valid reasons to desire anonymity while using the internet. That desire is stronger than ever now; spurred on by the almost daily revelations of government surveillance overreach in many countries around the world.
Anonymity footprints and their challenges
Anonymity, like security, has many layers. The raw connection to the internet is only one facet and probably gets too much focus. Certainly, your IP address can lead back to you, but there are so many ways to hide your IP address that observers have developed other methods of piercing anonymity. These techniques involve differing levels of technology, and some are not technology-related at all.
These footprints we leave behind include: temporary and cache files, IP address leakage, in-the-clear DNS and WebRTC queries, traffic correlation, browser fingerprinting, accidental clearnet leakage, and styleometry.
There’s not much that an operating system can do to defend against styleometry analysis. There are a few tools published by Drexel University to both analyze and obfuscate your writing here.
The other problems have technical solutions which are summarized in the chart below. There are many options available and I have limited the selection to TAILS, Qubes OS and Whonix.
Temporary and cache files on our computers
Web browsers are notorious for the amount of files they store. Static content such as images and sometimes entire pages are cached in order to make subsequent site visits faster. Other identifying data such as cookies are also stored in order to provide authentication and preference data to websites. All of that data can go a long way to identifying sites you’ve visited when it’s left behind on the machine you used.
There are two main types of memory in a computer. Temporary memory used by the operating system to process data and open applications is called RAM (Random Access Memory). RAM is volatile, meaning its content is destroyed when the computer is shut down. More permanent memory is needed to store things like user files and web browser temporary files. That type of memory is provided by things like hard disk drives and USB drives or memory cards. This type of memory is persistent meaning that its contents are more permanent and will survive a computer restart.
A Live CD is a good tool to ensure no files are left behind on that persistent memory. A Live CD is an operating system that you can boot into from a CD rather than installing it on a system. There’s a special tool that Live CDs use called a RAM Disk. As it sounds, a RAM Disk is a
disk in RAM. The Live CD steals some RAM from the computer and creates a virtual hard drive from it. The Live CD doesn’t know the difference and uses that disk as if it were a hard disk drive to store its persistent data. However, because it’s really just RAM, its contents are destroyed when the system is shut down.
It’s possible to configure some Live CDs to store their data on true persistent devices such as USB sticks. But, if you’re worried about anonymity, it would be unwise to do that.
Going one step further, the data in RAM isn’t usually deliberately erased when a Live CD system is shut down. Rather, the memory is just
released to allow the operating system to reallocate it. This means that whatever was last in that memory space is still there until it gets reused. It’s technically possible to read that data even though it is not addressed any more. A Live CD distribution like TAILS is aware of this, and deliberately overwrites the RAM space it used before releasing it.
IP address leakage
The internet uses TCP/IP. In the OSI model, that refers to the layer 3 (IP) and layer 4 (TCP) technologies required to make it work. It’s therefore necessary for every single internet request to have an IP address in it to tell the recipient server where to send its response. Those IPs can be, and usually are, logged and can be traced back to individual humans fairly easily.
The very basic function of any anonymity-focused system should be to obfuscate your IP address. TAILS, Qubes OS and Whonix can all use Tor which accomplishes this.
In-the-clear DNS and WebRTC queries
Your computer does a lot of things when using the Internet and requesting data from remote servers is just one part of it. Most internet communication is done using domain names instead of IP addresses. But because the internet relies on IP addresses to work, there has to be some process to reconcile a domain name with an IP address. That’s the role of DNS (the Domain Name System). When your computer issues a DNS query, an observer will then know what site you’re about to visit even if you’ve encrypted the actual communication itself using a VPN or some similar process.
WebRTC (Web Real Time Communication) is a set of protocols designed to allow real time communication over the web. Unfortunately, those protocols can also leak things such as your IP address even if you’re routing your DNS through a secure channel.
Hardening your browser to refuse WebRTC connections and routing your DNS through Tor are easy ways to avoid these pitfalls.
This DNS leak test will check for DNS and WebRTC leaks.
Traffic patterns that can track different activities to a single user
Traffic correlation is an advanced technique that usually requires significant resources to do well. Consider the Tor network; at least three Tor nodes are involved in any request. In order to correlate an encrypted request from the Tor entry node to the same request coming out of a Tor exit node an observer would have to be able to surveil a large number of all Tor nodes. But just because it’s hard doesn’t mean it isn’t possible.
Running different applications through the same Tor circuit can make this type of correlation easier. Each request would retain its individual anonymity but taken as a whole the large amount of disparate traffic can lead back to a single user.
Tor Stream Isolation creates different routes for each application which makes this type of analysis harder. Whonix performs stream isolation.
Interestingly enough, my naked Firefox browser with a script blocker installed is better protected against browser fingerprinting than the hardened Tor browser that comes with the TAILS Live CD. But, on the other hand, my naked browser is more unusual which could make me easier to fingerprint.
This is the Panopticlick results for the Tor browser in TAILS:
Accidental clearnet leakage
Humans have a tendency to use what is more convenient rather than what is more secure. Using a Live CD like TAILS or a well-contained distribution like Qubes OS does a pretty good job of making it hard to leak traffic. All the network traffic is routed through Tor or ip2 and it would be hard to accidentally circumvent that.
Whonix is a Linux distribution that comes in two parts. It is comprised of two VirtualBox images: a gateway and a workstation. The gateway connects to Tor and the workstation will only use the gateway for internet activities. While this system is novel and works well when used as intended, it also provides an opportunity to accidentally send traffic through your normal internet connection. This is possible because the host’s networking is not affected in any way. Accidentally using an application outside of the Whonix workstation is possible in a moment of inattention.
This image shows my IP address in both the Whonix workstation (left) and my host machine (right). I’m using a VPN on the right so that’s not my actual IP, but the point is that is not the Whonix gateway Tor IP. My host traffic is being sent through my normal ISP network connection. And, since I am using Firefox in both cases, it can be easy to mistakenly pick the wrong browser.
Styleometry is the attempt to identify a user through analysis of writing style and grammar. The general concept of accumulating meta data about users to identify individuals is well understood. But specifically using writing styles as an identifying factor has usually been the purview of historical researchers to identify previously unknown authors.
Journalist, dissidents, and whistleblowers are a class of internet users who generally benefit greatly from anonymity and also tend to write. If a writer has published works both anonymously and with attribution, it then becomes possible to try to correlate writing styles.
Researchers at Drexel University were able to identify authors of anonymous writings with 80 percent accuracy using stylometry.
Related: See our guide to anonymity online.