social media data

Social Data Trading Limited, a company that sells data on social media influencers to marketers, has exposed a database of nearly 235 million social media profiles on the web without a password or any other authentication required to access it, according to a new report from Comparitech researchers. The data included a wealth of information including names, contact info, personal info, images, and statistics about followers.

The profiles were taken from publicly viewable social media pages on Youtube, TikTok, and Instagram. Security researcher Bob Diachenko, who leads Comparitech’s cybersecurity research team, uncovered three identical copies of the exposed data on August 1.

Evidence suggests a connection between Social Data and a now-defunct company: Deep Social. The names of the Instagram datasets (accounts-deepsocial-90 and accounts-deepsocial-91) hint at the data’s origin. Based on this, Diachenko first contacted Deep Social using the email address listed on its website to disclose the exposure. The administrators of Deep Social forwarded the disclosure to Social Data. The CTO of Social Data acknowledged the exposure, and the servers hosting the data were taken down about three hours later.

Update on May 13, 2021: The former CEO of Deep Social, Pavel Maurus, contacted Comparitech to make clarifications about the incident. He said the data did not come from Deep Social as we previously suspected. Instead, he asserts that Social Data used a modified version of the same software as that used by Deep Social, hence the database names. The original software pulled profile data from Facebook’s official API, access to which was cut off by Facebook in 2018. The modified version, which Maurus says might have been stolen by former Deep Social employees who now work for Social Data Trading Limited, could scrape data without API access and could access a wider range of social networks like TikTok and Youtube. Timestamps on the exposed data corroborate Maurus’ account, suggesting it was harvested after Deep Social and Social Data Ltd. shut down. Maurus stresses that Social Data Trading Limited, which operates socialdata.hk, is a completely separate entity from another company he is involved with, Social Data Ltd., the latter of which shut down in January 2020. The author has modified the text of this article to reflect the distinction.

Facebook and Instagram banned Deep Social from their marketing APIs in 2018 and threatened legal action against it if it continued to scrape data from their users’ profiles. Deep Social then announced it would wind down operations and has since shut down its original service.

Social Data Trading Limited denies any connection between itself and Deep Social.

Web scraping is an automated task that copies data and information from web pages in bulk. Although Social Data insists it only scrapes what is publicly accessible, the practice is against Facebook, Instagram, TikTok, and Youtube terms of use. The automated scraping bots can be difficult to distinguish from normal website visitors, so social media companies have a difficult time preventing them from accessing user profiles until it’s too late.

instagram account example 2
An example showing what data can be scraped from an Instagram profile

A spokesperson from Social Data told Diachenko in an email, “Please, note that the negative connotation that the data has been hacked implies that the information was obtained surreptitiously. This is simply not true, all of the data is available freely to ANYONE with Internet access. I would appreciate it if you could ensure that this is made clear. Anyone could phish or contact any person that indicates telephone and email on his social network profile description in the same way even without the existence of the database. […] Social networks themselves expose the data to outsiders – that is their business – open public networks and profiles. Those users who do not wish to provide information, make their accounts private. [sic]”

Facebook company spokesperson Stephanie Otway told Comparitech in an email, “Scraping people’s information from Instagram is a clear violation of our policies. We revoked Deep Social’s access to our platform in June 2018 and sent a legal notice prohibiting any further data collection.”

Timeline of the exposure

social data index

We do not know how long the data was exposed for prior to our discovery of it on August 1. We also do not know whether any unauthorized parties accessed it during the exposure. Our honeypot experiments show that hackers can find and attack unsecured databases within hours of being exposed.

The database was shut down about three hours after sending our initial disclosure.

What data was exposed?

social data example

Three identical copies of the data were hosted at three separate IPv6 addresses. In total, each one stored data on about 235 million social media profiles. Here is a breakdown of the largest datasets:

  • 96,714,241 records scraped from Instagram
  • 95,678,713 records scraped from Instagram
  • 42,129,799 records scraped from TikTok
  • 3,955,892 records scraped from Youtube

Each record contains some or all of the following info:

  • Profile name
  • Full real name
  • Profile photo
  • Account description
  • Whether the profile belongs to a business or has advertisements
  • Statistics about follower engagement, including:
    • Number of followers
    • Engagement rate
    • Follower growth rate
    • Audience gender
    • Audience age
    • Audience location
    • Likes
  • Last post timestamp
  • Age
  • Gender

Based on samples we collected, about one in five records contained either a phone number or email address.

Dangers of exposed data

instagram account example
An example showing data that can be scraped from an Instagram profile

The information stored in this database is vulnerable to spam marketing and phishing campaigns. Users of Instagram and TikTok should be on the lookout for scams and phishing messages either sent directly or posted in comments. Even though the information is publicly available, the size and scope of an aggregated database makes it more vulnerable to mass attack than it would be in isolation.

The images and other profile data could be used by scammers to create fake imitation accounts. These accounts lure in followers, and then promote scams or misinformation.

The images could also be used without the owners’ permission for face recognition purposes.

Facebook and other social networks have employed both legal and technological solutions to stem web scraping of their users’ profiles, but the practice hasn’t ceased. Scrapers are difficult for automated systems to distinguish from normal website users. The most prominent example is Clearview.ai, which scraped profiles for images to be used in mass-marketed face recognition technology.

About Social Data and Deep Social

deep social home

Deep Social described itself as “a freemium influencer ranking, discovery and AI-driven analytics platform […] providing its 44,817 customers with in-depth insights into demographic and psychographic data of influencers and their audience.”

According to its website, Deep Social was used by a range of big-name brands including Samsung, Heineken, L’Oreal, Unilever, Walmart, Amazon, Disney, and Booking.com. It claimed to be “GDPR compliant”.

The company’s privacy policy says it was based outside of the European Economic Area but had a nominated representative in the UK and registered as a business in Delaware, USA.

Deep Social shut down in 2018 after Facebook reportedly banned it from its marketing API and threatened legal action.

Social Data Trading Limited launched in August 2019, according to Hong Kong business directories. Its website says it “helps your business to find Influencers and get in-depth insights into demographic and psychographic data of influencers and their audience throughout different types of social media on the web.”

Social Data is incorporated in Hong Kong, according to its terms of service (PDF) and its .hk top-level domain.

Why we reported this data incident

Comparitech researchers regularly scan the web for unprotected servers containing personal data. Upon discovering an unsecured database, we promptly begin an investigation to determine who is responsible for it, who is impacted, and what the potential ramifications could be if a malicious party obtains the data.

As soon as we determine who the owner is, we send a disclosure so it can be secured. We then publish an article like this one to raise awareness and curb potential harm to end users.

Previous data incident reports

Comparitech has published dozens of data incident reports like this one, including:

Gregory Boddin worked with Bob Diachenko and contributed to the research used in this report.