A recent investigation by Truffle Security has uncovered nearly 12,000 valid secrets, including API keys and passwords, in the Common Crawl dataset, an open-source collection of petabytes of web data maintained since 2008.
Truffle Security's team analyzed the Common Crawl's December 2024 archive, which consists of around 400 terabytes of data, scanning 2.67 billion web pages for sensitive information. The findings revealed 219 distinct types of secrets, with the most common being MailChimp API keys. In total, nearly 1,500 unique MailChimp keys were found hardcoded into HTML and JavaScript on front-end webpages.
About 63% of the exposed keys were found across multiple pages. One particular WalkScore API key was found on over 57,000 instances, spread across 1,871 different subdomains.
Researchers also discovered 17 unique live Slack webhooks embedded within a single webpage. These webhooks, which allow apps to post messages into Slack, should be kept confidential as they can be exploited to gain unauthorized access to communication channels.
Despite efforts to clean and filter training datasets, including removing sensitive content such as personally identifiable information (PII) and financial data, experts acknowledge that it is challenging to eliminate all sensitive data from massive datasets like Common Crawl. Pre-processing steps do not guarantee that confidential information is entirely stripped from the data, and this poses ongoing security risks.
Truffle Security said it contacted the affected service providers, including Amazon Web Services (AWS), MailChimp, and WalkScore, to alert them to the exposed keys. The companies have since revoked the compromised keys to prevent unauthorized use.