Malicious actors have multiple ways to share data they have stolen from websites or services. Some might post to popular forums to gain notoriety while others might post anonymously to paste sites like PasteBin. Combing through all the pastes being posted is beyond the ability of humans, so I’ve created a tool that helps users to find interesting data on PasteBin in real time.
How much data is there, and why is it hard to search?
On average, there are 100 files uploaded to PasteBin every minute which makes combing through all that data very difficult. It gets even worse when a user is trying to find specific information. PasteBin doesn’t organize or index data by content, just by titles. Most users on the site post anonymously and without a title, so many of the pastes available are just called “Untitled”.
A tool searching specifically for usernames and passwords would have to scrape every paste, apply some form of matching, and then output the paste ID for manual human review later. This creates a whole host of other issues. PasteBin periodically removes pastes that could potentially contain sensitive information or have been reported by the community. By the time a user gets around to manually reviewing the content of the paste, it may have been removed. In an ideal world, the content of the paste is saved and indexed for retrieval later.
Tools to search paste content and save pastes already exist, but many of them are project that have been abandoned, are closed source, or just don’t work very well. The most popular one was Dumpmon, which went inactive in October of 2018. Dumpmon was a Twitter bot built by Jordan Wright, which monitored 3 common paste sites (PasteBin, Pastie, and Slexy) for potential data leaks. Other tools include PasteBinLeaks, open source and deactivated in 2011, and PasteBinDorks, which is closed source and inactive as of 2014.
What is pastebin_scraper?
After Dumpmon ceased operation, there didn’t appear to be any other public tools that monitored Pastebin in real time for data leakage. It was created as an open source project, so I downloaded the source code from GitHub and tried to get it working again. It took a few weeks, but I was able to get my own version of the tool working with some significant changes made, outlined here:
- Updated to use Python3
- Removed Slexy/Pastie compatibility since those sites no longer exist.
- Removed all Twitter components
- Made logging more verbose for debugging
- Now uses PasteBin API so your IP won’t be blacklisted
- Removed redundant code blocks
- Changed DB from mongodb to MySQL.
How does pastebin_scraper work?
Every 60 seconds, the tool queries the PasteBin API for the latest 100 pastes. It contains a reference paste ID so it isn’t scanning pastes multiple times and wasting processor cycles. The paste ID is then used to download the full paste text and searched using a list of regular expressions. These attempt to identify email addresses, hashed passwords, Cisco device configuration files, and other such data. If the paste has information that is considered interesting, the paste information (author, paste ID, interesting content type) are saved into the database and a line is written to the log file that identifies the paste ID and what kind of data it contains.
Features to be added
As with any project, there is always more that can be done. We have compiled a list of features that I will work on adding in my free time. These features include:
- Splitting found emails and password combinations and saving them separately
- Creating a script to automate database and table creation
- Add support for multiple hash types
- Some form of statistics tracking or a daily email?
I have some experience in Python but I don’t consider myself a developer. I may not have used best practices or efficient code, but it works! I welcome any contribution to help me create new features or make the tool work more efficiently.
The project is available on GitHub here: https://github.com/Critical-Start/pastebin_scraper