Data & Tools

Summary

Datasets are summarized in the following table while more details will be followed in the part of Datasets.

Dataset Name Short Desc Access Link
RESIP Residential-proxy IPs Data
ResiFlow Residential-proxy traffic flows. Data
SpamHunter SMS spam messages Data
SearchIPT Illicit promotional texts that poison search engines Data
xPIP Posts published in the X platform to promote illicit goods and services Data
IFTTT Services and applets on IFTTT, a trigger-action workflow platform Data

Then, tools released as part of my research works are summarized in the table below, while more details are given in the dedicated part of Tools.

Tool Name Short Desc Access Link
The RESIP Infiltrator Given a RESIP service, it can be used to capture residential-proxy IP addresses Code
The SpamHunter It can be used to continuously capture SMS spam messages as reported by victims on public social networks (especially the X platform) Code
The IPT Toolchain It can be used to capture illicit promotion texts as distributed through poisoning search engines. Furthermore, It has an analysis module to deeply analyze IPTs Code
The PIP Hunter It can be used to capture posts of illicit promotion as distributed on online social networks Code

Datasets

RESIP: Datasets of residential-proxy IPs. Each of these datasets consists of millions of distinct IPs that were observed to serve as residential proxies. For more details, please refer to the project website. Also, the characteristics of these datasets are presented along with the collection methodology in multiple papers: [SP'19-a] [NDSS'21] [CCS'22-a].

ResiFlow: The dataset of residential-proxy traffic flows. By means of deploying residential-proxy nodes, we have captured over 3TB residential-proxy traffic, comprising over 116 residential-proxy traffic flows, TCP/UDP flows relaying network traffic between proxy users and diverse traffic destinations. Due to ethical consideration, this dataset is only available to researchers upon request and vetting. Refer to the project website to request access and learn more details.

SpamHunter. Leveraging SpamHunter, we have collected and made available a dataset of tens of thousands of SMS spam messages, the largest-ever public SMS spam dataset in terms of the number of SMS spam messages. This SpamHunter dataset is available on this project website, while the collection methodology is specified in details in [CCS'22-b].

SearchIPT: illicit promotional texts on search engines. This dataset contains over 11 million Illicit Promotion Texts (PIPs) as discovered on major search engines. These PIPs were used to promote 14 categories of illicit goods or services. Learn more details from the project website.

xPIP: posts of illicit promotion on the X platform. This dataset comprises 12 million distinct Posts of Illicit Promotion (PIPs). These PIPs were published from 580K X accounts, aiming to promote illicit goods or services of 10 different categories. Learn more details from the project website.

IFTTT: services and applets on the IFTTT platform. IFTTT, a trigger-action workflow platform, allows an user to set up if-then-else workflows to connect web services and IoT devices. We crawled all the services and applets (workflows) on IFTTT between Nov 2016 and May 2017, leading to this dataset that is available on this project website.

Tools

The RESIP Infiltrator. This tool can be used to capture residential-proxy IPs for a residential proxy service as long as it follows the back-connect proxy mode. The code base is available in this Github repo.

The SpamHunter. The SpamHunter is designed to discover SMS spam messages as reported by victims on Twitter. It has led to the largest-ever publicly available SMS spam dataset as illustrated above. This tool is available at this Github repo.

The IPT Toolchain. As shown in the below figure, This toolchain is designed to capture and analyze illicit promotional texts (IPTs). It consists of three modules: The IPT hunter for searching search engines for IPT candidates and classifying IPTs, the IPT analyzer for multi-class IPT classification and IPT contact extraction, and the IPT infiltrator to further infiltrate IPT contacts (especially Telegram accounts and websites). This toolchain is available at this Github repo.

Infiltrator

The PIP Hunter. This tool can be used to capture posts of illicit promotion (PIPs). It is well tested for the X platform and can be easily adapted to other social network platforms as long as you have API access to them. Leveraging this tool, we have collected the xPIP dataset as illustrated above. This tool is available at this Github repo.