In order to improve and develop email analytics tools, it is important to find a data set on which to test those tools. However finding a data set can be difficult due to privacy concerns. In this blog we look at the top 3 email data sets available.
One of the most popular data sets for email analysis is the Enron data set. It contains data from around 150 users and has a total of 500,000 messages. The Federal Energy Regulatory Commission made the data set public during its investigation. MIT later worked on the data set to fix a number of issues.
The UC Irvine Machine Learning Repository maintains 560 data sets which are available free of charge to the machine learning community. Whilst not as large as the Enron corpus, there are some helpful sets available. One example is the Spambase Data Set which includes both spam and non-spam emails. This is frequently used for spam models.
Often your own personal or business email can provide an easily accessible data set to analyse. An inbox is a historical archive of your life and for many users is a simple way to log your life without much effort. The data set may run into many thousands of messages and can cover a wide variety of subjects. It therefore follows that concerns over privacy are also redundant.
Threads is a great tool for analysing your own data set. Threads aggregates your entire company’s email and telephone calls into a single database. It enables you to search across multiple email data sets in one go providing deep insights into your own data. Try Threads free for 14 days today.