There might be many harmless reasons for sending anonymous emails – confessing your undying love for someone, seeking anonymous advice, or simply playing a joke on a friend – but there are also plenty of harmful reasons – making threats against someone, distributing child pornography or sending viruses, just to name a few. While police can often use the IP address to locate where an email originated, it may be harder to nail down exactly who sent it. A team of researchers claims to have developed an effective new technique to determine the authorship of anonymous emails that can provide presentable evidence in courts of law.
In an attempt to combat the increase of cybercrimes involving anonymous emails, Benjamin Fung, a professor of Information Systems Engineering at Quebec's Concordia University and an expert in data mining, and his colleagues set about developing a novel method of authorship attribution based on techniques used in speech recognition and data mining, which involves extracting useful, previously unknown knowledge from a large volume of raw data. Their approach relies on identifying frequent patterns and unique combinations of features that recur in a suspect's emails.
The technique works by first identifying the patterns found in emails written by the subject. Any of these patterns which are also found in the emails of other subjects are then filtered out, leaving patterns that are unique to the author of the emails being analyzed. These remaining frequent patterns then constitute what the researchers call the suspect's 'write-print' – a distinctive identifier akin to a fingerprint.
"Let's say the anonymous email contains typos or grammatical mistakes, or is written entirely in lowercase letters," says Fung. "We use those special characteristics to create a write-print. Using this method, we can even determine with a high degree of accuracy who wrote a given email, and infer the gender, nationality and education level of the author."
Fung and his colleagues tested their technique by examining the Enron Email Dataset – a collection containing over 200,000 real-life emails from 158 employees of the Enron Corporation. Using a sample of 10 emails written by each of 10 subjects – 100 emails in all – they were able to identify authorship with an accuracy of 80 to 90 percent.
"Our technique was designed to provide credible evidence that can be presented in a court of law," says Fung. "For evidence to be admissible, investigators need to explain how they have reached their conclusions. Our method allows them to do this."