In addition to being a nuisance, spam (junk) emails waste user time, disk space and network bandwidth. On my way back to KFUPM after the summer vacation, a simple idea regarding spam filtering hit me.
It all began with a simple question: “Why don’t you want to see spam emails?“. The answer was straightforward: “Because I’m not interested in whatever subjects the email is talking about“. Then, I started thinking:
We can’t use the subject header of the email because it can be totally unrelated to the body. So, we have to look at the content itself. How about if we extracted keywords that represent the main subject(s) of the email, and then compared them with keywords that represent the subjects that the user is interested in? After that, we should come up with a predicted “level of interest“. If it is too low, then the user will (most probably) not be interested in seeing this email (i.e. it’s spam).
This idea is now the core of my senior project. It will be a research-oriented, AI-related project. Regarding the first phase (keyword extraction), my teammate and I will most probably use some of the available services. We will focus our efforts on the second phase (keyword comparison). We have to figure out exactly how to do it and how to incorporate machine learning in it. We also might improve it by using Bayesian Belief Networks and/or Functional Network classifiers.
Spam filtering is one of the hot topics in the application of data mining and AI techniques. By working in this project, we hope that we can contribute to the ongoing research and develop an approach that will hopefully be taken as a basis for a new filtering technique or as an addition to existing ones.