Rough plan for Spam Detection and Junk Management
Currently these projects are planned for 0.6 (Spring/summer 2005), so this is a rough plan intended more to start identifying issues than to do detailed estimates or design.
We assume that we will integrate a third-party client-oriented spam detection engine.
SpamBayes? has the biggest mindshare but there are other options. At some point we'll have to investigate these in detail and choose, but for now it's good to know that they exist. Then the integration of a spam detection engine will certainly require work, because these need to keep some amount of historic data around about past spam in order to identify future spam, and whether this historic data is stored on disk or in our repository. Both options are work; if on disk we need to manage a separate area of user data in installs and upgrades, while if in repository we need to model the data and provide a layer to store it. There may be other operating system differences that I haven't thought of yet.
There are also server-side spam detection engines. Some of these require no client-side work at all, but others do -- these modify the email headers and provide spam flags and ratings. We hope that working with these is as simple as reading the "x-spam-status" header, and even better, that the spam detection library will do that work for us.
When we do detect spam we will mark it as such (requires
ContentModel? work). We envision only a yes/no flag and no ratings at this point. The presence of the junk flag means that the flagged Content Item does not appear in any normal view. There will need to be one special out-of-the-box view created which does show junk.
Junk management is tied into spam detection but now we're looking more at the user interface side of things.
- Users can mark a non-spam email as junk.
- Users can go to the junk folder and review junk and mark some of it as not junk.
Both of these actions will need to feed information back into the spam detection engine so it can learn.
User interface issues:
- Some users collect junk and never review it. Others want junk mail to be automatically deleted. Others review junk mail and delete it manually -- meaning that any junk mail that is still around is something they haven't yet reviewed. Which of these modes do we support?
- What user configuration do we allow? The ability to mark some sender addresses as "always junk" or "never junk"? The ability to tune the spam detection algorithm to be more tight or more loose?
- What happens when the user marks as junk (not junk), does the Item immediately disappear from the view?
--
LisaDusseault - 09 Dec 2004
I would strongly recommend making the spam score a number and not a boolean. Off-the-shelf server-side spam filters tend to be black boxes: they don't have good ways of taking input from the client like who is in the user's address book, who the user frequently corresponds with, what the user thinks is spam, etc. That is really valuable and useful information.
Conversely, the client usually doesn't get information that the SMTP server has, like how many accounts on this server have gotten email from this IP address today, or what the envelope information is. The poor communication between the client and the server is a result of history, and IMHO not likely to change until a brand new email protocol shows up (
MailDAV?, anyone?).
I believe, then, that the best spam filtering will happen when you have some work done on the server and some work done on the client. In order to do a good job, the client needs as much information as possible from the server, like a number representing the server's opinion of how spammy the message is.