Spam Fighting Secrets Revealed
In many ways, this post has been over 2 years in the making. When I started blogging, I decided, as a long-time coder and control freak, to do the back-end programming myself. I came to pretty quickly regret that decision, as blog spammers flooded the site long before I had any legitimate readers. Gradually, though, I began to see the fight against spam as a personal challenge, and decided to develop my own code to even the odds.That code has developed quite a bit over the past 2 years, and I've decided to make it available to my readers. I realize that I'm probably taking a big risk by doing this, and may get hit with a lot of spam, but I've come to see spam as a serious threat to usability, and I think we have to be willing to talk openly about how to fight it.
The Code
I'll start out with the link to the anti-spam code (written in PHP), in case you want to follow along while reading the post. For the non-coders out there, don't worry; I'll be discussing all of the techniques in a general, cross-platform sense. A couple of notes to you coders: (1) the code assumes that the input variable "$comment" contains your text, (2) you'll have to adjust the thresholds for your own situation, and (3) if you want to see the details for any given bit of text, I've created an output variable called "$spam_report".Spam Filters
Below are each of the techniques and filters used in the anti-spam code. The system operates by assigning weights to each filter and then comparing those weights against two thresholds: low spam probability and high spam probability. Low-probability spam is marked for approval, whereas high-probability spam is automatically rejected. I've found that this 3-tiered system (accept, approve, reject) is more forgiving and allows for greater flexibility.Trigger Words
This is simply a list of suspicious words (Viagra, anyone?) that gets compared to the comment text and tallied up. Although this is probably the lowest-tech of the filters, it has two additional advantages: (1) you can use it to block obscene or harassing content, and (2) you can use it to quickly filter out persistent spammers while you look for a better solution.
Link Counting
Links are the lifeblood of blog spammers, and link counting is the single most powerful filter I've found. The system has two separate thresholds, one for counts of "http:" and one for the "url=" format that some blogs and forums used to use. These days, I've found "url=" to almost always be indicative of spam (unless your system still uses it, of course), and in my personal experience, any blog comment with more than one "http:" link is suspicious.
Text Density
This filter strips out HTML tags and compares the length of the remaining text to the overall length of the comment. The limit is the lowest percentage of text-to-content that you deem acceptable. I designed this filter specifically to handle spam comments that consist of only one link but have just a few words of text around it. Those comments have an unusually low text density.
Vowel Density
Have you ever typed nonsense text that looked something like "sdfgsdfgsfdg"? Spammers do it all the time, and it creates a very unusual vowel-to-consonant ratio. Vowel density is the ratio of vowels to total text (after HTML tags are stripped out). The density of a normal post can range quite a bit, so you'll have to play around with this one. The vowel density filter has the added advantage of flagging non-Roman characters, for those of you who get Chinese and Russian spam. For reference, the average vowel density on my blog is currently 23.4%. The threshold setting represents the lowest acceptable limit as a percentage.
Spam Countermeasures
In addition to the spam-filtering code, I have a couple of extra features in place on my blog to deter spam that you won't find in the PHP:Image-based Submit
If you get hit by a lot of automated spam, this can be shockingly effective. Most automated programs use a method of form posting that only works for standard submit buttons and not image-based (input type="image") buttons. Switch your comment submission button to an image, and you may see a dramatic drop in spam. Better yet, switch your button to an image that looks like a CSS-styled button. How? It's easy: just create the CSS button, take a screen capture of it, and save it as an image.
24-hour Nofollow
This is a bit of an experiment, but since nofollowed links are essentially worthless to spammers, I've adopted a 24-hour nofollow policy. The nofollow is automatically lifted by the code after 24 hours, giving me enough time to remove spam comments without harming my legitimate visitors.
Disclaimers & Licensing
Will these techniques make you immune to spam? Of course not. Any spammer who just wanted to prove a point could certainly get a nonsense comment posted on this blog. The goal is to put the low-hanging fruit out of the reach of opportunistic blog spammers, especially people who rely on automation or are using spam comments to generate cheap, easy links.A bit of legalese: You are free to use the PHP code in this post as you like on your own blog, either in full or in part. I only ask that you not commercialize or mass-produce the code without consulting with me first. Additionally, if you use this code on an active blog and make changes or improvements, I'd love to hear about them.
Hamlet Batista
· Wednesday, June 25Dr Pete - This is a very useful contribution to the comment spam fighting community :-) Thanks a lot.
It would be interesting to measure your scripts/techniques against askimet. Have you thought/researched about incorporating Bayesian filtering?
Cheers
Dr. Pete
· Wednesday, June 25I've heard good things about Akismet and have looked at Bayesian filtering a little bit (more on the email side), but one of my goals was to see how much I could do with relatively simple, algorithmic approaches. One nice aspect of this is that these filters (either in combination or separately) could easily be used as pre-filters for a third-party approach like Akismet.
Adult Ühler
· Thursday, June 26These sound like some very advanced features. I would imagine your code would beat anything Askimet have produced knowing Automatic's security track record and general poor coding standards. May give this a whirl if I can get the blog back online.
Dr. Pete
· Thursday, June 26@Adult: You're far too kind. I am a bit obsessive about my coding and am proud of the simplicity of it, but I'm not going to rush out to try to commercialize this just yet. It's more of a personal experiment.
Steven Bradley
· Sunday, June 29Very nice of you to release this code Pete. Just a thought, but I would think this could make for a very welcome plugin. Any thoughts on developing one?
Dr. Pete
· Sunday, June 29@Steven: I've considered it, but since I don't use any of the major platforms, it gets a bit tricky. If there's anyone out there who likes this code and is WordPress savvy, I'd definitely be open to working with them on something.
Steve
· Thursday, July 3I've used my own home-grown anti-spam lib, which has a lot of the similar items you mentioned.
(I haven't looked at the code yet), but in my lib, I have a list of keywords that if found in comments set off alarms. I've spaced them out in case they trigger your anti-spam filters ;-)
"b u y", "s a l e", "c h e a p", "x % o f f", "o r d e r", etc.
Its been my experience that the spam contains 2 things. One is a link to spammers lair, the other is "sales jive" to lure people in.
That all said, I like the 24hour nofollow idea. I slapped nofollow on everything as a blanket measure, but I like your approach better! very clever.
Dr. Pete
· Thursday, July 3@Steve - Good point about call-to-action words like "buy" and "cheap"; those can be important spam flags in blog comments. Unfortunately, I'm starting to see more spam where the comment itself is innocuous, but the main link is spammy. I suspect that we're ultimately going to have to start following the links and evaluating the target site for spam content.
I've also toyed with more aggressive countermeasures, like auto-reporting spam links back to the source domain. A lot of times, they're being unknowingly hosted on social media sites, university sites, news sites, etc. Sometimes, I bounce particularly heinous spammers back to their own site or to another spammers site, just to mess with their heads :)




