Do you blog? Have you noticed an increase in "referer spam"? I have. Seems like every day now I get about three sites which come along and "link" to every page on my blog. It's become annoying, because I really enjoy looking through my referral logs, it's one of the best ways to find cool new blogs. Anyway this morning I decided to do something about it; I wrote a little spam filter for my server logs.
What's all this about? Well, every time someone requests a page from my server, it causes the server to write a log entry into a file. As part of the request, there can be a "referer", which is the address of the page from which the request was made. If the request was a result of a link, the referring page will be the page which contained the link. If site X has a link to me, and someone clicks on that link, the log will have the URL of the page on site X.
Unfortunately there are dirtbags out there who exploit this as a means of publicity; they make bogus requests to my site giving the URL of their site as the referer. Of course their site doesn't actually link to me, they just want me to go check out their site. It is a very lame way of publicizing a site because 1) the only audience for the referers is a site's webmaster and 2) if s/he does visit the referred-to site they'll already have a really low opinion of the site operators.
So, what to do? Well, I'm already piping logs through a filter, the great little [free] program called cronolog. I just added another filter to get rid of referer spam. Here's my new ErrorLog entry (in Apache's httpd.conf):
ErrorLog "| /var/log/httpd/w-uh.com/reffilter.ksh | /usr/local/bin/cronolog /var/log/httpd/w-uh.com/%Y/%m/%d-error.log"
This is all one line. The webserver passes each log entry into the reffilter.ksh script (my new invention), which then passes each entry on to cronolog (which writes the entries into files named for the current year and month). The reffilter.ksh script processes every log entry as follows:
- If the entry doesn't have a "referer", pass it through.
- If the entry's referer is my site, pass it through. This happens a lot; links within the site, and links from pages on my site to image files.
- If the entry's referer is not a well-formed URL, pass it through. A lot of search engine
robots and RSS feed readers give a bogus URL, these are actually nice to have, so I leave them.
- If the referer is a well-formed URL which is not my site, I retrieve the page from the URL. If this fails, I pass the referer through. I don't mind having referers which I can't access (because they're password protected, or from an email system, or whatever). No referral spammer would give a bad URL.
- If I was able to retrieve the page, I scan it to find the reference. If there's a link to my site, great, it was a legitimate referral, and I pass it through.
- If there's no reference to my site - aha, I caught you. I piss on you from a great height, and silently remove the referral from the log entry before passing it through.
So far this morning I have filtered 13 spams. Very satisfying. Yes, it is an odd way to spend Thanksgiving morning. But then, I am odd, so there you are.
BTW, yes, "referer" is misspelled. Someone at NCSA spelled it wrong at time zero, and now we're all stuck with it. You've got to love that.
P.S. If you would like my little script for your own use, please shoot me email, I'm happy to share.