Critical Section

Archive: November 25, 2004

<<< November 24, 2004

Home

November 27, 2004 >>>


referral spam be gone

Thursday,  11/25/04  09:00 AM

spamDo you blog?  Have you noticed an increase in "referer spam"?  I have.  Seems like every day now I get about three sites which come along and "link" to every page on my blog.  It's become annoying, because I really enjoy looking through my referral logs, it's one of the best ways to find cool new blogs.  Anyway this morning I decided to do something about it; I wrote a little spam filter for my server logs.

{
What's all this about?  Well, every time someone requests a page from my server, it causes the server to write a log entry into a file.  As part of the request, there can be a "referer", which is the address of the page from which the request was made.  If the request was a result of a link, the referring page will be the page which contained the link.  If site X has a link to me, and someone clicks on that link, the log will have the URL of the page on site X.

Unfortunately there are dirtbags out there who exploit this as a means of publicity; they make bogus requests to my site giving the URL of their site as the referer.  Of course their site doesn't actually link to me, they just want me to go check out their site.  It is a very lame way of publicizing a site because 1) the only audience for the referers is a site's webmaster and 2) if s/he does visit the referred-to site they'll already have a really low opinion of the site operators.
}

So, what to do?  Well, I'm already piping logs through a filter, the great little [free] program called cronolog.  I just added another filter to get rid of referer spam.  Here's my new ErrorLog entry (in Apache's httpd.conf):

ErrorLog "| /var/log/httpd/w-uh.com/reffilter.ksh | /usr/local/bin/cronolog /var/log/httpd/w-uh.com/%Y/%m/%d-error.log"

This is all one line.  The webserver passes each log entry into the reffilter.ksh script (my new invention), which then passes each entry on to cronolog (which writes the entries into files named for the current year and month).  The reffilter.ksh script processes every log entry as follows:

  • If the entry doesn't have a "referer", pass it through.
  • If the entry's referer is my site, pass it through.  This happens a lot; links within the site, and links from pages on my site to image files.
  • If the entry's referer is not a well-formed URL, pass it through.  A lot of search engine robots and RSS feed readers give a bogus URL, these are actually nice to have, so I leave them.
  • If the referer is a well-formed URL which is not my site, I retrieve the page from the URL.  If this fails, I pass the referer through.  I don't mind having referers which I can't access (because they're password protected, or from an email system, or whatever).  No referral spammer would give a bad URL.
  • If I was able to retrieve the page, I scan it to find the reference.  If there's a link to my site, great, it was a legitimate referral, and I pass it through.
  • If there's no reference to my site - aha, I caught you.  I piss on you from a great height, and silently remove the referral from the log entry before passing it through.

So far this morning I have filtered 13 spams.  Very satisfying.  Yes, it is an odd way to spend Thanksgiving morning.  But then, I am odd, so there you are.

BTW, yes, "referer" is misspelled.  Someone at NCSA spelled it wrong at time zero, and now we're all stuck with it.  You've got to love that.

P.S. If you would like my little script for your own use, please shoot me email, I'm happy to share.

 
 

Return to the archive.

Home
Archive
'13   '12   '11
'10   '09   '08
'07   '06   '05
'04   '03   all
About Me
W=UH
Email
RSS   OPML

Greatest Hits
Correlation vs. Causality
The Tyranny of Email
Unnatural Selection
Lying
Aperio's Mission = Automating Pathology
On Blame
Try, or Try Not
Books and Wine
Emergent Properties
God and Beauty
Moving Mount Fuji The Nest Rock 'n Roll
IQ and Populations
Are You a Bright?
Adding Value
Confidence
The Joy of Craftsmanship
The Emperor's New Code
Toy Story
The Return of the King
Religion vs IQ
In the Wet
the big day
solving bongard problems
visiting Titan
unintelligent design
Shorthorn
the nuclear option
second gear
On the Persistence of Bad Design...
Texas chili cookoff
the inflection point
almost famous design and stochastic debugging
may I take your order?
paper art
triple double
New Yorker covers
Death Rider! (da da dum)
how did I get here (Mt.Whitney)?
the Law of Significance
Holiday Inn
Daniel Jacoby's photographs
in praise of paddle shifting
the first bird
Gödel Escher Bach: Birthday Cantatatata
shining a light
Father's Day (in pictures)
your cat for my car
discovering the third quadrant
Jobsnotes of note
world population map
no joy in Baker
introducing eyesFinder