Critical Section


Google and Blogs

Sunday,  05/11/03  11:26 AM

There's been considerable discussion in the blogosphere about Google "dropping blogs" from search results.  Dave Winer linked Andrew Orlowski's article about Eric Schmidt's comments; more recently Dave links Evan Williams' reply that Orlowski is full of crap.  So what's the truth?  Unlike Evan, I have no inside knowledge (Evan is the founder of Pyra, makers of Blogger, which was recently purchased by Google), but here's some educated guesswork...

First, Google is all about delivering accurate search results.  If they thought dropping blogs would help, that's why they would do it.  (Not because they dislike blogs or have some philosophical axe to grind.)  So we need to think about whether blogs improve search results or not.  Second, Google has a history of separating search domains in their GUI (images, groups, directory, news).  Each of these domains have different characteristics, and when a user searches they generally know which domain they want to search within.  It is reasonable to assume that rather than dropping blogs altogether, Google would establish a new domain for them.  So we need to think about why they would do this and how it might work.  Finally, Google works great for most sites, but the way they index blogs could be improved.  So we can think about how blogs could best be indexed.

Dave asked "how will it [Google] tell the difference [between blogs and everything else]"?  I'm not sure how they could tell, there are gray lines between news sites, personal home pages, company sites, e-commerce stores, blogs, etc., but there are technical ways to distinguish (blogs ping weblogs.com, they have RSS feeds, etc.).  More on this below, but for now let's think about the differences a search engine would care about:

  1. Blogs' content changes frequently.
  2. Blogs are link-rich and content-poor.
  3. Blogs contain personal opinion

If you think about it these things all make blogs less useful to search engines.  Let's consider them in turn:

Blogs' content changes frequently.  Blogs are chronological diaries; many bloggers post at least once a day and some post multiple times a day.  Each post usually has a "permalink" (a URL which always links to the post), but the blog itself has a constant URL, and the content of that URL is always changing.  Consider my little blog; I post about once per day, and Google's spider visits me about once per day.  It takes Google some time before their spider's data are indexed and absorbed, so most of the time what Google "thinks" is on my blog's home page is only accurate for a few hours.  This is shown vividly by looking at my referer logs; Google often directs people to my home page based on content which is no longer there!

Blogs are link-rich and content-poor.  Many posts on a blog simply link to other posts on other blogs, perhaps adding some commentary and/or associating multiple posts with similar content together.  Not all blogs are that way - this is the "thinkers" vs. "linkers" distinction I've mentioned before - but overall if Google directs a searcher to a blog, they're more likely to find links than the information itself.  There is value in having the links aggregated by the blogger, but that's what Google does anyway.  So most blog posts are not very good targets for a search, even if many other bloggers have linked to them.

Blogs contain personal opinion.  By their very nature, blogs are one or a small number of people's thoughts about their world.  Blogs which blandly report news are uncommon; most blogs are full of philosophy, politics, sociology, and general spin.  This is what makes them interesting and fun to read, but it isn't clear this is helpful for someone searching for information.  If you are searching for "George Bush landing on the U.S.S.Lincoln", that's what you want to find, not 1,000 bloggers' personal opinions about George Bush's landing.

So I can see why Google might want to exclude blogs from search results.  By the same token, blogs have information that can't be found anywhere else; they are an incredible source of information.  The information takes several forms:

  • Firsthand accounts of news events.  Frequently bloggers "are there", and contribute detail and insight (and photos) unavailable anywhere else.
  • Links connecting information together in virtual threads.  The interconnections between blog posts are amazingly informative.  Consider the brief thread I described above: Dave Winer -> Evan Williams -> Andrew Orlowski -> Eric Schmidt.  Each added information to the overall picture, but I never would have found these connections by simply searching Google.
  • Personal opinions.  I noted above that if you are searching for information about George Bush's landing blogs would not be helpful.  { Except for a firsthand account, of course, what if a Navy seaman blogged about the event! }  But if you wanted to know what people thought about the landing, checking blogs is absolutely the thing to do (as opposed to, say, taking CNN's or Fox's word for it).
  • Discussions.  In addition to one person's opinion, you have the give-and-take between many people.  Frequently blogs have comment threads which host the discussion.  Or bloggers may link back and forth on their own blogs, perhaps connected by trackbacks.  The discussion is often more illuminating than the original information.

So I can see where Google would definitely want to continue presenting blogs' information, but segregated into a different search domain.  They would do this for another reason, too - to improve the presentation of results.  Google News results are different from Google Web results, and they are presented differently too, as a reflection of the underlying differences in the content.

There is no doubt Google's approach to indexing web sites made a qualitative improvement in web searching.  But there are ways blogs can be indexed which would be a big step forward:

  • Use weblogs.com for currency.  Most blogs "ping" weblogs.com whenever their content changes.  Google could use this to determine when blogs' content have changed and schedule their spiders accordingly.  By the same token any site which pings weblogs.com should be considered a blog.  If Google did this, everyone with a blog would want to ping.

That's the answer to Dave's question "how will it tell the difference?" - it will ask the bloggers!

  • Use RSS feeds for content.  Most blogs have RSS feeds which abstract their content.  Google can use blogs' RSS feeds to determine what posts are at which URLs without laboriously spidering each blog every time it is updated.  If Google did this everyone with a blog would want to have an RSS feed.
  • Model the interconnections between posts.  The multithreaded world of links between blogs contains a mine of information - as shown by Technorati, Dave Sifrey's terrific search engine.  If Google could provide a way to find and display these threads, it would be really cool.  Currently we have comments, trackbacks, links between sites, etc. - all valuable and all different - and it is tough to get the big picture without a lot of clicking around.
  • Aggregate opinions.  The magic of Google is that they use links to index pages, instead of the contents themselves.  ("You have what people say you have.")  This technique applied to blog posts could be very valuable, use links to categorize an expressed opinion, instead of the opinion itself.  ("You think what people say you think.")

No doubt there are other ways, too.  By segregating blogs and treating them differently, Google could improve the blog searching experience.  Which in turn would make the information on blogs more valuable.

Wrapping up, here are my conclusions:

  • Google might want to exclude blogs from search results.
  • Google would definitely want to continue presenting blogs' information, segregated into a different search domain.
  • Google could improve the blog searching experience by leveraging attributes of the blogs themselves, such as weblogs.com, RSS feeds, comments, and trackbacks, and by applying their technique of using links to categorize content.

Those are my thoughts, I'm sure you'll have others.  I'll search for them :)

P.S. Click here for a Technorati search for blogs which link to Orlowksi's article.  There are 195 listed, each of which has other inbound links, comment threads, trackbacks, etc.  Amazing!

 

Home
Archive
'13   '12   '11
'10   '09   '08
'07   '06   '05
'04   '03   all
About Me
W=UH
Email
RSS   OPML

Greatest Hits
Correlation vs. Causality
The Tyranny of Email
Unnatural Selection
Lying
Aperio's Mission = Automating Pathology
On Blame
Try, or Try Not
Books and Wine
Emergent Properties
God and Beauty
Moving Mount Fuji The Nest Rock 'n Roll
IQ and Populations
Are You a Bright?
Adding Value
Confidence
The Joy of Craftsmanship
The Emperor's New Code
Toy Story
The Return of the King
Religion vs IQ
In the Wet
the big day
solving bongard problems
visiting Titan
unintelligent design
Shorthorn
the nuclear option
second gear
On the Persistence of Bad Design...
Texas chili cookoff
the inflection point
almost famous design and stochastic debugging
may I take your order?
paper art
triple double
New Yorker covers
Death Rider! (da da dum)
how did I get here (Mt.Whitney)?
the Law of Significance
Holiday Inn
Daniel Jacoby's photographs
in praise of paddle shifting
the first bird
Gödel Escher Bach: Birthday Cantatatata
shining a light
Father's Day (in pictures)
your cat for my car
discovering the third quadrant
Jobsnotes of note
world population map
no joy in Baker
introducing eyesFinder