Getting an archive of blog posts

Where can I get an archive of past blog posts? With 26 million blogs, the archive would be enormous.

A query on Alexa shows they have 825k of rss (searching for "mime=XML") pages, but that's just a drop in the ocean. And their T&C prohibit copying the data – even though unlike web pages, RSS feeds are *INTENDED* for syndication.

Aaargh. So I'm back to wondering how to get an archive. Does anyone have one for sale (at a reasonable price… I'm just a guy, not a big corp).

Update: 23 March 2006

I had extended discussions with the Alexa Web Search project manager, Greger, and the upshot of it all is that we can't use them as a spider and get the content from them. Their terms and conditions specifically forbid copying the data.

What is Alexa Web Search for? We are only meant to use them as a search engine via their API. You can only collect the search results data, not the underlying data. So it wasn't a good outcome for me. 😦

Here's what I think is wrong with their strategy:

  • I can't get the content, so I can't innovate by using the content they have collected
  • They want to do the search themselves through their API. This means I can't get the content and use SQL Server full text indexing etc. and optimize my own searches according to what I perceive as my user's needs
  • Their content is out of date, but they may extend the API so that someone can request far more frequent updates/spidering of certain websites.

I think they would be far better off allowing me to buy SO MUCH data that I eventually have scalability problems and I begin to look for a hosted solution… which is what their API offers perfectly. But I'm not there yet, I'm here – I have small time needs, not big time needs.

So it seems the only way forward is either:

  • to begin collecting the data ourselves, or
  • to build a system which doesn't require an archive – which is what most meme-services are like, they act on current themes and don't keep archives, or
  • to buy a defunct website which has been going for a few years and has somewhat of an archive, or
  • to find someone who has an archive and is willing to share it (or sell it cheaply)

Onward and upward. 🙂

Update (Tues 4 April 2006):

TalkDigger (blog) is also struggling with needing an archive as I have written (Getting an archive of blog posts) – we are struggling with the same problem with reBlogger. They also may consider building a crawler. What a shame that Alexa doesn't see this need and fill it for us. (hint hint)



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: