Where can I get an archive of past blog posts? With 26 million blogs, the archive would be enormous.
A query on Alexa shows they have 825k of rss (searching for "mime=XML"
pages, but that's just a drop in the ocean. And their T&C prohibit copying the data - even though unlike web pages, RSS feeds are *INTENDED* for syndication.
Aaargh. So I'm back to wondering how to get an archive. Does anyone have one for sale (at a reasonable price… I'm just a guy, not a big corp).
Update: 23 March 2006
I had extended discussions with the Alexa Web Search project manager, Greger, and the upshot of it all is that we can't use them as a spider and get the content from them. Their terms and conditions specifically forbid copying the data.
What is Alexa Web Search for? We are only meant to use them as a search engine via their API. You can only collect the search results data, not the underlying data. So it wasn't a good outcome for me.
Here's what I think is wrong with their strategy:
- I can't get the content, so I can't innovate by using the content they have collected
- They want to do the search themselves through their API. This means I can't get the content and use SQL Server full text indexing etc. and optimize my own searches according to what I perceive as my user's needs
- Their content is out of date, but they may extend the API so that someone can request far more frequent updates/spidering of certain websites.
I think they would be far better off allowing me to buy SO MUCH data that I eventually have scalability problems and I begin to look for a hosted solution… which is what their API offers perfectly. But I'm not there yet, I'm here - I have small time needs, not big time needs.
So it seems the only way forward is either:
- to begin collecting the data ourselves, or
- to build a system which doesn't require an archive - which is what most meme-services are like, they act on current themes and don't keep archives, or
- to buy a defunct website which has been going for a few years and has somewhat of an archive, or
- to find someone who has an archive and is willing to share it (or sell it cheaply)
Onward and upward.
Update (Tues 4 April 2006):
TalkDigger (blog) is also struggling with needing an archive as I have written (Getting an archive of blog posts) - we are struggling with the same problem with reBlogger. They also may consider building a crawler. What a shame that Alexa doesn't see this need and fill it for us. (hint hint)
