Pheed Read #3

Pheed Read #3 - RSS Feeds Provide Untapped Advertising Audience

"Pheed Read" is released quarterly by Pheedo and details trends in RSS usage. This new one (only available in PDF - doh!) is not terribly good.

But don't miss the previous two Pheed Reads which are superb and useful:

Enjoy!

Blog search engines compared

This is a good post by Library Clips. Read more here: Bloglines blog search engine

The relevancy is based on subscriber numbers (so anything you write may be relevant to the search term by default, as you have lots of subscribers)…Technorati bases it on incoming links (again what about the long tail, and also this is just popularity)…whereas Sphere bases it on a number of things (incoming links, subscribers, content analysis, comments, etc…), I think these results will be more relevant and also reveal posts from blogs you don’t ususally see.

As usual John does a fab job of covering the options, choices and competitors. Library Clips rocks! More details about the new bloglines blog search can be found at TechCrunch (Finally! Bloglines Blog Search)

Google co-op

I'm still investigating the social stuff.

Google Co-op is a platform which enables you to use your expertise to help other users find information

I read some interesting comments about Co-op in Google Co-Op - Google Embracing Social Search?

There are obvious comparisons to Rollyo, Filangy, Prefound, Wink and other social search plays, but frankly Co-op just doesn’t cut it. It feels like another Google Base to me - ambitious in its scope, but utterly bamboozling to the user. Frankly, I’m not sure that Google will ever get social search right - community-building just isn’t in their DNA.

Scrolling down to the comments for that blog post… Google gets hammered for not making the application clearer and easier to use and this comment was particularly interesting:

One thing that bothers me at times about Google is how this massive corporation sometimes appears to act like an opensource, nonprofit project.

Froogle, Base, Co-op, and others all seem to depend on other people doing most of the work, and then Google ultimately owns the data.

I did some searches in the SEOData reBlogger to see what other posts on Google Co-op I could find:

The thing is… companies like Google keep on trying until they get it right. Windows 1.0 was nothing, 2.0 was nothing, 3.0 was good, 3.1 was big, 3.11 (with networking) was massive. reBlogger is the same, we'll just keep on going through the versions until we crack it. No one currently knows how to do "social" exactly right, so it's open slather for anyone. Google will eventually get this right and I suspect it will revolve around voting.

Here is the Wink collection for Google Co-op.

built_with_reblogger2006.gif

Tracking future Windows releases by using reBlogger

Have you seen Google trends yet? It tracks search activity and compares it to the preceding events.

I spoke about the inverse concept in the interview I did with Robin Good reBlogger: Digg-In-A-Box. The key differences are that I would track blog posts and not searches. Why? Searches are consumer oriented, but posts come directly from the source! It's obvious that consumers don't know the actual release date, but the bloggers inside the company's software team do. Their blog activity can give hints of what happening in the team. Even if the content they post doesn't specify the data - their activity could indicate something.

So Google tracks the number of searches and maps the news event that caused the surge in interest, but I'm suggesting mapping the blog activity and project that to a future event. We sometimes see predictive activity in searching, but it's only for a very widely known upcoming event: for example Christmas.

You can use reBlogger to track the activity of a particular group within a competing company. The value is huge for a business which tracks it's competition. Most company programmers blog (and have an OPML file) so I'm thinking that all we need is to find all their bloggers, group them and then count their post activity - and then generate a graph. Watch for any irregular change (a drop or a spike) and you know something is happening. It's so easy!

You can get the reBlogger 30 day free demo and install it (requires Microsoft SQL Server or SQL Server Express).

25 Things I Learned on Google Trends (humor)

Steve Rubel does some fabulous investigations.I particularly like:

15) Blogs have caught up to newspapers

1 8) Digg caught up to Slashdot.

19) Interest in blogs and RSS is much higher than in podcasting and wikis

Enjoy!

built_with_reblogger2006.gif

Blog Search Engine Sphere Launches

I previously posted about Sphere (blog) in Sphere… of influence but their site wasn't live. It is now!

I do like the way they have different pages with different ways of looking at a query. They’ve got:

The coolest thing has to be the custom range slider. When viewing results by relevance, you can choose a date range (so they are relevant, but not ancient), and you can choose predetermined ranges, or use a slider to make your own on the fly.

UPDATE:  They have managed to get quite a bit of buzzzzz on the blogs about their launch. Check out their blogpulse.

(I found their announcement on reBlogger)

Next generation search algo

I was dreaming over the weekend (again) and I wondered… what if… what if datacenters were neural networks. Before you laugh and go somewhere else, let me explain why this idea would return the very best search results.

A neural network "learns" by being trained. There is a user who makes statements like: this is a human head. The neural network learns to recognise that image as a human head. The operator show thousands of different kinds of heads (and things that are not heads are shown as NOT a human head). Eventually the neural network begins to ask questions (is THIS a human head?) and the operator says yes or no. Over time the neural network (NN) can correctly identify a human head apart from a basket ball or fruit.

What if Google trained their datacenters to recognise good pages? Right now they are using inbound links to value a page, but that idea is time-limited. Google could use the user's clicks as training. The NN puts up a variety of pages with varying amounts of information and watches what people click on. With a cookie it can figure out which page I stayed on and which page I didn't. With this information (which Google already has) it can correlate the query (the search statement) to high performing pages.

Sure this would take time and money - but if the NN is able to correctly learn about what I am looking for and identify what page best meets my needs… then it's a killer search algo. The results page would be 100% accurate all of the time. It will always be learning how to serve up the best search results.

Sphere… of influence

Heads up! So what is Sphere? And what is so much better about it?

John Battelle's Searchblog

Sphere works better than other blog search I've seen, plain and simple…. when a Searchblog author goes off topic and rants about, say, Jet Blue, that that author's rant will probably not rank as high for "Jet Blue" as would a reputable blogger who regularly writes about travel, even if that Searchblog author has a lot of high-PageRank links into his site.

Om Malik's Broadband Blog

Think Blog Rank, Instead of Google’s Page Rank. The company has also taken a few steps to out-smart the spammers, and tend to push what seems like spam-blog way down the page. Not censuring but bringing up relevant content first. They have pronoun checker. Too many I’s could mean a personal blog, with less focused information.

TechCrunch

spherebig.jpg

Sphere is a new blog search engine that quite frankly blows everything, and I mean everything, I’ve seen out of the water in terms of relevance. Until now, no one has come up with a way to properly sort blog posts by relevance, and the general default way of showing results is “reverse-chrono”, which simply puts the newest stuff at the top. Sphere appears to have solved the problem, or at least taken big steps in the right direction. Their approach involves three key algorithms - an analysis of links into and out of a blog, an analysis of metadata around a post (links, post frequency, length of posts, etc.), and something Tony calls their “secret sauce”, which is content semantic analysis to filter out spam and to understand what a blog post is talking about. Result sets show only two posts per blog on the first page, so no one blog can dominate a category…

Jeremy Zawodny

sphere-sm.png

Their technology seems far more splog (spam blog) resistant than many of the other engines. They don't actively filter it out, but the spam blogs end up being ranked so low that you rarely encounter them. That sounds like the right approach to me.

BusinessWeek's Stephen Baker interviews Tony Conrad and Mary Hodder (mp3 audio / podcast)

Looks interesting. :) FWIW: I found this through SEOData Blogosphere keyword.

UPDATE:

TypePad Sphere blog search widget (TechCrunch)

Google bowling (or… Eliminating The Competition!)

Finding it hard to beat your competition in rankings? Google bowling! (Link 1, Link 2 ,Link 3, Link 4). Here are some things they offer to "take down" the opposition (muhahahahaha):

  • Links from bad neighborhoods to ANY site you want
  • Java Scripts or “un”-sneaky Redirects to ANY site you want
  • Mass Automated Querying of their URL in Google

It reminds me of an assasination somehow.  I'd cry if it wasn't so funny. :P

Getting an archive of blog posts

Where can I get an archive of past blog posts? With 26 million blogs, the archive would be enormous.

A query on Alexa shows they have 825k of rss (searching for "mime=XML&quot ;) pages, but that's just a drop in the ocean. And their T&C prohibit copying the data - even though unlike web pages, RSS feeds are *INTENDED* for syndication.

Aaargh. So I'm back to wondering how to get an archive. Does anyone have one for sale (at a reasonable price… I'm just a guy, not a big corp).

Update: 23 March 2006

I had extended discussions with the Alexa Web Search project manager, Greger, and the upshot of it all is that we can't use them as a spider and get the content from them. Their terms and conditions specifically forbid copying the data.

What is Alexa Web Search for? We are only meant to use them as a search engine via their API. You can only collect the search results data, not the underlying data. So it wasn't a good outcome for me. :(

Here's what I think is wrong with their strategy:

  • I can't get the content, so I can't innovate by using the content they have collected
  • They want to do the search themselves through their API. This means I can't get the content and use SQL Server full text indexing etc. and optimize my own searches according to what I perceive as my user's needs
  • Their content is out of date, but they may extend the API so that someone can request far more frequent updates/spidering of certain websites.

I think they would be far better off allowing me to buy SO MUCH data that I eventually have scalability problems and I begin to look for a hosted solution… which is what their API offers perfectly. But I'm not there yet, I'm here - I have small time needs, not big time needs.

So it seems the only way forward is either:

  • to begin collecting the data ourselves, or
  • to build a system which doesn't require an archive - which is what most meme-services are like, they act on current themes and don't keep archives, or
  • to buy a defunct website which has been going for a few years and has somewhat of an archive, or
  • to find someone who has an archive and is willing to share it (or sell it cheaply)

Onward and upward. :)

Update (Tues 4 April 2006):

TalkDigger (blog) is also struggling with needing an archive as I have written (Getting an archive of blog posts) - we are struggling with the same problem with reBlogger. They also may consider building a crawler. What a shame that Alexa doesn't see this need and fill it for us. (hint hint)

built_with_reblogger2006.gif

A better search

Dave kinda gets it

Want to make a million $? Dave gives out a free idea. He says:

Implement a search engine that accumulates all the stories pointed to by the top meme-engines over time.

I always wondered what would come next after the meme-engines, and now I think this may be it. It's one level more concentrated than the meme-engines.

I think that Dave thinks it's about storing the "thread" of related items that memeorandum collected together, because it's kinda hard to collect them back together later on. Tin Finger has already written along these lines!

My opinion is that reBlogger is going to be even better than that!

  • Why should memeorandum create threads of related items? Users should.
  • And why only include the most current discussion items into that thread, why not include 1 year old items or 5 year old items into that thread.
  • And why fix the thread in time, let it evolve and mature as more and more people edit the thread and republish their own thread.
  • And as threads intersect across posts which they have in common, show the various intersections and the different directions you can go in - all from this one post.

He also says:

Make it run off their RSS feeds. You'd have to build it quickly (get there first) and build it to scale, because it would be pretty popular and would grow fast.

He's not wrong. :)

Update: Monday 13 March 2006

Dave's getting warmer and warmer.

I don’t want to only see the stories that most people are interested in, I want interesting stories.

Here is a list of some of the features.

  1. Reverse-chronologic order. Every item gets a shot at being the top item.
  2. Not grouped by which site they came from or which type of site they came from.
  3. The relevance algorithm gets looser so that more items make the grade.

Dave Winer wants the new reBlogger, he just doesn't know it yet. We better hurry up and code this puppy before someone else does!

50,000 posts an hour and a new blog each second

This post about the astounding growth of the blogosphere really is worth the read.

We track about 1.2 Million posts each day, which means that there are about 50,000 posts each hour. At that rate, it is literally impossible to read everything that is relevant to an issue or subject, and a new challenge has presented itself - how to make sense out of this monstrous conversation, and how to find the most interesting and authoritative information out there.

reBlogger has pretty much nailed removing the noise, leaving only signal. Yay! But this is what keeps me awake and churning over with ideas: how to make sense out of so much useful information. With so much GOOD information out there, how to order it and present it in a useful way.

Let me digress for a moment: It's the classic competition between Yahoo and Google for search.

Google is taking the algorythm approach. They throw thousands of CPUs at the problem and make their algo's stronger and better - in the attempt to find relevance. They invest into data crunching. More money. More smarts. More employees.

Yahoo is taking the community approach. They buy del.ico.us and flikr and just about anything which has either a) content or b) tags and tagging by people. The content is valuable to Yahoo as it pushes AOL of it's pedestal. The tagging is useful for their search engine to evaluate what REAL PEOPLE think is relevant.

Now let me return to the post topic… with so much GOOD information out there, how to order it and present it in a useful way. Technorati has the $ to index 27.2 million blogs, they are the Google. We will take the Yahoo approach and create innovative ways for our users to express themselves, to tag their world, to explore the mountain of information out there and to find the gem they need.

It's an exciting task. With reBlogger we have managed to exclude the noise, leaving only the signal - now we need to enhance the ability to find the gems hidden in this morass of information.