December 12, 2006
Mystic Statistics Heuristics
Just in time for the holidays, here's another post about our statistics, and this time we'll describe how we deal with metrics issues, how we think we can improve the kinds of statistics we provide, and admit that despite all this number crunching, we still don't know how many dribs are in a drab (but we know that the answer involves Planck's constant).
With over 500,000 feeds now managed, we deal with statistics anomalies like spiked/tanked subscriber counts, podcast counts, and click counts on a weekly, if not daily basis. Some of these are larger issues than others, obviously. We're sure that the good people at ComScore, HitWise, and other CamelCase-named statistics companies would agree that there are always issues and anomalies popping up that have to be beaten back with gusto like so many zombies in Dawn (or Shawn) of the Dead.
The goal we always set for ourselves is to try to maintain apples-to-apples comparisons across all types of counting and aggregator/client treatment. In other words, we try to say that regardless of what bucket some metric goes in, it should always result in the ability to look at a couple different pieces of the data (feeds, aggregators, podcatchers, etc.) and say "these make sense relative to one another." You set up some heuristics and algorithms that you then try to apply those as universally as possible and take your lumps. It's like the never-ending "uniques" debate that the web stats community has — you try to plant some stakes in the ground that get you to reasonable conclusions when you consider all the data, and then jump off the next bridge when you come to it.
Some of the metrics issues that we are continually addressing include:
- Automated aggregator clicks: There are some niche aggregators and feed-reading clients that will occasionally auto-click every link in a feed, presumably for offline caching or in order to perform some contextual analysis. So you have to come up with mechanisms to discount those clicks in publisher dashboards as not counting toward subscriber click totals.
- Bots as aggregators: Sometimes obvious attempts to cloak some bot, and sometimes just a hard-to-categorize service will emerge that polls a feed from loads of desktops. Now that there are many thousands of feed clients, we sometimes don't see these bots or stats until they appear on a threshold report we create internally. Publishers can see a set of "subscribers" from something that end up falling off the end of the world a month later. What happened? Sometimes the bot just goes away, sometimes it's combination of bot behavior within an otherwise valid client, and then other times after we've learned more about the bot or its behavior, we've concluded that it's not really polling a feed for the purpose of notifying a subscriber or delivering content to a subscriber.
- Default feeds: Some aggregators default subscribe users to a feed in some cases. For example, perhaps you create a new account with an aggregator and announce "I'm interested in technology feeds", and you're auto-subscribed to a list of feeds. What's the right thing to do here when those numbers are reported as subscribers? We've decided to count those as subscribers...the content is being updated for that subscriber and as long as the subscriber doesn't remove it, we generally say that it's not our place to say those aren't "subscribers". Now, we also provide Total Stats publishers with a metric called "Reach," and reach does a good job of helping pro stats customers understand "how much of my total subscriber base is actually opening my feed and looking at it on a day-to-day basis". This helps publishers with large aggregator counts to understand how many of those aggregator subscribers are "active" from day to day. There are a few aggregators that report subscriber counts purely based on "active" users, not cumulative over time, which obviously provides a more accurate running metric.
- Lack of visibility: There are a number of aggregators that provide no insight in their user-agents into the number of subscribers on behalf of whom they are requesting the feed. Obviously, these end up representing some undercounted number of subscribers for any publisher distributed to that aggregator, and this makes it harder for a publisher to understand their true distribution. We work with as many of these aggregators as we can in order to provide publishers with more information.
…to say nothing of the partial podcasts downloads and podcast download bots and other fun with podcast stats.
Across the board, we're seeing more and more distinct kinds of user-agents requesting feeds. Here's a quick chart of the growth in unique user-agents we've seen polling feeds just in the last six months.

Caveat Emptor: These chart numbers don't include user-agents with spammy identifiers that are obviously just long random strings, and hundreds of agents like "Shmucky-bot/1.0" and "Shmucky-bot/2.0" are only counted as one distinct user-agent. All of this data excludes the millions of requests a day we capture from clients with completely blank identifiers. Still, you can see the current count is well over 8,000 different kinds of feed reading entities. Everything from aggregators and search crawlers to thousands of mobile feed readers, hundreds of podcatchers, loads of language specific agents, specialty browser toolbars and more.
One of the questions we bounce around here is "what can we do to help people get more information about their statistics in order to better understand how their content is being distributed?" (although we don't speak to ourselves so eloquently). There are a few things we're always working on in this department:
- Provide more statistics. When publishers have more dots to connect, they can draw a more distinct picture of their content distribution traffic. +2 points for the well-executed "dots - picture" metaphor. We'll be rolling out blog stats very very soon (remember the BlogBeat acquisition a ways back?), so publishers can get a nicer picture of their "total readership" across feed and site. This will also help publishers better make sense of anomalies in one area; e.g., did I see a traffic spike to my site the day before I saw a big spike in subscribers? Was I seeing more search traffic to my site during these periods of sustained subscriber growth? etc. We've got more surprises in this stats department as well.
- Be transparent. We wrote up the peek inside TechCrunch's subscriber numbers a couple months ago as a way to help people understand what's behind the numbers. We can do more in this arena like describe the kinds of things we discuss with various publishers about traffic anomalies and other grey areas where the right metric isn't obvious.
- Be Creative. When the market has trouble with some metric or approach or perceives a lack of information, there's an obvious opportunity to step in and provide that data.
Comments
Hitwise does not have a capital W.
Oh man, this is a timely post. I'm currently doing a podcasting metrics series on my blog and this answers some questions I had. I'll be using this info in my next post in the series. Thanks for sharing.
This statement you made is good advice for anyone doing analytics:
"...you try to plant some stakes in the ground that get you to reasonable conclusions when you consider all the data, and then jump off the next bridge when you come to it."
It's easy for podcasters and bloggers to take for granted what's involved in providing accurate statistics.
I can't wait to see what else you have in store.
Jason Van Orden
Podcasting Metrics Series
