TechBrew

Informative geekery on software and technology

RSS: The GUID Problem

September 7th, 2007 by

Andy Brudtkuhl laments a common problem one encounters as a feed aggregator: It isn’t always simple to eliminate duplicates when comparing RSS items. This is even true when you’re comparing two items that are actually the same thing, but one has been republished.

He has discovered that when you are trying to compare the items, you can’t rely on the link alone to determine uniqueness. Essentially: the link in an RSS item is not a guaranteed unique ID (GUID).

Identical?In Andy’s case, he’s seeing items that originated from the same site, but are either linked to the original site (http://www.converstations.com/… /discovery-along.html), or have the FeedBurner link to track click-throughs: (http://feeds.feedburner.com/… /discovery-along.html).

This is one of those tough problems with RSS that the optional GUID element was supposed to solve. If the GUID element is present in the item – as is properly formed – you should be able to rely on that regardless of link variation.

Unfortunately, the RSS GUIDs are rarely used correctly among publishers, making it fairly unreliable. Mark Pilgrim discussed this back in 2004:

  1. Older RSS versions don’t have it, and even in the latest version, it’s still optional. So very few feeds actually have it.
  2. The RSS spec doesn’t give clear guidance on how to make a unique identifier, or how unique it really needs to be, or why you would bother. So many publishers generate useless IDs.
  3. It’s difficult to compare them, because the data type of the <guid> element isn’t stable. If a certain attribute is present and contains a certain value, then the element must be treated as a string. But in other cases, the element must be treated as a URL. As we’ll see in a minute, these data types have different rules for equality, so comparing GUIDs is more difficult than it sounds.

Pilgrim and the folks working on Atom came to the conclusion that a unique id should be on every item – atom:id - and that it must be formed according to a spec that ensures canonical uniqueness. If you have an Atom feed to work with, this is a reliable mechanism.

For better or worse, however, the adoption of Atom has been slow and the market still overwhelmingly uses RSS. For aggregators like Andy (or any feed reader) who have a mixed bag of feeds, they are forced to kludge their way to uniqueness.

Update:  James Holderness reminded me of his excellent discussion on this issue from last year:  RSS Duplicate Detection.

Trackback URI | Tags: News · Opinion

3 responses so far ↓

  • 1 Andy Brudtkuhl // Sep 7, 2007 at 11:52 am

    If only standards used standards!

    In analyzing the feeds I am aggregating I\’d say 1% of them use the GUID feature – which sucks. When initially building my engine I started with the GUID only to end disappointed that it was never used. Then I thought the link should do the trick, which it does on 85% of the feeds I aggregate but this is not enough.

    And now, yes, I have to kludge stuff together which I still don\’t believe is 100% reliable but it\’s a lot better than duplicated content.

    I\’m very interested to know how other RSS developers have gotten around this lovely non-standard gotcha in what is supposed to be standardized technology.

  • 2 James Holderness // Sep 7, 2007 at 7:42 pm

    For details on how other RSS aggregators deal with duplicates (they all handle the problem differently), check out this post:

    http://www.xn--8ws00zhy3a.com/blog/2006/08/rss-dup-detection

    (Encoding weidness in above link fixed by Mark)

  • 3 Mark Woodman // Sep 7, 2007 at 7:45 pm

    James,

    *headsmack* I completely forgot about your post on this topic. Forgive me, it has been a year. :) Thanks for chiming in.

    -M

Leave a Comment