Tony Hirst tipped me off in an email to the new “Fetch Data” module in Yahoo Pipes. This feature allows you to parse any XML or JSON, opening up a whole new world of possibility.
To demonstrate, I’ve created a pipe that will take all feeds listed in an OPML file, aggregate all of the items therein, and return the results.
The important part of making this work is to ensure I could get both RSS and Atom feed items, and then sort them by date. This proved to be a pain. The problem with the first part is that the XML structures for the two feed types are different, and the problem with the second part is that they use different date formats. Atom uses dc:date, and RSS 2.0 uses pubDate. Because Pipes does a String compare for sorting, the different formats make a dc:date/pubDate comparison useless. This is the now-infamous “Yahoo Pipes dating problem“.
Overcoming these obstacles, I’ve used a bit of trickery and/or slop to get RSS and Atom items to play nicely together, and correctly intersort by date. The date sorting business isn’t elegant, but with heavy use of Rename and Regex I got it working: OPML Aggregator, sorting RSS/Atom Items by Canonical Date
Here’s how it works:
The Nitty Gritty
An OPML file has a number of outline elements within the body. (This particular pipe assumes that no outlines are nested.) When using the Fetch Data model, the structure brought in can be accessed using “parent.child” notation to get at a certain path. In this case, therefore, I used “body.outline” to denote the repeating element I want to use.
Now that I can get at the xmlUrl of each feed in the OPML, I’m faced with a problem. Some of them are going to be Atom feeds, where the items are one level into the document structure. Others are going to be RSS feeds, where the items are two levels down, inside the channel.
To solve this problem, I split the output of the Fetch Data into two branches. Using For Each:Replace, I look for Atom items on the left branch (”item”), and RSS items on the right branch (”channel.item”), again using Fetch Data. I can now be assured that all items on the left have dc:date, and all items on the right have pubDate.
Now that I have Atom items on the left, and RSS items on the right, I use Replace to copy the respective date elements into several placeholder attributes: pubyear, pubmonth, pubday, and pubhour. At this point the entire dc:date or pubDate is copied into each, so I’ll need to use Regex on each attribute to get at the essential value.Here you can see the Rename Mappings for the custom date attributes on the Atom items, and the Regex to pare down each accordingly. The Regex for dc:date isn’t so bad, since all of the values are numeric, like so: “2007-03-28T00:34:43+00:00“.
Now comes the painful part. RSS dates are in this RFC-822 format, like so: “Mon, 19 Mar 2007 20:55:30 +0000“. The month isn’t specified as a number, so I’ll have to have a Rule for each month of the year: “Jan” becomes “01″, and so on. Here is part of the Regex module to show what I mean:
Once that nastiness was done, the rest came together nicely. Time to combine my RSS and Atom items back into one feed, and then sort them on my canonical date attributes:
Not the most elegant pipe in the world, but I’ve been itching to solve the sortable date problem. It looks like Yahoo has a canonical y:published attribute in the works, but until that is turned on, we’re stuck doing trickery like this.
Again, if you have any improvements to the Regex or other bits, please share them here.

Email

4 responses so far ↓
1 Rich Tatum // Apr 11, 2007 at 5:49 am
Interesting and potentially useful pipe!
Unfortunately, it chocked on my opml file.
Pipes seems to have a log of problems with Blogger atom feeds still.
:: sigh :;
Rich.
2 Bjorn // May 29, 2007 at 1:31 am
Even the y:published attribute is seeming to work differently from run to run for me, seemingly arbitrarily.
Quick dumb Pipes RegEx question you might be able to answer for me:
I\’m just trying to effectively blank out an attribute (so it\’s shown as \
3 Marshall Kirkpatrick // Jul 29, 2008 at 9:46 am
Awesome!! thanks!
4 Paper Bits – links for 2008-07-30 // Aug 3, 2008 at 5:51 pm
[...] OPML in Yahoo Pipes with Canonical Date Sorting Yahoo Pipes just got a whole lot more interesting. (tags: @toread XML yahoo) [...]
Leave a Comment