TechBrew

Informative geekery on software and technology

Tagging Habits of the A-List

September 18th, 2007 by

At the beginning of the year I set up JetStream to monitor the top 100 CNET Blogs and see how the Technology A-Listers (according to CNET) were using tagging in their posts. I’ve been collecting data for nine months now, so I thought I’d share some of what I’ve learned.

To Tag or Not To Tag

TaggingThe first thing I noticed is that many of the blogs don’t tag their posts at all. Those who do tag tend to fall into two camps: the Tags-As-Category camp, and the Semantic Tagging camp (usually individuals.) The former use a limited list of tags like “Gadget” or “News”, the latter like “iPod”, “numanuma”, or “youtube”. And, of course, there are some who do a little of both, as we do here on TechBrew.

Popular Tags of 2007

So far this year the 100 blogs have posted ~700,000 RSS and Atom items, and have used ~55,000 unique tags to label them. I should note at this point that uniqueness in JetStream is case-sensitive, so ‘iPod’, ‘ipod’, and ‘IPOD’ are in fact 3 different tags. I didn’t want to lose some interesting analysis opportunities by lumping all capitalization variants together.

The most popular tags – those used on the most items – are skewed toward the use of fixed category tags by some of the prolific blogs like Engadget and Gizmodo. Still, the top 30 tags used thus far paint an interesting picture of what the Tech blogosphere is talking about:

     TAG NAME               FREQUENCY OF USE
 #1: Gadgets.......................3796
 #2: apple.........................2900
 #3: Top...........................2054
 #4: Microsoft.....................2046
 #5: iphone........................1914
 #6: Home Entertainment............1875
 #7: internet......................1657
 #8: Cellphones....................1536
 #9: General.......................1505
#10: sony..........................1485
#11: xbox 360......................1478
#12: Video.........................1295
#13: ps3...........................1202
#14: Blogging......................1134
#15: nintendo......................1108
#16: wii...........................1096
#17: Portable Media................1053
#18: Peripherals...................1016
#19: Clips..........................998
#20: Google.........................976
#21: ipod...........................935
#22: Company & Product Profiles.....882
#23: japan..........................871
#24: Software.......................858
#25: news...........................840
#26: media..........................833
#27: Smartphones....................786
#28: windows........................769
#29: vista..........................761
#30: Original.......................746

I can’t help but laugh that ‘Blogging’ is used so often as a tag on blog items. Recursive or redundant, you decide.

Lets Get Semantic

It is easy to see who and what have been the newsmakers this year from the above list. Quite a bit can be inferred by the human brain, but there is a quite a bit of data mining and computer-based inference to be done as well.

To illustrate, anybody looking at the above list would be able to create some arbitrary semantic groupings:

{Gadgets, apple, iphone, Cellphones}

{Microsoft, xbox 360, windows, vista}

{xbox 360, ps3, wii}

Tag Graph of TechBrew.netThe next step for JetStream, in my mind, is to derive such semantic relationships automatically. The algorithm would be essentially based on frequency of association between certain tags within blog items. The strength of these relationships, based on frequency of co-usage, would be an indicator of relevance.

This approach isn’t very complicated, but it works because people have been doing the work of semantic tagging on all the items. What I’m speaking of is simply a mechanism to reveal what people have already done. Sort of FOAF, but for ideas and things instead of people. (Danny Ayers, are your ears burning?) This kind of visualization hints at what can be discovered through tagging trends across like-minded populations.

(JetStream does have a semantic view that is sliced by a given day’s worth of tags. You can see the “Tag Constellation” view of the day’s most popular tag and anything associated with it. Click the bullseye icon above any tag to center it in the constellation. This is a fairly simplistic approach, but it gives a taste of what can be done with semantic inference and better visualization.)

So What’s Next?

I’m considering a way to express the semantic groupings in a RDF construct, and have been asking a couple of acquaintances their thoughts on the matter. It would be useful to have a standard way to express semantic groups of tags – probably represented as URIs – but this whole thing may be a little too obscure to standardize just yet.

I’ve talked about some of what I would like to do, but I’d like to hear from you. What would be interesting to you? Should I just release the data and let people play with it? What format would you like to see it in? If I gave it to you, what would you do with it?

Trackback URI | Tags: Feeds · News

2 responses so far ↓

  • 1 Tim Finin // Sep 19, 2007 at 8:15 am

    Akshay Java of the UMBC ebquity lab did something along these lines in creating Feeds that Matter (http://ftm.umbc.edu/). He analysed the public folders of Bloglines users (which could be seen as a kind of tagging system) and then merged the folders using the similarity of their feed vectors. For a detailed description, see http://ebiquity.umbc.edu/paper/html/id/314/.

  • 2 Mark Woodman // Sep 19, 2007 at 8:22 am

    At first glance, Akshay’s work looks a lot like what TechMeme is trying to do, but the merging folders technique does have a strong conceptual similarity to what I’m mulling over. Thanks for the pointer, Tim!

Leave a Comment