Scott's Workblog

This blog has moved! Go to my new blog!

December 14, 2010

Parsing CC license information in different feed formats

This is something I wrote quite a long time ago as part of some help I was doing for the STEEPLE project and as advice to the UK OER programme, however I'm sure other people have run into this same issue and so I'm posting it here. You can see it in use on the Ensemble prototype.

RSS and Atom are a natural way to share lists of educational materials. However, one of the main issues with feed standardisation is the CC license. I've summarised this below.

1. Elements

Currently we have several different elements to choose from when placing our CC license:
  • license
  • license
  •> license
Different feed formats seem to prefer different options here, but its not uncommon to find them mixed up, all of which are valid XML. There is also possible confusion with Copyright, for which there are two elements:
  • RSS 2.0 copyright
  • DC rights
Feeds may sometimes put licensing information in these, which is technically not correct as license != copyright. But it happens.

2. Placement

There is licensing of items and of feeds; for Ensemble the main interest if the feed license as we're dealing with "albums" rather than arbitrary collections. However we will have to deal with situations where items are of mixed licenses (see below).

3. Content

Then there is the issue of what content to use here. I personally don't mind as long as there is a valid CC URI that can be extracted from the text using expressions. Others prefer declaring the license conditions using RDF. There is also some debate as to whether these elements should contain attributes specifying the URI, or the URI should be placed within the text content such as "Licensed under a Creative Commons Attribution - NonCommercial-ShareAlike 2.0 Licence - see". I think ultimately we're going to have to agree some common practice here - I suggest making sure the CC URI is somewhere in the text content of the element. Coping with both text and RDF content for CC is way too taxing; also the RDF content I've seen is redundant as it basically sets out what is already meant by the CC URI.

Aggregation Algorithm

Ultimately any aggregator has to handle a lot of variation here, even with some best practice evening things out. Here's my first stab at the algorithm I'll code up for Ensemble: 1. Find a channel-level element that matches any of:
  • license
  •> : license
  • : license
2. If none of the above found, try:
  • rss2: copyright
  • dc: rights
3. Parse the text content of the elements, and extract any URLs using regular expressions
  1. If there is a single consistent URL and it matches a known CC type, mark against CC dimension for browsing/filtering purposes
  2. If there is a single consistent URL but it does not match a known CC type, mark as "unknown license"
  3. If there are multiple inconsistent URLs,  mark the channel as "mixed licensing"
4. Next, repeat steps 1 & 2 for all the items and extract URLs using regular expressons
  1. If the items have licenses, and they are not all consistent, mark the channel as "mixed licensing"
  2. If the items have licenses, and they are all consistent, and the channel has no license, set the channel license to the value of the item licenses and process as in Step 3
  3. If the items have licenses, and the channel has a license, and these are not all equal, mark the channel as "mixed licensing"
5. If neither channel nor items have any license information - even plaintext with no URLs - mark the channel as "unknown license"

main archive