December 14, 2010
Parsing CC license information in different feed formats
This is something I wrote quite a long time ago as part of some help I was doing for the STEEPLE project and as advice to the UK OER programme, however I'm sure other people have run into this same issue and so I'm posting it here. You can see it in use on the Ensemble prototype.
1. ElementsCurrently we have several different elements to choose from when placing our CC license:
- http://creativecommons.org/ns#license license
- http://web.resource.org/cc/ license
- http://backend.userland.com/creativeCommonsRssModule> license
- RSS 2.0 copyright
- DC rights
2. PlacementThere is licensing of items and of feeds; for Ensemble the main interest if the feed license as we're dealing with "albums" rather than arbitrary collections. However we will have to deal with situations where items are of mixed licenses (see below).
3. ContentThen there is the issue of what content to use here. I personally don't mind as long as there is a valid CC URI that can be extracted from the text using expressions. Others prefer declaring the license conditions using RDF. There is also some debate as to whether these elements should contain attributes specifying the URI, or the URI should be placed within the text content such as "Licensed under a Creative Commons Attribution - NonCommercial-ShareAlike 2.0 Licence - see http://creativecommons.org/licenses/by-nc-sa/2.0/uk/". I think ultimately we're going to have to agree some common practice here - I suggest making sure the CC URI is somewhere in the text content of the element. Coping with both text and RDF content for CC is way too taxing; also the RDF content I've seen is redundant as it basically sets out what is already meant by the CC URI.
Aggregation AlgorithmUltimately any aggregator has to handle a lot of variation here, even with some best practice evening things out. Here's my first stab at the algorithm I'll code up for Ensemble: 1. Find a channel-level element that matches any of:
- http://creativecommons.org/ns#license: license
- http://web.resource.org/cc/>http://web.resource.org/cc/ : license
- http://backend.userland.com/creativeCommonsRssModule : license
- rss2: copyright
- dc: rights
- If there is a single consistent URL and it matches a known CC type, mark against CC dimension for browsing/filtering purposes
- If there is a single consistent URL but it does not match a known CC type, mark as "unknown license"
- If there are multiple inconsistent URLs, mark the channel as "mixed licensing"
- If the items have licenses, and they are not all consistent, mark the channel as "mixed licensing"
- If the items have licenses, and they are all consistent, and the channel has no license, set the channel license to the value of the item licenses and process as in Step 3
- If the items have licenses, and the channel has a license, and these are not all equal, mark the channel as "mixed licensing"