skip to main page content CETIS: Click here to return to the homepage
the centre for educational technology interoperability standards

skip over the long navigation bar
Home
News
Features
Events
Forums
Reference
Briefings
Press centre

Inside Cetis
what is Cetis?
Contact us
Cetis staff
Jobs at CETIS


 




Syndication
XML: Click here to get the news as an RSS XML file XML: Click here to get the news as an Atom XML file iCAL: Click here to get the events as an iCalendar file

Background
what are learning technology standards?
who's involved?
who's doing what?

CETIS Groups
what are cetis groups?
what difference can they make?
Assessment SIG
Educational Content SIG
Enterprise SIG
Metadata SIG
Life Long Learning Group
Portfolio SIG
Accessibility Group
Pedagogy Forum
Developer's forum

Subjects
Accessibility (310)
Assessment (74)
Content (283)
Metadata (195)
Pedagogy (34)
Profile (138)
Tools (197)
For Developers (569)
For Educators (344)
For Managers (339)
For Members (584)
SCORM (118)
AICC (18)
CEN (34)
DCMI (36)
EML (47)
IEEE (79)
IMS (302)
ISO (21)
OAI (24)
OKI (20)
PROMETEUS (12)
W3C (37)

print this article (opens in new window) view printer-friendly version (opens in new window)

Look ma, no metadata forms

It's long been recognised that adding meaningful descriptions is one of the more unpopular and error prone aspects of creating learning objects. One solution that has been touted for a pretty long time is to leave the chore to the machines, but practical solutions have been relatively rare. There is now an open source solution from one of the founders of the Learning Object Metadata standard (LOM): Ariadne.

Always a subject that is guaranteed to generate heated debate, authoring decent descriptions of learning materials is a problem that has roughly three kinds of solutions: leave it to the learning object's author, hire a professional (i.e. a librarian) or get a programme to do it. Some wags will say that there's a fourth solution —bung it on the web, and leave it to Google— but many others will say that isn't enough to manage and find the resources efficiently.

Solution one has the advantage that the author of the learning material knows it best, and is therefore in a good position to describe it. Trouble is, good learning material authors are not necessarily experts in information science, and a good many just aren't that interested either. Librarians are, but they're a precious commodity with limited time, and aren't necessarily familiar with the subject of the material. So if you could capture as much as possible from both camps in software, we'd make the problem a lot more tractable.

The Automatic Metadata Generation project

The European Ariadne project was instrumental in designing what later, via IMS, became the IEEE LOM standard. One of the Ariadne partners in the original LOM project, the computer science department at K.U. Leuven led by Erik Duval, has now come up with a web service based system that aims to automatise the generation of these LOM records as much as possible.

As outlined in a presentation at the Learning Technology WorkShop (WS-LT) meeting of the CEN/ISSS (the European information technology specification body) in Oslo yesterday, the system is one of the main weapons in Erik's ongoing war against metadata tagging forms.

The principle of the Automatic Metadata Generation (AMG) system is simple: one part sucks as much information out of a piece of content as possible, the other looks at the object's context for any further clues. The results are then spit out as either a web page with nicely tabulated information, or as a record in a number of machine readable formats, including IEEE LOM XML.

The bit that interrogates the content relies on the fact that a number of clues are inherent in any computer file such as a powerpoint presentation or a PDF document. There's the file format itself, for starters, and also the language of any text in it. As many a politician or marketing person have found out the hard way, there's also a fair amount of information about the author embedded in many ordinary office formats.

The cleverer and more crucial bit, though, is the interrogation of the learning object's context. This relies partially on the fact that most people will not store their material in some arbitrary place in a repository, a Virtual Learning Environment (VLE) or their own file system. As long as the AMG system knows the structure of the place, and what the structure means, it can literally make an educated guess at the kind of subjects, courses or learners for which a piece of content is intended.

Other contextual clues are provided by the fact that people don't normally work on any random topic in a void. Educators or learning material specialists are part of relatively constant communities of practice and institutions, and once that kind of information is thrown into the pot, the results can be made much more accurate.

The AMG system in practice

Whichever way it is used (see below), the system starts its job by either being fed a file, or by being pointed at some resource on the network. This choice immediately shows that what AMG does is not magic- it can't know information that isn't there, much less stick it into a metadata record.

The version on the KU Leuven web site, for example, accepts any kind of file, and the system knows about the metadata embedded in JPEG pictures. Pick a typical picture from your PC, though, and it will have almost nothing to work on; little from the object, and nothing from the context. Use a link into one of the known web accessible repositories (e.g. MIT's Open Courseware Initiative), and you get a pretty fulsome record, mostly because someone at MIT has already done a fair amount of categorisation.

That means the real strength of the system lies in its ability to, first, make whatever information go a long way and, second, to consolidate that information in a correctly formatted, standard metadata record.

For best results, then, some initial assembly may be required. Fortunately, most of the assembly can be done once, and can be shared. Also, any effort put in will likely pay off, in time saved repeating the same information over and over again and in pinpointing that one relevant object much more easily when needed.

Behind the scenes, in the software

In order to make any kind of information go a long way, the KU Leuven team had to break down the problem in smaller, isolated parts. For that reason AMG is factored into a number of different components.

Every file type, for instance, has its own text extractor, that feeds into a general a text extractor class. If a new file type comes along, a new extractor can be written and shared across implementations.

Much the same goes for the fairly crucial context based indexers that extract as much information as possible from a file system, VLE or repository. These three are main classes, but every instance needs its own indexer class or subclasses. Each of these need their own mapping of available information from the context to the desired metadata output; if your metadata profile requires a particular element, some way has to be found to extract it from the storage system.

That looks like a potentially fairly major coding job, but not anywhere near as much as writing a whole system from scratch. And, as before, once written, it is deployable in any installation of AMG, give or take some fiddling of the context to metadata element mapping. There is, for example, already a Blackboard building block.

What's more, the whole indexing system has been written into a single API, wrapped in a webservice. That way, indexers can be written for platforms other than AMG's java. Erik indicated, for example, that the Python/Zope based, open source FLE3 VLE has already been integrated successfully with AMG. The API also means that it can be implemented any number of ways: as a hosted solution (ASP), as an institutional service or even on the desktop.

In the processing and output end of things, the secret seems to lie mostly in a 'confidence value' that is pre-assigned to each source of information before it goes into the pot. This then used to resolve conflicts between different sources. Some tweaking and constraining of known possible values for particular elements also helps. A nice utility that would centralise this kind of tuning across the system would be a good project for someone, though.

Once the, by then, known bits of information have been gathered and resolved, filtering or transforming the output to the desired metadata record format looks fairly straightforward.

In short

Because of the way it is designed, the different ways in which it can be implemented, and because it is open source, AMG looks extensible enough to be adaptable to a pretty wide variety of situations. Any effort that has to be put into optimising the system for a particular situation is maximised by re-useability of code or better results. For best results, the effort would emphatically include that of learning authors or librarians, to augment what has been harvested from somewhere else. And yes, they'll have to use forms.

Resources

Documentation, publications, a development blog and a 'try it yourself' service is available from the Automatic Metadata Generation (AMG) site.

Source code for the AMG system and the Blackboard building block is available from SourceForge.

Related items:

Comments:

copyright cetis.ac.uk
Creative Commons License This work is licensed under a Creative Commons License.

syndication |publisher's statement |contact us |privacy policy

 go to start of page content