Standards bodies face call for more language codes
Scott Wilson, CETIS staff
September 06, 2001

The standard language taxonomies used widely today are coming under increasing criticism from a number of industries. Existing schemes support only a fraction of the languages and dialects in use around the world, which poses a problem for many applications, including learning technology.

The identification of languages is critical in the area of learning technology standards. We need to be able to identify, using metadata, the language used by a resource. There is
also the language of the intended end-user to consider (for example, a resource may be designed to teach an English-speaking person
French). We also need to be able to create metadata in various languages, and to be able to label that metadata so that repositories, search engines and readers can identify the language used.

Language definitions also form an integral part of markup languages such as XML, used to define the majority of learning technology standards.

The most common methods of denoting languages are the two- and three-letter codes defined in the ISO 639 standard, which provides "codes for the representation of names of languages" (ISO 639, ISO/FDIS 639-1, ISO 639-2). However, this only provides a set of language names for between 200 and 400 languages (depending on which version of the standard you are using): there are now draft proposals that call for adoption of schemes that identify 7,000 or even 70,000 languages and dialects.

One proposal calls for codes supporting representation of the language along at least five axes: "geog (geographical specification), script (writing system), temp (temporal specification), socli (sociolinguistic specification), and style (stylistic specification)."

Using a more complex scheme of this kind, you could identify a text as being written in "20th Century Colloquial Irish English" for example. This kind of flexibility could be very useful in the context of archives of historical documents - where there is considerable demand for a better language classification - but may be seen as overly complex for many applications.

For a detailed examination of the language identification issue, take a look at "Language Identifiers in the Markup Context" at XML Cover Pages.