Exploit Interactive HomeHomeSearch
Issue CoverEditorialFeaturesRegular ColumnsNews and EventsEt cetera

Multilingual Provision by Subject Gateways

Marianne Peereboom describes approaches to providing multilingual support in subject gateways.

Introduction

With support from the EU's Telematics Application Programme, phase 1 of the DESIRE I project (1996-1998) aimed to enable and enhance large-scale information networks for the research community. The 10 partners are continuing this work in DESIRE II (which runs until June 2000), focussing on distributed Web indexing, subject-based Web cataloguing, directory services and caching. One of the results of DESIRE I was a set of tools and guidelines to support Subject Gateways: services based on selection, description and classification of high quality networked resources, which emphasise the importance of skilled human involvement in the assessment and 'quality control' of their collections.

One of the objectives of DESIRE II is to enhance existing services and promote the development of new gateways using the DESIRE tools and guidelines. To support this an Information Gateways Handbook [1] has been published, for libraries wishing to set up their own information gateway on the Internet. The handbook covers strategic and information management issues as well as technical requirements. One of the things gateway managers have to decide is which multilingual facilities they want to include in their service. This article addresses some of those issues. It is based in part on the chapter about multilinguality in the Information Gateways Handbook.

Issues for information managers

Subject Gateways need to address the language needs of their audiences. Users may want to search a multilingual collection by using queries in one language or to retrieve documents in a number of specific languages, preferably also via an interface in the language of their choice. In some cases they may require some translation or summary in another language than that of the document. Ideally Subject Gateways should provide their users with the language support they need. In reality this will very likely be restricted, depending on the available technologies, the language skills of staff involved in selection and cataloguing, and last but not least, cost considerations.

Gateway managers will be confronted with various choices relating to the language support of the service they want to provide. Those choices for mono- or multilingual support present itself at many different levels:

Scope and selection policy

The scope policy of a gateway outlines the subject areas and the types of resources covered by the gateway. This includes language and geographical parameters. To set language parameters for a gateway the following questions will have to be asked:

The choices made in this area directly determine the skills required of the staff responsible for selecting and/or cataloguing the resources as well as the choice of relevant authoring and access tools and software. For example: creating an information gateway that includes resources in all European languages would require input from a team mastering all those languages between them. If the cataloguing is done by a separate team, this would also have to consist of people with extensive language skills. Not many gateways will be able to manage such a broad coverage with an in-house team. A distributed model - as opposed to a centralised model, where the gateway is the responsibility of one organisation - could offer a solution, by getting input from a multinational team, located in various countries, providing their input via the WWW. In this case a multilingual development framework needs to be implemented, based on standards in information retrieval and exchange. SOSIG provides an interesting case study of such a model. As the core team of SOSIG consists of English native speakers with no other language skills, SOSIG created a network of European correspondents, who suggest resources in a number of other languages to SOSIG staff. Problems with this approach are that the service is dependent on the good will of unpaid staff, and that communication takes place almost exclusively in a virtual environment.

Data representation and resource description formats

A multilingual gateway would require the WWW software lying behind the gateway to cope with multilingual data handling, search, retrieval and display. Existing standards and recommendations provide a framework for multilingual support in data communications and in description formats and metadata. [2].

The HTTP protocol, on which the Web is based, includes information about the type of the transferred information and the character encoding for text-based information. Based on the exchange of information between client (browser) and server (HTTP server) it is possible to provide character encoding and language negotiation between the information provider and the requester with regard to the accepted and preferred formats of the resources:

http-equiv="Content-Type" Content="text/html; charset=euc-jp"

The Content-Language entity header field describes the natural language(s) of the intended audience for the enclosed document:

http-equiv="Content-Type" Content-Language=se

If no Content-Language is specified, the default is that the content is intended for all language audiences.

It is also recommended to include information about the character encoding being used in the META information of the HTML document:

<META http-equiv="Content-Type" Content="text/html; charset=euc-jp">

Recent developments in XML provide facilities for defining/labelling the language of the whole document, entity or item by including language attributes in the corresponding tag. For example:

<p xml:lang="en">The quick brown fox jumps over the lazy dog.</p>
<p xml:lang="en-GB">What colour is it?</p>
<p xml:lang="en-US">What color is it?</p>
<sp who="Faust" desc='leise' xml:lang="de">
<l>Habe nun, ach! Philosophie,</l>
<l>Juristerei, und Medizin</l>
<l>und leider auch Theologie</l>
<l>durchaus studiert mit heißem Bemüh'n.</l>
</sp>

Although the default XML Character Set Encodings are UTF-8 and UTF-16 (which are encodings for ISO 10646 or UNICODE), specific encodings for XML documents can be defined in the initial XML declaration for the whole document or entity (which can be regarded as a separately stored part of the whole document), for example:

<? xml encoding='UTF-8' ?>
<? xml encoding='ISO8859-1' ?>

The Dublin Core metadata element set provides possibilities for defining the language of the intellectual content of the resource, the record and the labelling language of particular fields by means of assigning attibutes to the relevant Dublin Core field.

Examples

DC.Language Format

An English resource

<meta name = "DC.Language"
content = "en">
<meta name = "DC.Language"
scheme = "rfc1766"
content = "en">

An American resource

<meta name = "DC.Language"
scheme = "rfc1766"
content = "en-US">

A Japanese resource

<meta name = "DC.Language"
content = "ja">

A German resource, catalogued in French

<meta name = "DC.Language"
lang = "fr"
content = "allemand">

Field content language labeling/attributing.

A work in Spanish may be assigned the following metadata:

<meta name = "DC.Language"
scheme = "rfc1766"
content = "es">
<meta name = "DC.Title"
lang = "es"
content = "La Mesa Verde y la Silla Roja">
<meta name = "DC.Title"
lang = "en"
content = "The Green Table and the Red Chair">

 

Metadata and cataloguing rules

The metadata record will determine for a large part the search support a service will be able to provide. The more sophisticated the metadata format, and the more consistent the cataloguing practice, the more advanced retrieval options you will be able to support. On the other hand: 'garbage in = garbage out'. Some investment in multilingual development software/authoring tools as well as effort on the cataloguing side is necessary.

Traditional library practice is to create one record for one resource. On the Internet the question is what exactly constitutes a resource - the granularity issue. This is also relevant to language issues. Do you include only complete versions of the document, or do you also register parts of a site that are available in another language? If so, how substantial does the translated section have to be? A related issue is the problem of whether to create a separate record for each language version. For books this has been traditional practice; the translation of a book will get its own cataloguing record. For the Internet environment, it may be worth while to store information about different language versions in one record, as long as the fields relating to one version are linked in some way. It will be less labour-intensive to keep one record up to date, and there is no need to maintain a system of cross-references between language versions in order to keep track of different versions of one document.

Cataloguing rules relating to language may include:

Cross-language Information Retrieval (CLIR)

Cross-language information retrieval (CLIR) is the possibility to formulate queries in a natural language and retrieve documents in other languages than the language used for the query [3]. The main approaches are defined by Peters and Picchi [4] as:

The first two approaches are the most relevant for Subject Gateways.

1. Text-translation via machine translation techniques

For Cross-Language information retrieval, machine translation of the documents does not seem the most realistic option, because of the costs (and the fact that some aspects are redundant for CLIR, like treatment of word order). More feasible is the translation of the query into the language(s) of the documents. Retrieved documents may than be translated for the user, if required, a service that Alta Vista currently provides. It would be possible to add this service to an information gateway. Although results of machine translation are often far from perfect, readers may prefer a flawed translation of a document they can not read to none at all.

2. Knowledge-based techniques

First attempts involved matching the query to the document using machine-readable dictionaries. The best results have been reached with thesaurus-based approaches. The drawback is that thesaurus construction and maintenance is expensive, and training is required for optimum usage. In the case of thesaurus-based controlled vocabulary indexing and searching a set of monolingual thesauri is used which all map to a common system of concepts. Instead of the labour intensive manual assignment of thesaurus terms by indexers, research is being carried out in the area of (semi-)automatic assignment of terms. Thesauri may also form the basis for the more complex cross-language free text searching, where the query must be mapped to possible terms in the language(s) of the documents. ISO 5964 recognizes three approaches to the construction of multilingual thesauri:

Although some gateways use thesauri for subject access (OMNI) [5] or to provide the user with additional assistance in the choice of search terms (SOSIG), little or no use has been made by gateways of the potential of using the thesaurus for multilingual retrieval.

Classification schemes and keywords

If resources are classified using the numerical code from a classification scheme which is available in more than one language, this enables language independent search as well as the possibility to create a browsing structure in more than one language. It will also be relatively easy to add a new language to the browsing structure later, without having to update the individual records. In which languages a classification scheme is available, and/or if it is feasable to translate the scheme in new languages when the need arises, may influece the decision for a particular scheme.

Keywords may be added to the resource description in any language. Also in this case a consistent policy will enhance retrieval possibilities. Keywords can be added:

Keywords may be uncontrolled (for instance derived from the document itself) or chosen from a controlled vocabulary. When available in more than one language this will provide opportunities for searching documents in various languages with a query in one language.

In general users should be made aware of the consequences of the way they formulate their queries. This is easier said than done, if you want to avoid extensive help files or cluttered interfaces. For example: a simple query (all fields) in French may retrieve a document with this word in the title, but it won't result in any hits in the description field, if the descriptions are in English. As is well known, users are not very keen on reading help pages, so the search interface design should aim to present the language options in a clear and intuitive way.

User interface

To provide a bilingual interface seems to be the easiest part of providing multilingual support. Still some questions should be considered in relation to the language(s) of the interface and the choice for a mono- or multilingual interface.

The expected language skills of the target audience will be of major importance. This will be easier to determine if the gateway wants to serve a well defined language community rather then a broad heterogeneous audience. Staff will need to have the necessary skills to provide and maintain pages in more than one language. If not there will be extra costs for third party assistance, for instance a translation service. The gateway manager will have to balance the extra cost of creating and maintaining a multilingual interface with the profit for the users. Also it should be considered which multilingual browsing or search support can be offered in addition to the multilingual interface. For instance: is the classification scheme available in all languages of the interface, so the browsing structure can also be generated in those languages?

Current practice

Existing gateways in general don't have much to offer yet in terms of multilingual support. Quite a few gateways - at least if they are not based in either the UK or the US - do have a bilingual interface: mostly English and the language of the gateway's 'home' country. More sophisticated facilities, like multilingual search and/or browse support are not often available.

Subject gateways hardly ever describe the extent of their provisions in a detailed way, so it is difficult to assess what exactly they have to offer. In a report conducted as part of the DESIRE I project in 1997 [6], an assesment is given of a number of services: gateways, but also directories like Yahoo and robot based services. The gateways included in this survey are ADAM (art), OMNI (medical information), EEVL (engineering), ARGUS Clearinghouse (multidisciplinary) and DutchESS (multidisciplinary). Of these only DutchESS is based in a non-English speaking country (The Netherlands).

The main conclusion from this review was that there was considerable inconsistency in the way existing services deal with language issues. Not only did different gateways vary in their policies, there was also a lot of inconsistency within individual gateways. For example, titles are sometimes displayed in the language of the resource, and sometimes only in English, and when resources are available in more than one language this is only sometimes mentioned. Some Internet search engines also offer a form of multilingual support, such as interfaces in various languages, localised search by country usually based on domain name, or automatic translation (such as Alta Vista's Babelfish, based on the Systran translation system).

The table below gives the multilingual provisions of four services based in four European countries: DutchESS, SOSIG, the Finnish Virtual Library and the SSG-FI. It appears that the conclusions from the DESIRE report are still valid. Although all these services have some multilingual provisions, yet there seem to be no gateways with a sophisticated and consistent multilingual policy, including possibilities for multilingual retrieval.

DutchESS
http://www.kb.nl/dutchess/

home country:The Netherlands
scope policy:Resources in all languages are accepted provided the DutchESS subject specialist has the necessary language skill to evaluate the resource
language(s) of interface:Dutch and English
language(s) of browsing structure:Dutch and English
cataloguing rules:
  • title: titles of all language variants in one field, separated by "="
  • description in English only; language information in description
  • classification in Dutch and English; no keywords
  • URIs of all language variants are given

Finnish Virtual Library
http://www.jyu.fi/library/virtuaalikirjasto/

home country:Finland
scope policy:The FVL is a distributed service, different organisations are responsible for creating the Virtual Library for a certain subject field. Scope is determined by the needs of the frame organization and users - so this does allow for the selection of resources in any language, but most resources are in English, Finnish and Swedish
language(s) of interface:Finnish, with parts in English
language(s) of browsing structure:Finnish and English
cataloguing rules:
  • title in language of document
  • description in language of document, sometimes Finnish translation added (or Swedish as Finland's second language)
  • encoding of document language in separate field according to ISO639
  • indexing: use of thesauri and vocabularies used in the specific subject fields - in Finnish, English or both

SOSIG (Social Sciences information Gateway)
http://sosig.esrc.bris.ac.uk/

home country:UK
scope policy:SOSIG accepts resources in any language providing it has been evaluated by a member of the SOSIG team (consisting of core staff and correspondents) who is fluent in that language. The quality may be determined from a translation as opposed to the original language version
language(s) of interface:English
language(s) of browsing structure:English
cataloguing rules:
  • All language versions are combined in one record, with numbered 'variant' fields for metadata about the various language versions
  • Title in first language of resource; titles of other language versions in "alternative title" field
  • Availability in more than one language is mentioned in the description
  • Transliterations of foreign languages titles (without accents or umlauts) are repeated in the keywords
  • URIs of all language versions are given, the first one that of the 'first' language of the resource
  • Language fields contain the language of the variants, coded on the basis of an authority file
  • A destination field contains the country in which the server is located

SSG-FI Special Subject Guides / Fachinformation
http://www.sub.uni-goettingen.de/ssgfi/index.html

home country:Germany
scope policy:no language restrictions
language(s) of interface:English (part also or only in German, especially help pages)
language(s) of browsing structure:English
cataloguing rules:
  • Language versions combined in one resource description
  • Title in language of resource
  • All languages mentioned in language field, encoded according to ISO 639
  • Keywords in English
  • Description in English
search support:possibility to specify language of documents in advanced search option

 

User needs

For the EULER project, which is building an integrated interface to mathematical resources, a user survey was executed in 1998, to specify user needs. [7] One series of questions addressed their expectations of multilingual features. Multilingual provisions in the user interface were evaluated very low. On the other hand user surveys of the Finnish Virtual Library project showed that its users highly valued resource descriptions and help pages in their own Finnish language. One reason for this could be that mathematicians are used to communicating and publishing in English, while the audience of the FVL is broader and more heterogeneous in terms of language skills. Anyway, it seems advisable to try and determine the language skills and needs of the target audience, before deciding about which language provisions to include in a gateway service.

Future work: the Reynard project

The Reynard project is currently being negotiated with the EU within the Information Society Technologies programme (Fifth Framework Programme). It is to start in January 2000 and will run for 2,5 years. The main objective of this project is to develop a European broker service which will give access to various European subject services. Partners are national libraries, research libraries which have acquired expertise in different areas of subject gateway development, library related technology centres and university computer centres.

In this context multilinguality issues will also have to be addressed. Existing tools for indexing and searching in a multilingual environment will be examined, taking into special consideration metadata, classification systems and controlled vocabularies and thesauri in different languages. The possibilities for efficient translations and mapping between various controlled language systems will be researched. This will result in a state of the art report, including recommendations for the Reynard service. This will be the input for some testbed activity in the areas of multilingual retrieval and multilingual data flows.

Conclusion

Multilinguality is a complex issue. Although a lot of technology has become available in recent years, many problems have yet to be solved. For the time being gateways will not be able to provide more than very basic facilities if they need to keep costs within acceptable limits. However, putting some effort into making consistent choices - based on user needs - concerning such issues as scope and selection policy, metadata and cataloguing, classification and subject indexing, as well as regarding the use of the appropriate technologies, may greatly enhance the language support a gateway will be able to provide. Any extra facilities will have their costs, though, in terms of extra initial effort, maintenance, required skills of staff and so on. Institutions providing subject gateways - as well as other services on the Internet - will have to decide in each case whether the benefits for their users outweigh the necessary efforts to provide them. Ongoing research, like that in the Reynard project, may open up new ways to deal with those issues in the future. Multilinguality remains one of the challenges that have to be addressed to be able to serve a multitude of language communities without creating a virtual tower of Babel.

Reader Response

If you have any comments on this article, please contact the editors (exploit-editor@ukoln.ac.uk).

References

  1. Information Gateways Handbook: A guide to creating high quality portals on the Internet,
    URL: <http://www.desire.org/html/subjectgateways/handbook/>
  2. il8n Multilingual Support in Internet/IT Applications [Overview], Yuri Demchenko,
    URL: <http://www.terena.nl/projects/multiling/>
  3. Cross-Language Information Retrieval Resources, Douglas Oard,
    URL: <http://www.clis.umd.edu/dlrg/clir/>
  4. Across Languages, Across Cultures: Issues in Multilinguality and Digital Libraries, Carol Peters and Eugenio Picchi, D-Lib Magazine, May 1997
    URL: <http://www.dlib.org/dlib/may97/peters/05peters.html>
    European mirror available at URL: >http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/may97/peters/05peters.html>
  5. OMNI (Organising Medical Networked Information),
    URL: <http://omni.ac.uk/>
  6. Developing Multilingual Subject Gateways (DESIRE I report), Emma Worsfold et al.,
    URL: <http://www.sosig.ac.uk/desire/lang/language.html>
  7. What are the Expectations and Needs of Users for the EULER System? Results of the EULER User Questionnaire (1998),
    URL: <http://www.emis.de/projects/EULER/Reports/pD11/>

Author Details

Marianne Peereboom
Library Research Department
Koninklijke Bibliotheek (National Library of the Netherlands)
PO Box 90407
NL-2509 LK The Hague
The Netherlands

URL: <http://www.kb.nl/>
Email: marianne@python.konbib.nl

KB logo Marianne Peereboom is employed as project co-ordinator at the Library Research Department of the Koninklijke Bibliotheek (the National Library of The Netherlands). She is responsible for future development of the Dutch national subject gateway DutchESS and is involved in the DESIRE projects and project co-ordinator of the Reynard project.

For citation purposes:
Marianne Peereboom, "Multilingual Provision by Subject Gateways", Exploit Interactive, issue 3, October 1999
URL: <http://www.exploit-lib.org/issue3/multilingual-gateways/>