Exploit Interactive HomeHomeSearch
Issue CoverEditorialFeaturesRegular ColumnsNews and EventsEt cetera

DESIRE: Making the Most of the Web

Phil Cross, Nicky Ferguson, Tracey Hooper and Emma Place of the Institute for Learning and Research Technology, University of Bristol provide an overview of the DESIRE II project.

Introduction

How can we make the World Wide Web a better tool for supporting the work of the research community in Europe? This is the question that the DESIRE project [1] has been tackling since 1996, by developing new Internet technologies and using them to create high quality online information services.

DESIRE is a large project funded by the European Union under the Telematics Applications Programme [2]. Now in its second phase, it involves a team of researchers [3] from four countries - The Netherlands, Norway, Sweden and the UK, working at ten institutions including national libraries, university research centres and providers of national research networks.

The project recognises that researchers were one of the first communities to make use of the Web in their work. The Web is used as an information medium - for publishing and disseminating research data and for locating information to support further research. It is also used as a communication medium – for the discussion of research issues, both within projects and in the wider arena.

We know that users of the Web, particularly researchers, still have concerns about the Web as a vehicle for finding and delivering high quality research information:

Users would like to see technology solve these problems for them. They would like some sort of guarantee of quality - and some friendly places to start from - "subject-based community centres". These are the some of issues that DESIRE is dealing with.

Subject Gateways

The DESIRE project has demonstrated that Internet gateways built by qualified subject experts using standard Web technologies can offer Internet users unparalleled levels of quality when searching the Internet. Distributed teams of academics, university librarians and professionals are now involved in the development of large-scale "subject gateways" across Europe. These gateways offer catalogues of Internet sites and resources, which can be searched or browsed by subject. They guide people quickly and effectively to the kinds of sites that can support academic and professional work and the kinds of sites that people know they can trust. These are the highest quality portals on the Internet.

These gateways use standard metadata formats, formal classification schemes and have strict selection criteria. The editors are all subject or information experts who can make an informed judgement about the quality of an Internet site, based on its semantic content. The gateways use standardised formats and technologies, which means they are all compatible and can interoperate. Examples of such Gateways are: SOSIG [4], EELS [5], and DutchESS [6]

DESIRE is looking at issues such as scalability and interoperability. Interoperability allows us to combine distributed elements to form an integrated infrastructure. That is, to allow cross-searching across different databases or catalogues of information, which may cover different subjects or use different protocols. This includes the use of forward knowledge, a mechanism discovering what might be found in these remote collections before searching them. This speeds up search results by avoiding network overload through unnecessary querying of the remote catalogue. It also allows cross-browsing through different catalogues, creating virtual catalogues from constantly changing collections of information.

DESIRE has also looked at mechanisms for quality assurance. In addition to the quality assurance that comes from the selection of resources by subject experts for the subject gateways, we have taken part in work on the use of metadata for quality labelling systems, in particular using RDF, the Resource Description Framework [7]. Quality ratings on a particular resource being viewed by the user could be made available from a third party ratings bureau, or quality metadata could be used in a ranking algorithm within an subject gateway search engine.

DESIRE is also developing a Web site recommendation system. Gateways such as SOSIG hope to attract communities of users within particular disciplines and we hope to be able to use the expertise of our users to recommend quality resources as an addition to the catalogues we produce. Authorised users can make recommendations including comments about particular sites.

Harvesting and Indexing Web Resources

DESIRE also looks at mechanisms for automatically trawling and indexing the Internet - helping to ensure greater relevance and higher quality than the Internet search engines usually provide.

The Combine [8] harvesting and indexing software was developed under DESIRE I and gathers, parses and collects resources specifying rules for the URLs or servers that should be collected. For instance, the Harvester was used in the Nordic Web Index to create a distributed regional index containing all Web pages in the Nordic countries. It can also be used with the URLs that have been catalogued within a quality subject gateway, to index each site listed to a specified depth. This mechanism is used by the Social Science Information Gateway to create an additional database [9] of relevant, high quality pages.

During DESIRE II, the software is being extended. In particular its capabilities to index metadata are being improved, including such standards as Dublin Core [10], taking advantage of new formats such as RDF. Users should, for example, be able to search within a group of harvested data containing particular metadata fields and values, or the results of their searches could be divided into categories defined by the metadata they contain.

The aim of the design and architecture of Combine has been to provide a harvesting system which can be used for building fairly large indexes but no attempt has been made to compete with the world wide commercial search engines. Rather, it is building an index covering a small country or all universities in a region.

The harvesting policies can be formulated flexibly using allow and exclude rules, allowing distributed data collection implying that a number of servers will each have the responsibility for one or more regions or domains in a broad sense. These areas of responsibility can be assigned based upon actual network domains, organizations or geographical domains just as easily as they could be domains of human knowledge.

An important part of the architecture is an easy way to filter the sets of URLs to be indexed according to some subject or domain. Before a random set of URLs is loaded into the scheduler for processing, they are filtered through an external policy-filter. This filter, which is localised for each installation, determines what URLs are to be harvested given the policy adopted by the installation. It thus defines the region or domain a particular installation will cover.

The project has aimed at building a system which:

The range of document types that can be indexed is being extended to include perhaps postscript or Word documents, as is the range of protocols that can be harvested, to include NNTP and FTP, in addition to HTTP.

Automatic Classification of Resources

Another strand related to harvested databases is automatic classification. Although our emphasis is on collections of Web resources catalogued and classified by information professionals, there are occasions, such as with a harvested database, where automatic classification is very useful. It could also be used with the collections of recommended pages.

We have produced a report [11]on the current state of projects, methods, and problems associated with automatic classification. This work took place at Lund University [12] in Sweden, where they have created a collection of engineering resources for experimentation based on a harvested database. They have tested and evaluated different methods for creating the collection, and using different classification methods on the data to test their effectiveness. A pilot service demonstrating some of the above methods is now available [13].

Directory Indexing

The final strand of resource discovery research within the DESIRE project is the use of LDAP white pages directories. The use of directories as distributed information services is becoming widespread. Although white pages services are the major field of application, other uses are developing in this field. However, one issue which still remains open is the lack of an infrastructure that makes such distributed information more accessible to the end user.

One method the project has looked at is to use crawlers to gather all the data from LDAP directories in the Netherlands onto a single server. A second model is to use forward knowledge of the material contained on distributed directories. This system uses a central server that collects data based on the content of the individual directories - using the Common Indexing Protocol - this essentially produces simplified indexes of the records held by each directory. A search would initially be made on such a central server which would then return a list of directories that held the requested data. Remote searches can then be made on just those directories.

Such an approach would be viable for providing Europe-wide access to directory information - the goal is to have one distributed index for all directory protocols.

The DESIRE project is holding an Indexing workshop [14] in Delft, The Netherlands on 13-14 May 2000. The workshop is aimed at implementers and managers of National Research and Education Networks (NRENs), subject gateways, directory, indexing and searching innovators, information retrieval and automatic classification specialists, networked information developers and digital libraries specialists. Delegate registration costs are being met by the DESIRE project.

Information Gateways Handbook

The DESIRE team has produced the definitive guide to setting up and maintaining a large-scale information gateway. The DESIRE Information Gateways Handbook [15] was launched in October 1999 and is freely available over the web. It promotes the development of national gateway initiatives among the academic and library communities. It also promotes the adoption of standard procedures in setting up gateways, to ensure compatibility and the potential for integrating services.

The Information Gateways Handbook has three main sections:

The Handbook is an excellent example of collaborative working, which draws together the expertise and experience of the leading gateway practitioners in Europe (in fact, the world). The Handbook received very favourable reviews from all peer reviewers and since its launch has received a great deal of positive feedback.

Internet Detective Online Tutorial

Internet Detective [16] is an interactive, Web-based tutorial that can be accessed for free from any Web browser. It is designed to teach people to question the quality of information that they find on the Internet, warning them that this information is not always of the quality you’d expect from an academic library and so if used carelessly can degrade academic research, teaching or learning.

The tutorial encourages people to think like a detective when looking at an Internet site:

The tutorial has proven very popular, with over 44,000 registrations to date. We have had extensive feedback revealing that it has been incorporated into many university and school curricula and Internet training programmes. As a result of this feedback we created a second edition which included support materials for lecturers and trainers (a PowerPoint presentation, handouts and ideas for classroom exercises and assessments) as well as a downloadable version for offline use.

The tutorial has been recommended by national media, notably the BBC WebWise campaign, The Independent Newspaper and USA today. Following the success of the original, a Dutch version of the tutorial written by DESIRE staff at Koninklijke Bibliotheek [17] (National Library of the Netherlands) will be made available in May 2000. A series of 10 subject-based tutorials (The Virtual Training Suite) are also being created through the UK's Resource Discovery Network (RDN) [18].

Web Caching

Once a user has located some quality information of interest, the next problem comes from the time it can take to download the resource due to the enormous growth of traffic demand on network backbones.

One remedy is to install a Web caching service. Local caching services are already in widespread use. DESIRE is taking the idea one step further and building a network of interconnected caches which serve local, regional, national and international users, with the ultimate aim of being able to provide a coordinated service across the research networks of Europe. Already, both UNINETT (the national research network of Norway) [19] and SURFnet (the national research network of the Netherlands) [20] have set up national caching hierarchies which are interconnected with each other and with those of other countries. The DESIRE project is attempting to spread such cache meshes across Europe.

Statistics show that Web traffic is reduced by 30-50% by sending it through a web cache system (comprising one or more servers). The analysis shows that on every level of the mesh (institutional cache, top level cache, whole mesh) the benefits of caching exceed the costs involved.

The DESIRE project has now run two workshops [21] aimed at Web cache managers looking at Cost Benefit Analyses, cache architectures and intercache communication protocols.

Conclusion

The DESIRE project covers a wide range of services with the common goal of making the Internet a more useful resource for academic researchers across Europe. Other countries in Europe are now considering a national strategy for gateway and Web cache development, which involves the academic and library communities.

A new project called Renardus [22] will take the DESIRE work on subject gateways forward by developing a broker service offering improved subject-based routes to Internet-accessible collections of cultural and scientific information across Europe. Renardus will be working with - and building on - existing subject gateway initiatives.

References

  1. The DESIRE Project
    URL: <http://www.desire.org> Link to external resource
  2. Telematics for Research
    URL: <http://www.echo.lu/telematics/> Link to external resource
  3. DESIRE Project Partners
    URL: <http://www.desire.org/html/aboutus/projectpartners/> Link to external resource
  4. The Social Science Information Gateway (SOSIG)
    URL: <http://www.sosig.ac.uk/> Link to external resource
  5. Engineering Electronic Library, Sweden (EELS)
    URL: <http://eels.lub.lu.se/> Link to external resource
  6. DutchESS
    URL: <http://www.konbib.nl/dutchess/>
  7. Resource Description Framework (RDF)
    URL: <http://www.w3.org/RDF/> Link to external resource
  8. Combine
    URL: <http://www.lub.lu.se/combine/> Link to external resource
  9. Social Science Search Engine
    URL: <http://www.sosig.ac.uk/harvester.html> Link to external resource
  10. Dublin Core Metadata Initiative
    URL: <http://purl.org/dc/> Link to external resource
  11. Automatic Classification of Engineering Resources
    URL: <http://www.desire.org/html/research/deliverables/D3.6/> Link to external resource
  12. Netlab - Lund University Library
    URL: <http://www.lub.lu.se/netlab/> Link to external resource
  13. Auto-classifier demonstrator
    URL: <http://www.lub.lu.se/desire/demonstration.html> Link to external resource
  14. DESIRE Indexing workshop
    URL: <http://www.desire.org/html/subjectgateways/workshops/indexing2.html> Link to external resource
  15. DESIRE Information Gateways Handbook
    URL: <http://www.desire.org/handbook/> Link to external resource
  16. Internet Detective
    URL: <http://www.sosig.ac.uk/desire/internet-detective.html> Link to external resource
  17. Koninklijke Bibliotheek
    URL: <http://www.konbib.nl/> Link to external resource
  18. The Resource Discovery Network
    URL: <http://www.rdn.ac.uk/> Link to external resource
  19. UNINETT
    URL: <http://www.uninett.no/> Link to external resource
  20. SURFnet
    URL: <http://www.surfnet.nl/> Link to external resource
  21. DESIRE Workshops
    URL: <http://www.desire.org/html/subjectgateways/workshops/> Link to external resource
  22. Renardus
    URL: <http://www.renardus.org/> Link to external resource

Author Details

Phil Cross, Nicky Ferguson, Tracey Hooper and Emma Place
Institute for Learning and Research Technology
University of Bristol
Bristol
BS8 1HH
United Kingdom

Tel: +44 (0)117 928 7197
Email: t.a.hooper@bristol.ac.uk
URL: < http://www.ilrt.bristol.ac.uk/> Link to external resource

DESIRE logo Phil Cross, Nicky Ferguson, Tracey Hooper and Emma Place are all part of the DESIRE Team based at ILRT (The Institute for Learning and Research Technology), University of Bristol. Phil Cross is a Senior Technical Researcher who also works on the SOSIG and Renardus projects. Nicky Ferguson is the Institute's Research Director and directs both the DESIRE and SOSIG projects. Tracey Hooper is DESIRE Project Manager and coordinates the project. Emma Place is a Senior Researcher working on SOSIG and the RDN Virtual Training Suite - a successor to the Internet Detective.

For citation purposes:
Phil Cross, Nicky Ferguson, Tracey Hooper and Emma Place, "DESIRE: Making the Most of the Web", Exploit Interactive, issue 5, April 2000
URL: <http://www.exploit-lib.org/issue5/desire/>


[HTML Validation] - [Accessibility check]