

In what will be a regular Web Technologies column, Brian Kelly and Ian Peacock discuss the results of a recent analysis of the Telematics for Libraries Projects' URLs.
The WebWatch project [1] based at UKOLN, University of Bath, has developed robot software to analyse web technologies in use within a range of communities, and to advise communities on the implications of the findings. The WebWatch project has recently analysed the URLs for Telematics for Libraries projects. This article summaries the findings.
As well as providing a hyperlink to a resource, the URL of a Web page also provides useful information. The fully qualified hostname is normally of the form www.project-site.domain (e.g. www.desire.org or www.pride.ac.uk. This can often provide information on the country the server physically resides in (the UK in the first example, although the country is not known in the second example). The hostname can also indicate the nature of the organisation hosting the page (e.g. com or .co refers to a commercial company).
The construction of the directory hierarchy can indicate the relation of the page to the site. It may be possible to develop heuristic techniques to ascertain the ease with which the URL may be used, cited or remembered.
A list of Telematics for Libraries projects is maintained at the ECHO web site [2]. An HTML document exists for each project which provides summary information for each project. If the project has a project web site, the address is provided. Using this information we extracted 50 URLs corresponding to project web pages from the list of 107 projects.
A Perl script was used to obtain information from the project URLs. The script provided the following information for each URL:
URL, Scheme, Hostname, Path, Port, Fragment, Length of hostname, Length of path, Top domain, Secondary domain
Not suprisingly, all URLs used the http scheme. All but one URL (which used port 1999) refered to the (standard) TCP port 80 for data transfer over HTTP.
Five URLs contained the tilde character (~). This is a Web server mechanism that allows users to have Web space under the hierarchy /~user/.
All URLs ended with the suffix .htm (HTML document) or .html ( HTML document) or with a slash (/).
Figure 1 shows the length of the URL without the scheme, i.e of the hostname and path (which includes the first /). The two components are shown in different colours.
![]() Figure 1 - Number of characters in each URL |
Note that for the longer URLs, the length of the path dominates the length of the overall URL.
Table 1 lists the six project sites which contained only the domain name and no path.
| Project | URL |
| BALTICSEAWEB | http://www.baltic.vtt.fi/ |
| CASA | http://www.casa.issn.org:1999/ |
| DER@L | http://deral.infc.ulst.ac.uk/ |
| EUROPAGATE | http://europagate.dtv.dk/ |
| MALVINE | http://www.malvine.org/ |
| TOLIMAC | http://tolimac.ulb.ac.be/ |
It should be noted that only the MALVINE project has its own domain - the other projects included the project name before the organisational name.
In contrast, Table 2 lists the projects with the longest paths.
| Project | URL |
| SPRINTELB | http://www.iol.ie/resource/dublincitylibrary/sprintel/index.html |
| BIBDEL | http://www.uclan.ac.uk/research/centre/cerlim/projects/bibdelhp.htm |
| TRANSLIB | &lr;http://peterpan.uc3m.es/proyectos/translib/HomePage.htm>
|
| HARMONICA | http://www.svb.nl/project/harmonica/harmonica.htm |
| COBRA | http://portico.bl.uk/gabriel/en/projects/cobra.html |
| DECIMAL | http://www.mmu.ac.uk/h-ss/dic/research/decimal.htm |
In Table 2 it can be noticed that:
The top level and second level domain was extracted from the hostname of each URL. Table 3 shows the findings. Figure 2 gives a pie-chart representation of the data.
|
Figure 2 - Top level domains found in URLs |
Figure 3 shows the second level domains that comprise the data for Table 3.
Figure 3 - Second level domains found in URLs |
Figure 3 shows that ac.uk is the densest in terms of project sites. This in turn means that the top-level uk domain has the highest number of project sites.
It is desirable for projects to provide a short URL for the main entry point for the project, as this is more memorable and less likely to cause mistakes when citing the address (either in print or when speaking).
One way of shortening a long path name is to avoid including the filename, by making use of the web server's default naming convention. For example the URL for the fictuous project Microscape: <http://www.foo.bar.com/projects/microscape/microscape.htm> could be replaced by <http://www.foo.bar.com/projects/microscape/>. This is not only shorter, but also avoids potential mistakes in typing the suffix (e.g. "did she say .htm or .html?").
Organisatiations may have policies governing the directory structure which may result in long URLs. Use of the ~name convention may provide a shortened URL - e.g. <http://www.foo.bar.com/~microscape/>. A problem with this approach is that the ~name convention is often used for personal home pages (many universities use this approach to provide web space for students). End users who have experienced use of this approach may place low value on URLs containing the tilde character, as described in SOSIG's Internet Detective [3]. This tutorial guide to finding quality resources on the Internet states "If the URL contains a tilde then be aware that you are probably (although not definitely) looking at a personal page with personal opinions rather than an official site giving the official line." [4].
If the host organisation permits it, it may be desirable to include the project name within the domain name. For example the Europagate project which is hosted by the DTV (Danmarks Tekniske Videncenter & Bibliotek) has the URL <http://europagate.dtv.dk/>.
Rather than relying on the host organisation's policies for hosting web sites, projects may chose to obtain their own domain name. For example Exploit Interactive obtained the domain name exploit-lib.org. The Exploit Interactive website is hosted at the URL <http://www.exploit-lib.org/>. The domain name was obtained from InterNIC [5]. The first choice of <http://www.exploit.org/> had already been taken.
Another alternative could be to make use of the EU.org [6] organisation. EU.org's organisational home page states that "The goal of EU.org is to provide free subdomain registration to users or non-profit organizations who cannot afford the outrageous fees demanded by some NICs, especially in Europe". Using EU.org it would be possible to use the domain name exploit.eu.org. The Exploit Interactive decided not to pursue this option, since little was known about the EU.org organisation.
A final alternative which could be considered is the use of PURLs [7]. Instead of pointing directly to the location of an Internet resource, a PURL (Persistent URLs) points to an intermediate resolution service. Since EU project deliverables may well be sought after once the project has finished, it may not be desirable to provide a URL which may be deleted once the project has completed (as could happen if the website is hosted by a large organisation, and files are automatically deleted when project staff leave).
In the longer term the use of DOIs (Document Object Identifiers) [8] should be considered for use by projects.
If you have any comments on this article, please contact the editors (exploit-editor@ukoln.ac.uk).
Brian Kelly
UK Web Focus
UKOLN: UK Office for Library and Information Networking
UKOLN
University of Bath
Bath
UK
BA1 7AY
Tel: +44 1225 323943
Fax: +44 1225 826838
URL: <http://www.ukoln.ac.uk/>
Ian Peacock
Netcraft
Bath
UK
Email: ip@netcraft.com
URL: http://www.netcraft.com/
![]() Brian Kelly is employed as UK Web Focus, at UKOLN (UK Office for Library and Information Networking) at the University of Bath, England. Brian's responsibilities include keeping the UK Higher Education community informed of web developments. |
Ian Peacock recently left UKOLN to join Netcraft; a networking consultancy based in Bath,
England. It is well known worldwide for its Web Server Survey, which is widely
considered a primary empirical metric for the number of web sites and the
relative popularity of web server software on the internet. Clients include IBM, Hewlett
Packard, Sun Microsystems, and Microsoft. |
For citation purposes:
Brian Kelly and Ian Peacock, "URLs for Telematics for Libraries Project Pages," Exploit Interactive, issue 1, 10 April 1999
URL: <http://www.exploit-lib.org/issue1/urls/>
|
Issue Home | Editorial | Features | Regular Columns | News and Events | Et cetera | ||
|
| ||
| Go to Top |
A UKOLN Service. Contact Us. Copyright © 1999-2006
|
Last Updated: 9 May 1999 |