Exploit Interactive HomeHomeSearch
Issue CoverEditorialFeaturesRegular ColumnsNews and EventsEt cetera

Flexible Access to Statistics, Tables and Electronic Resources

Simon Musgrave introduces the FASTER project (and preceding NESSTAR project) that provides flexible access to statistics, tables and electronic resources held at social science data archives and other data publishers.

Introduction - Social Science Data Archives

Data about society, whether economic or social, is collected by many government departments, research institutes and companies. Many of these data collections are available for re-use (secondary analysis). Within the academic sector social science data archives have been established in several European countries to provide researcher and students with ready access to these data [1]. Some of these archives have been in existence for 2-3 decades and house the largest collections of accessible computer-readable data in the social sciences in their respective countries. The primary goals of the archives have been to safeguard the data and make it as easily accessible as possible for teaching and research independent of whether the users are able to pay for the services or not.

The NESSTAR project (1998-2000) [2] aimed to increase the use of these data by developing a set of generic tools that make it easier to:

The social science data archives are rarely engaged in the collection of primary data, but serve as brokers between various data providers and the academic community. They not only preserve data for future use but also add their own value to the data:

Overview

The basic strategic goal of the FASTER project (2000-2001) [3] is to link the world of the data archives described above with the official statistics and develop advanced metadata models and systems that build on the leading developments in both. To over-simplify, it could be said that the archives and libraries have concentrated on metadata for resource discovery and data transfer, hence the concentration on study descriptions and codebooks. Conversely the statistical offices and other data generators have concentrated on metadata for questionnaires, process control and tabular outputs. By combining the work of these traditions, we hope to generate metadata models and systems that are consistent all the way from data conceptualisation and collection through to end use and interpretation. Much of the drive for the FASTER project comes for the innovative NESSTAR project [2]. An example for the on-line browsing is shown in the screen-shot below.

Figure 1: Online Browsing
Figure 1: Online Browsing

In order to interact with this diagram, and find out more about the data or add another explanatory variable, such as gender, it is possible to click on the graph above, a NESSTAR bookmark. To run this it is necessary to have the NESSTAR client installed on you desktop [4].

Components

In order to understand the needs of the next generation of data analyst ‘dream machine’ it is necessary to review existing metadata traditions and develop a wider understanding of the different types of metadata that could, or should, be considered part of the system.

Catalogue

The first part of the metadata, and in many ways the most traditional, is the catalogue. This is the part of the metadata that includes title, ownership, location, keywords and other attributes. The data library community has adopted the Data Documentation Initiative (DDI) [5]. It includes fields for almost every conceivable aspect of data cataloguing and includes mapping to the Dublin Core.

It is to be hoped that the exciting developments in the digital library community in developing a range of digitally based services will include effective links to the virtual data services being developed in the NESSTAR/FASTER projects. This is the part of the metadata that is most usable by the widest community and it is recommended that services are developed which speak a range of protocols, including Z39.50, so that they can be accessed, or ‘harvested’, by a range of library portals and services.

Dictionary

When the DDI, as draft specification, was converted from SGML to XML this, almost inadvertently, opened the door to a range of new Web based services. This was because XML was designed as a key part of the new semantic driven Web, designed for machine to machine interoperability as well as human readability. The new specification of the codebook ( or data dictionary) in the DDI format, allow the data library and hence the data publishing software, to formalise the structure and descriptions of the data content.

The standardisation of this level of metadata has been one of the most exciting developments of the current standard. No longer does the user just have to rely on the skill of the cataloguers, but by use of appropriate search engines the user is able to search directly on the specific value labels embedded in the dictionary. This is especially useful for categorical variables, for example ethnicity, occupation, geographies, gender.

Context

Reflecting back on the fact that the DDI was essentially designed as a storage and transfer format for the predominantly survey datasets in the existing libraries and archives, it is easy to see that this structure, rich and powerful as it is, is unable to address some of the wider issues, such as contextual metadata. On one level the potential for exciting multi-media is clear. Categorical variables, such as housing, occupation and geographies could easily be not just enlivened, but also clarified, by the appropriate use of pictures, video and maps. These objects could be embedded, but more appropriately linked, to many parts of the DDI metadata.

Simple textual metadata, such as user guides, interviewer questionnaires (or links to computer aided questionnaires systems) can and should be added. They provide detailed background information for the advanced researcher. However there is also a need for wider contextual information for the inexperienced user. This might include access to teaching and learning material, for example to understand the meaning of a quota sample or to be alerted to the issues of question routing or variable weighting.

Quality

The fourth level of metadata that is of growing concern is related to data quality. The growth of advanced Web systems that link to many datasets can leave the user, particularly the naïve user, with serious problems. It is difficult to know whether a dataset is of high data quality or of high usability quality from a simple list of hits. An experienced user may be able to assess data quality from a well-documented dataset, and the DDI does include fields for such items as response rates. However this is likely to be beyond the reach of the casual user. On the other hand a high quality dataset can be identified, but the user will want to know how accessible it is to his particular environment.

The solution to this dilemma could come in one of two ways. On the one hand it might be possible to establish some sort of quality stamp for data, even a grading. These types of proposals have not been met with universal accord due to the difficulty in agreeing them, let alone implementing them. A second, more ad hoc but practical, solution is the development of added value portals or collections. This is likely to be the domain of the higher quality data libraries who are able to gather a set of working links to a range of reliable and trusted data sources. They might be grouped around course or discipline requirements and may have notes added that are semi-confidential.

People

The concept of specialist recommendation leads to the next type of metadata, which is not really metadata at all in the traditional sense. This is the human knowledge that relates to a dataset. A large part of this is likely to be unstructured and held in the head of one or two experts or expert centres. Whilst the attempt to record any common points, for example in Frequently Asked Questions repositories, is important, a good metadata system should include links to people as well. These could be advanced help desk systems that integrate video with screen duplication, or simply telephone numbers of experts. The support can be of several types, ranging from statistical analysis through data management and data content through to basic IT skills.

Supplementary

The final type of metadata, and perhaps the most exciting, is the metadata generated by the user – the supplementary metadata. This could be comment on a dataset, tips, corrections and the like. However it can also be the very use of the dataset. The NESSTAR system allows the user to bookmark operations such as a search or a table or a graph. These bookmarks can be saved and added into electronic articles, especially Web based ones. So it is easy to envisage the scenario in which a researcher carries out an analysis on a particular dataset and create an on-line article. In this they publish their analysis, not just as graph or table, but also include a live link back to the environment in which that graph or table was created. See the NESSTAR example above for how this might work.

Nature of social science data

The discussion on metadata components is largely based around the types of data set, the survey, that is the bread and butter of the existing data services. However the FASTER project is broadening the discussion beyond these types of dataset and has engaged in detailed ‘brainstorming’ with the developers of other types of metadata systems. Major players in the production of aggregate tables are the national statistical offices. The FASTER project benefits from having participation form three statistical offices, those of the Netherlands, Ireland and Norway and close links with a fourth, the UK. These types of tabulation systems, whether interactive, such as Statline, or simply table publications, demand a cell based metadata system.

Microdata or Aggregate Data

A major part of the discussion in the FASTER project has centred on the distinction between microdata and aggregate data. Statistics Netherlands has proposed a CRISTAL data model that handles both micro and aggregate data (or data cubes) in the same logical model. If both representations of data can be handled by the same model, then can the data cube, the multi-dimensional table, be called a dataset itself. Traditionally the Census aggregate statistics (the small area statistics) were considered as datasets, and so it is correct to consider every data table as a dataset in its own right. In some instance a table can be derived from the original microdata and so can be considered as a tabulation and even recorded in a bookmark as described earlier. However the table may be much more complex than this and has been derived from a multiplicity of sources, including a synthetic database as described below.

Synthetic Databases

The major data collectors, in particular the National Statistical Institutes (NSIs), have very different approaches to data collection (see Bethlehem et al. 1999). Some, notably the Anglo-Saxon countries, are heavily biased towards surveys. Others, notably the Netherlands and Scandinavia, are biased towards registers. Given the high cost of surveys, and the imperative to cut costs, it is likely that the use of secondary data collection, via administrative data such as registers, will grow. In addition one survey is used for multiple logical data collections. In this way a synthetic database can be imputed from a mixture of register and survey data. As a result a published table may have no underlying dataset and the microdata with the table will focus mainly on explaining how it was derived, analogous to describing how a survey was collected.

These trends will result in more data being released as multi-dimensional tables and not as surveys and hence the traditional archive model of simply holding discrete surveys will begin to break down for many organisations. The current bias towards surveys is illustrated by the first version of the Data Documentation Initiative (DDI), which is primarily focused on the description of discrete surveys.

Diversity

The coming years will, we anticipate, lead to a much wider and more diverse range of data resources to serve both the research community and the wider and growing community of data users, whether sophisticated analysts or casual data browsers. The retrieval and replication of the resulting analysis is essential and one of the goals of the project is to investigate ways in which this might be possible.

Client

Given the need to more extensive metadata and the changing nature of the data themselves, what sort of data client is necessary in the next generation of the ‘dream machine’. The FASTER project believes that the new client should have the following features:

User Profile driven

All of the metadata above is important in making proper use of the data resources once they have been identified and located. However the initial identification and location of data remains a fundamental issue. Most of the existing systems rely on the user knowing which sites to search. The development of more advanced search engines, such as Google, Clever (in development at IBM) and ResearchIndex have the potential to make the data services more accessible. The techniques that these services use provide exemplars of the type of advanced search techniques that will be required to sift and sort the growing number of hits in the wider data web that is developing. For example it would be sensible to rank datasets in more intelligent ways, such as the number of publications based on the dataset weighted for the age of the dataset. Another ranking mechanism would be to use the quality measures discussed above.

In addition to the need to provide some more sophisticated ranking of a generic nature, it is also necessary to provide a way for users to either develop their own profile or choose one of a number of profiles. Developing a user profile depends on the ability of the user to store, retrieve and edit their own preferences for searching. Within NESSTAR this can be done by saving bookmark of search parameters. These can be saved and re-run by an active agent.

User profiles can be represented in XML and used intelligently to map onto the organisational profiles described later on. They can be stored locally, or on servers. They can be used to add extra functionality at the local organisational level, as well as the user level. The development of user profiles will be carried out using any existing developments in related spheres, for example within the electronic library world.

Adjusting to the type of data

In addition to the changes in the client that are driven by the user, there are many changes that are driven by the type of data, or metadata. It is possible to conceive of systems that load up a client depending on the type of data being delivered across the web. This is, or course, similar to the well-established functionality of the mime type in which different data types are able to initialise different applications. Other types of data that are of specific interest include geographical data, for example linking demographic information to maps, and time series data. The FASTER system will include ways of ensuring that these types of data can be displayed intelligently, maybe via an alternative system altogether.

It is worth, at this juncture, pointing out that a major part of the NESSTAR development has been the creation of an XML based protocol for the communication between the client and server. This protocol is in the public domain and available to anyone who wants to create new clients and servers to interact with the overall NESSTAR service. As a result the project expects that some services for specific types of data will be linked into the overall NESSTAR network. Indeed this is one of the most exciting outworkings of the project – the ability of multiple data servers and clients to inter-operate at a variety of levels (e.g. resource discovery, data browsing, data delivery and local portals)

Workbench

A further issue to be resolved at the client level is the development of an intelligent way of sifting and sorting the variety of data and related services. To use the analogy of the social science workbench, the user wants to have easy access to the right collection of raw materials (typically the data) and the right collection of tools (typically analysis applications). It does not help them to be cluttered up with a huge amount of irrelevant material. Rather the user wants to be able to sort relevant resources quickly and easily and then store the re-arranged environment so that it can be returned to quickly and easily.

Conclusions

The FASTER project is very ambitious, it is seeking to address many of the issues that have been thrown up by the success of the NESSTAR project. However it is seeking to solve these problems by focusing on the development of the data web, rather than one-off solutions. It is the fervent hope of the author that the infrastructure that is being developed by both the NESSTAR and FASTER project can be both participatory and ever expanding. In this way our, and the data using community, dreams can both grow and be realised [6] [7].

References

  1. Cessda Council of European Social Science Data Archives
    URL: <http://www.nsd.uib.no/Cessda/ > Link to external resource
  2. Networked Social Science Tools and Resources
    URL: <http://www.nesstar.org/> Link to external resource
  3. Flexible Access to Statistics, Tables and Electronic Resources
    URL: <http://www.faster-data.org/> Link to external resource
  4. Install NESSTAR Explorer
    URL: <http://www.nsd.uib.no/nesstarexplorer/ > Link to external resource
  5. Documentation Initiative (DDI)
    URL: <http://www.icpsr.umich.edu/DDI/ > Link to external resource
  6. Bethlehem J, Kent J., Willeboordse A. & Ypma W. (1999) On the use of metadata in Statistical data processing, Working Paper No. 23, UN/ECE Work Session on Statistical Metadata, Geneva, Switzerland, 22-24 September 1999.
  7. Musgrave, S. &Ryssevik, J (1999). The Social Science Dream Machine. Resource Discovery, Analysis and Delivery on the Web. Forthcoming in the Social Science Computing Review.

Author Details

Simon Musgrave
Deputy Director of the Data Archive
NESSTAR and FASTER project leader
University of Essex
Colchester
CO4 3SQ

Tel: 44 (0)1206 872321
Fax: 44 (0)1206 872003
<http://www.data-archive.ac.uk/ > Link to external resource
<http://www.nesstar.org/> Link to external resource

Simon Musgrave is Deputy Director of the UK Data Archive, a national centre for the collection and dissemination of economic and social data. He is co-ordinator of the NESSTAR and FASTER projects.

For citation purposes:
Simon Musgrave, "Flexible Access to Statistics, Tables and Electronic Resources", Exploit Interactive, issue 7, 2nd October 2000
URL: <http://www.exploit-lib.org/issue7/data/>


[HTML Validation] - [Accessibility check]