Exploit Interactive HomeHomeSearch
Issue CoverEditorialFeaturesRegular ColumnsNews and EventsEt cetera

News Services for the Library Community

Many commercial Web sites carry news feeds from third parties. Suppliers of news feeds often use email to deliver user-selected areas of news. Karen Neal looks at ways in which the European Library community can provide their own news services to their community. She describes the NewsAgent project, which provides an email and Web-based current awareness service, and reviews developments to standards for providing similar services.

Background

As people are suffering from “information overload” in both their professional and personal lives, news services that aggregate and disseminate targeted or personalised updates have become a common tool. Typical services deliver updates to the users inbox or set cookies on the user’s computer so that their personal news categories take over the front page of the site. While many individuals organisations choose to disseminate their information in the form of an e-mail newsletter, larger entities are currently pulling information from key sources and disseminating to interested parties. Internet news such as Moreover.com and Individual.com currently pull in news feeds, articles from online journals and online newspapers and disseminate the information to users in specific news channels. Legal, medical and business information are all represented by some form of news and content updating service. Smaller, more specialised, information groups do not necessarily have the resources to organise and disseminate the information from their chosen resources. This can be attributed to the lack of sufficient technology to deal with the task of pulling information from disparate resources, the lack well formatted content for harvesting, and the lack of disparate resources (such as listservs and newsgroups) from having their most valuable content efficiently disseminated.

NewsAgent

NewsAgent began in 1996 when funding was awarded for the design and implementation of a service that would sort and deliver key information resources to library and information professionals. The funding was given under the Electronic Journals strand of the Electronic Libraries Programme (eLib) to research and develop the concept of an open-standards-based, electronic, personalised, current awareness service.

Dublin Core metadata tags are recognised by the system as well as a NewsAgent specific tagging terms. In addition, specific keywords are also employed to map stored user subject profiles against new information resources. Software robots have been used to build an Oracle database where both user profiles and document attributes are stored.

Users can join the service using the project Web page [1] (see Figure 1). Once within the systems users may sign up to receive information updates by email as well as search for information resources currently within the database. Users can receive updates from predefined topics to which they subscribe as well as create their own personalised searches.

Figure 1: NewsAgent Log In Screen
Figure 1: NewsAgent Log In Screen

Web sites are monitored by the harvester and will create records when changes occur. With a change, information is gathered from the HTML coding of the page and a new record is created. In addition, messages to listservs to which the system is subscribed also enter the system. The subject field of the messages are transformed into the title of the entry and the message is placed on a Web page that is accessible for 30 days. The NewsAgent administrator also has the ability to send news and information into the system via e-mail or by creating a new record within the administration client.

Figure 2: Results of an Online Search on the topic of Censorship
Figure 2: Results of an Online Search on the topic of Censorship

The new entries are automatically analyzed for their content by way of DC tags and title tags. An index is created nightly with the new information being matched up against the profile of the users. The updates are then sent out. The format of the update is up to the user who can choose between daily or weekly, long or short entries, and text or HTML. The longer entries contain the description information when it is available from harvesting or the first few lines of a message that has been received via email.

Issues in Service

While there are numerous news-alerting services available online, the content is of a general nature coming from wire services. As the newsagent service pulls information from resources as uncontrolled as listservs and with automated tools in the case of Web pages there have been many issues which have arisen. A primary factor has been the lack of metadata, inadequacies of URLs for delineating collections of resources, inadequacies of HTTP modification dates for identifying changes to resources, as well as the inability of the harvester to be more finely tuned.

With the numerous listservs that NewsAgent is subscribed to, there had to be a method in place by which information could be efficiently disseminated to subscribers. To do this, all of the e-mail messages coming into the server were converted into HTML pages with the title of the e-mail becoming the title for the item’s record. Words in the title are then matched the set of keywords associated with specific searches. If a message contains any of these words, then they are sent out with the content for that news section. Users are also able to create customized searches that work along the same principle. Any new messages containing words they have entered into a search statement will be delivered within their next update. The administrator can review all of the messages that have been imported overnight prior to their dissemination. This will allow for multiple postings, solved inquiries, and non-topical exchanges to be deleted. This is a time consuming process and would be greatly improved if users of listservs put additional key terminology in their subject line. After all, many listserv archives are searchable online. Even with variable terminology, additional content information on the subject line will assist current users as well as potential technological changes to assist with congregating and disseminating such information.

Problems with Older Harvesters

The older harvesters, such as the one that NewsAgent is currently using, are not able to tell the difference between banner advertisements and changes to the Web pages. Minor changes such as these can result in the pages being harvested and not bringing any new information or value into a system. In the case of NewsAgent, every time a page is visited which has a revolving banner advertisement the page is harvested and a new record is created. This means that without human intervention those receiving updates will find that XYZ Press Releases will be listed very frequently, and the Web page will not have had any new press releases for several weeks. Also, if a person were to search for company XYZ from the NewsAgent web site they will view a large number of records for the company’s press release pages unless duplicates are manually deleted. Sites that bring this additional strain to the NewsAgent harvester are currently being monitored using Netmind’s free Mind-it service. As these pages change, the NewsAgent administrator reviews the “changed page” and decides whether any new content has been added and whether to add the information to the NewsAgent system.

Problems with URLs

Harvesters can be configured to block out certain parts of Web sites that have been deemed of little value for the target information. This can include pages for areas such as feedback listings, advertising rates and archival information. One primary method in which a harvester can block such information is by using the file extension. In doing this, it can be configured to ignore everything which includes “/1999/”or “/volume4/ ”in the URL. The ability to effectively sort the wheat from the chaff of a Web site can result in problems if the information within the Web site is not organized in such a way to efficiently harvest around specific sections. The publishers who have been contacted regarding the organization of their information have responded most favorably. However, as many of these individuals are working with a low budget, are understaffed or are the only ones responsible for the site, they do not have the resources necessary to move the files to a more suitable format.

Record Dates

The date of the information harvested can also be a source of problems. There is no way to add information to the system without it possibly being disseminated to the users. This means that an older article thought to have value by the administrator and added to the system, or an older item which is harvested, may well be listed on a users update of new resources in addition to being available to users who perform online searches. The item record contains the import date and does not allow users to surmise the date of the item’s creation or last update unless the title or subject line contains the relevant information.

Tagging

The lack of efficient tagging has also been an issue for the NewAgent harvester. While information such as the page’s author is not generally necessary, having a title that accurately reflects the contents on the page is vital. The contents tags have also been found to be underused by many content producers. In a general sense, many pages only have titles reflective of their content and neglect to mention the name of the site. Consider a press release page. It is not uncommon for a company or organization to use the tile “Press Releases” in the tags for the page. Who’s Press Releases are being viewed? If that particular page is harvested only the information in the title tag will be used. If a user were harvesting information from the site, they would have to look at the URL to attempt to decipher the company’s name. If another user had book marked the page, they would have the same question regarding which company’s press releases they have without edited the entry or until they look at the URL. The subject of the web page appears to quite often be overlooked. Without any coded subject information, the NewsAgent system will simply copy the title into the subject area of the item's record.

Overall

Overall, the NewsAgent service is currently functioning in its role for professional information dissemination. However, the need for administration and monitoring is quite high. While some minor changes to the service have been made, additional developments will be needed to make the system more autonomous. New technology and future changes to the system may help to increase the systems capabilities while decreasing the administration costs.

RSS

There are currently several initiatives underway to assist in making the information now available on the Internet much more logical and easily integrated into dissemination initiatives. Integration of Rich Site Summary (RSS) [2] information within a system such as NewsAgent could be beneficial for the content producer, the information disseminator as well as the information recipient. While time and resources would be necessary up front from the harvesters as well as the content producers, the final result would be much more accurate and user friendly. Information regarding copyright, editor, dates of publication/update, and image information could sit alongside the title and descriptions of a site. Additionally, a content producer can let aggregators know which hours or days to avoid collecting from a site. This will dramatically reduce the number of times a page is collected in error or a site is accessed at inopportune times wasting time and resources. Of course some pages, such as those for press releases, will be updated at an irregular interval. In cases such as these, a harvester can simply be set to look for new content headlines available from the RSS files for such a site.

While integration of RSS will take time at the beginning, it will eventually save a great deal of time and frustration. Rather than attempt continual work around when confronting dissemination and harvesting, making the first step towards embracing additional technological capabilities should assist in continuing to keep libraries and information professionals at the forefront of the changes within information technology. Quite often with new technologies, groups and organizations wait before investing in a particular technology to see how many others are integrating the same technology. If the same measurements are used as the basis for technologies such as RSS, it will take a great deal of time before the “magic number” of users is reached and the technology is embraced by a majority. With the plethora of information and resources available, it would be within every organisation's best interests to examine how they can best integrate RSS information into their Web site. With many institutions stretching their resources and unable to consider the cost or time involved in changing their current practices, a quick look at how much has changed in just the past few years should encourage content producers to move towards RSS. Whether one consults with others who are currently involved in RSS or if one works in cooperation with other organizations, there are methods by which to move one’s technological capabilities forward with minimal impact on the bottom line.

References

  1. NewsAgent project Web site
    URL: <http://www.newsagent.sbu.ac.uk> Link to external resource
  2. UKOLN Metadata Resources - Rich Site Summary - RSS
    URL: <http://www.ukoln.ac.uk/metadata/resources/rss/> Link to external resource

Author Details

Karen Neal
Researcher
LITC
South Bank University

Karen has now left LITC. Her new email address is
Email: ateah@hotmail.com

All enquiries regarding NewsAgent should be sent to:
Andrew Cox
LITC
South Bank University
103 Borough Road
London
SE1 0AA

Email: coxam@sbu.ac.uk

For citation purposes:
Karen Neal, "News Services for the Library Community", Exploit Interactive, issue 7, 2nd October 2000
URL: <http://www.exploit-lib.org/issue7/newsagent/>


[HTML Validation] - [Accessibility check]