active web home || papers || posters
The Active Web

Mediated Access: a Novel Technique for Searching the World Wide Web.

Mourad Mechkour, David J. Harper, Gheorghe Muresan
The Robert Gordon University
St Andrew Street, Aberdeen, AB25 1HG.
United Kingdom
{mrm, djh, gm}@scms.rgu.ac.uk

Currently, most users are facing difficulties using the existing tools to achieve satisfaction in their information-seeking task on the World Wide Web (WWW). Some of these difficulties arise because of the nature of the information available on the web. For instance a huge size, heterogeneous content which spans a multitude of domains of interest, no semantic organization of it, and only reference links between documents are used to structure it. Others are related to the tools used, which mostly do not provide enough assistance to the users. These users feel that most of the relevant information available there is not accessible because they do not have the right tools to help them find it and separate it from the bulk of the non-relevant information (a usual query retrieves far too many non-relevant documents); and almost no intelligent assistance is available for query formulation, i.e. choice of the appropriate terms combination to describe their information needs.

We believe in the need for some sort of intelligent assistance to filter the information available on the WWW and reduce it to specialized subsets which were selected according to the domain of interest of a particular user or group of users. This assistance should also provide some guidance in the query formulation stage, by helping the user in building an effective query which is representative of its information need and precise enough to exclude from the result most of the non-relevant documents.

 Mediating access to the WWW

Figure 1: Mediating Access to the World Wide Web.

 

In the WebCluster project we propose a new approach to searching the World Wide Web based on the concept of mediated access. Mediated access is the use of one document collection (source collection) as a domain-specific filter to access the WWW (target collection). This idea will limit the scope of view of the users to a subcollection of the web containing mostly documents relevant to the domain covered by the source collection. It will also allow transposition of the semantic structure of the domain, along with the operations available on it, to the web.

Defining a domain of interest (filter) by a collection of documents representative of it offers many advantages. It allows, by the use of Document Clustering techniques [6, 7], to extract automatically the semantic structure of the domain, and hence its main concepts, and keep this structure up-to-date. It is simple, dynamic and allows users to express evolution of their interest (shift, reinforcement) just by adding and/or removing documents from the collection. It allows personalization as well, since different users can use different sets of documents to define their own view of a domain of interest. Using document clustering allows us to transpose this structure easily to the web. Since each cluster can be represented by a hypothetical document computed from the documents it contains, and then extended by adding the documents from the web that are similar to it.

In Mediated access, see Figure 1, the users, instead of querying the target collection and going through multiple iterations of the process (query formulation, , , ,,,query evaluation, , relevance assessment), will first interact with the source collection to formulate a query, which is then issued to the target collection to retrieve the relevant documents available there. Therefore, searching the WWW becomes a two stage process :

1.
The Query formulation stage helps the users to formulate a precision oriented query. They search (browsing/querying) the structured source document collection to gain knowledge about the inherent semantic structure of the domain covered by it and retrieve some relevant source documents or cluster of documents. This stage will allow them to learn the important concepts of the domain and the ones corresponding to their information need, and how they are represented in this document collection. At the end of this stage users should have identified their information need either as a concept (or set of concepts) of the specific domain, or as a cluster of relevant documents.
2.
The Mediated Access stage uses the information need expression identified in the previous stage to issue a query to the target collection, and allow the users to browse through the result hence produced.

The mediated access to the web approach offers numerous advantages, among them it allows a semantic structuring of the web without clustering the whole of it, it offers an important help in the query formulation process, and increases the precision of the search.

In this project we have developed a two part tool:

1.
the WebCluster server that implements the basic clustering and searching facilities. This component allows users to define the most effective clustering of their document collection by choosing the most appropriate clustering method from a wide range of state of the art clustering methods, and implements a set of search strategies (browsing/querying) that can be used to search the clustered collections.
2.
The WebCluster client (interface), that implements the idea of mediated access to the web. This interface combines three displays. The first is dedicated to the interaction with the source collection (querying/browsing of the clustered collection). The second is dedicated to the visualization of the result of a query as a list of documents (from the source collection or from the web). And the third is dedicated to a detailed view of a cluster (concepts and documents representative of it) or a document (from the source collection or from the web).

An evaluation of this tool is being conducted in an experimental way. The goal of this experiment is to measure how effective WebCluster is, and what improvements it provides to the users compared to a classical web search engine.

Acknowledgments

We thank T. Bratvold, J. Myllymki, M. L. Barja, G. Sonnenberger, and H-P. Frei from Ubilab for their valuable comments.

This project is supported by Ubilab of the Union Bank of Switzerland.

Bibliography

1
Hearst M. A. and Pederson J. O.


Reexamining the cluster hypothesis : Scatter/gather on retrieval.
In 19th, ACM SIGIR Conference, pages 76-84, August 18-22 1996.

2
Croft W. B.


A model for cluster searching based on classification.
Information Systems, 5:189-195, 1980.

3
Hendry D.G. and Harper D. J.


An informal information-seeking environment.
JASIS, 48(11):1036-1048, November 1997.

 

4

Salton G.


Automatic Text processing.
Addison-Wesley, Reading, Mass, USA, 1989.

5
Parsons J. and Wand Y.

Choosing classes in conceptual modeling.
Communications of the ACM, 40(6):63-69, 1997.

6
Van Rijsbergen C. J.

Information Retrieval, 2nd edition.
Buttersworth, London, 1979.

7
Jain A. K. and Dubes R. C.

Algorithms for Clustering Data.
Prentice Hall, 1988.

8
A. Leouski and W.B. Croft.

An evaluation of techniques for clustering search results.
Technical Report IR-76, University of Massachusetts at Amherst, 1996.

9
M. Mechkour, D. J. Harper, and G. Muresan.

The webcluster project. using clustering for mediating access to the world wide web.
In ACM-SIGIR'98, Melbourne, Australia, pages 357-358, August 1998.
Poster.

active web home || papers || posters