Mourad Mechkour, David J. Harper, Gheorghe Muresan
Currently, most users are facing difficulties using the existing tools to achieve satisfaction in their information-seeking task on the World Wide Web (WWW). Some of these difficulties arise because of the nature of the information available on the web. For instance a huge size, heterogeneous content which spans a multitude of domains of interest, no semantic organization of it, and only reference links between documents are used to structure it. Others are related to the tools used, which mostly do not provide enough assistance to the users. These users feel that most of the relevant information available there is not accessible because they do not have the right tools to help them find it and separate it from the bulk of the non-relevant information (a usual query retrieves far too many non-relevant documents); and almost no intelligent assistance is available for query formulation, i.e. choice of the appropriate terms combination to describe their information needs.
We believe in the need for some sort of intelligent assistance to filter the information available on the WWW and reduce it to specialized subsets which were selected according to the domain of interest of a particular user or group of users. This assistance should also provide some guidance in the query formulation stage, by helping the user in building an effective query which is representative of its information need and precise enough to exclude from the result most of the non-relevant documents.

Figure 1: Mediating Access to the World Wide Web.
In the WebCluster project we propose a new approach to searching the World Wide Web based on the concept of mediated access. Mediated access is the use of one document collection (source collection) as a domain-specific filter to access the WWW (target collection). This idea will limit the scope of view of the users to a subcollection of the web containing mostly documents relevant to the domain covered by the source collection. It will also allow transposition of the semantic structure of the domain, along with the operations available on it, to the web.
Defining a domain of interest (filter) by a collection of documents representative of it offers many advantages. It allows, by the use of Document Clustering techniques [6, 7], to extract automatically the semantic structure of the domain, and hence its main concepts, and keep this structure up-to-date. It is simple, dynamic and allows users to express evolution of their interest (shift, reinforcement) just by adding and/or removing documents from the collection. It allows personalization as well, since different users can use different sets of documents to define their own view of a domain of interest. Using document clustering allows us to transpose this structure easily to the web. Since each cluster can be represented by a hypothetical document computed from the documents it contains, and then extended by adding the documents from the web that are similar to it.
In Mediated access, see Figure 1, the users, instead of querying the target collection and going through multiple iterations of the process (query formulation, , , ,,,query evaluation, , relevance assessment), will first interact with the source collection to formulate a query, which is then issued to the target collection to retrieve the relevant documents available there. Therefore, searching the WWW becomes a two stage process :
The mediated access to the web approach offers numerous advantages, among them it allows a semantic structuring of the web without clustering the whole of it, it offers an important help in the query formulation process, and increases the precision of the search.
In this project we have developed a two part tool:
An evaluation of this tool is being conducted in an experimental way. The goal of this experiment is to measure how effective WebCluster is, and what improvements it provides to the users compared to a classical web search engine.
We thank T. Bratvold, J. Myllymki, M. L. Barja, G. Sonnenberger, and H-P. Frei from Ubilab for their valuable comments.
This project is supported by Ubilab of the Union Bank of Switzerland.
Reexamining the cluster hypothesis : Scatter/gather on retrieval.
In 19th, ACM SIGIR Conference, pages 76-84, August 18-22 1996.
A model for cluster searching based on classification.
Information Systems, 5:189-195, 1980.
An informal information-seeking environment.
JASIS, 48(11):1036-1048, November 1997.
Automatic Text processing.
Addison-Wesley, Reading, Mass, USA, 1989.
Choosing classes in conceptual modeling.
Communications of the ACM, 40(6):63-69, 1997.
Information Retrieval, 2nd edition
.Algorithms for Clustering Data
.An evaluation of techniques for clustering search results.
Technical Report IR-76, University of Massachusetts at Amherst, 1996.
The webcluster project. using clustering for mediating access to the world wide web.
In ACM-SIGIR'98, Melbourne, Australia, pages 357-358, August 1998.
Poster.