Generating unambiguous URL clusters form Web search
Files
(Published version)
Date
2009
Authors
Smith, G.S.
Brailsforde, T.
Donner, C.
Hooijmaijers, D.
Truran, M.
Goulding, J.
Ashman, H.L.
Editors
Baeza-Yates, B.-Y.
Ricardo, R.
Ricardo, R.
Advisors
Journal Title
Journal ISSN
Volume Title
Type:
Conference paper
Citation
Proceedings of WSDM'09 Second ACM International Conference on Web Search and Web Data Mining : Workshop on Web Search Click Data, 2009 / Baeza-Yates, B.-Y., Ricardo, R. (ed./s), pp.28-34
Statement of Responsibility
Conference Name
WSDM'09 Second ACM International Conference on Web Search and Web Data Mining (9 Feb 2009 : Barcelona, Spain)
Abstract
The main questions addressed are i) whether it is feasible to generate single-sense / sense-coherent clusters, and ii) whether, in a closed world, it would be feasible to discover ambiguous terms. The experimentation showed that sense-coherent clusters were found and further indicated that ambiguous terms could be detected from observing small overlap between large clusters.
This paper evaluates the proposed coselection method for generating single-sense clusters against two other methods, with varying parameters. The evaluation is done both with a human evaluation to determine the quality of the clusters generated by the different methods, and by a simple "edit distance" analysis to determine the content difference of the methods.
This paper reports on the generation of unambiguous clusters of URLs from clickthrough data from the MSN search query log excerpt (the RFP 2006 dataset). Selections (clickthroughs) by a single user from a single query can be assumed to have some mutual semantic relevance, and the URLs coselected in this way can be aggregated to form single-sense clusters. When the graphs for a single term separate into distinct clusters, the semantics of the distinct clusters can be interpreted as disambiguated aggregations of URLs. This principle had been tested on smaller and more constrained datasets previously, and this paper reports on findings from applying a method based on the principle to the RFP 2006 dataset.
School/Discipline
Dissertation Note
Provenance
Description
Access Status
Rights
Copyright 2009 Association for Computing Machinery
Access Condition Notes: Accepted manuscript available open access