Generating unambiguous URL clusters form Web search

Date

2009

Authors

Smith, G.S.
Brailsforde, T.
Donner, C.
Hooijmaijers, D.
Truran, M.
Goulding, J.
Ashman, H.L.

Editors

Baeza-Yates, B.-Y.
Ricardo, R.

Advisors

Journal Title

Journal ISSN

Volume Title

Type:

Conference paper

Citation

Proceedings of WSDM'09 Second ACM International Conference on Web Search and Web Data Mining : Workshop on Web Search Click Data, 2009 / Baeza-Yates, B.-Y., Ricardo, R. (ed./s), pp.28-34

Statement of Responsibility

Conference Name

WSDM'09 Second ACM International Conference on Web Search and Web Data Mining (9 Feb 2009 : Barcelona, Spain)

Abstract

The main questions addressed are i) whether it is feasible to generate single-sense / sense-coherent clusters, and ii) whether, in a closed world, it would be feasible to discover ambiguous terms. The experimentation showed that sense-coherent clusters were found and further indicated that ambiguous terms could be detected from observing small overlap between large clusters. This paper evaluates the proposed coselection method for generating single-sense clusters against two other methods, with varying parameters. The evaluation is done both with a human evaluation to determine the quality of the clusters generated by the different methods, and by a simple "edit distance" analysis to determine the content difference of the methods. This paper reports on the generation of unambiguous clusters of URLs from clickthrough data from the MSN search query log excerpt (the RFP 2006 dataset). Selections (clickthroughs) by a single user from a single query can be assumed to have some mutual semantic relevance, and the URLs coselected in this way can be aggregated to form single-sense clusters. When the graphs for a single term separate into distinct clusters, the semantics of the distinct clusters can be interpreted as disambiguated aggregations of URLs. This principle had been tested on smaller and more constrained datasets previously, and this paper reports on findings from applying a method based on the principle to the RFP 2006 dataset.

School/Discipline

Dissertation Note

Provenance

Description

Access Status

Rights

Copyright 2009 Association for Computing Machinery Access Condition Notes: Accepted manuscript available open access

License

Grant ID

Call number

Persistent link to this record