Please use this identifier to cite or link to this item:
Scopus Web of Science® Altmetric
Type: Journal article
Title: Duplicate detection in programming question answering communities
Author: Zhang, W.E.
Sheng, Q.Z.
Lau, J.H.
Abebe, E.
Ruan, W.
Citation: ACM Transactions on Internet Technology, 2018; 18(3):37-1-37-21
Publisher: Association for Computing Machinery
Issue Date: 2018
ISSN: 1533-5399
Statement of
Wei Emma Zhang, Quan Z. Sheng, Jey Han Lau, Wei Emma Zhang, Wei Emma Zhang, Ermyas Abebe, Wenjie Ruan
Abstract: Community-based Question Answering (CQA) websites are attracting increasing numbers of users and contributors in recent years. However, duplicate questions frequently occur in CQA websites and are currently manually identified by the moderators. Automatic duplicate detection, on one hand, alleviates this laborious effort for moderators before taking close actions, and, on the other hand, helps question issuers quickly find answers. A number of studies have looked into related problems, but very limited works target Duplicate Detection in Programming CQA (PCQA), a branch of CQA that is dedicated to programmers. Existing works framed the task as a supervised learning problem on the question pairs and relied on only textual features. Moreover, the issue of selecting candidate duplicates from large volumes of historical questions is often un-addressed. To tackle these issues, we model duplicate detection as a two-stage “ranking-classification” problem over question pairs. In the first stage, we rank the historical questions according to their similarities to the newly issued question and select the top ranked ones as candidates to reduce the search space. In the second stage, we develop novel features that capture both textual similarity and latent semantics on question pairs, leveraging techniques in deep learning and information retrieval literature. Experiments on real-world questions about multiple programming languages demonstrate that our method works very well; in some cases, up to 25% improvement compared to the state-of-the-art benchmarks.
Keywords: Community-based question answering; question quality; classification; latent semantics; association rules
Rights: © 2018 ACM
RMID: 0030121118
DOI: 10.1145/3169795
Grant ID:
Appears in Collections:Computer Science publications

Files in This Item:
There are no files associated with this item.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.