Choosing an NLP library for analyzing software documentation: a systematic literature review and a series of experiments

Al Omran, F.; Treude, C.

Please use this identifier to cite or link to this item: https://hdl.handle.net/2440/109334

Scopus	Web of Science®	Altmetric
Citations
?	?

Type:	Conference paper
Title:	Choosing an NLP library for analyzing software documentation: a systematic literature review and a series of experiments
Author:	Al Omran, F. Treude, C.
Citation:	IEEE International Working Conference on Mining Software Repositories, 2017, pp.187-197
Publisher:	IEEE
Issue Date:	2017
Series/Report no.:	IEEE International Working Conference on Mining Software Repositories
ISBN:	978-1-5386-1544-7
ISSN:	2160-1852 2160-1860
Conference Name:	2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR 2017) (20 May 2017 - 21 May 2017 : Buenos Aires, Argentina)
Statement of Responsibility:	Fouad Nasser A Al Omran, Christoph Treude
Abstract:	To uncover interesting and actionable information from natural language documents authored by software developers, many researchers rely on "out-of-the-box" NLP libraries. However, software artifacts written in natural language are different from other textual documents due to the technical language used. In this paper, we first analyze the state of the art through a systematic literature review in which we find that only a small minority of papers justify their choice of an NLP library. We then report on a series of experiments in which we applied four state-of-the-art NLP libraries to publicly available software artifacts from three different sources. Our results show low agreement between different libraries (only between 60% and 71% of tokens were assigned the same part-of-speech tag by all four libraries) as well as differences in accuracy depending on source: For example, spaCy achieved the best accuracy on Stack Overflow data with nearly 90% of tokens tagged correctly, while it was clearly outperformed by Google's SyntaxNet when parsing GitHub ReadMe files. Our work implies that researchers should make an informed decision about the particular NLP library they choose and that customizations to libraries might be necessary to achieve good results when analyzing software artifacts written in natural language.
Keywords:	Natural language processing, NLP libraries, part-of-speech tagging, software documentation
Rights:	© 2017 IEEE
DOI:	10.1109/MSR.2017.42
Published version:	http://dx.doi.org/10.1109/msr.2017.42
Appears in Collections:	Aurora harvest 7 Computer Science publications

Files in This Item:

File	Description	Size	Format
hdl_109334.pdf	Accepted version	187.27 kB	Adobe PDF	View/Open

Show full item record

Adelaide Research & Scholarship