Exploring software reusability metrics with Q&A forum data

Matthew T Patrick

doi:10.1016/j.jss.2020.110652

Abstract

Question and answer (Q&A) forums contain valuable information regarding software reuse, but they can be challenging to analyze due to their unstructured free text. Here we introduce a new approach (LANLAN), using word embeddings and machine learning, to harness information available in StackOverflow. Specifically, we consider two different kinds of user communication describing difficulties encountered in software reuse: ‘problem reports’ point to potential defects, while ‘support requests’ ask for clarification on software usage. Word embeddings were trained on 1.6 billion tokens from StackOverflow and applied to identify which Q&A forum messages (from two large open source projects: Eclipse and Bioconductor) correspond to problem reports or support requests. LANLAN achieved an area under the receiver operator curve (AUROC) of over 0.9; it can be used to explore the relationship between software reusability metrics and difficulties encountered by users, as well as predict the number of difficulties users will face in the future. Q&A forum data can help improve understanding of software reuse, and may be harnessed as an additional resource to evaluate software reusability metrics.

Full Text