Assessment of off-the-shelf SE-specific sentiment analysis tools: An extended replication study

Nicole Novielli,Alexander Serebrenik,Filippo Lanubile,Fabio Calefato

doi:10.1007/s10664-021-09960-w

Abstract

Sentiment analysis methods have become popular for investigating human communication, including discussions related to software projects. Since general-purpose sentiment analysis tools do not fit well with the information exchanged by software developers, new tools, specific for software engineering (SE), have been developed. We investigate to what extent off-the-shelf SE-specific tools for sentiment analysis mitigate the threats to conclusion validity of empirical studies in software engineering, highlighted by previous research. First, we replicate two studies addressing the role of sentiment in security discussions on GitHub and in question-writing on Stack Overflow. Then, we extend the previous studies by assessing to what extent the tools agree with each other and with the manual annotation on a gold standard of 600 documents. We find that different SE-specific sentiment analysis tools might lead to contradictory results at a fine-grain level, when used off-the-shelf. Conversely, platform-specific tuning or retraining might be needed to take into account differences in platform conventions, jargon, or document lengths.

Highlights

Sentiment analysis, i.e., the task of extracting the positive or negative semantic orientation a text (Pang and Lee 2008), has emerged as a tool for empirical software engineering studies to mine emotions and opinions from textual developer-generated content (Novielli et al 2019) in the ‘social programmer’ (Storey 2012) ecosystem
Our results suggest that SE-specific fine-tuning of sentiment analysis tools to the software engineering domain might not be enough to improve accuracy, and platform-specific tuning or retraining might be needed to adjust the model performance to the shifts in lexical semantics due to different platform jargon or conventions
We reported the results of an extended replication aimed at assessing to what extent SE-specific sentiment analysis tools mitigate the threats to conclusion validity highlighted by previous research

Summary

Introduction

I.e., the task of extracting the positive or negative semantic orientation a text (Pang and Lee 2008), has emerged as a tool for empirical software engineering studies to mine emotions and opinions from textual developer-generated content (Novielli et al 2019) in the ‘social programmer’ (Storey 2012) ecosystem. Jongeling et al (2017) tried to replicate previously published empirical studies and showed that the choice of the sentiment analysis tool has an impact on the validity of the conclusions. To overcome such limitations, researchers have started developing SE-specific sentiment analysis tools to mine developers’ emotions (e.g., Calefato et al (2018a), Ahmed et al (2017), Islam and Zibran (2017), and Chen et al (2019)) and opinions (e.g., (Lin et al 2019; Uddin and Khomh 2017)). Given the disagreement among these tools, Jongeling and colleagues conducted a replication of previous studies on sentiment analysis in software engineering to understand to what extent the choice of an instrument affects the results. They observed contradictory findings and concluded that previous studies’ results cannot be replicated when different, general-purpose sentiment analysis tools are used, i.e., the instrument choice can induce threats to conclusion validity

Results

Discussion

Conclusion