Representation Learning for Stack Overflow Posts: How Far Are We?

Junda He,Ting Zhang,David Lo,Ferdian Thung,Zhou Yang,Bowen Xu,Xin Zhou,Kisub Kim,Ivana Clairine Irsan

doi:10.1145/3635711

Abstract

The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content. The performance of such solutions hinges significantly on the selection of representation models for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers’ interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon neural networks such as convolutional neural network and transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks (i.e., tag recommendation, relatedness prediction, and API recommendation). The results show that Post2Vec cannot further improve the SOTA techniques of the considered downstream tasks, and BERTOverflow shows surprisingly poor performance. To find more suitable representation models for the posts, we further explore a diverse set of transformer-based models, including (1) general domain language models (RoBERTa, Longformer, and GPT2) and (2) language models built with software engineering related textual artifacts (CodeBERT, GraphCodeBERT, seBERT, CodeT5, PLBart, and CodeGen). This exploration shows that models like CodeBERT and RoBERTa are suitable for representing Stack Overflow posts. However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple yet effective strategy to improve the representation models of Stack Overflow posts by continuing the pre-training phase with the textual artifact from Stack Overflow. The overall experimental results demonstrate that SOBERT can consistently outperform the considered models and increase the SOTA performance significantly for all the downstream tasks.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: ACM Transactions on Software Engineering and Methodology	Publication Date: Mar 15, 2024
Citations: 6	License type: other-oa

R Discovery Prime

R Discovery Prime

Representation Learning for Stack Overflow Posts: How Far Are We?

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Software Engineering and Methodology

Lead the way for us

Similar Papers

Post2Vec: Learning Distributed Representations of Stack Overflow Posts
Bowen Xu ... Chengran Yang
IEEE Transactions on Software Engineering | VOL. 48
Bowen Xu, et. al.Bowen Xu ... Chengran Yang
01 Sep 2022
IEEE Transactions on Software Engineering | VOL. 48

Stack Overflow: A code laundering platform?
Le An ... Giuliano Antoniol
-
Le An, et. al.Le An ... Giuliano Antoniol
01 Feb 2017
01 Feb 2017

I Know What You Are Searching for: Code Snippet Recommendation from Stack Overflow Posts
Zhipeng Gao ... Xin Xia
ACM Transactions on Software Engineering and Methodology | VOL. 32
Zhipeng Gao, et. al.Zhipeng Gao ... Xin Xia
26 Apr 2023
ACM Transactions on Software Engineering and Methodology | VOL. 32

Characterizing Leveraged Stack Overflow Posts
Salvatore Geremia ... Rocco Oliveto
-
Salvatore Geremia, et. al.Salvatore Geremia ... Rocco Oliveto
01 Sep 2019
01 Sep 2019

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Representation Learning for Stack Overflow Posts: How Far Are We?

Abstract

Talk to us

Similar Papers

More From: ACM Transactions on Software Engineering and Methodology