An automated detection of confusing variable pairs with highly similar compound names in Java and Python programs

Hirohisa Aman,Sousuke Amasaki,Tomoyuki Yokogawa,Minoru Kawahara

doi:10.1007/s10664-023-10339-2

Hirohisa Aman, Sousuke Amasaki + Show 2 more

https://doi.org/10.1007/s10664-023-10339-2

Copy DOI

Abstract

Variable names represent a significant source of information regarding the source code, and a successful naming of variables is key to producing readable code. Programmers often use a compound variable name by concatenating two or more words to make it more informative and enhance the code readability. While each compound variable name is descriptive, a collection of them sometimes produces “confusing” variable pairs if their names are highly similar, e.g., “shippingHeight,” vs. “shippingWeight.” A confusing variable pair would adversely affect the code readability because it can cause a misreading or mix-up of variables during the programming or code review activities. Toward automated support for enhancing code readability, this paper conducts a large-scale investigation of compound variable names in Java and Python programs. The investigation collects 116,921,127 pairs of compound-named variables from 1,876 open-source Java projects and 106,943,523 pairs of such variables from 2,427 open-source Python projects. Then, this study analyzes those variable pairs from two perspectives of name similarity: string similarity and semantic similarity. Through an evaluation study with 30 human participants, the data analyses show that both string and semantic similarity can help detect confusing variable pairs in Java and Python programs. In order to distill confusing variable pairs automatically, support tools for detecting confusing variable pairs are also developed in this study.

Full Text