Programming language detection from source code excepts remains an active research field, which has already been addressed with machine learning and natural language processing. Identifying the language of short code snippets poses both benefits and challenges across various scenarios, such as embedded code analysis, forums, Q&A systems, search engines, source code repositories, and text editors. Existing approaches for language detection typically require multiple lines or even the entire file contents. In this article, we propose a character-level deep learning model designed to predict the programming language from a single line of code. To this aim, we construct a balanced dataset comprising 434.18 million instances across 21 languages, significantly exceeding the size of existing datasets by three orders of magnitude. Leveraging this dataset, we train a deep bidirectional recurrent neural network that achieves a 95.07% accuracy and macro-F1 score for a single-line code. To predict the programming language of multiple lines (e.g., code snippets) and entire files, we build a stacking ensemble meta-model that leverages our single-line model to efficiently recognize the language of multiple lines of code. Our system outperforms the state-of-the-art approaches not only for a single line of code, but also for snippets of 5 and 10 lines and whole files of source code. We also present PLangRec, an open-source language detection system that includes our trained models. PLangRec is freely available as a user-friendly web application, a web API, and a Python desktop program.
Read full abstract