Monolingual, multilingual and cross-lingual code comment classification

Marija Kostić,Vuk Batanović,Boško Nikolić

doi:10.1016/j.engappai.2023.106485

Marija Kostić, Vuk Batanović + Show 1 more

Open Access

https://doi.org/10.1016/j.engappai.2023.106485

Copy DOI

Export

Save

Cite

Abstract
Full-Text
Similar Papers

Abstract

Listen

Code comments are one of the most useful forms of documentation and metadata for understanding software implementation. Previous research on code comment classification has focused only on comments in English, typically extracted from a few programming languages. This paper addresses the problem of code comment classification not only in the monolingual setting, but also in the multilingual and cross-lingual one, in order to examine whether they can outperform the traditional monolingual approach. To tackle this task, we introduce a novel, publicly available code comment dataset, consisting of over 10,000 code comments collected from software projects written in eight programming languages (C, C++, C#, Java, JavaScript/TypeScript, PHP, Python, and SQL). About half of them are written in Serbian while the other half are written in English. This dataset was manually annotated according to a newly proposed taxonomy of code comment categories. We fine-tune and evaluate multiple monolingual and multilingual pre-trained neural language models on the code comment classification task and compare their performances to several baselines. The best results for Serbian comments are obtained using the monolingual neural model BERTić, trained on Serbian and closely related languages. On the other hand, the optimal choice for English is the multilingual neural model multilingual BERT, which successfully extracts useful patterns from data in both languages. Although the cross-lingual setting shows some promise for simple binary classification, it has yet to reach sufficiently high performance levels for practical use. We also analyze model performance across different programming languages.

Full Text

Published Version

View

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Engineering Applications of Artificial Intelligence	Publication Date: Jun 10, 2023
Citations: 7	License type: cc-by

R Discovery Prime

Monolingual, multilingual and cross-lingual code comment classification

Abstract

Published Version

Talk to us

Similar Papers

More From: Engineering Applications of Artificial Intelligence

Lead the way for us

Similar Papers

Improving Cross-lingual Information Retrieval on Low-Resource Languages via Optimal Transport Distillation
Zhiqi Huang ... Puxuan Yu
-
Zhiqi Huang, et. al.Zhiqi Huang ... Puxuan Yu
27 Feb 2023
27 Feb 2023

Improving Pre-Trained Multilingual Model with Vocabulary Expansion
Hai Wang ... Dian Yu
-
Hai Wang, et. al.Hai Wang ... Dian Yu
01 Jan 2019
01 Jan 2019

Improving sentence representation for vietnamese natural language understanding using optimal transport
Phu Xuan-Vinh Nguyen ... Kiet Van Nguyen
Journal of Intelligent & Fuzzy Systems | VOL. 45
Phu Xuan-Vinh Nguyen, et. al.Phu Xuan-Vinh Nguyen ... Kiet Van Nguyen
02 Dec 2023
Journal of Intelligent & Fuzzy Systems | VOL. 45

TiBERT: Tibetan Pre-trained Language Model
Sisi Liu ... Xiaobing Zhao
-
Sisi Liu, et. al.Sisi Liu ... Xiaobing Zhao
09 Oct 2022
09 Oct 2022

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

Monolingual, multilingual and cross-lingual code comment classification

Abstract

Published Version

Talk to us

Similar Papers

More From: Engineering Applications of Artificial Intelligence