JEMMA: An extensible Java dataset for ML4Code applications

Anjan Karmakar,Miltiadis Allamanis,Romain Robbes

doi:10.1007/s10664-022-10275-7

Abstract

Machine Learning for Source Code (ML4Code) is an active research field in which extensive experimentation is needed to discover how to best use source code’s richly structured information. With this in mind, we introduce JEMMA: An Extensible Java Dataset for ML4Code Applications, which is a large-scale, diverse, and high-quality dataset targeted at ML4Code. Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks. JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties (e.g., metrics, static analysis results) for 50,000 Java projects from the 50K-C dataset, with over 1.2 million classes and over 8 million methods. JEMMA is also extensible allowing users to add new properties and representations to the dataset, and evaluate tasks on them. Thus, JEMMA becomes a workbench that researchers can use to experiment with novel representations and tasks operating on source code. To demonstrate the utility of the dataset, we also report results from two empirical studies on our data, ultimately showing that significant work lies ahead in the design of context-aware source code models that can reason over a broader network of source code entities in a software project—the very task that JEMMA is designed to help with.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Empirical Software Engineering	Publication Date: Mar 1, 2023
Citations: 2	License type: open-access

R Discovery Prime

R Discovery Prime

JEMMA: An extensible Java dataset for ML4Code applications

Abstract

Talk to us

Similar Papers

More From: Empirical Software Engineering

Lead the way for us

Similar Papers

Testing the past: can we still run tests in past snapshots for Java projects?
Michel Maes-Bermejo ... Jesus M Gonzalez-Barahona
Empirical Software Engineering | VOL. 29
Michel Maes-Bermejo, et. al.Michel Maes-Bermejo ... Jesus M Gonzalez-Barahona
30 Jul 2024
Empirical Software Engineering | VOL. 29

A Metric-Based Approach to Assess Class Testability
Yogesh Singh ... Anju Saha
-
Yogesh Singh, et. al.Yogesh Singh ... Anju Saha
01 Jan 2008
01 Jan 2008

Special issue: a selection of distinguished papers from the 18th Working Conference on Reverse Engineering 2011
Martin Pinzger ... Denys Poshyvanyk
Journal of Software: Evolution and Process | VOL. 26
Martin Pinzger, et. al.Martin Pinzger ... Denys Poshyvanyk
31 Oct 2013
Special issue: a selection of distinguished papers from the 18th Working Conference on Reverse Engineering 2011
Martin Pinzger ... Denys Poshyvanyk

Neural Network-based Approach for Source Code Classification to Enhance Software Maintainability and Reusability
Mohamed Ifham ... B T G S Kumara
-
Mohamed Ifham, et. al.Mohamed Ifham ... B T G S Kumara
08 Dec 2021
08 Dec 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

JEMMA: An extensible Java dataset for ML4Code applications

Abstract

Talk to us

Similar Papers

More From: Empirical Software Engineering