Jupyter Notebooks on GitHub: Characteristics and Code Clones

Malin Källén,Tobias Wrigstad

doi:10.22152/programming-journal.org/2021/5/15

Malin Källén, Tobias Wrigstad

Open Access

https://doi.org/10.22152/programming-journal.org/2021/5/15

Copy DOI

Abstract

Jupyter notebooks has emerged as a standard tool for data science programming. Programs in Jupyter notebooks are different from typical programs as they are constructed by a collection of code snippets interleaved with text and visualisation. This allows interactive exploration and snippets may be executed in different order which may give rise to different results due to side-effects between snippets. Previous studies have shown the presence of considerable code duplication -- code clones -- in sources of traditional programs, in both so-called systems programming languages and so-called scripting languages. In this paper we present the first large-scale study of code cloning in Jupyter notebooks. We analyse a corpus of 2.7 million Jupyter notebooks hosted on GitHJub, representing 37 million individual snippets and 227 million lines of code. We study clones at the level of individual snippets, and study the extent to which snippets are recurring across multiple notebooks. We study both identical clones and approximate clones and conduct a small-scale ocular inspection of the most common clones. We find that code cloning is common in Jupyter notebooks -- more than 70% of all code snippets are exact copies of other snippets (with possible differences in white spaces), and around 50% of all notebooks do not have any unique snippet, but consists solely of snippets that are also found elsewhere. In notebooks written in Python, at least 80% of all snippets are approximate clones and the prevalence of code cloning is higher in Python than in other languages. We further find that clones between different repositories are far more common than clones within the same repository. However, the most common individual repository from which a Jupyter notebook contains clones is the repository in which itself resides.

Highlights

Data science, that is processing, analysing and extracting knowledge from large quantities of data, has emerged as a new inter-disciplinary field or new research paradigm, and an increasingly important component in industry, as many companies strive to be “data-driven”
In this paper we present the first large-scale study of code cloning in Jupyter notebooks
We find that code cloning is common in Jupyter notebooks – more than 70 % of all code snippets are exact copies of other snippets, and around 50 % of all notebooks do not have any unique snippet, but consists solely of snippets that are found elsewhere

Summary

Introduction

That is processing, analysing and extracting knowledge from large quantities of data, has emerged as a new inter-disciplinary field or new research paradigm, and an increasingly important component in industry, as many companies strive to be “data-driven”. The emergence and rapid growth of this field is fuelled by the availability and easy access to vast quantities of data, the relative ease with which such data sets can be gathered with new technology, and the availability of easyto-use computational tools that hide most of the complicated data crunching and computation behind (relatively speaking) easy interfaces. This allows a new class of programmers — that would not traditionally view themselves as such — to explore data sets and use statistical methods for business decisions, in research, and in society. Jupyter Notebook files are stored on disk in JSON format

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: The Art, Science, and Engineering of Programming	Publication Date: Feb 26, 2021
Citations: 4	License type: cc-by

R Discovery Prime

R Discovery Prime

Jupyter Notebooks on GitHub: Characteristics and Code Clones

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Art, Science, and Engineering of Programming

Lead the way for us

Similar Papers

Code Duplication and Reuse in Jupyter Notebooks
...
UVic’s Research and Learning Repository (University of Victoria) | VOL. -
, et. al. ...
27 May 2020
UVic’s Research and Learning Repository (University of Victoria) | VOL. -

Code Duplication and Reuse in Jupyter Notebooks
Andreas P Koenzen ... Neil A Ernst
-
Andreas P Koenzen, et. al.Andreas P Koenzen ... Neil A Ernst
02 Jul 2020
02 Jul 2020

VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery
Seulbae Kim ... Seunghoon Woo
-
Seulbae Kim, et. al.Seulbae Kim ... Seunghoon Woo
01 May 2017
01 May 2017

Bug Propagation through Code Cloning: An Empirical Study
Manishankar Mondal ... Chanchal K. Roy
-
Manishankar Mondal, et. al.Manishankar Mondal ... Chanchal K. Roy
01 Sep 2017
01 Sep 2017

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Jupyter Notebooks on GitHub: Characteristics and Code Clones

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: The Art, Science, and Engineering of Programming