Mining structured data in natural language artifacts with island parsing

Alberto Bacchelli,Anthony Cleve,Andrea Mocci,Michele Lanza

doi:10.1016/j.scico.2017.06.009

Alberto Bacchelli, Anthony Cleve + Show 2 more

Open Access

https://doi.org/10.1016/j.scico.2017.06.009

Copy DOI

Abstract

Software repositories typically store data composed of structured and unstructured parts. Researchers mine this data to empirically validate research ideas and to support practitioners' activities. Structured data (e.g., source code) has a formal syntax and is straightforward to analyze; unstructured data (e.g., documentation) is a mix of natural language, noise, and snippets of structured data, and it is harder to analyze. Especially the structured content (e.g., code snippets) in unstructured data contains valuable information. Researchers have proposed several approaches to recognize, extract, and analyze structured data embedded in natural language. We analyze these approaches and investigate their drawbacks. Subsequently, we present two novel methods, based on scannerless generalized LR (SGLR) and Parsing Expression Grammars (PEGs), to address these drawbacks and to mine structured fragments within unstructured data. We validate and compare these approaches on development emails and Stack Overflow posts with Java code fragments. Both approaches achieve high precision and recall values, but the PEG-based one achieves better computational performances and simplicity in engineering.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Science of Computer Programming	Publication Date: Aug 1, 2017
Citations: 5	License type: publisher-specific-oa

R Discovery Prime

R Discovery Prime

Mining structured data in natural language artifacts with island parsing

Abstract

Talk to us

Similar Papers

More From: Science of Computer Programming

Lead the way for us

Similar Papers

Lake symbols for island parsing
Katsumi Okuda ... Shigeru Chiba
The Art, Science, and Engineering of Programming | VOL. 5
Katsumi Okuda, et. al.Katsumi Okuda ... Shigeru Chiba
30 Oct 2020
The Art, Science, and Engineering of Programming | VOL. 5

Natural Language Processing and the Promise of Big Data: Small Step Forward, but Many Miles to Go.
Thomas M Maddox ... Michael A Matheny
Circulation. Cardiovascular quality and outcomes | VOL. 8
Thomas M Maddox, et. al.Thomas M Maddox ... Michael A Matheny
18 Aug 2015
Circulation. Cardiovascular quality and outcomes | VOL. 8

Big Data, Predictive Analytics, and Quality Improvement in Kidney Transplantation: A Proof of Concept.
T.R Srinivas ... G Mour
American Journal of Transplantation | VOL. 17
T.R Srinivas, et. al.T.R Srinivas ... G Mour
04 Jan 2017
American Journal of Transplantation | VOL. 17

Forecasts of the Amount Purchase Pork Meat by Using Structured and Unstructured Big Data
Ga-Ae Ryu ... Aziz Nasridinov
Agriculture | VOL. 10
Ga-Ae Ryu, et. al.Ga-Ae Ryu ... Aziz Nasridinov
18 Jan 2020
Agriculture | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mining structured data in natural language artifacts with island parsing

Abstract

Talk to us

Similar Papers

More From: Science of Computer Programming