Data Provenance

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon
Take notes icon Take Notes

Data provenance has evolved from a niche topic to a mainstream area of research in databases and other research communities. This article gives a comprehensive introduction to data provenance. The main focus is on provenance in the context of databases. However, it will be insightful to also consider connections to related research in programming languages, software engineering, semantic web, formal logic, and other communities. The target audience are researchers and practitioners that want to gain a solid understanding of data provenance and the state-of-the-art in this research area. The article only assumes that the reader has a basic understanding of database concepts, but not necessarily any prior exposure to provenance.

Similar Papers
  • Research Article
  • Cite Count Icon 14
  • 10.1145/3689735
Knowledge Transfer from High-Resource to Low-Resource Programming Languages for Code LLMs
  • Oct 8, 2024
  • Proceedings of the ACM on Programming Languages
  • Federico Cassano + 9 more

Over the past few years, Large Language Models of Code (Code LLMs) have started to have a significant impact on programming practice. Code LLMs are also emerging as building blocks for research in programming languages and software engineering. However, the quality of code produced by a Code LLM varies significantly by programming language. Code LLMs produce impressive results on high-resource programming languages that are well represented in their training data (e.g., Java, Python, or JavaScript), but struggle with low-resource languages that have limited training data available (e.g., OCaml, Racket, and several others). This paper presents an effective approach for boosting the performance of Code LLMs on low-resource languages using semi-synthetic data. Our approach, called M ulti PL-T, generates high-quality datasets for low-resource languages, which can then be used to fine-tune any pretrained Code LLM. M ulti PL-T translates training data from high-resource languages into training data for low-resource languages in the following way. 1) We use a Code LLM to synthesize unit tests for commented code from a high-resource source language, filtering out faulty tests and code with low test coverage. 2) We use a Code LLM to translate the code from the high-resource source language to a target low-resource language. This gives us a corpus of candidate training data in the target language, but many of these translations are wrong. 3) We use a lightweight compiler to compile the test cases generated in (1) from the source language to the target language, which allows us to filter our obviously wrong translations. The result is a training corpus in the target low-resource language where all items have been validated with test cases. We apply this approach to generate tens of thousands of new, validated training items for five low-resource languages: Julia, Lua, OCaml, R, and Racket, using Python as the source high-resource language. Furthermore, we use an open Code LLM (StarCoderBase) with open training data (The Stack), which allows us to decontaminate benchmarks, train models without violating licenses, and run experiments that could not otherwise be done. Using datasets generated with M ulti PL-T, we present fine-tuned versions of StarCoderBase and Code Llama for Julia, Lua, OCaml, R, and Racket that outperform other fine-tunes of these base models on the natural language to code task. We also present Racket fine-tunes for two very recent models, DeepSeek Coder and StarCoder2, to show that M ulti PL-T continues to outperform other fine-tuning approaches for low-resource languages. The M ulti PL-T approach is easy to apply to new languages, and is significantly more efficient and effective than alternatives such as training longer.

  • Conference Article
  • Cite Count Icon 4
  • 10.1145/3379177.3390305
How to Treat the Use of Grey Literature in Software Engineering
  • Jun 26, 2020
  • Xin Zhou

Context: Following on other scientific disciplines, such as health sciences, the use of grey literature (GL) is becoming widespread in Software Engineering (SE) research. Whilst the number of papers incorporating GL on SE is increasing, there is little empirically known about different aspects of the use of GL in SE research. In particular, there is a lack of excellent evaluation standard for the quality of GL. Aim: Our research is aimed at systematically reviewing the use of GL in SE, empirically exploring SE researchers' views on GL and providing a guide for using GL in SE and for quality assessment of the GL to be included. Method: We used a mixed-methods approach for this research. We carried out a Systematic Literature Review (SLR) of the use of GL in SE. Then we surveyed the authors of the papers included in the SLR (as GL users) and the invited experts in the SE community on the use of GL in SE research. Results: We systematically selected and reviewed 102 SE secondary studies that incorporate GL in SE research, from which we identified two groups based on their reporting: 1) 76 reviews only claim their use of GL; 2) 26 reviews report the results by including GL. We also obtained 20 replies from the GL users and 24 replies from the invited SE experts. Conclusion: There is no common understanding of the meaning of GL in SE. Researchers define the scopes and the definitions of GL in a variety of ways. We found five main reasons of using GL in SE research. The findings have enabled us to propose a conceptual model for how GL works in SE research lifecycle. In the next workThere is a need for research to develop guidelines for using GL in SE and for assessing quality of GL.

  • Conference Article
  • Cite Count Icon 19
  • 10.1145/3377811.3380336
An evidence-based inquiry into the use of grey literature in software engineering
  • Jun 27, 2020
  • He Zhang + 4 more

Context: Following on other scientific disciplines, such as health sciences, the use of Grey Literature (GL) has become widespread in Software Engineering (SE) research. Whilst the number of papers incorporating GL in SE is increasing, there is little empirically known about different aspects of the use of GL in SE research. Method: We used a mixed-methods approach for this research. We carried out a Systematic Literature Review (SLR) of the use of GL in SE, and surveyed the authors of the selected papers included in the SLR (as GL users) and the invited experts in SE community on the use of GL in SE research. Results: We systematically selected and reviewed 102 SE secondary studies that incorporate GL in SE research, from which we identified two groups based on their reporting: 1) 76 reviews only claim their use of GL; 2) 26 reviews report the results by including GL.We also obtained 20 replies from the GL users and 24 replies from the invited SE experts. Conclusion: There is no common understanding of the meaning of GL in SE. Researchers define the scopes and the definitions of GL in a variety of ways.We found five main reasons of using GL in SE research. The findings have enabled us to propose a conceptual model for how GL works in SE research lifecycle. There is an apparent need for research to develop guidelines for using GL in SE and for assessing quality of GL. The current work can provide a panorama of the state-of-the-art of using GL in SE for the follow-up research, as to determine the important position of GL in SE research.

  • Research Article
  • 10.9734/ajrcos/2024/v17i2420
A 5 Year Bibliometric Review of Programming Language Research Dynamics in Southeast Asia (2018-2023)
  • Jan 25, 2024
  • Asian Journal of Research in Computer Science
  • Arawela Lou Delmo + 2 more

Aims: To conduct a systematic examination and bibliometric analysis of Scopus-indexed literature focusing on emerging trends in programming languages research within Southeast Asia.
 Study Design: This study employs a mixed method approach, incorporating both qualitative and bibliometric analysis.
 Place and Duration of Study: Publication data for review was obtained from the Scopus database, covering the period from 2018 to 2023, with a specific focus on the progress in programming language and semantics research within ASEAN countries.
 Methodology: We used the Preferred Reporting Items for Systematic Reviews and Meta-analysis (PRISMA) protocol to collect publication data. Bibliometric data was visualized through Biblioshiny and VOSviewer.
 Results: From 2018 to 2023, the research production involving programming languages and semantics across ASEAN countries has been strong, yielding a total of 233 documents from 160 unique sources. However, the annual growth rate was at -10.87%. There was a total of 882 authors with only 10 sole authors in the field. 46.78% of the documents had international co-authorship with an average of 4.03 authors per document. The literature spanned across 764 unique author keywords and 7424 citations with an average of 11.12 citations per document.
 Conclusion: Southeast Asia has a rich and collaborative research space in the field of Programming Languages but it faces several barriers such as the absence of a unified research agenda, the lack of adequate funding, and the relatively weak industrial base.

  • Research Article
  • Cite Count Icon 9
  • 10.1145/882240.882258
Influences on the design of exception handling ACM SIGSOFT project on the impact of software engineering research on programming language design
  • Jul 1, 2003
  • ACM SIGSOFT Software Engineering Notes
  • Barbara G Ryder + 1 more

There has long been a close association between research in software engineering and the design of programming languages. Part of the IMPACT project involves an exploration of the interrelations of these two fields and documentation in a report of how fundamental research in software engineering has been a valuable resource for programming language features commonly used today. The resulting report investigates the relationship by considering features in currently used languages, including exceptions, control and data abstractions, types, inheritance, concurrency and visualization mechanisms.This paper, exerpted from the report, focuses on the influence of software engineering research on the development of exceptions. The paper demonstrates that there is a symbiotic relationship between software engineering research and the design of exception handing in programming languages. Publication of these partial results is aimed at soliciting feedback and comments from both the programming languages and software engineering communities.

  • Research Article
  • Cite Count Icon 20
  • 10.1016/s0950-5849(05)80002-3
Programming versus databases in the object-oriented paradigm
  • Feb 1, 1993
  • Information and Software Technology
  • Sa Demurjian + 2 more

Programming versus databases in the object-oriented paradigm

  • Research Article
  • Cite Count Icon 14
  • 10.1145/885638.885644
Influences on the design of exception handling
  • Jun 1, 2003
  • ACM SIGPLAN Notices
  • Barbara G Ryder + 1 more

There has long been a close association between research in software engineering and the design of programming languages. Part of the IMPACT project involves an exploration of the interrelations of these two fields and documentation in a report of how fundamental research in software engineering has been a valuable resource for programming language features commonly used today. The resulting report investigates the relationship by considering features in currently used languages, including exceptions, control and data abstractions, types, inheritance, concurrency and visualization mechanisms.This paper, exerpted from the report, focuses on the influence of software engineering research on the development of exceptions. The paper demonstrates that there is a symbiotic relationship between software engineering research and the design of exception handing in programming languages. Publication of these partial results is aimed at soliciting feedback and comments from both the programming languages and software engineering communities.

  • Book Chapter
  • Cite Count Icon 3
  • 10.1007/11926078_85
The Semantic Web: Suppliers and Customers
  • Jan 1, 2006
  • Rudi Studer

The notion of the Semantic Web can be coined as a Web of data when bringing database content to the Web or as a Web of enriched human-readable content when encoding the semantics of web-resources in a machine-interpretable form. It has been clear from the beginning that realizing the Semantic Web vision will require interdisciplinary research. At this the fifth ISWC, it is time to re-examine the extent to which interdisciplinary work has played and can play a role in Semantic Web research, and even how Semantic Web research can contribute to other disciplines. Core Semantic Web research has drawn from various disciplines, such as knowledge representation and formal ontologies, reusing and further developing their techniques in a new context. However, there are several other disciplines that explore research issues very relevant to the Semantic Web in different guises and to differing extents. As a community, we can benefit by also recognizing and drawing from the research in these different disciplines. On the other hand, Semantic Web research also has much to contribute to these disciplines and communities. For example, the Semantic Web offers scenario that often ask for unprecedented scalability of techniques from other disciplines. Throughout the talk, I will illustrate these points through examples from disciplines such as natural language processing, databases, software engineering and automated reasoning. The industry also has a major role to play in the realization of the Semantic Web vision. I will therefore additionally examine the added value of Semantic Web technologies for commercial applications and discuss issues that should be addressed for broadening the market for Semantic Web technologies.

  • Conference Article
  • Cite Count Icon 19
  • 10.1145/3368089.3409767
Community expectations for research artifacts and evaluation processes
  • Nov 7, 2020
  • Ben Hermann + 2 more

Background. Artifact evaluation has been introduced into the software engineering and programming languages research community with a pilot at ESEC/FSE 2011 and has since then enjoyed a healthy adoption throughout the conference landscape. Objective. In this qualitative study, we examine the expectations of the community toward research artifacts and their evaluation processes. Method. We conducted a survey including all members of artifact evaluation committees of major conferences in the software engineering and programming language field since the first pilot and compared the answers to expectations set by calls for artifacts and reviewing guidelines. Results. While we find that some expectations exceed the ones expressed in calls and reviewing guidelines, there is no consensus on quality thresholds for artifacts in general. We observe very specific quality expectations for specific artifact types for review and later usage, but also a lack of their communication in calls. We also find problematic inconsistencies in the terminology used to express artifact evaluation’s most important purpose – replicability. Conclusion. We derive several actionable suggestions which can help to mature artifact evaluation in the inspected community and also to aid its introduction into other communities in computer science.

  • Conference Article
  • Cite Count Icon 9
  • 10.1109/icse.2013.6606771
1st International workshop on live programming (LIVE 2013)
  • May 1, 2013
  • Brian Burg + 2 more

Live programming is an idea espoused by programming environments from the earliest days of computing (such as Lisp machines and SmallTalk) but have since lain dormant. Recently, the prevalence of asynchronous feedback in programming languages such as Javascript and advances in visualizations and user interfaces have lead to a resurgence of live programming in online education communities (such as Khan Academy) and in experimental IDEs (such as LightTable). The LIVE 2013 workshop includes 11 papers describing visions, implementations, mashups, and new directions of live programming environments. The participants include both practitioners of live coding and researchers in programming languages and software engineering. Finally, several demos curated on the live workshop page are presented.

  • Conference Article
  • Cite Count Icon 2
  • 10.5555/2486788.2487068
1st international workshop on live programming (LIVE 2013)
  • May 18, 2013
  • Brian R Burg + 2 more

Live programming is an idea espoused by programming environments from the earliest days of computing (such as Lisp machines and SmallTalk) but have since lain dormant. Recently, the prevalence of asynchronous feedback in programming languages such as Javascript and advances in visualizations and user interfaces have lead to a resurgence of live programming in online education communities (such as Khan Academy) and in experimental IDEs (such as LightTable). The LIVE 2013 workshop includes 11 papers describing visions, implementations, mashups, and new directions of live programming environments. The participants include both practitioners of live coding and researchers in programming languages and software engineering. Finally, several demos curated on the live workshop page are presented.

  • Research Article
  • Cite Count Icon 11
  • 10.1145/984532.984536
Research issues in database specification
  • Apr 1, 1983
  • ACM SIGMOD Record
  • Michael L Brodie

This paper summarizes discussions of a panel on "Type Specifications and Databases" at VLDB in Mexico City. Panel members are listed at the end of the paper.Significant advances have been achieved in software engineering and programming language research in the development of specification techniques. There are important consequences for design, redesign, precision, and analysis of software. The importance of this work to database applications, and indeed data models and data languages, is now becoming apparent. However, specific database issues (e.g., constraints, complex data relationships, shared data, data independence) alter the specification problem as encountered in programming languages. The summary emphasizes the importance of (precise) specification in the database context and relates recent results in both programming languages and databases. It also lists outstanding theoretical problems and the relationship of advances in specification research to the development of semantic data models and high level languages for databases.

  • Conference Article
  • 10.1145/2676726.2682620
Databases and Programming
  • Jan 14, 2015
  • Peter Buneman

The 1990s saw a hugely productive interaction between database and programming language research. Ideas about type systems from programming languages played a central role in generalizing and adapting relational database systems to new data models. At the same time databases provided some of the best concrete examples of the application of concurrency theory and of the benefits of high-level optimization in functional programming languages. One of the driving ambitions behind this research was the idea that database access should be properly embedded in programming languages: one should not have to be bilingual in order to use a database from a programming language; and that goal has to some extent been realized. In the past fifteen years, new data models, both for data storage and for data exchange have appeared with depressing regularity and with each such model, the inevitable query language. Does programming language research have anything to contribute to these new languages? Should we take the time to to worry about embedding these models in conventional languages? Over the same period, some interesting new connections between databases and programming languages have emerged, notably in the areas of scientific databases, annotation and provenance. Will this provide new opportunities for cross-fertilization?

  • Research Article
  • 10.1145/2775051.2682620
Databases and Programming
  • Jan 14, 2015
  • ACM SIGPLAN Notices
  • Peter Buneman

The 1990s saw a hugely productive interaction between database and programming language research. Ideas about type systems from programming languages played a central role in generalizing and adapting relational database systems to new data models. At the same time databases provided some of the best concrete examples of the application of concurrency theory and of the benefits of high-level optimization in functional programming languages. One of the driving ambitions behind this research was the idea that database access should be properly embedded in programming languages: one should not have to be bilingual in order to use a database from a programming language; and that goal has to some extent been realized. In the past fifteen years, new data models, both for data storage and for data exchange have appeared with depressing regularity and with each such model, the inevitable query language. Does programming language research have anything to contribute to these new languages? Should we take the time to to worry about embedding these models in conventional languages? Over the same period, some interesting new connections between databases and programming languages have emerged, notably in the areas of scientific databases, annotation and provenance. Will this provide new opportunities for cross-fertilization?

  • Conference Article
  • Cite Count Icon 1
  • 10.1109/icse.1994.672728
Workshop on the intersection between databases and software engineeering
  • Jan 1, 1994
  • R King

Summary form only given. The complete workshop presentation was not made available for publication as part of the conference proceedings. In 1989, a workshop was held in Napa, California, USA. The meeting brought together database and software engineering researchers, and resulted in lively discussions concerning research issues of interest to both software engineering and database researchers. This workshop is, in a sense, meant to be a sequel to the Napa Workshop. We hope to not only assess the state of the art in this research specialty, but to also develop the nucleus of a research agenda that spans both communities. Attendance at the workshop is by invitation only. Fifty-three prospective attendees submitted position papers, which were reviewed by a seven person committee representing the software engineering and database fields. Twenty-three papers were accepted. Our goal for the workshop is to have open technical discussions, not presentations of specific papers. The topics of discussion will be motivated largely by the topics covered in the accepted position statements.

Save Icon
Up Arrow
Open/Close
  • Ask R Discovery Star icon
  • Chat PDF Star icon

AI summaries and top papers from 250M+ research sources.