What is a dataset?

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V

In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce Explain-Da-V, a framework aiming to explain changes between two given dataset versions. Explain-Da-V generates explanations that use data transformations to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that Explain-Da-V generates better explanations than existing data transformation synthesis methods.

Read full abstract

Proceedings of the VLDB Endowment
Citations: 4
Feb 1, 2023
Roee Shraga + 1

Relevant

Cite

Listen

Save

Smart Data Collection System for Brownfield CNC Milling Machines: A New Benchmark Dataset for Data-Driven Machine Monitoring

Manufacturing processes have undergone tremendous technological progress in recent decades. To meet the agile philosophy in industry, data-driven algorithms need to handle growing complexity, particularly in Computer Numerical Control machining. To enhance the scalability of machine learning in real-world applications, this paper presents a benchmark dataset for process monitoring of brownfield milling machines based on acceleration data. The data is collected from a real-world production plant using a smart data collection system over a two-years period. In this work, the edge-to-cloud setup is presented followed by an extensive description of the different normal and abnormal processes. An analysis of the dataset highlights the challenges of machine learning in industry caused by the environmental and industrial factors. The new dataset is published with this paper and available at: https://github.com/boschresearch/CNC_Machining.

Read full abstract

Procedia CIRP
Citations: 21
Jan 1, 2022
Mohamed-Ali Tnani + 2

Save

Recursion in RDF Data Shape Languages

An RDF data shape is a description of the expected contents of an RDF document (aka graph) or dataset. A major part of this description is the set of constraints that the document or dataset is required to satisfy. W3C recently (2014) chartered the RDF Data Shapes Working Group to define SHACL, a standard RDF data shape language. We refer to the ability to name and reference shape language elements as recursion. This article provides a precise definition of the meaning of recursion as used in Resource Shape 2.0. The definition of recursion presented in this article is largely independent of language-specific details. We speculate that it also applies to ShEx and to all three of the current proposals for SHACL. In particular, recursion is not permitted in the SHACL-SPARQL proposal, but we conjecture that recursion could be added by using the definition proposed here as a top-level control structure.

Read full abstract

May 19, 2015
Arthur Ryman

Relevant

Cite

Listen

Save

Collection and Validation of Psychophysiological Data from Professional and Amateur Players: a Multimodal eSports Dataset

Proper training and analytics in eSports require accurately collected and annotated data. Most eSports research focuses exclusively on in-game data analysis, and there is a lack of prior work involving eSports athletes' psychophysiological data. In this paper, we present a dataset collected from professional and amateur teams in 22 matches in League of Legends video game with more than 40 hours of recordings. Recorded data include the players' physiological activity, e.g. movements, pulse, saccades, obtained from various sensors, self-reported aftermatch survey, and in-game data. An important feature of the dataset is simultaneous data collection from five players, which facilitates the analysis of sensor data on a team level. Upon the collection of dataset we carried out its validation. In particular, we demonstrate that stress and concentration levels for professional players are less correlated, meaning more independent playstyle. Also, we show that the absence of team communication does not affect the professional players as much as amateur ones. To investigate other possible use cases of the dataset, we have trained classical machine learning algorithms for skill prediction and player re-identification using 3-minute sessions of sensor data. Best models achieved 0.856 and 0.521 (0.10 for a chance level) accuracy scores on a validation set for skill prediction and player re-id problems, respectively. The dataset is available at https://github.com/smerdov/eSports Sensors Dataset.

Read full abstract

Citations: 3
Nov 2, 2020
Andrey Somov + 3

Relevant

Cite

Listen

Save

Dataset of mechanical properties and electrical conductivity of copper-based alloys

This article presents a collection of data on approximately 150 copper-based alloys. The data compilation is based on articles published since 1993 and consists of about 1830 records. Each record contains a unique set of descriptors, such as composition and processing route, and targets, including properties such as hardness, yield strength, ultimate tensile strength, and electrical conductivity. The dataset includes information on the composition in mass percent of 20 alloying elements, and hundreds of temperature-time thermal treatments and thermomechanical conditions. The database is continually updated and hosted on an open data repository. Some of the data are presented graphically in the article to aid interpretation. This study intends to promote the identification of more sustainable alternatives to Cu-Be alloys, which is particularly relevant in developing non-toxic and environmentally-friendly alloys.

Read full abstract

Scientific Data
Citations: 6
Jul 29, 2023
Stéphane Gorsse + 3

Save

A multi-dataset data-collection strategy produces better diffraction data.

A multi-dataset (MDS) data-collection strategy is proposed and analyzed for macromolecular crystal diffraction data acquisition. The theoretical analysis indicated that the MDS strategy can reduce the standard deviation (background noise) of diffraction data compared with the commonly used single-dataset strategy for a fixed X-ray dose. In order to validate the hypothesis experimentally, a data-quality evaluation process, termed a readiness test of the X-ray data-collection system, was developed. The anomalous signals of sulfur atoms in zinc-free insulin crystals were used as the probe to differentiate the quality of data collected using different data-collection strategies. The data-collection results using home-laboratory-based rotating-anode X-ray and synchrotron X-ray systems indicate that the diffraction data collected with the MDS strategy contain more accurate anomalous signals from sulfur atoms than the data collected with a regular data-collection strategy. In addition, the MDS strategy offered more advantages with respect to radiation-damage-sensitive crystals and better usage of rotating-anode as well as synchrotron X-rays.

Read full abstract

Acta Crystallographica Section A Foundations of Crystallography
Citations: 25
Oct 18, 2011
Zhi Jie Liu + 7

Save

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report)

In multi-user environments in which data science and analysis is collaborative, multiple versions of the same datasets are generated. While managing and storing data versions has received some attention in the research literature, the semantic nature of such changes has remained under-explored. In this work, we introduce \texttt{Explain-Da-V}, a framework aiming to explain changes between two given dataset versions. \texttt{Explain-Da-V} generates \emph{explanations} that use \emph{data transformations} to explain changes. We further introduce a set of measures that evaluate the validity, generalizability, and explainability of these explanations. We empirically show, using an adapted existing benchmark and a newly created benchmark, that \texttt{Explain-Da-V} generates better explanations than existing data transformation synthesis methods.

Read full abstract

Jan 30, 2023
Roee Shraga + 1

Relevant

Cite

Listen

Save

General Intelligent Dataset Description Method and Application

With the development of the Internet, the growth rate of big data has exceeded the ability of human beings to obtain information from data. In the near future, faced with the sea of data, seemingly logically correct algorithms may come to the opposite conclusion, which will be a “data disaster” era. Therefore, letting data “speak” and enable data intelligence is an important problem that must be solved in the field of big data research. In this paper, we propose a “data intelligence” framework based on Brooks’s Subsumption Architecture for robotics. In this framework, we define a new dataset description method and try to encapsulate the dataset described by this method into a software container. Then, we set up the corresponding operations on the dataset container. Finally, we identify an interaction model for a “dataset agent” based on academic work on agents in the field of artificial intelligence. And we designed and implemented some functions of an intelligent document manager, which proves that our design ideas and methods are feasible.

Read full abstract

Journal of Physics: Conference Series
Jul 1, 2023
Danchen Ma + 2

Save

Dataset collection from a SubT environment

This article presents a dataset collected from the subterranean (SubT) environment with a current state-of-the-art sensors required for autonomous navigation. The dataset includes sensor measurements collected with RGB, RGB-D, event-based and thermal cameras, 2D and 3D lidars, inertial measurement unit (IMU), and ultra wideband (UWB) positioning systems which are mounted on the mobile robot. The overall sensor setup will be referred further in the article as a data collection platform. The dataset contains synchronized raw data measurements from all the sensors in the robot operating system (ROS) message format and video feeds collected with action and 360 cameras. A detailed description of the sensors embedded into the data collection platform and a data collection process are introduced. The collected dataset is aimed for evaluating navigation, localization and mapping algorithms in SubT environments. This article is accompanied with the public release of all collected datasets from the SubT environment. Link: Dataset

Read full abstract

Robotics and Autonomous Systems
Citations: 10
Jun 14, 2022
Anton Koval + 7

Save

Dataset Definition Standard (DDS)

This document gives a set of recommendations to build and manipulate the datasets used to develop and/or validate machine learning models such as deep neural networks. This document is one of the 3 documents defined in [1] to ensure the quality of datasets. This is a work in progress as good practices evolve along with our understanding of machine learning. The document is divided into three main parts. Section 2 addresses the data collection activity. Section 3 gives recommendations about the annotation process. Finally, Section 4 gives recommendations concerning the breakdown between train, validation, and test datasets. In each part, we first define the desired properties at stake, then we explain the objectives targeted to meet the properties, finally we state the recommendations to reach these objectives.

Read full abstract

Jan 7, 2021
Sylvaine Picard + 6

Relevant

Cite

Listen

Save

Answer from top 10 papers

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V

Smart Data Collection System for Brownfield CNC Milling Machines: A New Benchmark Dataset for Data-Driven Machine Monitoring

Recursion in RDF Data Shape Languages

Collection and Validation of Psychophysiological Data from Professional and Amateur Players: a Multimodal eSports Dataset

Dataset of mechanical properties and electrical conductivity of copper-based alloys

A multi-dataset data-collection strategy produces better diffraction data.

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report)

General Intelligent Dataset Description Method and Application

Dataset collection from a SubT environment

Dataset Definition Standard (DDS)

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

What is a dataset?

Answer from top 10 papers

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V

Smart Data Collection System for Brownfield CNC Milling Machines: A New Benchmark Dataset for Data-Driven Machine Monitoring

Recursion in RDF Data Shape Languages

Collection and Validation of Psychophysiological Data from Professional and Amateur Players: a Multimodal eSports Dataset

Dataset of mechanical properties and electrical conductivity of copper-based alloys

A multi-dataset data-collection strategy produces better diffraction data.

Explaining Dataset Changes for Semantic Data Versioning with Explain-Da-V (Technical Report)

General Intelligent Dataset Description Method and Application

Dataset collection from a SubT environment

Dataset Definition Standard (DDS)