Abstract

Data-intensive science is reality in large scientific organizations such as the Max Planck Society, but due to the inefficiency of our data practices when it comes to integrating data from different sources, many projects cannot be carried out and many researchers are excluded. Since about 80% of the time in data-intensive projects is wasted according to surveys we need to conclude that we are not fit for the challenges that will come with the billions of smart devices producing continuous streams of data—our methods do not scale. Therefore experts worldwide are looking for strategies and methods that have a potential for the future. The first steps have been made since there is now a wide agreement from the Research Data Alliance to the FAIR principles that data should be associated with persistent identifiers (PID) and metadata (MD). In fact after 20 years of experience we can claim that there are trustworthy PID systems already in broad use. It is argued, however, that assigning PIDs is just the first step. If we agree to assign PIDs and also use the PID to store important relationships such as pointing to locations where the bit sequences or different metadata can be accessed, we are close to defining Digital Objects (DO) which could indeed indicate a solution to solve some of the basic problems in data management and processing. In addition to standardizing the way we assign PIDs, metadata and other state information we could also define a Digital Object Access Protocol as a universal exchange protocol for DOs stored in repositories using different data models and data organizations. We could also associate a type with each DO and a set of operations allowed working on its content which would facilitate the way to automatic processing which has been identified as the major step for scalability in data science and data industry. A globally connected group of experts is now working on establishing testbeds for a DO-based data infrastructure.

Highlights

  • DATA INTENSIVE SCIENCE IS REALITYIn large research organizations such as the Max Planck Society (MPS) data-driven science is being practised for many years

  • Data-intensive science is reality in large scientific organizations such as the Max Planck Society, but due to the inefficiency of our data practices when it comes to integrating data from different sources, many projects cannot be carried out and many researchers are excluded

  • If we agree to assign persistent identifiers (PID) and use the PID to store important relationships such as pointing to locations where the bit sequences or different metadata can be accessed, we are close to defining Digital Objects (DOs) which could indicate a solution to solve some of the basic problems in data management and processing

Read more

Summary

DATA INTENSIVE SCIENCE IS REALITY

In large research organizations such as the Max Planck Society (MPS) data-driven science is being practised for many years. In the domain of neurosciences large initiatives such as the human brain project (Figure 3) are working on methods that allow drawing relations between phenomena of brain diseases with patterns that can be found in different types of data sets such as from gene sequencing, brain imaging, etc. In times where an increasing number of people are suffering from brain diseases a deeper understanding about their causes and early detection possibilities are urgently needed. With respect to these and other examples we can make a number of observations:. Data from different labs are needed to fit all free parameters of the underlying models; An efficient data infrastructure is required to aggregate, manage, and process all data; The efforts for data-intensive science are huge and only advanced labs with sufficient resources can currently carry out such work

DATA REALITY AND TRENDS
PERSISTENT IDENTIFIERS AND METADATA AS BASIC STEPS
DEPENDENCE ON PIDS
DIGITAL OBJECTS
DO-BASED APPROACHES
Findings
CONCLUSIONS
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call