Analysis of complex data sets to infer/discover meaningful information/knowledge involves (after data collection and cleaning): (i) Modeling the data — an approach for deriving a suitable representation of data for analysis, (ii) translating analysis objectives into computations on the generated model instance; these computations can be as simple as a query or a complex computation (e.g., community detection over multiple layers), (iii) computation of expressions generated — considering efficiency and scalability, and (iv) drill-down of results to understand them clearly. Beyond this, it is also useful to visualize results for easier understanding. Covid-19 visualization dashboard presented in this paper is an example of this.This paper covers the above steps of data analysis life cycle using a representation (or model) that is gaining importance. With complex data sets containing multiple entity types and relationships, an appropriate model to represent the data is important. For these data sets, we first establish the advantages of Multilayer Networks (or MLNs) as a data model. Then we use an entity-relationship based approach to convert the data set into MLNs for a precise representation of the data set. After that, we outline how expected analysis objectives can be translated using keyword-mapping to aggregate analysis expressions. Finally, we demonstrate, through a set of example data sets and objectives, how the expressions corresponding to objectives are evaluated using an efficient decoupling-based approach. Results are further drilled down to obtain actionable knowledge from the data set.Using the widely popular Enhanced Entity Relationship (EER) approach for requirements representation, we demonstrate how to generate EER diagrams for data sets and further generate, algorithmically, MLNs as well as Relational schema for analysis and drill down, respectively. Using communities and centrality for aggregate analysis, we demonstrate the flexibility of the chosen model to support diverse set of objectives. We also show that compared to current analysis approaches, a “decoupling-based” approach using MLNs is more appropriate as it preserves structure as well as semantics of the results and is very efficient. For this computation, we need to derive expressions for each analysis objective using the MLN model. We provide guidelines to translate English queries into analysis expressions based on keywords.Finally, we use several data sets to establish the effectiveness of modeling using MLNs and their analysis using the decoupling approach that has been proposed recently. For coverage, we use different types of MLNs for modeling, and community and centrality computations for analysis. The data sets used are from US commercial airlines, IMDb (a large international movie data set), the familiar DBLP (or bibliography database), and the Covid-19 data set. Our experimental analyses using the identified steps validate modeling, breadth of objectives that can be computed, and overall versatility of the life cycle approach. Correctness of results is verified, where possible, using independently available ground truth. Furthermore, we demonstrate drill-down that is afforded by this approach (due to structure and semantics preservation) for a better understanding and visualization of results.
Read full abstract