Abstract

The problem of missing values has long been studied by researchers working in areas of data science and bioinformatics, especially the analysis of gene expression data that facilitates an early detection of cancer. Many attempts show improvements made by excluding samples with missing information from the analysis process, while others have tried to fill the gaps with possible values. While the former is simple, the latter safeguards information loss. For that, a neighbour-based (KNN) approach has proven more effective than other global estimators. The paper extends this further by introducing a new summarization method to the KNN model. It is the first study that applies the concept of ordered weighted averaging (OWA) operator to such a problem context. In particular, two variations of OWA aggregation are proposed and evaluated against their baseline and other neighbor-based models. Using different ratios of missing values from 1%–20% and a set of six published gene expression datasets, the experimental results suggest that new methods usually provide more accurate estimates than those compared methods. Specific to the missing rates of 5% and 20%, the best NRMSE scores as averages across datasets is 0.65 and 0.69, while the highest measures obtained by existing techniques included in this study are 0.80 and 0.84, respectively.

Highlights

  • DNA microarray technology [1] is used to monitor expression data under a variety of conditions

  • In order to improve the quality of imputation, the work presented in this paper proposes an organic combination of CKNNimpute with the argument-dependent ordered weighted averaging (OWA) operator [33,34], which has not been investigated far in the literature

  • The general process of OWA consists of three steps: (i) input values are rearranged in the descending order, (ii) weights of these inputs are determined using a preferred method, and (iii) based on the derived weights, these rearranged input values are combined into a single value

Read more

Summary

Introduction

DNA microarray technology [1] is used to monitor expression data under a variety of conditions. A number of alternatives have been proposed under the umbrella of ‘aggregation operator’ that combines multiple sources of information into a global outcome [25] For this purpose, Yager’s ordered weighted averaging (OWA) operators [26] have proven useful for many problem domains such as data mining, decision making, artificial neural networks, approximate reasoning and fuzzy system [27]. New models are evaluated with several published gene expression data sets, in comparison with basic statistical models, the conventional KNNimpute and its weighted variation The behavior of these models are assessed using different levels of missing values, with the results providing a guideline for their practical uses.

Proposed Method
Acquisition of Gene Clusters
Cluster-Directed Selection of Nearest Neighbours
Application of Argument-Dependent OWA Operator
Performance Evaluation
Experimental Design
Experimental Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call