Abstract

There are serious challenges posed by high-dimensional data sets. With the arrival of new technologies, high-throughput modeling is becoming a norm in many disciplines such as statistical genetics, epidemiology, astronomy, high energy physics, and ecology. Highdimensional data have emerged from various sources such as digital images, documents, next-gen sequencing, mass spectrometry, metabolomics, microarray, proteomics, online videos and web pages. One area with a growing need for new statistical methods and theory for high-dimensional data is the classification of subgroups. For example, cancer classification has primarily been based on histopathological appearance of tumor. However, patients with similar tumor appearance can have different prognosis and response to treatment. The traditional way to classify cancer by pathological review may cause biased results and misclassify the tumor subtypes for patients. The availability of microarray data allows simultaneous measures of thousands of genes. These high-dimensional data have become a standard tool for biomedical studies and are now commonly collected from patients in clinical trials. The identification of informative genes may result in potential molecular markers for tumor class prediction. Correct classifications can help practitioners identify the right treatment for patients. Due to the cost and/or experimental difficulties in obtaining sufficient biological materials, it is common to see studies with sample size much smaller than the number of dimensions. These problems are referred to as “large p small n” issues, where p is the number of dimensions (or say genes) and n is the sample size. High-dimensional data pose challenges to traditional statistical methods. For instance, owing to small n, there are increased uncertainties in the standard estimations of parameters such as means and variances. As a consequence, statistical analyses based on such parameters estimation are usually unreliable. To have improved parameters estimation, researchers have come up with innovative ways to deal with this.

Highlights

  • There are serious challenges posed by high-dimensional data sets

  • Pang [7] applied the shrinkage estimates of variances in Tong [9] into the diagonal discriminant scores, and formed two shrinkage-based rules called Shrinkage-based DQDA (SDQDA) and Shrinkage-based Diagonal Linear Discriminant Analysis (DLDA) (SDLDA)

  • The assumptions made in the diagonal discriminant analysis and its variations may not be realistic

Read more

Summary

Introduction

There are serious challenges posed by high-dimensional data sets. With the arrival of new technologies, high-throughput modeling is becoming a norm in many disciplines such as statistical genetics, epidemiology, astronomy, high energy physics, and ecology. One area with a growing need for new statistical methods and theory for high-dimensional data is the classification of subgroups. The availability of microarray data allows simultaneous measures of thousands of genes These high-dimensional data have become a standard tool for biomedical studies and are commonly collected from patients in clinical trials. High-dimensional data pose challenges to traditional statistical methods. A common approach for the analysis of high-dimensional data classification is discriminant analysis. Note that the sample covariance matrices are singular when p is larger than n Traditional methods such as Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) are not applicable to high-dimensional data classification directly

Recent Advances
Discussion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.