Abstract
There were five problems of the discriminant analysis. The first problem (Problem1) was the defect of a number of misclassifications (NM) that caused many problems. We found the relation of NM and the coefficients of linear discriminant functions (LDFs) firstly. Moreover, we proposed the minimum NM (MNM) instead of NM. MNM is an important statistic in the discriminant analysis. Problem2 was that there was no study about the discrimination of linearly separable data (LSD). Although Vapnik proposed a hard-margin Support Vector Machine (H-SVM) that defined LSD clearly, nobody investigated the LSD-discrimination. Let us MNMk is MNM of the k-variable model, and MNM(k+1) is MNM of (k+1)-variable model to add one variable to the k-variable model, MNMk is larger than equal MNM(k+1) because the k-variable model is a subset of (k+1)-variable model (MNM monotonic decrease). In addition to this fact, if MNMk = 0, all models including the k-variable are LSD. We call LSD as Matryoshka in cancer gene analysis. If the full model with p variables is big Matryoshka, it contains many smaller Matryoshkas nested within it due to the MNM monotonic decrease. We showed the error rates of statistical discriminant functions based on the variance-covariance matrices are very high for LSD. It was impossible to judge pass / fail using the exam score correctly. If there are two testlets such as T1 and T2 and the passing point is 50 points, f = T1 + T2 - 50 is an obvious LDF, and it's MNM = 0. However, the error rate of Fisher's LDF exceeds 30% because the exam data does not satisfy Fisher's assumption and many successful candidates near the discriminant hyperplane are misclassified. Problem3 is the defect of the generalized inverse matrix technique. Problem4 is that discriminant analysis is not the traditional inferential statistics because there are no standard errors of error rates and discriminant coefficients. We developed four Optimal LDFs (OLDFs) and the 100-fold cross validation for a small sample (Method1). We proposed the simple and powerful model selection method such as the best model that has the minimum mean of error rates among all possible models of the validation samples (M2). We showed the M2 of Revised IP-OLDF (RIP) based on MNM criterion was better than other LDFs by six different types of ordinary data. On October 28, 2015, we obtained six microarrays and realized we forgot Problem5. Many statisticians could not succeed in cancer gene analysis by microarrays. They pointed out three difficulties or excuses such as 1) small n, large p data, 2) NP-hard, 3) Difficulty in separating signal and noise. However, no studies could find microarrays were LSD. We developed the Matryoshka Feature Selection Method (Method2) and could decompose microarrays into plural small Matryoshkas (SMs) and the noise subspace. That is, Method2 could easily solve three difficulties. Because all MNMs of SMs are zero and those are small samples, we can analyze these SMs by standard statistical methods very easy. However, only logistic regression can discriminate all SM accurate as same as RIP and H-SVM. Other standard statistical methods cannot accurately separate the two classes such as cancer and healthy subjects. After many trials, we create new data using RIP discriminant score of all SM instead of genes. With this breakthrough advance, we can propose an oncogene diagnosis. In this paper, we introduce the reason why Problem5 was not solved from 1970.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.