Abstract
This paper introduces stringing via Manifold Learning (ML-stringing), an alternative to the original stringing based on Unidimensional Scaling (UDS). Our proposal is framed within a wider class of methods that map high-dimensional observations to the infinite space of functions, allowing the use of Functional Data Analysis (FDA). Stringing handles general high-dimensional data as scrambled realizations of an unknown stochastic process. Therefore, the essential feature of the method is a rearrangement of the observed values. Motivated by the linear nature of UDS and the increasing number of applications to biosciences (e.g., functional modeling of gene expression arrays and single nucleotide polymorphisms, or the classification of neuroimages) we aim to recover more complex relations between predictors through ML. In simulation studies, it is shown that ML-stringing achieves higher-quality orderings and that, in general, this leads to improvements in the functional representation and modeling of the data. The versatility of our method is also illustrated with an application to a colon cancer study that deals with high-dimensional gene expression arrays. This paper shows that ML-stringing is a feasible alternative to the UDS-based version. Also, it opens a window to new contributions to the field of FDA and the study of high-dimensional data.
Highlights
To study the benefits of using ML-stringing instead of the Unidimensional Scaling (UDS)-based version, we focus mainly on three aspects: (1) the visual representation of the stringed high-dimensional data achieved by the estimated functional predictors; (2) the interpretability of the estimated coefficient function; and (3) the accuracy of the predictions achieved by the SOF
We realized that stringing based on UDS rearranged data according to linear relationships between predictors
Motivated by these findings we introduced ML-stringing, a version of the method that takes into account a more complex structure of the data, like nonlinearities
Summary
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. A considerable literature has grown up around the topic of high-dimensional data. In this scenario, classical statistical tools are insufficient to study the data, as the number of features is generally higher than the sample size. Microarrays measure gene expressions and in most cases can contain up to 105 genes (features or predictors) for less than one hundred subjects (samples). It is common to deal with a huge difference between the sample size n and the number p of features (written as n p). If the data comes with an associated response (say a category indicating ill/healthy patient) tasks such as modeling become very difficult
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.