Abstract
Recent studies have demonstrated microarray expression data can be used to identify gene regulatory pathways. However, one of the major challenges is to utilize the large microarray data (genes and micro-RNAs) to have an efficient computational model. Therefore, there is an urgent need to reduce the dimensionality of these large sets using machine learning methods without compromising the accuracy. This requires an appropriate machine learning algorithm to select the significant features from these large datasets. Therefore, in this study, we use a supervised method based on a Random Forest to identify significant features from three microarray datasets from prenatal nicotine, alcohol, and nicotine and alcohol exposure groups in two different cell types (dopamine and non-dopamine neurons). Our approach was computationally efficient to reduce the dimensionality of extremely large microarray datasets. Furthermore, our results indicated that using only the top 20% of features was sufficient to confirm the genetic pathways previously identified when using all of the features in the model.
Highlights
Microarrays enable the global screening of gene expression profiles by quantifying the changes in the regulation of thousands of genes [1]
ANIMAL EXPERIMENTS The microarray data was collected from dopaminergic and non-dopaminergic neurons obtained from the rat ventral tegmental area (VTA)
All experiments were performed in accordance with the protocols approved by the Institutional Animal Care and Use Committee (IACUC) and the University of Houston Animal Care Operations (ACO)
Summary
Microarrays enable the global screening of gene expression profiles by quantifying the changes in the regulation of thousands of genes [1]. Microarrays have been adopted to identify the gene regulation pathways [2] using supervised or unsupervised machine learning methods. The large number of features limits the model reliability and in many cases, may cause overfitting [3]. To improve the efficiency of the gene regulatory network modelling, the dimensionality of the features including messenger RNAs (mRNA, genes) and microRNAs (miRNAs) needs to be reduced [4]. There are two different approaches including unsupervised and supervised methods to reduce the dimensionality of complex datasets. In unsupervised learning, having a large size data and features negatively affects the computational performance of the underlying learning algorithm. The Hill Climb (HC) unsupervised learning algorithm for dimensionally reduction has been widely used in practice to improve its computational efficiency[5]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.