Abstract

BackgroundAlthough different quality controls have been applied at different stages of the sample preparation and data analysis to ensure both reproducibility and reliability of RNA-seq results, there are still limitations and bias on the detectability for certain differentially expressed genes (DEGs). Whether the transcriptional dynamics of a gene can be captured accurately depends on experimental design/operation and the following data analysis processes. The workflow of subsequent data processing, such as reads alignment, transcript quantification, normalization, and statistical methods for ultimate identification of DEGs can influence the accuracy and sensitivity of DEGs analysis, producing a certain number of false-positivity or false-negativity. Machine learning (ML) is a multidisciplinary field that employs computer science, artificial intelligence, computational statistics and information theory to construct algorithms that can learn from existing data sets and to make predictions on new data set. ML–based differential network analysis has been applied to predict stress-responsive genes through learning the patterns of 32 expression characteristics of known stress-related genes. In addition, the epigenetic regulation plays critical roles in gene expression, therefore, DNA and histone methylation data has been shown to be powerful for ML-based model for prediction of gene expression in many systems, including lung cancer cells. Therefore, it is promising that ML-based methods could help to identify the DEGs that are not identified by traditional RNA-seq method.ResultsWe identified the top 23 most informative features through assessing the performance of three different feature selection algorithms combined with five different classification methods on training and testing data sets. By comprehensive comparison, we found that the model based on InfoGain feature selection and Logistic Regression classification is powerful for DEGs prediction. Moreover, the power and performance of ML-based prediction was validated by the prediction on ethylene regulated gene expression and the following qRT-PCR.ConclusionsOur study shows that the combination of ML-based method with RNA-seq greatly improves the sensitivity of DEGs identification.

Highlights

  • Different quality controls have been applied at different stages of the sample preparation and data analysis to ensure both reproducibility and reliability of RNA sequencing (RNA-seq) results, there are still limitations and bias on the detectability for certain differentially expressed genes (DEGs)

  • We determined that the model based on InfoGain feature selection and Logistic Regression classification is powerful and robust for Differentially expressed genes (DEGs) prediction

  • Our study shows that the combination of Machine learning (ML)-based method with RNA-seq significantly improved the sensitivity of DEGs identification

Read more

Summary

Results

Summary of input data and features Previous studies revealed that H3K9Ac, H3K14Ac and H3K23Ac were involved in the regulation of gene expression in the response to ethylene [8, 34]. Comparison of prediction using different models we compare the performance of the models that were defined as the top 3 powerful ones, that are the model based on InfoGain feature selection and Logistic Regression classification, Classification Via Regression and Random Subspace classification for genes prediction (Additional file 1: Table S4) by using the high or medium (top 60%) expressed genes, including most of ethylene regulated genes (97.8%) [34]. All the predicted genes by the model based on InfoGain and Logistic Regression showed the same regulation by ethylene as the result from RNA-seq (Referred to as true positive genes, TP, Fig. 3d), and 4 of them are known differentially. The predicted ethylene-induced alterations in gene expression in Col-0 were reduced or not detected in ein mutant (Fig. 5d) Taken together, these results suggest that the prediction of changes in gene expression conducted by our model based on InfoGain and Logistic Regression achieved an impressive level of accuracy. Our results obtained by using the model based on InfoGain and Logistic Regression demonstrate that the genomic locations that relative to each transcript including promoters, exons and gene bodies (Additional file 1: Table S5) can provide useful information for the prediction of gene expression

Background
Methods
Conclusion
37. Langmead B
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.