Abstract

Protein sequencing has rapidly changed the landscape of healthcare and life science by accelerating the growth of diagnostics and personalized medicines for a variety of fatal diseases. Next-generation nanopore/nanoslit sequencing is promising to achieve single-molecule resolution with chromosome-size-long readability. However, due to inherent complexity, high-throughput sequencing of all 20 amino acids demands different approaches. Aiming to accelerate the detection of amino acids, a general machine learning (ML) method has been developed for quick and accurate prediction of the transmission function for amino acid sequencing. Among the utilized ML models, the XGBoost regression model is found to be the most effective algorithm for fast prediction of the transmission function with a very low test root-mean-square error (RMSE ∼0.05). In addition, using the random forest ML classification technique, we are able to classify the neutral amino acids with a prediction accuracy of 100%. Therefore, our approach is an initiative for the prediction of the transmission function through ML and can provide a platform for the quick identification of amino acids with high accuracy.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call