Abstract

The alarming spread of diseases across the globe has become a major concern for global healthcare agencies. The research community is actively involved in inventing better and more efficient ways of detecting and treating diseases to solve this global challenge. The abundance of molecular sequence data has eased the path for researchers to develop Machine Learning (ML) based solutions. The performance of the ML models used to classify molecular sequences depends heavily on the type of embedding used to obtain an appropriate numerical representation of the molecular sequences. In recent years, many embedding approaches have been introduced for molecular sequence analysis. However, there is still a need for improvement as far as the efficiency of the methods is concerned (i.e., the ability to capture pairwise relationships and patterns effectively, which could affect the classification performance). To provide a solution to this problem, we propose an efficient kernel-based technique for embedding generation from molecular sequences, which involves computing a kernel matrix using the Sinkhorn-Knopp algorithm and the normalized pairwise distances between k-mers in a manner that satisfies the constraints of a probability distribution. Further, kernel principal component analysis (PCA) is applied to get the top PCs, which are then used as the final embedding. As a result of the experiments, we obtained an ROC-AUC score of 0.657 for our method, which is higher than the scores obtained using baselines. This clearly shows that the low-dimensional embedding obtained through the proposed approach provides an efficient and effective solution for molecular sequence analysis.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.