Abstract

Machine learning pipelines for protein functional family prediction are urgently needed especially now that only 1% of raw protein sequences have been manually annotated. Although existing machine learning algorithms have achieved a decent performance in modeling and predicting the functional families of protein sequences, they still have two drawbacks. First, biological dependencies among nucleotides are not rich enough to describe motifs for these methods. Also, existing algorithms are not accurate enough to predict the functional families of newly discovered proteins. To address the above limitations simultaneously, we propose a novel deep learning framework for predicting protein family, DeepPPF, which employs the word2vec technique in capturing distributional dependencies among nucleotides and discovers rich features from diverse motif lengths to characterize proteins. The novelty of the DeepPPF is in utilizing distributional dependencies among nucleotides. Experimental results on G protein-coupled receptor hierarchical datasets show the effectiveness of DeepPPF in achieving the state-of-the-art performance in items of Mathew’s correlation coefficients (MCC) of 97.62%, 88.45% and, 83.09% for family, sub-family and, sub-subfamily hierarchical levels, respectively. Also, DeepPPF outperformed existing methods in terms of prediction accuracy and Mathew’s correlation coefficients on the cluster of orthologous groups (COG) and phage of orthologous groups (POG) datasets. Furthermore, we analyzed the ability of DeepPPF framework to discover rich motifs for functional classes with the least sets of protein sequences. The experimental results show that rich motif discovery is key to improving the modeling performance of protein families through deep learning techniques. Finally, we investigated the effect of transferring a low-level functional domain level to a high-level functional domain and results show that the target domain prediction can be improved with transfer learning. Therefore, our proposed deep learning framework can be useful in characterizing protein functional families. The codes and datasets are available at https://github.com/CSUBioGroup/DeepPPF.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call