Abstract
Identification of novel photosynthetic proteins is important for understanding and improving photosynthetic efficiency. Synergistically, genome neighborhood can provide additional useful information to identify photosynthetic proteins. We, therefore, expected that applying a computational approach, particularly machine learning (ML) with the genome neighborhood-based feature should facilitate the photosynthetic function assignment. Our results revealed a functional relationship between photosynthetic genes and their conserved neighboring genes observed by ‘Phylo score’, indicating their functions could be inferred from the genome neighborhood profile. Therefore, we created a new method for extracting patterns based on the genome neighborhood network (GNN) and applied them for the photosynthetic protein classification using ML algorithms. Random forest (RF) classifier using genome neighborhood-based features achieved the highest accuracy up to 87% in the classification of photosynthetic proteins and also showed better performance (Mathew’s correlation coefficient = 0.718) than other available tools including the sequence similarity search (0.447) and ML-based method (0.361). Furthermore, we demonstrated the ability of our model to identify novel photosynthetic proteins compared to the other methods. Our classifier is available at http://bicep2.kmutt.ac.th/photomod_standalone, https://bit.ly/2S0I2Ox and DockerHub: https://hub.docker.com/r/asangphukieo/photomod.
Highlights
Identification of novel photosynthetic proteins is important for understanding and improving photosynthetic efficiency
Curated gene ontology (GO) profile of photosynthetic genes of seven reference genomes (Supplementary Table S3) retrieved from Uniprot was used as the curated dataset
We investigated the relationship between the Phylo scores−a conservation score of neighboring genes−and a higher prediction specificity achieved when a more stringent criterion is used, and their functional similarity, in terms of GO
Summary
Identification of novel photosynthetic proteins is important for understanding and improving photosynthetic efficiency. We expected that applying a computational approach, machine learning (ML) with the genome neighborhood-based feature should facilitate the photosynthetic function assignment. We created a new method for extracting patterns based on the genome neighborhood network (GNN) and applied them for the photosynthetic protein classification using ML algorithms. Random forest (RF) classifier using genome neighborhood-based features achieved the highest accuracy up to 87% in the classification of photosynthetic proteins and showed better performance (Mathew’s correlation coefficient = 0.718) than other available tools including the sequence similarity search (0.447) and ML-based method (0.361). The application of the model was demonstrated by comparing the prediction performance with other available tools and by predicting 12 novel photosynthetic proteins, which were recently identified by experiments[22,23,24,25,26,27,28]
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.