Abstract

BackgroundDNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. DNA repair has mainly been studied in experimental and clinical situations, and relatively few information-based approaches to new extracting DNA repair knowledge exist. As a first step, automatic detection of DNA repair proteins in genomes via informatics techniques is desirable; however, there are many forms of DNA repair and it is not a straightforward process to identify and classify repair proteins with a single optimal method. We perform a study of the ability of homology and machine learning-based methods to identify and classify DNA repair proteins, as well as scan vertebrate genomes for the presence of novel repair proteins. Combinations of primary sequence polypeptide frequency, secondary structure, and homology information are used as feature information for input to a Support Vector Machine (SVM).ResultsWe identify that SVM techniques are capable of identifying portions of DNA repair protein datasets without admitting false positives; at low levels of false positive tolerance, homology can also identify and classify proteins with good performance. Secondary structure information provides improved performance compared to using primary structure alone. Furthermore, we observe that machine learning methods incorporating homology information perform best when data is filtered by some clustering technique. Analysis by applying these methodologies to the scanning of multiple vertebrate genomes confirms a positive correlation between the size of a genome and the number of DNA repair protein transcripts it is likely to contain, and simultaneously suggests that all organisms have a non-zero minimum number of repair genes. In addition, the scan result clusters several organisms' repair abilities in an evolutionarily consistent fashion. Analysis also identifies several functionally unconfirmed proteins that are highly likely to be involved in the repair process. A new web service, INTREPED, has been made available for the immediate search and annotation of DNA repair proteins in newly sequenced genomes.ConclusionDespite complexity due to a multitude of repair pathways, combinations of sequence, structure, and homology with Support Vector Machines offer good methods in addition to existing homology searches for DNA repair protein identification and functional annotation. Most importantly, this study has uncovered relationships between the size of a genome and a genome's available repair repetoire, and offers a number of new predictions as well as a prediction service, both which reduce the search time and cost for novel repair genes and proteins.

Highlights

  • DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation

  • DNA repair is believed to exist in any organism with metabolic activity, and recent evidence suggests that even ancient bacteria from as many as tens of thousands of years ago was capable of DNA repair [3]

  • We further clarify that the objective of this paper is not to study any specific repair gene in a particular organism, but rather to establish that several general repair patterns exist in all organisms, to provide new computational tools for DNA repair research, to use those tools to identify more proteins involved in repair, and to convey the computational complexity of repair protein prediction analogous to its real world complexity

Read more

Summary

Introduction

DNA repair is the general term for the collection of critical mechanisms which repair many forms of DNA damage such as methylation or ionizing radiation. Of the many forms of DNA repair, nucleotide excision repair, or NER, is a critical repair system because of its ability to repair bulky lesions that consist of more than one nucleotide [1] and its complexity in utilizing at least 25 different polypeptides [4]. Another key mechanism is the mismatch repair system, which improves the error rate when copying DNA from one mistake per 107 nucleotides to one mistake per 109 nucleotides [5]. There are some subtopics of DNA repair, such as translesion DNA synthesis (TLS), which are still at a primitive level of understanding [2]

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call