Abstract
Transcription factors (TFs) are proteins that control the transcription of a gene from DNA to messenger RNA (mRNA). TFs bind to a specific DNA sequence called a binding site. Transcription factor binding sites have not yet been completely identified, and this is considered to be a challenge that could be approached computationally. This challenge is considered to be a classification problem in machine learning. In this paper, the prediction of transcription factor binding sites of SP1 on human chromosome1 is presented using different classification techniques, and a model using voting is proposed. The highest Area Under the Curve (AUC) achieved is 0.97 using K-Nearest Neighbors (KNN), and 0.95 using the proposed voting technique. However, the proposed voting technique is more efficient with noisy data. This study highlights the applicability of the voting technique for the prediction of binding sites, and highlights the outperformance of KNN on this type of data. The study also highlights the significance of using voting.
Highlights
Transcription Factor Binding Sites of DNA sequences of living organisms contain information that create proteins
The results obtained in this paper show an enhanced performance for predicting transcription factors (TFs) binding sites, with an Area Under the Curve (AUC) equals 0.97 using K-Nearest Neighbors (KNN) and 0.95 using the proposed voting technique
The model starts with the dataset section, where the dataset is explained, followed by the classification stage, where different classifiers are employed to predict the transcription factor binding sites (TFBSs) of SP1 human chromosome 1
Summary
Transcription Factor Binding Sites of DNA sequences of living organisms contain information that create proteins. When a gene is to be transcribed, the enzyme must bind itself to the DNA of the gene at a specific sequence called the “promoter” sequence This is done with the help of the transcription factors. Logistic Regression [4] is a commonly used technique in classification problems that is applied when the values are completely different from each other. It is a statistical model in which a logistic curve is fitted to the dataset. Linear Discriminant Analysis (LDA) [5] is mainly used for dimensionality reduction It finds a linear combination of features that separates two or more classes of objects. It could be viewed as a classification algorithm
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have