Domain Adaptation with Logistic Regression for the Task of Splice Site Prediction

Nic Herndon,Doina Caragea

doi:10.1007/978-3-319-19048-8_11

Abstract

AbstractSupervised classifiers are highly dependent on abundant labeled training data. Alternatives for addressing the lack of labeled data include: labeling data (but this is costly and time consuming); training classifiers with abundant data from another domain (however, the classification accuracy usually decreases as the distance between domains increases); or complementing the limited labeled data with abundant unlabeled data from the same domain and learning semi-supervised classifiers (but the unlabeled data can mislead the classifier). A better alternative is to use both the abundant labeled data from a source domain and the limited labeled data from the target domain to train classifiers in a domain adaptation setting. We propose such a classifier, based on logistic regression, and evaluate it for the task of splice site prediction – a difficult and essential step in gene prediction. Our classifier achieved high accuracy, with highest areas under the precision-recall curve between 50.83% and 82.61%.KeywordsDomain adaptationLogistic regressionSplice site predictionImbalanced data

Full Text