Abstract

We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples are available from the target domain. In particular, we consider the problem setting motivated by the task of splice site prediction. For this task, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there is only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three ways of incorporating the unlabeled data -- with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels -- for the splice site prediction in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction indicating that using soft labels only can lead to better classifier compared to the other two ways.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call