Abstract

We consider the problem of adding a large unlabeled sample from the target domain to boost the performance of a domain adaptation algorithm when only a small set of labeled examples is available from the target domain. In particular, we consider the problem setting motivated by the tasks of splice site prediction and protein localization. For example, for splice site prediction, annotating a genome using machine learning requires a lot of labeled data, whereas for non-model organisms, there are only some labeled data and lots of unlabeled data. With domain adaptation one can leverage the large amount of data from a related model organism, along with the labeled and unlabeled data from the organism of interest to train a classifier for the latter. Our goal is to analyze the three approaches of incorporating the unlabeled data—with soft labels only (i.e., Expectation-Maximization), with hard labels only (i.e., self-training), or with both soft and hard labels—for the splice site prediction and protein localization in particular, and more broadly for a general iterative domain adaptation setting. We provide empirical results on splice site prediction and protein localization indicating that using a combination of soft and hard labels performs as good as the best of the other two approaches of integrating unlabeled data.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call