Abstract

A data level sampling method of target dataset-oriented instance transfer is proposed to solve the problem that the characteristics of interactive texts such as short sentences,missing parts of sentences and unbalanced class distribution in multiple-domains result in difficulties of high dimension,sparse eigenvalue in feature space and lack of positive instances.A function is employed to choose features for evaluating the instance similarity between source and target datasets.The function calculates the sum of the information gains of Top-N common features of these two datasets and their proportions in the sum.Moreover,a homogenization processing method is presented for feature spaces of the target dataset and the source dataset to overcome the feature spaces inconsistency between these two datasets.A method for selecting and transferring instances from a domain of source dataset to the corresponding one of target dataset is adopted to solve the problem of unbalanced class distribution in multiple domains.Experimental results show that the proposed method effectively alleviates the unbalanced problem in target dataset.The proposed method running with four classic classification methods,i.e.support vector machine,random forest,naive Bayes,and random committee,results in an 11.3%improvement in average of weighted receiver operating characteristic curve(ROC).

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.