A data level sampling method of target dataset-oriented instance transfer is proposed to solve the problem that the characteristics of interactive texts such as short sentences, missing parts of sentences and unbalanced class distribution in multiple-domains result in difficulties of high dimension, sparse eigenvalue in feature space and lack of positive instances. A function is employed to choose features for evaluating the instance similarity between source and target datasets. The function calculates the sum of the information gains of Top-N common features of these two datasets and their proportions in the sum. Moreover, a homogenization processing method is presented for feature spaces of the target dataset and the source dataset to overcome the feature spaces inconsistency between these two datasets. A method for selecting and transferring instances from a domain of source dataset to the corresponding one of target dataset is adopted to solve the problem of unbalanced class distribution in multiple domains. Experimental results show that the proposed method effectively alleviates the unbalanced problem in target dataset. The proposed method running with four classic classification methods, i.e. support vector machine, random forest, naive Bayes, and random committee, results in an 11.3% improvement in average of weighted receiver operating characteristic curve (ROC). ©, 2015, Xi'an Jiaotong University.
Bibliographical noteThis article is available at: http://dx.doi.org/10.7652/xjtuxb201504011
- Imbalanced sentiment classification
- Instance transfer
- Interactive texts
- Multiple domain