Abstract
The lack of sentiment resources in poor resource languages poses challenges for the sentiment analysis in which machine learning is involved. Cross-lingual and semi-supervised learning approaches have been deployed to represent the most common ways that can overcome this issue. However, performance of the existing methods degrades due to the poor quality of translated resources, data sparseness and more specifically, language divergence. An integrated learning model that uses a semi-supervised and an ensembled model while utilizing the available sentiment resources to tackle language divergence related issues is proposed. Additionally, to reduce the impact of translation errors and handle instance selection problem, we propose a clustering-based bee-colony-sample selection method for the optimal selection of most distinguishing features representing the target data. To evaluate the proposed model, various experiments are conducted employing an English-Arabic cross-lingual data set. Simulations results demonstrate that the proposed model outperforms the baseline approaches in terms of classification performances. Furthermore, the statistical outcomes indicate the advantages of the proposed training data sampling and target-based feature selection to reduce the negative effect of translation errors. These results highlight the fact that the proposed approach achieves a performance that is close to in-language supervised models.
Highlights
With the development of Web 3.0 era, artificial intelligence (AI), increasing amount of multi-lingual user-generated content are available that expresses the users’ views, feedback or comments concerning various aspects such as products quality, services, and government policies
Normalize Yp Repeat the above steps n times from step 3 to build n trained semi-supervised models Each trained with different feature set; Step (2) The n Semi-Supervised classifier vote to determine the final labels for the unlabeled data Yp
The experimental results using LR, NB, Maximum Entropy (ME), classifiers voting ensemble on Books B, DVDs D, Electronics E, and Kitchen Appliances K are summarized in Table 2 and Figure 2
Summary
With the development of Web 3.0 era, artificial intelligence (AI), increasing amount of multi-lingual user-generated content are available that expresses the users’ views, feedback or comments concerning various aspects such as products quality, services, and government policies. In order to overcome the annotation cost, various solutions have been proposed in the literature to exploit the unlabeled data in target-language (this is called semi-supervised learning) [1], or to explore translated models and/or data available in other languages (this is called transfer learning) [3,4,5,9] The lack of these annotated resources in the majority of languages motivated research toward cross-lingual approaches for sentiment analysis. SCLL techniques attempt to make use of current annotated sentiment resources from opulent language domain (i.e., genre or/and different topics) These approaches employ machine translation (from target to source languages, or from source to target, which are referred to as bidirectional), bilingual lexicons or cross-lingual representation learning techniques with parallel corpora to project the labeled data from source to targeted language [1,3,9,10].
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.