Abstract

Social media platforms have rich text data, which can be used in data mining and analysis. However, given the fact, the evolution speed of natural languages is rapid in social media, and data on social media is very noisy. This is a great challenge to the accuracy of data analysis. To overcome this problem, we propose a topic-relevant content extraction (TRCE) based on deep multiple instance classification, leveraging existing information and hierarchical relationships among texts under a thread on social media as weak supervision to extract topic-strong-relevant data and filter out noise accurately without manually labeling data. The proposed method introduces latent variables, Bernoulli distribution, and variational inference into multiple-instance learning (MIL) to generate pseudo labels. Then we employ a dual-stream neural network with a 3-stage training process to achieve training MIL end-to-end. Experimental results show TRCE has a significant improvement compared with other MIL methods. Meanwhile, it only has a little decrease compared with supervised text classification on accuracy and F1 score. Given the fact TRCE does not need manually labeled data at all, while supervised classification relies heavily on labeled data, TRCE is a competitive method to extracting topic-relevant data and filtering out noise on social media.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.