Abstract
In recent years, social media messaging app data has served as a precious resource to extract useful information, such as critical clues and evidence in legal trials and criminal investigations. Although these data can be of various types, they are mostly in the form of natural language text. Therefore, to extract information from them efficiently, it is essential to research practical natural language processing approaches. This study proposes applying a deep-learning-based named-entity recognition (NER) system as a natural language processing approach for information extraction to these messaging data. In addition, a system for automatically constructing NER training data is presented using the distant supervision method for the training data of deep-learning models. Because social media messaging app data generally include a significant amount of noise, such as typographical and word-spacing errors, a NER system with robustness against these types of noisy data is required to extract information from the messaging data effectively. The results demonstrate that the proposed approach outperforms that of a NER system with manually labeled training data.
Highlights
With the recent popularization of smartphones and social network service (SNS) applications, private interpersonal communication has become easier through social media messaging (SMM)
Precision measures the quality of predictions, and it is represented as the ratio of the number of predicted named entities (NEs) that are correct answers to the number of NEs predicted by the proposed named-entity recognition (NER) system
When syllable embedding with CNN and POS features are added to the baseline, the proposed method improved by 0.82%p on the large automatically labeled data (67,200 messages) generated by distance supervision
Summary
With the recent popularization of smartphones and social network service (SNS) applications, private interpersonal communication has become easier through social media messaging (SMM). This approach results in wasted resources, in terms of time and cost To solve this issue, the distant supervision method, which is a semi-supervised learning method, was used in this study to construct training data automatically, resulting in automatically labeled data for deep-learning-based NER using the SMM app data. We achieved improved performance when the post-training method with automatically labeled data was applied, and fine-tuning was conducted on BERT-based NER. The BiLSTM-CRF-based NER system trained by large automatically labeled data showed 14.14%p improvement compared to the small sample. This result proves that significant improvement can be achieved by applying the distance supervision technique and using largesized unlabeled data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.