Mining the Frequent Patterns of Named Entities for Long Document Classification

Bohan Wang,Jianwei Zhang,Rui Qi,Wenjun Ke,Xiaoguang Yuan,Jinhua Gao

doi:10.3390/app12052544

Abstract

Nowadays, a large amount of information is stored as text, and numerous text mining techniques have been developed for various applications, such as event detection, news topic classification, public opinion detection, and sentiment analysis. Although significant progress has been achieved for short text classification, document-level text classification requires further exploration. Long documents always contain irrelevant noisy information that shelters the prominence of indicative features, limiting the interpretability of classification results. To alleviate this problem, a model called MIPELD (mining the frequent pattern of a named entity for long document classification) for long document classification is demonstrated, which mines the frequent patterns of named entities as features. Discovered patterns allow semantic generalization among documents and provide clues for verifying the results. Experiments on several datasets resulted in good accuracy and marco-F1 values, meeting the requirements for practical application. Further analysis validated the effectiveness of MIPELD in mining interpretable information in text classification.

Highlights

Text classification is a classic topic in the field of natural language processing
We present the idea that identifying key information in long documents is the key to classifying such long documents; We demonstrate the MIPELD, a model that can extract key information from long document samples, classify them according to key information, and provide the basis for the classification choices; Our experimental results on three datasets demonstrate that our method has accurate and robust performance
K-nearest neighbor (KNN) [21] is a non-parametric classification method that finds the k-nearest neighbors of the text to be classified in the training set, ranks the candidate types of the text to be classified based on the neighbor types, and selects the type with the highest score as the type of text to be classified

Summary

Introduction

With the development of information technology, a large amount of document-level text data have emerged on different topics, such as news articles [1], patent documents [2], e-commerce [3], construction specification [4], and medical narratives [5]. Effective classification of those long documents is beneficial to the efficient acquisition of target information. For machine learning-based methods, statisticsbased text representation methods are the most commonly used, such as one-hot [6], bagof-words [7], and n-grams with TF-IDF [8] These methods are simple and effective, with low computational costs. The neural network methods usually need a large scale dataset to achieve good performance, and their time consumption is significant

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied Sciences	Publication Date: Feb 28, 2022
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Mining the Frequent Patterns of Named Entities for Long Document Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences

Lead the way for us

Similar Papers

SCAT: a system of classification for Arabic texts
Rami Ayadi ... Mohsen Maraoui
International Journal of Internet Technology and Secured Transactions | VOL. 3
Rami Ayadi, et. al.Rami Ayadi ... Mohsen Maraoui
01 Jan 2010
International Journal of Internet Technology and Secured Transactions | VOL. 3

Research On Text Classification Based On Deep Neural Network
Deageon Kim
International Journal of Communication Networks and Information Security (IJCNIS) | VOL. 14
Deageon KimDeageon Kim
31 Dec 2022
International Journal of Communication Networks and Information Security (IJCNIS) | VOL. 14

Multilevel Classification of Pakistani News using Machine Learning
Anum Ilyas ... Narmeen Zakaria Bawany
-
Anum Ilyas, et. al.Anum Ilyas ... Narmeen Zakaria Bawany
21 Dec 2021
21 Dec 2021

Seed-Guided Topic Model for Document Filtering and Classification
Chenliang Li ... Jian Xing
ACM Transactions on Information Systems | VOL. 37
Chenliang Li, et. al.Chenliang Li ... Jian Xing
06 Dec 2018
ACM Transactions on Information Systems | VOL. 37

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Mining the Frequent Patterns of Named Entities for Long Document Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied Sciences