Abstract
Nowadays, a large amount of information is stored as text, and numerous text mining techniques have been developed for various applications, such as event detection, news topic classification, public opinion detection, and sentiment analysis. Although significant progress has been achieved for short text classification, document-level text classification requires further exploration. Long documents always contain irrelevant noisy information that shelters the prominence of indicative features, limiting the interpretability of classification results. To alleviate this problem, a model called MIPELD (mining the frequent pattern of a named entity for long document classification) for long document classification is demonstrated, which mines the frequent patterns of named entities as features. Discovered patterns allow semantic generalization among documents and provide clues for verifying the results. Experiments on several datasets resulted in good accuracy and marco-F1 values, meeting the requirements for practical application. Further analysis validated the effectiveness of MIPELD in mining interpretable information in text classification.
Highlights
Text classification is a classic topic in the field of natural language processing
We present the idea that identifying key information in long documents is the key to classifying such long documents; We demonstrate the MIPELD, a model that can extract key information from long document samples, classify them according to key information, and provide the basis for the classification choices; Our experimental results on three datasets demonstrate that our method has accurate and robust performance
K-nearest neighbor (KNN) [21] is a non-parametric classification method that finds the k-nearest neighbors of the text to be classified in the training set, ranks the candidate types of the text to be classified based on the neighbor types, and selects the type with the highest score as the type of text to be classified
Summary
With the development of information technology, a large amount of document-level text data have emerged on different topics, such as news articles [1], patent documents [2], e-commerce [3], construction specification [4], and medical narratives [5]. Effective classification of those long documents is beneficial to the efficient acquisition of target information. For machine learning-based methods, statisticsbased text representation methods are the most commonly used, such as one-hot [6], bagof-words [7], and n-grams with TF-IDF [8] These methods are simple and effective, with low computational costs. The neural network methods usually need a large scale dataset to achieve good performance, and their time consumption is significant
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.