Automatic Text Classification Research Articles

Purpose – The purpose of this paper is to develop a novel feature selection approach for automatic text classification of large digital documents – e-books of online library system. The main idea mainly aims on automatically identifying the discourse features in order to improving the feature selection process rather than focussing on the size of the corpus. Design/methodology/approach – The proposed framework intends to automatically identify the discourse segments within e-books and capture proper discourse subtopics that are cohesively expressed in discourse segments and treating these subtopics as informative and prominent features. The selected set of features is then used to train and perform the e-book classification task based on the support vector machine technique. Findings – The evaluation of the proposed framework shows that identifying discourse segments and capturing subtopic features leads to better performance, in comparison with two conventional feature selection techniques: TFIDF and mutual information. It also demonstrates that discourse features play important roles among textual features, especially for large documents such as e-books. Research limitations/implications – Automatically extracted subtopic features cannot be directly entered into FS process but requires control of the threshold. Practical implications – The proposed technique has demonstrated the promised application of using discourse analysis to enhance the classification of large digital documents – e-books as against to conventional techniques. Originality/value – A new FS technique is proposed which can inspect the narrative structure of large documents and it is new to the text classification domain. The other contribution is that it inspires the consideration of discourse information in future text analysis, by providing more evidences through evaluation of the results. The proposed system can be integrated into other library management systems.

Read full abstract

Automatic detection of adverse drug reaction (ADR) mentions from text has recently received significant interest in pharmacovigilance research. Current research focuses on various sources of text-based information, including social media-where enormous amounts of user posted data is available, which have the potential for use in pharmacovigilance if collected and filtered accurately. The aims of this study are: (i) to explore natural language processing (NLP) approaches for generating useful features from text, and utilizing them in optimized machine learning algorithms for automatic classification of ADR assertive text segments; (ii) to present two data sets that we prepared for the task of ADR detection from user posted internet data; and (iii) to investigate if combining training data from distinct corpora can improve automatic classification accuracies. One of our three data sets contains annotated sentences from clinical reports, and the two other data sets, built in-house, consist of annotated posts from social media. Our text classification approach relies on generating a large set of features, representing semantic properties (e.g., sentiment, polarity, and topic), from short text nuggets. Importantly, using our expanded feature sets, we combine training data from different corpora in attempts to boost classification accuracies. Our feature-rich classification approach performs significantly better than previously published approaches with ADR class F-scores of 0.812 (previously reported best: 0.770), 0.538 and 0.678 for the three data sets. Combining training data from multiple compatible corpora further improves the ADR F-scores for the in-house data sets to 0.597 (improvement of 5.9 units) and 0.704 (improvement of 2.6 units) respectively. Our research results indicate that using advanced NLP techniques for generating information rich features from text can significantly improve classification accuracies over existing benchmarks. Our experiments illustrate the benefits of incorporating various semantic features such as topics, concepts, sentiments, and polarities. Finally, we show that integration of information from compatible corpora can significantly improve classification performance. This form of multi-corpus training may be particularly useful in cases where data sets are heavily imbalanced (e.g., social media data), and may reduce the time and costs associated with the annotation of data in the future.

Read full abstract

Automatic Text Classification Research Articles

Related Topics

Articles published on Automatic Text Classification

The linguistic construal of disciplinarity: A data‐mining approach using register features

Automatic text classification method based on Zipf’s law

Web Page Classification Using SVM and FURIA

A feature selection approach for automatic e-book classification based on discourse segmentation

Automatic Categorization of Documents Using Latent Semantic Analysis and Fuzzy Inference Algorithm of Mamdani

Automatic Classification of Bengali Sentences Based on Sense Definitions Present in Bengali Wordnet

Feature Selection and Reduction for Persian Text Classification

Vehicle Fault Diagnostics Using Text Mining, Vehicle Engineering Structure and Machine Learning

Improvement of automatic Chinese text classification by combining multiple features

Evolving fuzzy grammar for crime texts categorization

Outomatiese genreklassifikasie vir Afrikaans

Portable automatic text classification for adverse drug reaction detection via multi-corpus training

An Improved Expectation Maximization based Semi-Supervised Email Classification using Naïve Bayes and K- Nearest Neighbor

Research and Implementation of Text Classification Algorithm

The method of zonal correlation text analysis

Application of a staged learning-based resource allocation network to automatic text categorization

Implementation of Support Vector Machine Technique in Feedback Analysis System

Improved Information Filtering and Feature Dimensionality Reduction using Semantic based Feature Dataset for Text Classification: In Context to Social Network

"Our Grief is Unspeakable'': Automatically Measuring the Community Impact of a Tragedy

Automatic classification of documents in a natural language: A conceptual model

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Automatic Text Classification Research Articles

Related Topics

Articles published on Automatic Text Classification

The linguistic construal of disciplinarity: A data‐mining approach using register features

Automatic text classification method based on Zipf’s law

Web Page Classification Using SVM and FURIA

A feature selection approach for automatic e-book classification based on discourse segmentation

Automatic Categorization of Documents Using Latent Semantic Analysis and Fuzzy Inference Algorithm of Mamdani

Automatic Classification of Bengali Sentences Based on Sense Definitions Present in Bengali Wordnet

Feature Selection and Reduction for Persian Text Classification

Vehicle Fault Diagnostics Using Text Mining, Vehicle Engineering Structure and Machine Learning

Improvement of automatic Chinese text classification by combining multiple features

Evolving fuzzy grammar for crime texts categorization

Outomatiese genreklassifikasie vir Afrikaans

Portable automatic text classification for adverse drug reaction detection via multi-corpus training

An Improved Expectation Maximization based Semi-Supervised Email Classification using Naïve Bayes and K- Nearest Neighbor

Research and Implementation of Text Classification Algorithm

The method of zonal correlation text analysis

Application of a staged learning-based resource allocation network to automatic text categorization

Implementation of Support Vector Machine Technique in Feedback Analysis System

Improved Information Filtering and Feature Dimensionality Reduction using Semantic based Feature Dataset for Text Classification: In Context to Social Network

"Our Grief is Unspeakable'': Automatically Measuring the Community Impact of a Tragedy

Automatic classification of documents in a natural language: A conceptual model