Text Classification System Research Articles

본 논문은 기계학습 기법에 기반한 웹문서 자동분류 시스템의 성능을 높이기 위한 새로운 형태의 특징가공 기법을 제안한다. 제안 기법은 하이퍼텍스트 웹문서에 대한 자동분류를 효과적으로 수행하기 위해 하이퍼링크 관계를 활용하여 특징 집합을 확장시킨다. 웹문서는 하이퍼링크 관계를 통해 서로 연결된 구조를 가지며, 그 관계는 많은 경우 연관도가 높은 문서들 간에 존재한다. 이러한 링크 정보가 분류모델의 주요 인자가 되는 특징 집합의 질을 높이는 중요한 역할을 수행할 수 있다. 제안 기법의 기본 아이디어는 워드넷 온톨로지를 기반으로 분류 대상 문서와 인접 문서들에 포함된 단어(특징)들 간의 의미적 유사도를 평가함으로써 다수의 특징들로 구성된 추상화된 개념적 특징을 생성하는 것이다. 여기서 유사도 함수는 워드넷 안에서 특징들 간의 상/하위어 관계 정보를 정량적으로 계산하게 된다. 분류모델의 구축시 추상화된 개념 특징은 일반 특징과 동일하게 간주하여 보다 정확한 분류 모델을 구축하는데 기여한다. Web-KB 문서집합을 이용한 실험을 통해 제안 기법이 기존 기법 보다 우수함을 보였다. This paper presents a novel feature engineering technique that can improve the conventional machine learning-based text classification systems. The proposed method extends the initial set of features by using hyperlink relationships in order to effectively categorize hypertext web documents. Web documents are connected to each other through hyperlinks, and in many cases hyperlinks exist among highly related documents. Such hyperlink relationships can be used to enhance the quality of features which consist of classification models. The basic idea of the proposed method is to generate a sort of ed concept feature which consists of a few raw feature words; for this, the method computes the semantic similarity between a target document and its neighbor documents by utilizing hierarchical relationships in the WordNet ontology. In developing classification models, the ed concept features are equated with other raw features, and they can play a great role in developing more accurate classification models. Through the extensive experiments with the Web-KB test collection, we prove that the proposed methods outperform the conventional ones.

Read full abstract

The pervasiveness of information available on the internet means that increasing numbers of documents must be classified. Text categorization is not only undertaken by domain experts, but also by automatic text categorization systems. Therefore, a text categorization system with a multi-label classifier is necessary to process the large number of documents. In this study, a proposed multi-label text categorization system is developed to classify multi-label documents. Data mapping is performed to transform data from a high-dimensional space to a lower-dimensional space with paired SVM output values, thus lowering the complexity of the computation. A pairwise comparison approach is applied to set the membership function in each predicted class to judge all possible classified classes. To better explain the proposed model, a comparative study using Reuter's data sets is performed on several multi-label approaches such as Naive Bayes, Multi-Label Mixture, Jaccard Kernel and Bp-MLL. Though the comparative results of the empirical experiment indicate that the proposed multi-label text categorization system performs better than other methods in terms of overall performance indices, these comparisons are done under the conditions without knowing original settings of parameters. From these comparative studies, it is found that these probabilities of documents appearing in correctly predicted classes and those of documents appearing in the wrongly predicted classes are important properties and we conclude that the probability of 0.5 for model membership function is a good criterion to judge between correctly and incorrectly classified documents from the results of the empirical experiment.

Read full abstract

Text Classification System Research Articles

Related Topics

Articles published on Text Classification System

워드넷 기반 특징 추상화를 통한 웹문서 자동분류시스템의 성능향상

Collaborative Filtering Recommender Systems

Researching in Web Technology Classification Based on Improved Support Vector Machine

Creation and Use of Ontology Related to Genes, Syndromes, Diseases and Symptoms for the Classification of Biomedical Texts

DACS Dewey index-based Arabic Document Categorization System

Design of Text Categorization System Based on SVM

Ontology-guided feature engineering for clinical text classification

Semantic Similarity Metric and its Application in Text Classification

Text Classification Combined an Improved CHI and Category Relevance Factor

Efficient representation of text with multiple perspectives

Multiclass Boosting with Adaptive Group‐Based kNN and Its Application in Text Categorization

Text Classification Using Support Vector Machine with Mixture of Kernel

A Novel Category Vector-Based Cross Language Text Categorization Method

Semi-Automatic Labeling of Training Data Sets in Text Classification

Feature sub-set selection metrics for Arabic text classification

Solving multi-label text categorization problem using support vector machine approach with membership function

A Hybrid Arabic Text Summarization Technique Based on Text Structure and Topic Identification

Symbolic rule-based classification of lung cancer stages from free-text pathology reports

A parametric methodology for text classification

Distributed Text Classification With an Ensemble Kernel-Based Learning Approach

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Text Classification System Research Articles

Related Topics

Articles published on Text Classification System

워드넷 기반 특징 추상화를 통한 웹문서 자동분류시스템의 성능향상

Collaborative Filtering Recommender Systems

Researching in Web Technology Classification Based on Improved Support Vector Machine

Creation and Use of Ontology Related to Genes, Syndromes, Diseases and Symptoms for the Classification of Biomedical Texts

DACS Dewey index-based Arabic Document Categorization System

Design of Text Categorization System Based on SVM

Ontology-guided feature engineering for clinical text classification

Semantic Similarity Metric and its Application in Text Classification

Text Classification Combined an Improved CHI and Category Relevance Factor

Efficient representation of text with multiple perspectives

Multiclass Boosting with Adaptive Group‐Based kNN and Its Application in Text Categorization

Text Classification Using Support Vector Machine with Mixture of Kernel

A Novel Category Vector-Based Cross Language Text Categorization Method

Semi-Automatic Labeling of Training Data Sets in Text Classification

Feature sub-set selection metrics for Arabic text classification

Solving multi-label text categorization problem using support vector machine approach with membership function

A Hybrid Arabic Text Summarization Technique Based on Text Structure and Topic Identification

Symbolic rule-based classification of lung cancer stages from free-text pathology reports

A parametric methodology for text classification

Distributed Text Classification With an Ensemble Kernel-Based Learning Approach