Abstract
Text categorization can become a very difficult problem to solve in many cases. However many text categorization algorithms have been developed in the history of computer science, they are not always as accurate as we expect. Some of them are highly accurate in special cases while others perform well in different cases. In this work, we are comparing two famous methods in text categorization; the first one is the well-known term weighting algorithm and the second one is the logistic regression algorithm. All the dataset is got from our previous start-up named “Ume Market Network” which was an online peer-to-peer e-commerce system, and was synchronized with Facebook sales groups. Every offer in this dataset should be categorized as a sale/purchase offer; therefore, the problem is a classical binary categorization on a text dataset of formal as well as colloquial expressions in English, Italian, and German languages. After overcoming all the ambiguities the logistic regression algorithm outperformed the term weighting algorithm by around 25% in acuracy.
Highlights
Collection Frequency FactorThere are much more detailed parameters to consider. As an example original term frequency, IDF (inverse document frequency) and IDF probability (term relevance) are considered in this method
In a nice similar research [11], Ifrim, Bakir, and weikum have shown that the logistic regression has a good impact in categorizing documents using variable length n-gram words or characters while learning involves automatic tokenization
The standard method of word tokenization is commonly used in text categorization as a means of the training set before the learning algorithm
Summary
There are much more detailed parameters to consider. As an example original term frequency, IDF (inverse document frequency) and IDF probability (term relevance) are considered in this method. The factors used are: TF: term frequency IDF: Multiply TF by an inverse document frequency (IDF). The IDF factor varies inversely with the number of documents ni which contains the term ti in a collection of N documents and is typically computed as log (N/ni). In a nice similar research [11], Ifrim, Bakir, and weikum have shown that the logistic regression has a good impact in categorizing documents using variable length n-gram words or characters while learning involves automatic tokenization. They tried to solve this problem using n-gram logistic regression using gradient ascent approach. This offer could contain pictures, which are a good source to extract information
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have