Abstract

Text categorization can become a very difficult problem to solve in many cases. However many text categorization algorithms have been developed in the history of computer science, they are not always as accurate as we expect. Some of them are highly accurate in special cases while others perform well in different cases. In this work, we are comparing two famous methods in text categorization; the first one is the well-known term weighting algorithm and the second one is the logistic regression algorithm. All the dataset is got from our previous start-up named “Ume Market Network” which was an online peer-to-peer e-commerce system, and was synchronized with Facebook sales groups. Every offer in this dataset should be categorized as a sale/purchase offer; therefore, the problem is a classical binary categorization on a text dataset of formal as well as colloquial expressions in English, Italian, and German languages. After overcoming all the ambiguities the logistic regression algorithm outperformed the term weighting algorithm by around 25% in acuracy.

Highlights

  • Collection Frequency FactorThere are much more detailed parameters to consider. As an example original term frequency, IDF (inverse document frequency) and IDF probability (term relevance) are considered in this method

  • In a nice similar research [11], Ifrim, Bakir, and weikum have shown that the logistic regression has a good impact in categorizing documents using variable length n-gram words or characters while learning involves automatic tokenization

  • The standard method of word tokenization is commonly used in text categorization as a means of the training set before the learning algorithm

Read more

Summary

Collection Frequency Factor

There are much more detailed parameters to consider. As an example original term frequency, IDF (inverse document frequency) and IDF probability (term relevance) are considered in this method. The factors used are: TF: term frequency IDF: Multiply TF by an inverse document frequency (IDF). The IDF factor varies inversely with the number of documents ni which contains the term ti in a collection of N documents and is typically computed as log (N/ni). In a nice similar research [11], Ifrim, Bakir, and weikum have shown that the logistic regression has a good impact in categorizing documents using variable length n-gram words or characters while learning involves automatic tokenization. They tried to solve this problem using n-gram logistic regression using gradient ascent approach. This offer could contain pictures, which are a good source to extract information

The Problem
Data Structure
General Statistics about the Dataset
Pictures
Offers on the Platform
Problem Solving Approach
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call