Abstract
In information retrieval (IR), using language model is an alternative approach to vector space model and other probabilistic term weighting models. The basic principle of the language model is to construct a model for each document and rank the documents by score which is estimated from this model. The score, in this case, represents the likelihood of generation of the query from the given document. To develop new text retrieval strategies, the language model is an attractive approach with the help of its simplicity and effectiveness. In text classification which employs methods from IR domain, documents are generally represented through vector space model (VSM). The success of the VSM depends on term weighting process that is an important step that corresponds the contribution of a term to the semantics of a text. In this paper, we investigate utilizing language model for term weighting and its effect on text classification performance. We compare the language model based term weighting with several popular and traditional term weighting methods including Binary, TF (Term Frequency), and TF*IDF (Term Frequency-Inverse Document Frequency) on three different Turkish datasets. Our experimental results revealed that language model based term weighting generally outperforms traditional methods except from binary weighting.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have