Abstract

Automatic Document Classification (ADC) is still one of the major information retrieval problems. It usually employs a supervised learning strategy, where we first build a classification model using pre-classified documents and then use this model to classify unseen documents. The majority of supervised algorithms consider that all documents provide equally important information. However, in practice, a document may be considered more or less important to build the classification model according to several factors, such as its timeliness, the venue where it was published in, its authors, among others. In this paper, we are particularly concerned with the impact that temporal effects may have on ADC and how to minimize such impact. In order to deal with these effects, we introduce a temporal weighting function (TWF) and propose a methodology to determine it for document collections. We applied the proposed methodology to ACM-DL and Medline and found that the TWF of both follows a lognormal. We then extend three ADC algorithms (namely kNN, Rocchio and Naïve Bayes) to incorporate the TWF. Experiments showed that the temporally-aware classifiers achieved significant gains, outperforming (or at least matching) state-of-the-art algorithms.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.