Abstract

Classifying documents to a large-scale web taxonomy is a challenging research problem because of a large number of categories and associated documents in the taxonomy. The state-of-the-art solution known as the narrow-down approach utilizes a search engine to reduce an entire category hierarchy to most relevant categories and selects the best one among them using a classifier. In a recent language modelling approach, top-level category information (or global information) was used in judging the appropriateness of a local category, which led to performance improvements. However, we observe that using global information has a limited influence on the final category selection under some conditions. First, global information may be inaccurate even though it is generated by a top-level category classifier using an entire hierarchy. Second, it has little influence when two competing categories share the same top-level category or when the local category information has too strong an influence on the final category selection. To resolve the limitations, in this paper, we propose two external methods: (1) a meta-classifier with novel dependency features among top-level categories based on an ensemble learning framework; and (2) a query modification model based on a statistical feedback method to improve the query document representation instead of just juggling with information in the hierarchy. Our methods were evaluated using the Open Directory Project test collection.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.