Abstract

We address the problem of integrating web taxonomies from different real Internet applications. Integrating web taxonomies is to transfer instances from a source to target taxonomy. Unlike the conventional text categorization problem, in taxonomy integration, the source taxonomy contains extra information that can be used to improve the categorization. The major existing methods can be divided in two types: those that use neighboring categories to smooth the document term vector and those that consider the semantic relationship between corresponding categories of the target and source taxonomies to facilitate categorization. In contrast to the first type of approach, which only uses a flattened hierarchy for smoothing, we apply a hierarchy shrinkage algorithm to smooth child documents by their parents. We also discuss the effect of using different hierarchical levels for smoothing. To extend the second type of approach, we extract fine-grain semantic relationships, which consider the relationships between lower-level categories. In addition, we use the cosine similarity to measure the semantic relationships, which achieves better performance than existing methods. Finally, we integrate the existing approaches and the proposed methods into one machine learning model to find the best feature configuration. The results of experiments on real Internet data demonstrate that our system outperforms standard text classifiers by about 10%.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.