Abstract
To address the process of document similarity, ontology based knowledge base such as WordNet and Wikipedia is used widely. However, there are still available different challenges, such as polysemy, synonym and high dimensionality. In this paper, a novel method for calculating the similarity of text documents is proposed. The proposed system exploits ontological framework to give correct assessment of the similarity between terms. A modified method for concepts extraction using WordNet and Wikipedia is proposed in this paper. Text document is represented as a conceptual coexistence graph. Index is constructed to handle scalability and easy computation based on large concepts and terms association. Graph similarity is calculated using vertex similarity. The integrated approach can find theme of documents based on disambiguated and extracted concepts. The experimental has been evaluated on 20 newsgroup dataset and self-generated datasets. Results show that our approach significantly improved compared to bag of words approach.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.