Abstract

Identifying the similarity between two documents is a challenging but important task. It benefits various applications like recommender systems, plagiarism detection and so on. To process any text document one of the popularly used approaches is document term matrix (DTM). The proposed approach processes the oldest, untouched, one of the morphologically critical languages, Sanskrit and builds a document term matrix for Sanskrit (DTMS) and Document synset matrix Sanskrit (DSMS). DTMS uses the frequency of the term whereas DSMS uses the frequency of synset instead of term and contributes to the dimension reduction. The proposed approach considers the semantics and context of the corpus to solve the problem of polysemy. More than 760 documents including Subhashitas and stories are processed together. F1 Score, precision, Matthews Correlation coefficient (MCC) which is the most balanced measure and accuracy are used to prove the betterment of the proposed approach.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.