Abstract

Objective: With the rising quantum of documents in corpuses, it is very important that data management and data assurance is with high interoperability towards retrieving the critical documents from vast range of services. By focusing on the semantic features, which could improve the level of accuracy in document tracing and retrieval, the issues and limitations in the present models could be addressed in an effective manner. Methods/Statistical Analysis: In this study, focus is on depicting the robustness of semantic features based clustering techniques and its efficacy, compared to the other kind of clustering techniques. This paper proposed a concept based document clustering by corpus utility scale (COCUS) proposed. The utility scale proposed in COCUS is derived with support of topic related selected document set as knowledge base that enables to cluster the documents by their concept relevancy. The proposed clustering model is assessed through the state of the art metrics called cluster purity, inverse of purity and cluster level harmonic mean. Experiments were carried out on datasets that comprise the containing specific kind of literature gathered from varied open access journals from publishers. The total 1509 number of documents was collected and among them 497 documents was used as knowledgebase and rest 1012 documents were used for clustering process. Findings: The experimental study evincing that the proposed model is scalable and robust. The purity and harmonic mean of the resultant clusters confirming that the COCUS clusters the documents by their concept relevancy with 94% accuracy (Average of the topic level harmonic mean of the clusters was found as 0.94). Application/ Improvements: The computational complexity of the COCUS is evinced as linear, where the majority of benchmarking models are found to be np-hard.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.