Abstract

Identifying similar text passages plays an important role in many applications in NLP, such as paraphrase generation, automatic summarization, etc. This paper presents some experiments on detecting and clustering similar sentences of texts in Brazilian Portuguese. We propose an evalution framework based on an incremental and unsupervised clustering method which is combined with statistical similarity metrics to measure the semantic distance between sentences. Experiments show that this method is robust even to treat small data sets. It has achieved 86% and 93% of F-measure and Purity, respectively, and 0.037 of Entropy for the best case.KeywordsSentence SimilaritySentence ClusteringStatistical Metrics

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call