Text Similarity Measures in News Articles by Vector Space Model Using NLP

Ritika Singh,Satwinder Singh

doi:10.1007/s40031-020-00501-5

Abstract

The present global size of online news websites is more than 200 million. According to MarketingProfs, more than 2 million articles are published every day on the web, but Online News websites have also circulated editorial content over the internet that specifies which articles to display on their website’s home pages and what articles to highlight, e.g., broad text size for main news articles. Many of the articles posted on a news website are very similar to many other news websites. The selective reporting of top news headlines and also the similarity among news across various news associations is well-identified but not very well calculated. This paper identifies the top news items on the news sites and measures the similarity between two same news items in two languages (Hindi and English) referring to the same event. To accomplish this, a highlighted headline and link extractor has been created to extract top news for both Hindi and English from Google’s news feed. First, translate the Hindi news article into English by using Google translator and then compare it with English news articles. Second, we used the cosine similarity, Jaccard similarity, Euclidean distance measure to calculate news similarity score. The frequency of nouns and the next word of nouns from the news articles are also extracted. Our methodology clearly shows that we can efficiently identify top news articles and measure the similarity between news reports.

Full Text