Abstract

When TextRank algorithm based on graph model constructs graph associative edges, the co-occurrence window rules only consider the relationships between local terms. Using the information in the document itself is limited. In order to solve the above problems, an improved TextRank keyword extraction algorithm based on rough data reasoning combined with word vector clustering, RDD-WRank, was proposed. Firstly, the algorithm uses rough data reasoning to mine the association between candidate keywords, expands the search scope, and makes the results more comprehensive. Then, based on Wikipedia online open knowledge base, word embedding technology is used to integrate Word2Vec into the improved algorithm, and the word vector of TextRank lexical graph nodes is clustered to adjust the voting importance of nodes in the cluster. Compared with the traditional TextRank algorithm and the Word2Vec algorithm combined with TextRank, the experimental results show that the improved algorithm has significantly improved the extraction accuracy, which proves that the idea of using rough data reasoning can effectively improve the performance of the algorithm to extract keywords.

Highlights

  • In this information age, people’s lives are full of information

  • Improved Algorithm Using Word Vector Based on Rough Data-Deduction e classic TextRank algorithm constructs the graph model of candidate keywords through the co-occurrence relationship and iteratively calculates the weight of each node through the average transition probability matrix until it converges. is approach is relatively simple and effective, but it has certain limitations. e rule of co-occurrence window only considers the correlation between local words, so some words that are locally associated with certain keywords may be extracted

  • Experimental Data. e experiment selected the Wikipedia Chinese corpus released in February 2020 “zhwiki-20200201-pages-articles-multistream.xml.bz2” to train Chinese word vectors [43, 44], which contains a main file of 1.9CB

Read more

Summary

Introduction

People’s lives are full of information. Faced with such a huge amount of data, it is important to quickly and accurately obtain the content which we are interested in and which is valuable. In order to further improve the keyword extraction effect of the TextRank algorithm, Literature [18] proposed PositionRank, an unsupervised model for extracting keywords from academic documents, which combines information of all locations where words appear to bias PageRank. Literature [28] proposed a cuckoo search algorithm and k-means supervised hybrid clustering algorithm to divide all kinds of data samples into clusters so as to provide training subsets with high diversity and merged the word2vec model into the traditional TextRank algorithm by using word embedding technology to improve the accuracy of keyword extraction. Literature [29] merged the word2vec model into the traditional TextRank algorithm by using word embedding technology to improve the accuracy of keyword extraction

Research Theory
Rough Data-Deduction
Experimental Data and Evaluation Criteria
Experimental Results and Analysis
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call