MLPV: Text Representation of Scientific Papers Based on Structural Information and Doc2vec

Yonghe Lu,Yongshan Chen,Jiayi Luo,Yuanyuan Zhai

doi:10.11648/j.ajist.20190303.12

Abstract

Text representation is the key for text processing. Scientific papers have significant structural features. The different internal components, mainly including titles, abstracts, keywords, main texts, etc., embody different degrees of importance. In addition, the external structural features of scientific papers, such as topics and authors, also have certain value for analysis of scientific papers. However, most of the traditional analysis methods of scientific papers are based on the analysis of keyword co-occurrence and citation links, which only consider partial information. There is a lack of research on the textual information and external structural information of scientific papers, which has led to the inability to deeply explore the inherent laws of scientific papers. Therefore, this paper proposes Multi-Layers Paragraph Vector (MLPV), a text representing method for scientific papers based on Doc2vec and structural information of scientific papers including both internal and external structures, and constructs five text representation models: PV-NO, PV-TOP, PV-TAKM, MLPV and MLPV-PSO. The results show that the effect of the MLPV model is much better than the PV-NO, PV-TOP and PV-TAKM models. The average accuracy of MLPV model is much more stable and higher, reaching 91.71%, which proves its validity. On the basis of the MLPV model, the accuracy of the optimized MLPV-PSO model is 3.33% higher than MLPV model which proves the effectiveness of the optimization algorithm.

Highlights

Before performing natural language processing (NLP), as an unstructured data, text needs to be transformed into structured data that can be recognized by computers, which is called text representation
This paper proposes Multi-Layers Paragraph Vector (MLPV) model based on structural information to represent scientific papers and the Paragraph Vector model (PV model, Doc2vec)
This paper proposes Multi-Layers Paragraph Vector (MLPV), an improved text representation model based on text structure information and Doc2vec, for text representation of scientific papers

Summary

Introduction

Before performing natural language processing (NLP), as an unstructured data, text needs to be transformed into structured data that can be recognized by computers, which is called text representation. Text representation is the basic and import part of NLP. The quality of text representation has directly influence on the effectiveness of text semantic analysis, such as text classification, text clustering, automatic extraction of summary and keywords, and calculation of text similarity. It has caught extensive attention of scholars and has made great progress. The traditional text representation models which have been widely used mainly include Boolean logic model, probability model, vector space model and N-gram model. Recent research is mostly based on the distributed representations of individual words or continuous words, or based on deep learning

Methods

Results

Conclusion