Abstract
Software knowledge community contains a large scale of software knowledge entities with complex structure and rich semantic relations. Semantic relation extraction of software knowledge entities is a critical task for software knowledge graph construction, which has an important impact on knowledge graph based tasks such as software document generation and software expert recommendation. Due to the problems of entity sparsity, relation ambiguity, and the lack of annotated dataset in user-generated content of software knowledge community, it is difficult to apply existing methods of relation extraction in the software knowledge domain. To address these issues, we propose a novel software knowledge entity relation extraction model which incorporates entity-aware information with syntactic dependency information. Bidirectional Gated Recurrent Unit (Bi-GRU) and Graph Convolutional Networks (GCN) are used to learn the features of contextual semantic representation and syntactic dependency representation, respectively. To obtain more syntactic dependency information, a weight graph convolutional network based on Newton’s cooling law is constructed by calculating a weight adjacency matrix. Specifically, an entity-aware attention mechanism is proposed to integrate the entity information and syntactic dependency information to improve the prediction performance of the model. Experiments are conducted on a dataset which is constructed based on texts of the StackOverflow and show that the proposed model has better performance than the benchmark models.
Highlights
As a successful software knowledge community, StackOverflow provides a platform for software developers to exchange and share knowledge about software programming, configuration management, and project organization and gradually develops into an important knowledge base in the software field [1]. e social text of StackOverflow contains a large scale of specific software knowledge entities with complex structure and rich semantic relations
Compared with Q&A text, tagWiki is a text with good text standardization and domain knowledge integrity, which used to describe the definitions of various tags and related resources in StackOverflow. erefore, we construct the annotated dataset based on the Q&A text and tagWiki text of StackOverflow for software knowledge entity relation extraction. e detailed construction process is as follows
Based on the analysis of the syntactic dependency structure, we introduce Graph Convolutional Networks (GCN) model to model the syntactic dependency structure information of sentence sequence and assign different weights to the adjacency matrix according to the distance between nodes, so as to realize the enhanced representations of syntactic dependency between nodes. erefore, based on Bidirectional Gated Recurrent Unit (Bi-GRU) model, we compare the performance of software knowledge entity relation extraction with GCN model and the weighted GCN model. e experimental results are shown in Table 4 and Figure 4
Summary
As a successful software knowledge community, StackOverflow provides a platform for software developers to exchange and share knowledge about software programming, configuration management, and project organization and gradually develops into an important knowledge base in the software field [1]. e social text of StackOverflow contains a large scale of specific software knowledge entities with complex structure and rich semantic relations. E machine learning-based relation extraction method utilizes feature engineering and annotated data to achieve better performance, which effectively alleviates the dependence on linguistics and domain knowledge, and has strong domain migration ability. Zhao et al [18] proposed a relation triplets extraction framework in the software engineering field by incorporating dependency parser with rulebased methods In this framework, Support Vector Machine (SVM) is used as a classifier to evaluate the domain correlation of candidate relation triples, and a software knowledge graph covering 35,279 relation triples, 44,800 concepts, and 9660 verb phrases is constructed by combining text features, corpus features, concept features, and source features. Compared with financial investment, science education, biomedicine, and other fields, corresponding publicly annotated dataset and proper models for software engineering field are not available
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.