Abstract

Wikipedia contains large-scale concepts and rich semantic information. A number of knowledge base construction projects such as WikiTaxonomy, DBpedia, and YAGO have acquired data from Wikipedia. Despite the huge amount of relations in Wikipedia, the semantic relations (i.e. subsumptions) between domain concepts are rather sparse, especially in software engineering (SE) area. Hence, it is difficult to derive a software engineering knowledge base directly from Wikipedia. Meanwhile, domain knowledge base has become indispensable to a growing number of applications in software engineering. So the discov- ery of missing semantic relations between software engineering concepts in Wikipedia is essential. In this paper, we propose an approach to automatically discovering the missing subsumption relations between software engineering concepts. Specifically, we extract the SE domain concepts from Wikipedia firstly. And secondly, we design a machine learning based algorithm with some novel features to calculate the semantic relevancy between concepts. Thirdly, we offer and utilize a semi-supervised model to incorporate the features, which discovers the SE subsumptions. Experimental results show that our approach can effectively find the missing subsumption relations between software engineering concepts. Finally, we build a taxonomy which contains 193,593 concepts together with 357,662 subsumption relations. Compared with the taxonomies which are extracted from general-purpose knowledge bases such as WikiTaxonomy, YAGO and Schema.org, our dataset has a larger scale in software engineering domain. Index Terms—Subsumption Extraction, Software Engineering, Wikipedia

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call