Abstract

The Wikipedia category system was designed to enable browsing and navigation of Wikipedia. It is also a useful resource for knowledge organisation and document indexing, especially using automatic approaches. However, it has received little attention as a resource for manual indexing. In this article, a hierarchical taxonomy of three-level depth is extracted from the Wikipedia category system. The resulting taxonomy is explored as a lightweight alternative to expert-created knowledge organisation systems (e.g. library classification systems) for the manual labelling of open-domain text corpora. Combining quantitative and qualitative data from a crowd-based text labelling study, the validity of the taxonomy is tested and the results quantified in terms of interrater agreement. While the usefulness of the Wikipedia category system for automatic document indexing is documented in the pertinent literature, our results suggest that at least the taxonomy we derived from it is not a valid instrument for manual subject matter labelling of open-domain text corpora.

Highlights

  • Being one of the largest knowledge resources and crowd-based endeavours on the web to date, Wikipedia has been studied extensively in numerous disciplines, including information science, linguistics, computer science and natural language processing, to name but a few

  • The aim of this study was to evaluate whether the WCS reduced to a depth of three levels is a valid taxonomy for subject matter labelling of open-domain text corpora

  • The underlying use case is corpus labelling for browsing, searching and navigating text corpora, which is a prerequisite for numerous tasks in corpus and computational linguistics or related fields

Read more

Summary

Introduction

Being one of the largest knowledge resources and crowd-based endeavours on the web to date, Wikipedia has been studied extensively in numerous disciplines, including information science, linguistics, computer science and natural language processing, to name but a few. Wikipedia-based research mainly falls into two categories: first, research into the Wikipedia phenomenon itself, and, second, studies using Wikipedia as a data source for other research interests and applications [1,2]. A prolific line of inquiry from the second category deals with the use of Wikipedia for knowledge acquisition and knowledge organisation. The community-driven Wikipedia category system (WCS) for subject matter indexing is being explored as an alternative to more traditional knowledge organisation systems created by dedicated expert groups, or expert-created systems for short. While by 2012 the WCS had received relatively little attention [4], a recent bibliographic review [5] has identified a large body of research on the exploration and application of the WCS, highlighting its suitability for various problems in knowledge organisation. The WCS can be used to index collections of data belonging to different types of media

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call