Abstract

Topic models have been shown to be a useful way of representing the content of large document collections, for example, via visualization interfaces (topic browsers). These systems enable users to explore collections by way of latent topics. A standard way to represent a topic is using a term list; that is the top‐n words with highest conditional probability within the topic. Other topic representations such as textual and image labels also have been proposed. However, there has been no comparison of these alternative representations. In this article, we compare 3 different topic representations in a document retrieval task. Participants were asked to retrieve relevant documents based on predefined queries within a fixed time limit, presenting topics in one of the following modalities: (a) lists of terms, (b) textual phrase labels, and (c) image labels. Results show that textual labels are easier for users to interpret than are term lists and image labels. Moreover, the precision of retrieved documents for textual and image labels is comparable to the precision achieved by representing topics using term lists, demonstrating that labeling methods are an effective alternative topic representation.

Highlights

  • In recent years, a large amount of information has been made available online in digital libraries, collections, and archives

  • We chunk-parse the primary candidates to extract noun chunks and generate component n-grams from the noun chunks, excluding n-grams that do not themselves exist as Wikipedia titles. As this procedure generates a number of labels, we introduce an additional filter to remove labels that have low association with other labels, based on the Related Article Conceptual Overlap (RACO) lexical association method (Grieser et al, 2011)

  • Further analysis is carried out to determine relevance of the retrieved documents based on the topics that were selected in the first stage

Read more

Summary

Introduction

A large amount of information has been made available online in digital libraries, collections, and archives. Much of this information is stored in unstructured format (e.g., text) and is not organized using any classification system. The majority of search interfaces rely on keyword-based search. This approach only works when users have sufficient domain knowledge to be able to generate appropriate queries, but this is not always the case. Users may not know what information is available or not be sufficiently familiar with the information to be able to select appropriate keywords

Objectives
Methods
Results
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call