The Searchbench - Combining Sentence-semantic, Full-text and Bibliographic Search in Digital Libraries

Ulrich Schäfer,Christian Spurk,Magdalena Wolska,Benjamin Weitz,Jörg Steffen,Bernd Kiefer,Rui Wang

doi:10.18352/lq.8091

Abstract

We describe a novel approach to precise searching in the full content of digital libraries. The Searchbench (for search workbench) is based on sentence-wise syntactic and semantic natural language processing (NLP) of both born-digital and scanned publications in PDF format. The term born-digital means natively digital, i.e. prepared electronically using typesetting systems such as LaTeX, OpenOffice, and the like. In the Searchbench, queries can be formulated as (possibly underspecified) statements, consisting of simple subject-predicate-object constructs such as ‘algorithm improves word alignment’. This reduces the number of false hits in large document collections when the search words happen to appear close to each other, but are not semantically related. The method also abstracts from passive voice and predicate synonyms. Moreover, negated statements can be excluded from the search results, and negated antonym predicates again count as synonyms (e.g. not include = exclude).In the Searchbench, a sentence-semantic search can be combined with search filters for classical full-text, bibliographic metadata and automatically computed domain terms. Auto-suggest fields facilitate text input. Queries can be bookmarked or emailed. Furthermore, a novel citation browser in the Searchbench allows graphical navigation in citation networks. These have been extracted automatically from metadata and paper texts. The citation browser displays short phrases from citation sentences at the edges in the citation graph and thus allows students and researchers to quickly browse publications and immerse into a new research field. By clicking on a citation edge, the original citation sentence is shown in context, and optionally also in the original PDF layout.To showcase the usefulness of our research, we have a applied it to a collection of currently approx. 25,000 open access research papers in the field of computational linguistics and language technology, the ACL Anthology ( http://aclweb.org/anthology). The Searchbench user interface is a web application running in every modern, JavaScript-enabled web browser, also on smart phones and tablet computers. The system is a free and public service at http://aclasb.dfki.de. Because the NLP technology is domain-independent, it could also be applied to newspaper texts, technical documentation, or scientific publications from other disciplines. The aim of this paper is to make the benefits of this new, language technology based approach known in library research and related fields.This article summarises 9 peer reviewed publications from the past three years that have been published in international conferences and workshops in the area of computational linguistics, and tries to present them in an appropriate way to the LIBER audience. The original papers contain more details and are freely available from the author’s homepage[1] or via the Searchbench[2].

Highlights

Searching in the ever and faster increasing amount of digitally available publications is tedious and often unsatisfactory
We summarise our research that has been conducted over the last three years on precise searching in digital scientific libraries by using natural language processing, viz. deep syntactic parsing with sentence semantic output
The general observation is that a sentence-semantic search often delivers precise results with a low percentage of unrelated results

Summary

Introduction

Searching in the ever and faster increasing amount of digitally available publications is tedious and often unsatisfactory. Natural language processing can help in making a search more precise and efficient. We summarise our research that has been conducted over the last three years on precise searching in digital scientific libraries by using natural language processing, viz. Deep syntactic parsing with sentence semantic output. The research led to a practical system, the Searchbench, and a free online service, the ACL Anthology Searchbench, that can be used to test the research results and benefits for search in scholarly publications (Schäfer, Kiefer, Spurk, Steffen, & Wang, 2011). The approaches are domainindependent and can be applied to other text domains as long as edited text with well-formed English sentences is predominant

Objectives

Results

Conclusion