Abstract

We provide an overview of forty years of work with language corpora by the research group that started in 1972 as the Norwegian Computing Centre for the Humanities. A brief history highlights major corpora and tools that have been developed in numerous collaborations, including corpora of literature, dialect recordings, learner language, parallel texts, newspaper articles, blog posts and tweets. Current activities are also described, with a focus on corpus analysis tools, treebanks and social media analysis. Keywords : corpus building; corpus analysis tools; treebanks; social media analysis

Highlights

  • In 1972 the Norwegian Computing Centre for the Humanities was formed in Bergen

  • The latest incarnation, since 2011, is the Computational Language Unit (CLU) at Uni Computing, which is a department of Uni Research AS, Bergen

  • We survey some highlights of this work, which has included the creation and analysis of corpora of literature, dialect recordings, learner language, parallel texts, newspaper articles, blog posts and tweets

Read more

Summary

Introduction

In 1972 the Norwegian Computing Centre for the Humanities was formed in Bergen. Over the subsequent years this group morphed into the Humanities Data Centre, the Humanities and Information Technology Centre, Unifob AKSIS and Uni Digital. The constant thread through these groups has been research and development at the intersection of language, culture and information technology. We survey some highlights of this work, which has included the creation and analysis of corpora of literature, dialect recordings, learner language, parallel texts, newspaper articles, blog posts and tweets. Through this brief history we comment on the impact of earlier work on current activities in Bergen and elsewhere, and summarise how developments in computing technologies have afforded, and continue to afford, opportunities for new kinds of corpus-­‐based research. Looking to the future, we focus on three on-­‐going strands of work at CLU: (i) the development of a corpus analysis system to best exploit the recent availability of large-­‐scale and richly annotated corpora; (ii) the development of a system for building and using treebanks, and their application to linguistics and computational linguistics; (iii) corpus-­‐based approaches to extracting information from massive and heterogeneous corpora of social media

From Ibsen to Twitter
Corpus management and analysis software
Turning corpora into treebanks
Extracting information and information structures from corpora
Closing Remarks
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call