FemSMA Corpus Workbench. Ein Werkzeug zur Unterstützung der qualitativen und quantitativen Analyse von textuellen Daten

Brigitte Krenn

doi:10.13092/lo.76.2818

Abstract

In various areas of (linguistic) research, there is a need to analyse larger amounts of textual data. Digitisation and the availability of computational linguistics tools offer substantial support in qualitatively and quantitatively analysing those data sets. Keeping, maintaining and presenting data and their metadata within one system facilitate data inspection and browsing. Quick assessment of data sets for the presence or absence of specific textual characteristics is supported by the possibility to manually annotate segments of text with theory-driven meta-information in combination with automatic analysis employing computational linguistics tools and computerized search. In the present contribution, the FemSMA Corpus Workbench CWB is introduced. CWB is a computational linguistics tool for manual and automatic annotation and analysis of text documents. CWB supports storage and maintenance of, and annotation and search in textual data and related metadata. CWB is a client-server application with a web interface as frontend for data inspection and manual annotation. Data storage and automatic processing is done at server side. Automatically annotated are word-level features such as parts of speech; general word features such as capitalisation, character reduplication, abbreviation; swear words and emotion words. Due to its modular system architecture, CWB can be flexibly extended, which, however, requires the involvement of computational linguists to adapt and extend CWB’s automatic analysis and search functionalities, and represent the new functionality in the web interface.

Highlights

In various areas of research, there is a need to analyse larger amounts of textual data
Quick assessment of data sets for the presence or absence of specific textual characteristics is supported by the possibility to manually annotate segments of text
Corpus Workbench (CWB) is a client-server application with a web interface as frontend

Summary

Einleitung

In verschiedenen Bereichen der (linguistischen) Forschung besteht die Notwendigkeit Sammlungen von Texten anhand theoretischer Fragestellungen qualitativ zu untersuchen. Im vorliegenden Beitrag wird die FemSMA1 Corpus Workbench (CWB) vorgestellt, als aktuelles Beispiel für ein computerlinguistisches Instrument, das folgende Funktionalitäten verbindet: automatische Suche in Textdokumenten, manuelle und automatische Annotierung von Texten mittels computerlinguistischer Analysetools. Die CWB wurde ursprünglich mit dem Ziel entwickelt Social Media Postings zu analysieren und zu annotieren, (i) um zu studieren inwieweit Autor_innengender anhand von Textmerkmalen in den Postings vorhergesagt werden kann; (ii) um ein Referenzkorpus zum Training von statistischen Modellen aufzubauen, die zur Klassifizierung von Texten nach Autor_innengeschlecht herangezogen werden können. Diese Listen wiederum dienen den Computerlinguist_innen im Team als Grundlage um zu bewerten, welche der für eine (text)linguistische Analyse relevanten Merkmale mittels welcher computerlinguistischer Verfahren und mit welcher Treffsicherheit automatisch identifiziert und annotiert werden können. Abschnitt 0 liefert eine Zusammenfassung und einen Ausblick zur geplanten weiteren Entwicklung der CWB

Corpus Workbench CWB

Funktionalitäten für die manuelle Annotierung

Definition von Labelgruppen und Labels

Zusammenfassung