Language Technology Tools Research Articles

EXTENDED ABSTRACT:The paper presents three language resources enabling better full-text access to digitised printed historical Slovenian texts: a hand-annotated corpus, a hand-annotated lexicon of historical words and a collection of transcribed texts. The aim of the resources is twofold: on one hand they support empirical linguistic research (corpus, collection) and represent a reference tool for the research of historical Slovenian (lexicon) while on the other hand they may serve as training data for the development of Human Language Technologies enabling better full-text search in digital libraries containing Slovenian written cultural heritage, modernisation of historical texts, and the development of better technological solutions for text recognition and scanning. The hand annotated corpus of historical Slovenian contains the text from 1,000 pages sampled from the years 1750 to 1900, two texts date to the end of the 16th or 17th century. The corpus contains a little more than 250,000 word tokens; each of them being annotated with hand validated linguistic features: modernised form, lemma or base form, and morhpo-syntactic description. Thus the word token »ajfram« is annotated with the normalised form »ajfrom«, by the lemma »ajfer« and morphosyntactic description »Som« or »Samostalnik« (noun), »občni« (common), »moški« (masculine) and a modernised form »gorečnost« (fervour). At first the corpus was annotated automatically and then manually verified and corrected. The lexicon was created automatically from the hand-annotated corpus. It contains only attested word-forms and examples of use. The word-forms are ordered under their modern equivalents. All the modern forms of a particular word constitute a dictionary entry, defined by its lemma with conjoint information i.e. the morpho-syntactic description and the closest contemporary synonyms. Thus the entry »ajfrer/Som/gorečnost« is annotated by two modernised words »ajfra « and »ajfrom« and their archaic forms »ajfram« and »aifram« and by attestattion: »…shaz noi frihtei tu shebranje karbo sdei udrukono is velzhim aifram noi is flisam inu is andohtjo 3 vezhiere saporedama …« (Tapravi inu tazieli Colemone-Shegen, 1800, p. 183). At present, the lexicon contains over 25,000 entries (including modern words in archaic texts), 50,000 word-forms and 70,000 archaic forms. The third resource is represented by an extensive collection of digitised texts similar to the corpus. The difference is that the words are annotated automatically by a tool developed to process historical Slovenian text named ToTrTaLe. The tool implements a pipeline, where it first tokenises the text and then attempts to transcribe the archaic words to their modern day equivalents. Then, the text is tagged and lemmatised using the models for modern Slovenian language. It contains about 5 million words of hand-corrected transcriptions from the following digitised texts: • Slovenian books and editions of the newspaper »Kmetijske in rokodelske novice«, digitised by the National University Library (NUK) in the frame of the EU project IMPACT (5000 pages); • Digital library AHLib,1 comprising Slovenian books translated from German (100 books); • A selection of Slovenian books2 All three resources (corpus, lexicon, collection) are encoded according to the Text Encoding Initiative Guidelines TEI P5, which enable the definition of XML schemas for encoding texts for scholarly purposes. The home page of the project at http://nl.ijs.si/imp/ enables access to the resources. The collection and the lexicon are available for on-line browsing, the corpus and the automatically annotated collection for linguistics searches via a concordancer, while all the resources can be also downloaded in their source XML form under the Creative Commons Attribution Licence. In future we expect to extend the resources, however, even their present scope is sufficient for corpus based diachronic studies of historical Slovenian language and for developing useful language technology tools for processing cultural heritage texts.

Read full abstract

BackgroundFree text is helpful for entering information into electronic health records, but reusing it is a challenge. The need for language technology for processing Finnish and Swedish healthcare text is therefore evident; however, Finnish and Swedish are linguistically very dissimilar. In this paper we present a comparison of characteristics in Finnish and Swedish free-text nursing narratives from intensive care. This creates a framework for characterising and comparing clinical text and lays the groundwork for developing clinical language technologies.MethodsOur material included daily nursing narratives from one intensive care unit in Finland and one in Sweden. Inclusion criteria for patients were an inpatient period of least five days and an age of at least 16 years. We performed a comparative analysis as part of a collaborative effort between Finnish- and Swedish-speaking healthcare and language technology professionals that included both qualitative and quantitative aspects. The qualitative analysis addressed the content and structure of three average-sized health records from each country. In the quantitative analysis 514 Finnish and 379 Swedish health records were studied using various language technology tools.ResultsAlthough the two languages are not closely related, nursing narratives in Finland and Sweden had many properties in common. Both made use of specialised jargon and their content was very similar. However, many of these characteristics were challenging regarding development of language technology to support producing and using clinical documentation.ConclusionsThe way Finnish and Swedish intensive care nursing was documented, was not country or language dependent, but shared a common context, principles and structural features and even similar vocabulary elements. Technology solutions are therefore likely to be applicable to a wider range of natural languages, but they need linguistic tailoring.AvailabilityThe Finnish and Swedish data can be found at: http://www.dsv.su.se/hexanord/data/.

Read full abstract

Language Technology Tools Research Articles

Related Topics

Articles published on Language Technology Tools

Unearthing the latent assumptions inscribed into language tools: the cross-cultural benefits of applying a reflexive lens in co-design

Mit Wortschatz und lexikografischen Ressourcen handeln: kritische Überlegungen zur Anwendung lexikalischer, lexikografischer und digitaler Kompetenzen im virtuellen Raum beim akademischen Schreiben im DaF-Bereich

Universal Dependency Treebank for Santali Language

Infraestrutura de Investigação para a Ciência e Tecnologia da Linguagem - PORTULAN CLARIN

Hierarchical self attention based sequential labelling model for Bhojpuri, Maithili and Magahi languages

Spoken word corpus and dictionary definition for an African language

Towards Kikamba Computational Grammar

Lithuanian-Latvian, Latvian-Lithuanian Parallel Corpus (LILA)

Historical Slovenian Language Resources

INTERVIEW: Knowledge and Terminology Management at Crisplant

Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies

El projecte Atlantis: recursos digitals per a les lleng&#252;es minoritzades de la UE, recursos per a l&#39;ensenyament del català

Compilation and Exploitation of Parallel Corpora

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Language Technology Tools Research Articles

Related Topics

Articles published on Language Technology Tools

Unearthing the latent assumptions inscribed into language tools: the cross-cultural benefits of applying a reflexive lens in co-design

Mit Wortschatz und lexikografischen Ressourcen handeln: kritische Überlegungen zur Anwendung lexikalischer, lexikografischer und digitaler Kompetenzen im virtuellen Raum beim akademischen Schreiben im DaF-Bereich

Universal Dependency Treebank for Santali Language

Infraestrutura de Investigação para a Ciência e Tecnologia da Linguagem - PORTULAN CLARIN

Hierarchical self attention based sequential labelling model for Bhojpuri, Maithili and Magahi languages

Spoken word corpus and dictionary definition for an African language

Towards Kikamba Computational Grammar

Lithuanian-Latvian, Latvian-Lithuanian Parallel Corpus (LILA)

Historical Slovenian Language Resources

INTERVIEW: Knowledge and Terminology Management at Crisplant

Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies

El projecte Atlantis: recursos digitals per a les lleng&amp;#252;es minoritzades de la UE, recursos per a l&amp;#39;ensenyament del català

Compilation and Exploitation of Parallel Corpora

El projecte Atlantis: recursos digitals per a les llengües minoritzades de la UE, recursos per a l'ensenyament del català