Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks

Jurgita Kapočiūtė-Dzikienė,Andrius Utka,Ligita Šarkutė

doi:10.15388/klbt.2014.7674

Abstract

In our paper we present a corpus of transcribed Lithuanian parliamentary speeches. The corpus is prepared in a specific format, appropriate for different authorship identification tasks. The corpus consists of approximately 111 thousand texts (24 million words). Each text matches one parliamentary speech produced during an ordinary session from the period of 7 parliamentary terms starting on March 10, 1990 and ending on December 23, 2013. The texts are grouped into 147 categories corresponding to individual authors, therefore they can be used for authorship attribution tasks; besides, these texts are also grouped according to age, gender and political views, therefore they are also suitable for author profiling tasks. Whereas short texts complicate recognition of author speaking style and are ambiguous in relation to the style of other authors, we incorporated only texts containing not less than 100 words into the corpus. In order to make each category as comprehensive and representative as possible, we included only those authors, who produced speeches at least 200 times. All the texts are lemmatized, morphologically and syntactically annotated, tokenized into the character n-grams. The statistical information of the corpus is also available. We have also demonstrated that the created corpus can be effectively used in authorship attribution and author profiling tasks with supervised machine learning methods. The corpus structure also allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses.

Highlights

Tekstai suskirstyti į 3 grupes, rankiniu būdu surinkus informaciją apie parlamentarų politines pažiūras bei tų pažiūrų pasikeitimus
The corpus structure allows using it with unsupervised machine learning methods and can be used for creation of rule-based methods, as well as in different linguistic analyses

Summary

Įvadas

Kiekvieno žmogaus rašymo stilius (šablonai, naudojami sakinių formavimui, žodyno turtingumas, frazeologizmai, gramatinės ar sintaksinės klaidos) yra savotiškas jo „piršto antspaudas“. Šį progresą paskatino pats tokių tyrimų poreikis, kurį daugiausia lėmė elektroninių tekstų, ypač anoniminių, atsiradimas. Vieni stilometrijos uždaviniai sprendžia konkretaus autoriaus autorystės nustatymo problemas: pavyzdžiui, teismo lingvistai nagrinėja, kas internetiniame forume atskleidė konfidencialią įmonės informaciją; kas atsiuntė grasinančio turinio elektroninį laišką, kurio adresas visiškai neinformatyvus; ar kompiuteryje rastą atsisveikinimo laišką iš tiesų parašė pats savižudis; kuris iš socialiniame tinkle prisistatančių asmenų iš tiesų yra užsimaskavęs pedofilas. Jeigu autorystės nustatymas apsiribotų tik autoriaus verifikacijos tyrimais (Koppel ir Schler 2004, 63), kai turint anoniminį tekstą reikia nustatyti, ar jį parašė mums gerai pažįstamas autorius, ar ne, jis būtų lengvai įveikiamas žmogui. Kai turimi keli šimtai ar net tūkstančiai galimų autorių (autorių-kandidatų): net ir labai reprezentatyvi kiekvieno iš jų rašytų tekstų imtis vargiai padeda nustatyti naujo nežinomo teksto autorystę.

Autorystės nustatymo tyrimų apžvalga

Seimo posėdžių stenogramų tekstynas

Eksperimentai ir rezultatai

Apibendrinimai ir išvados

Summary

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Kalbotyra	Publication Date: Mar 30, 2016
Citations: 2	License type: cc-by

R Discovery Prime

R Discovery Prime

Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Kalbotyra

Lead the way for us

Similar Papers

AUTHORSHIP ATTRIBUTION OF RESPONSA USING CLUSTERING
Yaakov Hacohen-Kerner ... Orr Margaliot
Cybernetics and Systems | VOL. 45
Yaakov Hacohen-Kerner, et. al.Yaakov Hacohen-Kerner ... Orr Margaliot
18 Aug 2014
Cybernetics and Systems | VOL. 45

Author Clustering with and Without Topical Features
Polina Panicheva ... Olga Litvinova
-
Polina Panicheva, et. al.Polina Panicheva ... Olga Litvinova
01 Jan 2019
01 Jan 2019

Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
Tatiana Litvinova ... Polina Panicheva
-
Tatiana Litvinova, et. al.Tatiana Litvinova ... Polina Panicheva
28 Jun 2019
28 Jun 2019

Authorship Attribution via Network Motifs Identification
Vanessa Queiroz Marinho ... Graeme Hirst
-
Vanessa Queiroz Marinho, et. al.Vanessa Queiroz Marinho ... Graeme Hirst
01 Oct 2016
01 Oct 2016

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Corpus of transcribed parliamentary speeches for authorship attribution and author profiling tasks

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Kalbotyra