Abstract

In this study we use transcripts of the Sejm (Polish parliament) to predict speaker’s background: gender, education, party affiliation and birth year. We create learning cases consisting of 100 utterances by the same author and, using rich multi-level annotations of the source corpus, extract a variety of features from them. They are either text-based (e.g. mean sentence length, percentage of long words or frequency of named entities of certain types) or word-based (unigrams and bigrams of surface forms, lemmas and interpretations). Next, we apply general-purpose feature selection, regression and classification algorithms and obtain results well over the baseline (97% of accuracy for gender, 95% for education, 76–88% for party). Comparative study shows that random forest and k nearest neighbour’s classifier usually outperform other methods commonly used in text mining, such as support vector machines and naïve Bayes classifier. Performed evaluation experiments help to understand how these solutions deal with such sparse and highly-dimensional data and which of the considered traits influence the language the most. We also address difficulties caused by some of the properties of Polish, typical also for other Slavonic languages.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call