Abstract

Authorship attribution of text documents is a “hot” domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The application domain is responsa, which are answers written by well-known Jewish rabbis in response to various Jewish religious questions. We have built a corpus of 6,079 responsa, composed by five authors who lived mainly in the 20th century and containing almost 10 M words. The clustering tasks that have been performed were according to two or three or four or five authors. Clustering has been performed using three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (HVW); and two unsupervised machine learning methods: K-means and Expectation Maximization (EM). The best clustering tasks according to two or three or four authors achieved results above 98%, and the improvement rates were above 40% in comparison to the “majority” (baseline) results. The EM method has been found to be superior to K-means for the discussed tasks. FW has been found as the best word list, far superior to FFW. FW, in contrast to FFW, includes function words, which are usually regarded as words that have little lexical meaning. This might imply that normalized frequencies of function words can serve as good indicators for authorship attribution using unsupervised ML methods. This finding supports previous findings about the usefulness of function words for other tasks, such as authorship attribution, using supervised ML methods, and genre and sentiment classification.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call