AUTHORSHIP ATTRIBUTION OF RESPONSA USING CLUSTERING

Yaakov Hacohen-Kerner,Orr Margaliot

doi:10.1080/01969722.2014.945311

Abstract

Authorship attribution of text documents is a “hot” domain in research; however, almost all of its applications use supervised machine learning (ML) methods. In this research, we explore authorship attribution as a clustering problem, that is, we attempt to complete the task of authorship attribution using unsupervised machine learning methods. The application domain is responsa, which are answers written by well-known Jewish rabbis in response to various Jewish religious questions. We have built a corpus of 6,079 responsa, composed by five authors who lived mainly in the 20th century and containing almost 10 M words. The clustering tasks that have been performed were according to two or three or four or five authors. Clustering has been performed using three kinds of word lists: most frequent words (FW) including function words (stopwords), most frequent filtered words (FFW) excluding function words, and words with the highest variance values (HVW); and two unsupervised machine learning methods: K-means and Expectation Maximization (EM). The best clustering tasks according to two or three or four authors achieved results above 98%, and the improvement rates were above 40% in comparison to the “majority” (baseline) results. The EM method has been found to be superior to K-means for the discussed tasks. FW has been found as the best word list, far superior to FFW. FW, in contrast to FFW, includes function words, which are usually regarded as words that have little lexical meaning. This might imply that normalized frequencies of function words can serve as good indicators for authorship attribution using unsupervised ML methods. This finding supports previous findings about the usefulness of function words for other tasks, such as authorship attribution, using supervised ML methods, and genre and sentiment classification.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

AUTHORSHIP ATTRIBUTION OF RESPONSA USING CLUSTERING

Abstract

Talk to us

Similar Papers

More From: Cybernetics and Systems

Lead the way for us

Similar Papers

Abstract 2449: Unsupervised machine learning methods reveal metabolomic based clusters in breast cancer patients
Jocelyn Gal ... Lun Jing
Cancer Research | VOL. 79
Jocelyn Gal, et. al.Jocelyn Gal ... Lun Jing
01 Jul 2019
Abstract 2449: Unsupervised machine learning methods reveal metabolomic based clusters in breast cancer patients
Jocelyn Gal ... Lun Jing

Survival analysis of patient groups defined by unsupervised machine learning clustering methods based on patient metabolomic data.
Caroline Bailleux ... Jocelyn Gal
Computational and Structural Biotechnology Journal | VOL. 21
Caroline Bailleux, et. al.Caroline Bailleux ... Jocelyn Gal
01 Jan 2023
Computational and Structural Biotechnology Journal | VOL. 21

Machine learning in pain research.
Jörn Lötsch ... Alfred Ultsch
Pain | VOL. 159
Jörn Lötsch, et. al.Jörn Lötsch ... Alfred Ultsch
24 Nov 2017
Pain | VOL. 159

Comparison of unsupervised machine-learning methods to identify metabolomic signatures in patients with localized breast cancer
Jocelyn Gal ... Emmanuel Chamorey
Computational and Structural Biotechnology Journal | VOL. 18
Jocelyn Gal, et. al.Jocelyn Gal ... Emmanuel Chamorey
01 Jan 2020
Computational and Structural Biotechnology Journal | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

AUTHORSHIP ATTRIBUTION OF RESPONSA USING CLUSTERING

Abstract

Talk to us

Similar Papers

More From: Cybernetics and Systems