Abstract

Author Attribution (AA) is a critical stylometry problem that tries to deduce the identity of the authors of electronic texts (e-texts) by only examining the texts. AA is essential for enhancing various application domains, such as recommender systems and forensics. Nevertheless, existing techniques in AA have not been assessed with Emirati social media e-texts. The reason is that no suitable dataset exists for evaluating AA techniques in this context. This paper introduces the Khonji-Iraqi Emirati Tweets Author Identification (AID) dataset with 30 authors (KIT-30), and detailed evaluations. Compound grams, a new definition of grams, are introduced, which allows us to achieve higher classification accuracy. Also, when the number of suspect authors increases, the classification accuracy degradation is not as severe as previously reported, when using suitable data representation. Furthermore, in order to work towards addressing the lack of conveniently-available implementations of stylometry methods, we have developed an extensive e-text feature extraction library, namely Fextractor, with a highly intuitive API. The library generalizes all existing n-gram-based feature extraction methods under the at least l-frequent, dir-directed, k-skipped n-grams, and allows grams to be diversely defined, including definitions that are based on high-level grammatical aspects, such as Part of Speech (POS) tags, as well as lower-level ones, such as the distribution of function words and word shapes.

Highlights

  • E-text stylometry is concerned about analyzing the writing styles of input e-texts in order to extract information about their authors

  • Author Identification (AID) can be the Author Attribution (AA) problem addressed in this paper, or the Author Verification (AV) problem, or the Author Diarization (AD) problem

  • This constraint increases the difficulty of the AA problems, as it substantially minimizes the possibility of test tweets being chronologically too close from their learning counterparts

Read more

Summary

INTRODUCTION

E-text stylometry is concerned about analyzing the writing styles of input e-texts in order to extract information about their authors. The lack of evaluation datasets for stylometry problem solvers, when executed against e-texts that are written in Emirati Arabic, a dialect of the Arabic language that is natively spoken in the United Arab Emirates (UAE). The release of the library under a permissible opensource library We hope that this would enable other researchers to conveniently study the feature extraction methods, or evaluate their methods against the existing ones, without facing the time and effort barrier that is currently required to implement the many methods. The results show that even when the number of suspect authors increases to 30, some AA models can achieve high accuracy in the context of Emirati tweets when using suitable text vectorization methods.

RELATED WORKS
AUTHOR ATTRIBUTION MODEL
COMPOUND GRAMS
EVALUATION METHODOLOGY
VIII. FEXTRACTOR
EXAMPLES
CONCLUSION
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.