Automatic authorship attribution in Albanian texts

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Automatic authorship identification is a challenging task that has been the focus of extensive research in natural language processing. Regardless of the progress made in attributing authorship, the need for corpora in under-resourced languages impedes advancing and examining present methods. To address this gap, we investigate the problem of authorship attribution in Albanian. We introduce a newly compiled corpus of Albanian newsroom columns and literary works and analyze machine-learning methods for detecting authorship. We create a set of hand-crafted features targeting various categories (lexical, morphological, and structural) relevant to Albanian and experiment with multiple classifiers using two different multiclass classification strategies. Furthermore, we compare our results to those obtained using deep learning models. Our investigation focuses on identifying the best combination of features and classification methods. The results reveal that lexical features are the most effective set of linguistic features, significantly improving the performance of various algorithms in the authorship attribution task. Among the machine learning algorithms evaluated, XGBoost demonstrated the best overall performance, achieving an F1 score of 0.982 on literary works and 0.905 on newsroom columns. Additionally, deep learning models such as fastText and BERT-multilingual showed promising results, highlighting their potential applicability in specific scenarios in Albanian writings. These findings contribute to the understanding of effective methods for authorship attribution in low-resource languages and provide a robust framework for future research in this area. The careful analysis of the different scenarios and the conclusions drawn from the results provide valuable insights into the potential and limitations of the methods and highlight the challenges in detecting authorship in Albanian. Promising results are reported, with implications for improving the methods used in Albanian authorship attribution. This study provides a valuable resource for future research and a reference for researchers in this domain.

Similar Papers
  • Research Article
  • Cite Count Icon 4
  • 10.1371/journal.pone.0310057
Automatic authorship attribution in Albanian texts.
  • Oct 22, 2024
  • PloS one
  • Arta Misini + 3 more

Automatic authorship identification is a challenging task that has been the focus of extensive research in natural language processing. Regardless of the progress made in attributing authorship, the need for corpora in under-resourced languages impedes advancing and examining present methods. To address this gap, we investigate the problem of authorship attribution in Albanian. We introduce a newly compiled corpus of Albanian newsroom columns and literary works and analyze machine-learning methods for detecting authorship. We create a set of hand-crafted features targeting various categories (lexical, morphological, and structural) relevant to Albanian and experiment with multiple classifiers using two different multiclass classification strategies. Furthermore, we compare our results to those obtained using deep learning models. Our investigation focuses on identifying the best combination of features and classification methods. The results reveal that lexical features are the most effective set of linguistic features, significantly improving the performance of various algorithms in the authorship attribution task. Among the machine learning algorithms evaluated, XGBoost demonstrated the best overall performance, achieving an F1 score of 0.982 on literary works and 0.905 on newsroom columns. Additionally, deep learning models such as fastText and BERT-multilingual showed promising results, highlighting their potential applicability in specific scenarios in Albanian writings. These findings contribute to the understanding of effective methods for authorship attribution in low-resource languages and provide a robust framework for future research in this area. The careful analysis of the different scenarios and the conclusions drawn from the results provide valuable insights into the potential and limitations of the methods and highlight the challenges in detecting authorship in Albanian. Promising results are reported, with implications for improving the methods used in Albanian authorship attribution. This study provides a valuable resource for future research and a reference for researchers in this domain.

  • Research Article
  • 10.52783/jes.1506
Enabling Natural Language Processing and AI Research in Low-Resource Languages: Development and Description of an Assamese UPoS Tagged Dataset
  • Apr 4, 2024
  • Journal of Electrical Systems
  • Kuwali Talukdar, Shikhar Kumar Sarma

This paper describes in detail the Universal Parts of Speech (UPoS) tagged dataset for the Assamese language. PoS tagged dataset in a language is crucial for experimenting and creating resources for various Natural Language Processing (NLP) and AI research. With the growing usage of Universal Dependency standards, tagged dataset with Universal PoS tags are becoming very much essential for contemporary experiments in the NLP community. NLP research in Assamese, and Indo-Aryan language, is relatively new, and the language is considered a Low Resource language. The dataset of UPoS tagged Assamese text is created with an aim of contributing towards enriching this low resource language for NLP and AI tasks. The size of the dataset is 283506 tokens of Assamese vocabulary, against total 20280 sentences, tagged with 17 standard UPoS tags of core lexical categories. The raw data are taken from an open-source corpus originally tagged with BIS tagset. The original size of 453457 tokens against 29504 sentences, after subjected to data filtering, was reduced to this clean resource of 283506 tokens. Lexical categories mapping is done with linguistic expertise, from BIS to UPoS tagsets. Mapped pattern was used for a first-level conversion of BIS tags to UPoS tags. Linguistic validation is also performed with linguistic experts and inter annotator agreement/disagreements were recorded. Second level validation resulted in deciding on the agreement, producing the final version of the dataset. This Assamese UPoS tagged dataset is the first of its kind with UPoS annotations and shall serve a wider Assamese NLP research community for model training using Machine Learning/Deep Learning Techniques.

  • Research Article
  • Cite Count Icon 3
  • 10.64539/sjer.v1i1.2025.6
A Bibliometric Analysis of Natural Language Processing and Classification: Trends, Impact, and Future Directions
  • Jan 3, 2025
  • Scientific Journal of Engineering Research
  • Setiawan Ardi Wijaya + 5 more

This study presents a bibliometric analysis of Natural Language Processing (NLP) and classification research, examining trends, impacts, and future directions. NLP, a key field in artificial intelligence, focuses on enabling computers to process and understand human language through tasks such as text classification, sentiment analysis, and speech recognition. Classification plays a crucial role in organizing textual data, facilitating applications like spam detection and content recommendation. The research employs bibliometric analysis to evaluate publication trends, citation networks, and emerging themes from 1992 to 2025. Using data retrieved from Scopus, descriptive statistical analysis and bibliometric mapping with VOSviewer reveal key contributors, influential publications, and subject area distributions. Findings indicate a significant rise in NLP research, with deep learning models, particularly transformers, driving advancements in the field. The study highlights dominant research areas, including computer science, engineering, and medicine, and identifies leading countries in NLP research, such as the United States, China, and India. Additionally, ethical concerns, including bias and fairness in NLP applications, are discussed as critical challenges for future research. The insights derived from this analysis provide valuable guidance for researchers and policymakers in shaping the next phase of NLP development.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 4
  • 10.1007/s43681-024-00606-3
Speciesism in natural language processing research
  • Nov 27, 2024
  • AI and Ethics
  • Masashi Takeshita + 1 more

Natural Language Processing (NLP) research on AI Safety and social bias in AI has focused on safety for humans and social bias against human minorities. However, some AI ethicists have argued that the moral significance of nonhuman animals has been ignored in AI research. Therefore, the purpose of this study is to investigate whether there is speciesism, i.e., discrimination against nonhuman animals, in NLP research. First, we explain why nonhuman animals are relevant in NLP research. Next, we survey the findings of existing research on speciesism in NLP researchers, data, and models and further investigate this problem in this study. The findings of this study suggest that speciesism exists within researchers, data, and models, respectively. Specifically, our survey and experiments show that (a) among NLP researchers, even those who study social bias in AI, do not recognize speciesism or speciesist bias; (b) among NLP data, speciesist bias is inherent in the data annotated in the datasets used to evaluate NLP models; (c) OpenAI GPTs, recent NLP models, exhibit speciesist bias by default. Finally, we discuss how we can reduce speciesism in NLP research.

  • Research Article
  • Cite Count Icon 37
  • 10.1080/0960085x.2020.1816145
From semantics to pragmatics: where IS can lead in Natural Language Processing (NLP) research
  • Sep 24, 2020
  • European Journal of Information Systems
  • Yan Li + 2 more

Natural Language Processing (NLP) is now widely integrated into web and mobile applications, enabling natural interactions between humans and computers. Although there is a large body of NLP studies published in Information Systems (IS), a comprehensive review of how NLP research is conceptualised and realised in the context of IS has not been conducted. To assess the current state of NLP research in IS, we use a variety of techniques to analyse a literature corpus comprising 356 NLP research articles published in IS journals between 2004 and 2018. Our analysis indicates the need to move from semantics to pragmatics. More importantly, our findings unpack the challenges and assumptions underlying current research trends in NLP. We argue that overcoming these challenges will require a renewed disciplinary IS focus. By proposing a roadmap of NLP research in IS, we draw attention to three NLP research perspectives and present future directions that IS researchers are uniquely positioned to address.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 15
  • 10.1186/s12911-019-0778-z
A two-site survey of medical center personnel\u2019s willingness to share clinical data for research: implications for reproducible health NLP research
  • Apr 1, 2019
  • BMC Medical Informatics and Decision Making
  • Chunhua Weng + 3 more

BackgroundA shareable repository of clinical notes is critical for advancing natural language processing (NLP) research, and therefore a goal of many NLP researchers is to create a shareable repository of clinical notes, that has breadth (from multiple institutions) as well as depth (as much individual data as possible).MethodsWe aimed to assess the degree to which individuals would be willing to contribute their health data to such a repository. A compact e-survey probed willingness to share demographic and clinical data categories. Participants were faculty, staff, and students in two geographically diverse major medical centers (Utah and New York). Such a sample could be expected to respond like a typical potential participant from the general public who is given complete and fully informed consent about the pros and cons of participating in a research study.ResultsTwo thousand one hundred forty respondents completed the surveys. 56% of respondents were “somewhat/definitely willing” to share clinical data with identifiers, while 89% of respondents were “somewhat (17%)/definitely willing (72%)” to share without identifiers. Results were consistent across gender, age, and education, but there were some differences by geographical region. Individuals were most reluctant (50–74%) sharing mental health, substance abuse, and domestic violence data.ConclusionsWe conclude that a substantial fraction of potential patient participants, once educated about risks and benefits, would be willing to donate de-identified clinical data to a shared research repository. A slight majority even would be willing to share absent de-identification, suggesting that perceptions about data misuse are not a major concern. Such a repository of clinical notes should be invaluable for clinical NLP research and advancement.

  • Research Article
  • Cite Count Icon 21
  • 10.1162/coli_a_00420
Natural Language Processing and Computational Linguistics
  • Dec 7, 2021
  • Computational Linguistics
  • Junichi Tsujii

Natural Language Processing and Computational Linguistics

  • Research Article
  • 10.14445/23488387/ijcse-v11i6p101
English
  • Jun 30, 2024
  • International Journal of Computer Science and Engineering
  • Ekambaram Kesavulu Reddy

Recent years have been an active testing ground for artificial neural networks for language understanding, a very important aspect of NLP. In this respect, emerging NLP technologies are largely motivated by the rising requirements to cope with the issues raised by different NLP tasks, allowing the processing and analysis of large text data samples, uncovering complex language behaviors, as well as extracting valuable information from disorganized text. NLP (Natural Language Processing) has proven to be the most successful field of machine learning thanks to its capability to teach itself and detect all kinds of features on its own based on enormous amounts of data. In NLP tasks like language modelling, text classification, emotion analysis, and machine translation, RNNs, CNNs, and transformer-based models have been used in new ways. While NLP is generally agreed upon the difficulties it faces, the progress of technology also gives birth to unexpected challenges. Thus, two factors, namely the expanding collections of large text datasets and the pressing need for more accurate and time-saving NLP models that emerge as a consequence are giving rise to new kinds of deep learning models and techniques. Here, this paper analyzes as a whole the most recent achievement of neural architectures for natural language processing applications. From introducing current models and approaches in NLP, highlighting their strengths and weaknesses, and identifying the areas to be researched in the future, this paper will conduct this discussion.<br /> Then, this paper will go on and investigate the of one in NLP, together with the importance of constantly improving architectures which are responsible for tackling these hard tasks. Subsequently, it will talk about the recent breakthroughs in deep learning models namely RNNs, CNNs, transformer-based models and attention mechanisms will be discussed next. At last, this paper will cover the ever-evolving roofline in NLP research, including transfer learning, self-supervised learning, and multimodal learning. Moreover, this paper will also underline the current shortcomings of existing NLP models and locate the themes where research needs to be reevaluated. This article, through the deep learning architecture review for NLP, offered a full-range overview of the recent advancement in deep learning, and this article is developed as a valuable corpus for the researcher, practitioners, and students in the field of NLP.

  • PDF Download Icon
  • Conference Article
  • Cite Count Icon 406
  • 10.18653/v1/p18-1128
The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing
  • Jan 1, 2018
  • Rotem Dror + 3 more

Statistical significance testing is a standard statistical tool designed to ensure that experimental results are not coincidental. In this opinion/ theoretical paper we discuss the role of statistical significance testing in Natural Language Processing (NLP) research. We establish the fundamental concepts of significance testing and discuss the specific aspects of NLP tasks, experimental setups and evaluation measures that affect the choice of significance tests in NLP research. Based on this discussion we propose a simple practical protocol for statistical significance test selection in NLP setups and accompany this protocol with a brief survey of the most relevant tests. We then survey recent empirical papers published in ACL and TACL during 2017 and show that while our community assigns great value to experimental results, statistical significance testing is often ignored or misused. We conclude with a brief discussion of open issues that should be properly addressed so that this important tool can be applied. in NLP research in a statistically sound manner.

  • Research Article
  • 10.64751/ijdim.2025.v4.n4(1).pp15-20
CATEGORY-BASED SENTIMENT ANALYSIS OF SINDHI NEWS HEADLINES USING MACHINE LEARNING DEEP LEARNING AND TRANSFORMER MODELS
  • Nov 22, 2025
  • International Journal of Data Science and IoT Management System
  • Dr Rakesh + 3 more

The rapid growth of digital content has made sentiment analysis (SA) an essential tool for understanding public sentiment and classifying textual data. Despite significant progress in natural language processing (NLP), low-resource languages, particularly Sindhi, remain underexplored due to the lack of computational tools and annotated datasets. This study addresses this gap by introducing the Sindhi News Headlines Dataset (SNHD), a novel corpus annotated for both SA and category classification across eight categories: Crime, Economy, Entertainment, Health, Politics, Science &amp; Technology, Social, and Sports. To evaluate the effectiveness of different machine learning (ML), deep learning (DL), and transformer-based approaches, we conduct a comparative analysis of various models on SA and category classification tasks. Furthermore, we leverage Explainable Artificial Intelligence (XAI) techniques, such as Local Interpretable Model-Agnostic Explanations (LIME), to gain insights into model decision-making. Experimental results show that traditional ML models outperform DL and transformer-based models on the SNHD dataset. Specifically, Support Vector Machines with Radial Basis Function (SVM-RBF) achieves the highest performance for SA (0.74 accuracy and weighted F-score), while the Ridge Classifier (RC) delivers the best results for category classification (0.84 accuracy and weighted F-score). Among transformer models, XLM-RoBERTa demonstrates strong performance in category classification (0.82 accuracy and weighted F-score). These findings establish a benchmark for future research in Sindhi NLP and highlight the potential of hybrid approaches in tackling challenges associated with low-resource languages. This work provides a foundational resource for NLP researchers seeking to advance computational methods for Sindhi and similar underrepresented languages.

  • Research Article
  • Cite Count Icon 3
  • 10.55124/jaim.v2i1.238
Enhancing Named Entity Recognition in Low-Resource Dravidian Languages A Comparative Analysis of Multilingual Learning and Transfer Learning Techniques
  • Jan 1, 2024
  • Journal of Artificial intelligence and Machine Learning
  • Kiranmaye Panchadara

Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP), yet its performance in low-resource languages, such as Kannada, Malayalam, Tamil, and Telugu of the Dravidian language family, remains challenging due to limited linguistic resources. Originating from India, these languages represent a rich linguistic diversity, but they often lack adequate resources for technological advancements. In this study, we explore methods to enhance NER performance in these low-resource Indian languages using multilingual learning and transfer learning techniques. Leveraging mBERT, RoBERTa, and XLM-RoBERTa algorithms, we conduct a comprehensive analysis. Initially, we evaluate each algorithm's performance on individual languages, obtaining accuracy scores. Subsequently, we merged datasets from pairs of languages to investigate cross-lingual transfer learning. For instance, combining Kannada and Tamil datasets yields a better accuracy, surpassing Kannada's standalone accuracy. We repeat this process for Tamil, Malayalam, and Telugu subsequently, assessing both individual and multilingual accuracies. Our experiments provide insights into the efficacy of multilingual learning and transfer learning across diverse Dravidian languages, contributing to bridging the technological gap between urban and rural communities in India. By analyzing the impact of algorithm choice and cross-lingual transfer, we uncover valuable findings to advance NER performance in underrepresented languages. This study demonstrates the potential of technological advancements to empower diverse linguistic communities and foster inclusivity in NLP research and applications.

  • Research Article
  • Cite Count Icon 4
  • 10.1371/journal.pcbi.1012755
The role of chromatin state in intron retention: A case study in leveraging large scale deep learning models.
  • Jan 10, 2025
  • PLoS computational biology
  • Ahmed Daoud + 1 more

Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision. By providing pre-trained models that can be fine-tuned for specific applications, they enable researchers to create accurate models with minimal effort and computational resources. Large scale genomics deep learning models come in two flavors: the first are large language models of DNA sequences trained in a self-supervised fashion, similar to the corresponding natural language models; the second are supervised learning models that leverage large scale genomics datasets from ENCODE and other sources. We argue that these models are the equivalent of foundation models in natural language processing in their utility, as they encode within them chromatin state in its different aspects, providing useful representations that allow quick deployment of accurate models of gene regulation. We demonstrate this premise by leveraging the recently created Sei model to develop simple, interpretable models of intron retention, and demonstrate their advantage over models based on the DNA language model DNABERT-2. Our work also demonstrates the impact of chromatin state on the regulation of intron retention. Using representations learned by Sei, our model is able to discover the involvement of transcription factors and chromatin marks in regulating intron retention, providing better accuracy than a recently published custom model developed for this purpose.

  • Research Article
  • 10.63163/jpehss.v4i1.1260
Text Classification Performance Analysis through Machine Learning and Deep Learning on the Low-Resourced Balochi Language
  • Mar 31, 2026
  • Physical Education, Health and Social Sciences
  • Muhammad Ameen Chhajro + 5 more

Text classification is a crucial task in Natural Language Processing (NLP). The purpose of text classification research is to classify the text into pre-defined classes automatically. Low-resource languages still receive less attention in NLP tasks due to the scarcity of publicly annotated datasets and computational resources. Similarly, Balochi, a low-resource language with a 2500-year history and cultural significance, has not been considered much for the development of NLP applications. This research study implements a text classification task in Balochi and compares machine learning, Deep Learning, and Transformer-based models. Balochi-language’s unlabelled dataset of approximately 5.5k sentences was collected, and various pre-processing techniques, including tokenization, stop words removal, and text normalization, were applied. The experimental results of this research conclude that, among machine learning models, the SGD classifier achieved the highest accuracy of 98.83%. Among Deep Learning models, the BiLSTM achieved the highest accuracy of 98%. However, the Transformer-based model, the pre-trained XLM-RoBERTa, performed exceptionally well, achieving 99% accuracy on the Balochi classification task. These research findings provide a foundation for future multilingual pre-trained models for low-resource languages and aim to develop consistent Balochi language models for NLP applications.

  • Conference Article
  • Cite Count Icon 38
  • 10.1109/bracis.2016.071
Authorship Attribution via Network Motifs Identification
  • Oct 1, 2016
  • Vanessa Queiroz Marinho + 2 more

Concepts and methods of complex networks can be used to analyse texts at their different complexity levels. Examples of natural language processing (NLP) tasks studied via topological analysis of networks are keyword identification, automatic extractive summarization and authorship attribution. Even though a myriad of network measurements have been applied to study the authorship attribution problem, the use of motifs for text analysis has been restricted to a few works. The goal of this paper is to apply the concept of motifs, recurrent interconnection patterns, in the authorship attribution task. The absolute frequencies of all thirteen directed motifs with three nodes were extracted from the co-occurrence networks and used as classification features. The effectiveness of these features was verified with four machine learning methods. The results show that motifs are able to distinguish the writing style of different authors. In our best scenario, 57.5% of the books were correctly classified. The chance baseline for this problem is 12.5%. In addition, we have found that function words play an important role in these recurrent patterns. Taken together, our findings suggest that motifs should be further explored in other related linguistic tasks.

  • Conference Article
  • Cite Count Icon 10
  • 10.1145/3342827.3342834
Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
  • Jun 28, 2019
  • Tatiana Litvinova + 2 more

Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, as they encode both style and content information. We evaluate different types of character n-gram features in an authorship attribution task in a real-world noisy dataset of Russian forum posts. We also supplement them with a number of new simple n-gram features capturing syntactic and discourse patterns. We perform authorship attribution in a single-topic and a cross-topic setting, as the research question is whether character n-grams capture both style and content information. Our results show that character n-grams are indeed very successful in Russian forum post authorship attribution. However, there is no clear distinction of style and content n-grams, as the same types of n-grams work well for both single-topic and cross-topic settings. In our experiments the generalized simple n-gram features which reveals syntactic and discourse patterns were proved to be also very important in authorship attribution of short informal Russian texts. They represent a different kind of authorship information and are a successful addition to the character n-grams in authorship attribution of forum texts in the Russian language.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant