Automatic authorship attribution in Albanian texts
Automatic authorship identification is a challenging task that has been the focus of extensive research in natural language processing. Regardless of the progress made in attributing authorship, the need for corpora in under-resourced languages impedes advancing and examining present methods. To address this gap, we investigate the problem of authorship attribution in Albanian. We introduce a newly compiled corpus of Albanian newsroom columns and literary works and analyze machine-learning methods for detecting authorship. We create a set of hand-crafted features targeting various categories (lexical, morphological, and structural) relevant to Albanian and experiment with multiple classifiers using two different multiclass classification strategies. Furthermore, we compare our results to those obtained using deep learning models. Our investigation focuses on identifying the best combination of features and classification methods. The results reveal that lexical features are the most effective set of linguistic features, significantly improving the performance of various algorithms in the authorship attribution task. Among the machine learning algorithms evaluated, XGBoost demonstrated the best overall performance, achieving an F1 score of 0.982 on literary works and 0.905 on newsroom columns. Additionally, deep learning models such as fastText and BERT-multilingual showed promising results, highlighting their potential applicability in specific scenarios in Albanian writings. These findings contribute to the understanding of effective methods for authorship attribution in low-resource languages and provide a robust framework for future research in this area. The careful analysis of the different scenarios and the conclusions drawn from the results provide valuable insights into the potential and limitations of the methods and highlight the challenges in detecting authorship in Albanian. Promising results are reported, with implications for improving the methods used in Albanian authorship attribution. This study provides a valuable resource for future research and a reference for researchers in this domain.
- Research Article
4
- 10.1371/journal.pone.0310057
- Oct 22, 2024
- PloS one
Automatic authorship identification is a challenging task that has been the focus of extensive research in natural language processing. Regardless of the progress made in attributing authorship, the need for corpora in under-resourced languages impedes advancing and examining present methods. To address this gap, we investigate the problem of authorship attribution in Albanian. We introduce a newly compiled corpus of Albanian newsroom columns and literary works and analyze machine-learning methods for detecting authorship. We create a set of hand-crafted features targeting various categories (lexical, morphological, and structural) relevant to Albanian and experiment with multiple classifiers using two different multiclass classification strategies. Furthermore, we compare our results to those obtained using deep learning models. Our investigation focuses on identifying the best combination of features and classification methods. The results reveal that lexical features are the most effective set of linguistic features, significantly improving the performance of various algorithms in the authorship attribution task. Among the machine learning algorithms evaluated, XGBoost demonstrated the best overall performance, achieving an F1 score of 0.982 on literary works and 0.905 on newsroom columns. Additionally, deep learning models such as fastText and BERT-multilingual showed promising results, highlighting their potential applicability in specific scenarios in Albanian writings. These findings contribute to the understanding of effective methods for authorship attribution in low-resource languages and provide a robust framework for future research in this area. The careful analysis of the different scenarios and the conclusions drawn from the results provide valuable insights into the potential and limitations of the methods and highlight the challenges in detecting authorship in Albanian. Promising results are reported, with implications for improving the methods used in Albanian authorship attribution. This study provides a valuable resource for future research and a reference for researchers in this domain.
- Research Article
- 10.52783/jes.1506
- Apr 4, 2024
- Journal of Electrical Systems
This paper describes in detail the Universal Parts of Speech (UPoS) tagged dataset for the Assamese language. PoS tagged dataset in a language is crucial for experimenting and creating resources for various Natural Language Processing (NLP) and AI research. With the growing usage of Universal Dependency standards, tagged dataset with Universal PoS tags are becoming very much essential for contemporary experiments in the NLP community. NLP research in Assamese, and Indo-Aryan language, is relatively new, and the language is considered a Low Resource language. The dataset of UPoS tagged Assamese text is created with an aim of contributing towards enriching this low resource language for NLP and AI tasks. The size of the dataset is 283506 tokens of Assamese vocabulary, against total 20280 sentences, tagged with 17 standard UPoS tags of core lexical categories. The raw data are taken from an open-source corpus originally tagged with BIS tagset. The original size of 453457 tokens against 29504 sentences, after subjected to data filtering, was reduced to this clean resource of 283506 tokens. Lexical categories mapping is done with linguistic expertise, from BIS to UPoS tagsets. Mapped pattern was used for a first-level conversion of BIS tags to UPoS tags. Linguistic validation is also performed with linguistic experts and inter annotator agreement/disagreements were recorded. Second level validation resulted in deciding on the agreement, producing the final version of the dataset. This Assamese UPoS tagged dataset is the first of its kind with UPoS annotations and shall serve a wider Assamese NLP research community for model training using Machine Learning/Deep Learning Techniques.
- Research Article
3
- 10.64539/sjer.v1i1.2025.6
- Jan 3, 2025
- Scientific Journal of Engineering Research
This study presents a bibliometric analysis of Natural Language Processing (NLP) and classification research, examining trends, impacts, and future directions. NLP, a key field in artificial intelligence, focuses on enabling computers to process and understand human language through tasks such as text classification, sentiment analysis, and speech recognition. Classification plays a crucial role in organizing textual data, facilitating applications like spam detection and content recommendation. The research employs bibliometric analysis to evaluate publication trends, citation networks, and emerging themes from 1992 to 2025. Using data retrieved from Scopus, descriptive statistical analysis and bibliometric mapping with VOSviewer reveal key contributors, influential publications, and subject area distributions. Findings indicate a significant rise in NLP research, with deep learning models, particularly transformers, driving advancements in the field. The study highlights dominant research areas, including computer science, engineering, and medicine, and identifies leading countries in NLP research, such as the United States, China, and India. Additionally, ethical concerns, including bias and fairness in NLP applications, are discussed as critical challenges for future research. The insights derived from this analysis provide valuable guidance for researchers and policymakers in shaping the next phase of NLP development.
- Research Article
4
- 10.1007/s43681-024-00606-3
- Nov 27, 2024
- AI and Ethics
Natural Language Processing (NLP) research on AI Safety and social bias in AI has focused on safety for humans and social bias against human minorities. However, some AI ethicists have argued that the moral significance of nonhuman animals has been ignored in AI research. Therefore, the purpose of this study is to investigate whether there is speciesism, i.e., discrimination against nonhuman animals, in NLP research. First, we explain why nonhuman animals are relevant in NLP research. Next, we survey the findings of existing research on speciesism in NLP researchers, data, and models and further investigate this problem in this study. The findings of this study suggest that speciesism exists within researchers, data, and models, respectively. Specifically, our survey and experiments show that (a) among NLP researchers, even those who study social bias in AI, do not recognize speciesism or speciesist bias; (b) among NLP data, speciesist bias is inherent in the data annotated in the datasets used to evaluate NLP models; (c) OpenAI GPTs, recent NLP models, exhibit speciesist bias by default. Finally, we discuss how we can reduce speciesism in NLP research.
- Research Article
37
- 10.1080/0960085x.2020.1816145
- Sep 24, 2020
- European Journal of Information Systems
Natural Language Processing (NLP) is now widely integrated into web and mobile applications, enabling natural interactions between humans and computers. Although there is a large body of NLP studies published in Information Systems (IS), a comprehensive review of how NLP research is conceptualised and realised in the context of IS has not been conducted. To assess the current state of NLP research in IS, we use a variety of techniques to analyse a literature corpus comprising 356 NLP research articles published in IS journals between 2004 and 2018. Our analysis indicates the need to move from semantics to pragmatics. More importantly, our findings unpack the challenges and assumptions underlying current research trends in NLP. We argue that overcoming these challenges will require a renewed disciplinary IS focus. By proposing a roadmap of NLP research in IS, we draw attention to three NLP research perspectives and present future directions that IS researchers are uniquely positioned to address.
- Research Article
15
- 10.1186/s12911-019-0778-z
- Apr 1, 2019
- BMC Medical Informatics and Decision Making
BackgroundA shareable repository of clinical notes is critical for advancing natural language processing (NLP) research, and therefore a goal of many NLP researchers is to create a shareable repository of clinical notes, that has breadth (from multiple institutions) as well as depth (as much individual data as possible).MethodsWe aimed to assess the degree to which individuals would be willing to contribute their health data to such a repository. A compact e-survey probed willingness to share demographic and clinical data categories. Participants were faculty, staff, and students in two geographically diverse major medical centers (Utah and New York). Such a sample could be expected to respond like a typical potential participant from the general public who is given complete and fully informed consent about the pros and cons of participating in a research study.ResultsTwo thousand one hundred forty respondents completed the surveys. 56% of respondents were “somewhat/definitely willing” to share clinical data with identifiers, while 89% of respondents were “somewhat (17%)/definitely willing (72%)” to share without identifiers. Results were consistent across gender, age, and education, but there were some differences by geographical region. Individuals were most reluctant (50–74%) sharing mental health, substance abuse, and domestic violence data.ConclusionsWe conclude that a substantial fraction of potential patient participants, once educated about risks and benefits, would be willing to donate de-identified clinical data to a shared research repository. A slight majority even would be willing to share absent de-identification, suggesting that perceptions about data misuse are not a major concern. Such a repository of clinical notes should be invaluable for clinical NLP research and advancement.
- Research Article
21
- 10.1162/coli_a_00420
- Dec 7, 2021
- Computational Linguistics
Natural Language Processing and Computational Linguistics
- Research Article
- 10.14445/23488387/ijcse-v11i6p101
- Jun 30, 2024
- International Journal of Computer Science and Engineering
Recent years have been an active testing ground for artificial neural networks for language understanding, a very important aspect of NLP. In this respect, emerging NLP technologies are largely motivated by the rising requirements to cope with the issues raised by different NLP tasks, allowing the processing and analysis of large text data samples, uncovering complex language behaviors, as well as extracting valuable information from disorganized text. NLP (Natural Language Processing) has proven to be the most successful field of machine learning thanks to its capability to teach itself and detect all kinds of features on its own based on enormous amounts of data. In NLP tasks like language modelling, text classification, emotion analysis, and machine translation, RNNs, CNNs, and transformer-based models have been used in new ways. While NLP is generally agreed upon the difficulties it faces, the progress of technology also gives birth to unexpected challenges. Thus, two factors, namely the expanding collections of large text datasets and the pressing need for more accurate and time-saving NLP models that emerge as a consequence are giving rise to new kinds of deep learning models and techniques. Here, this paper analyzes as a whole the most recent achievement of neural architectures for natural language processing applications. From introducing current models and approaches in NLP, highlighting their strengths and weaknesses, and identifying the areas to be researched in the future, this paper will conduct this discussion.<br /> Then, this paper will go on and investigate the of one in NLP, together with the importance of constantly improving architectures which are responsible for tackling these hard tasks. Subsequently, it will talk about the recent breakthroughs in deep learning models namely RNNs, CNNs, transformer-based models and attention mechanisms will be discussed next. At last, this paper will cover the ever-evolving roofline in NLP research, including transfer learning, self-supervised learning, and multimodal learning. Moreover, this paper will also underline the current shortcomings of existing NLP models and locate the themes where research needs to be reevaluated. This article, through the deep learning architecture review for NLP, offered a full-range overview of the recent advancement in deep learning, and this article is developed as a valuable corpus for the researcher, practitioners, and students in the field of NLP.
- Conference Article
406
- 10.18653/v1/p18-1128
- Jan 1, 2018
Statistical significance testing is a standard statistical tool designed to ensure that experimental results are not coincidental. In this opinion/ theoretical paper we discuss the role of statistical significance testing in Natural Language Processing (NLP) research. We establish the fundamental concepts of significance testing and discuss the specific aspects of NLP tasks, experimental setups and evaluation measures that affect the choice of significance tests in NLP research. Based on this discussion we propose a simple practical protocol for statistical significance test selection in NLP setups and accompany this protocol with a brief survey of the most relevant tests. We then survey recent empirical papers published in ACL and TACL during 2017 and show that while our community assigns great value to experimental results, statistical significance testing is often ignored or misused. We conclude with a brief discussion of open issues that should be properly addressed so that this important tool can be applied. in NLP research in a statistically sound manner.
- Research Article
- 10.64751/ijdim.2025.v4.n4(1).pp15-20
- Nov 22, 2025
- International Journal of Data Science and IoT Management System
The rapid growth of digital content has made sentiment analysis (SA) an essential tool for understanding public sentiment and classifying textual data. Despite significant progress in natural language processing (NLP), low-resource languages, particularly Sindhi, remain underexplored due to the lack of computational tools and annotated datasets. This study addresses this gap by introducing the Sindhi News Headlines Dataset (SNHD), a novel corpus annotated for both SA and category classification across eight categories: Crime, Economy, Entertainment, Health, Politics, Science & Technology, Social, and Sports. To evaluate the effectiveness of different machine learning (ML), deep learning (DL), and transformer-based approaches, we conduct a comparative analysis of various models on SA and category classification tasks. Furthermore, we leverage Explainable Artificial Intelligence (XAI) techniques, such as Local Interpretable Model-Agnostic Explanations (LIME), to gain insights into model decision-making. Experimental results show that traditional ML models outperform DL and transformer-based models on the SNHD dataset. Specifically, Support Vector Machines with Radial Basis Function (SVM-RBF) achieves the highest performance for SA (0.74 accuracy and weighted F-score), while the Ridge Classifier (RC) delivers the best results for category classification (0.84 accuracy and weighted F-score). Among transformer models, XLM-RoBERTa demonstrates strong performance in category classification (0.82 accuracy and weighted F-score). These findings establish a benchmark for future research in Sindhi NLP and highlight the potential of hybrid approaches in tackling challenges associated with low-resource languages. This work provides a foundational resource for NLP researchers seeking to advance computational methods for Sindhi and similar underrepresented languages.
- Research Article
3
- 10.55124/jaim.v2i1.238
- Jan 1, 2024
- Journal of Artificial intelligence and Machine Learning
Named Entity Recognition (NER) is a crucial task in Natural Language Processing (NLP), yet its performance in low-resource languages, such as Kannada, Malayalam, Tamil, and Telugu of the Dravidian language family, remains challenging due to limited linguistic resources. Originating from India, these languages represent a rich linguistic diversity, but they often lack adequate resources for technological advancements. In this study, we explore methods to enhance NER performance in these low-resource Indian languages using multilingual learning and transfer learning techniques. Leveraging mBERT, RoBERTa, and XLM-RoBERTa algorithms, we conduct a comprehensive analysis. Initially, we evaluate each algorithm's performance on individual languages, obtaining accuracy scores. Subsequently, we merged datasets from pairs of languages to investigate cross-lingual transfer learning. For instance, combining Kannada and Tamil datasets yields a better accuracy, surpassing Kannada's standalone accuracy. We repeat this process for Tamil, Malayalam, and Telugu subsequently, assessing both individual and multilingual accuracies. Our experiments provide insights into the efficacy of multilingual learning and transfer learning across diverse Dravidian languages, contributing to bridging the technological gap between urban and rural communities in India. By analyzing the impact of algorithm choice and cross-lingual transfer, we uncover valuable findings to advance NER performance in underrepresented languages. This study demonstrates the potential of technological advancements to empower diverse linguistic communities and foster inclusivity in NLP research and applications.
- Research Article
4
- 10.1371/journal.pcbi.1012755
- Jan 10, 2025
- PLoS computational biology
Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision. By providing pre-trained models that can be fine-tuned for specific applications, they enable researchers to create accurate models with minimal effort and computational resources. Large scale genomics deep learning models come in two flavors: the first are large language models of DNA sequences trained in a self-supervised fashion, similar to the corresponding natural language models; the second are supervised learning models that leverage large scale genomics datasets from ENCODE and other sources. We argue that these models are the equivalent of foundation models in natural language processing in their utility, as they encode within them chromatin state in its different aspects, providing useful representations that allow quick deployment of accurate models of gene regulation. We demonstrate this premise by leveraging the recently created Sei model to develop simple, interpretable models of intron retention, and demonstrate their advantage over models based on the DNA language model DNABERT-2. Our work also demonstrates the impact of chromatin state on the regulation of intron retention. Using representations learned by Sei, our model is able to discover the involvement of transcription factors and chromatin marks in regulating intron retention, providing better accuracy than a recently published custom model developed for this purpose.
- Research Article
- 10.63163/jpehss.v4i1.1260
- Mar 31, 2026
- Physical Education, Health and Social Sciences
Text classification is a crucial task in Natural Language Processing (NLP). The purpose of text classification research is to classify the text into pre-defined classes automatically. Low-resource languages still receive less attention in NLP tasks due to the scarcity of publicly annotated datasets and computational resources. Similarly, Balochi, a low-resource language with a 2500-year history and cultural significance, has not been considered much for the development of NLP applications. This research study implements a text classification task in Balochi and compares machine learning, Deep Learning, and Transformer-based models. Balochi-language’s unlabelled dataset of approximately 5.5k sentences was collected, and various pre-processing techniques, including tokenization, stop words removal, and text normalization, were applied. The experimental results of this research conclude that, among machine learning models, the SGD classifier achieved the highest accuracy of 98.83%. Among Deep Learning models, the BiLSTM achieved the highest accuracy of 98%. However, the Transformer-based model, the pre-trained XLM-RoBERTa, performed exceptionally well, achieving 99% accuracy on the Balochi classification task. These research findings provide a foundation for future multilingual pre-trained models for low-resource languages and aim to develop consistent Balochi language models for NLP applications.
- Conference Article
38
- 10.1109/bracis.2016.071
- Oct 1, 2016
Concepts and methods of complex networks can be used to analyse texts at their different complexity levels. Examples of natural language processing (NLP) tasks studied via topological analysis of networks are keyword identification, automatic extractive summarization and authorship attribution. Even though a myriad of network measurements have been applied to study the authorship attribution problem, the use of motifs for text analysis has been restricted to a few works. The goal of this paper is to apply the concept of motifs, recurrent interconnection patterns, in the authorship attribution task. The absolute frequencies of all thirteen directed motifs with three nodes were extracted from the co-occurrence networks and used as classification features. The effectiveness of these features was verified with four machine learning methods. The results show that motifs are able to distinguish the writing style of different authors. In our best scenario, 57.5% of the books were correctly classified. The chance baseline for this problem is 12.5%. In addition, we have found that function words play an important role in these recurrent patterns. Taken together, our findings suggest that motifs should be further explored in other related linguistic tasks.
- Conference Article
10
- 10.1145/3342827.3342834
- Jun 28, 2019
Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, as they encode both style and content information. We evaluate different types of character n-gram features in an authorship attribution task in a real-world noisy dataset of Russian forum posts. We also supplement them with a number of new simple n-gram features capturing syntactic and discourse patterns. We perform authorship attribution in a single-topic and a cross-topic setting, as the research question is whether character n-grams capture both style and content information. Our results show that character n-grams are indeed very successful in Russian forum post authorship attribution. However, there is no clear distinction of style and content n-grams, as the same types of n-grams work well for both single-topic and cross-topic settings. In our experiments the generalized simple n-gram features which reveals syntactic and discourse patterns were proved to be also very important in authorship attribution of short informal Russian texts. They represent a different kind of authorship information and are a successful addition to the character n-grams in authorship attribution of forum texts in the Russian language.