DAAB: Deep Authorship Attribution in Bengali

  • Abstract
  • Literature Map
  • Similar Papers
Abstract
Translate article icon Translate Article Star icon

Authorship attribution identifies the true author of an unknown document. Authorship attribution plays a crucial role in plagiarism detection and blackmailer identification, however, the existing studies on authorship attribution in Bengali are limited. In this paper, we propose an instance-based deep authorship attribution model, called DAAB, to identify authors in Bengali. Our DAAB model fuses features from convolutional neural networks and another set of features from an artificial neural network to learn the stylometry of an author for authorship attribution. Extensive experiments with three real benchmark datasets such as Bengali-Quora and two online Bengali Corpus demonstrate the superiority of our authorship attribution model.

Similar Papers
  • Conference Article
  • Cite Count Icon 4
  • 10.1109/icasert.2019.8934492
Authorship Attribution in Bengali Literature using Convolutional Neural Networks with fastText’s word embedding model
  • May 1, 2019
  • Hemayet Ahmed Chowdhury + 3 more

Authorship attribution (AA) is the process of attempting to identify the likely authorship of a given document by analyzing previous works of the authors in question. This paper proposes deep neural network-based continuous skip-gram models for authorship attribution in Bengali Literature. We present a data set of 2400 Bengali blog posts from 6 authors of current time and compare the performances of traditional lexical n-gram based models to our proposed approaches. We achieve a best accuracy of more than 92% on the held-out dataset with a deep convolutional neural network with skipgram word embeddings by fastText as the feature, which outperforms the other traditional models examined in this paper on Bengali Language. The results provide a clear indication that extracting features with the use of hidden layers in deep neural networks from continuous word embeddings work better as a feature set for authorship attribution systems on Bengali Literature than sparse lexical n-gram based features and shallow classifiers.

  • Conference Article
  • Cite Count Icon 39
  • 10.1109/icsc.2012.46
Translate Once, Translate Twice, Translate Thrice and Attribute: Identifying Authors and Machine Translation Tools in Translated Text
  • Sep 1, 2012
  • Aylin Caliskan + 1 more

In this paper, we investigate the effects of machine translation tools on translated texts and the accuracy of authorship and translator attribution of translated texts. We show that the more translation performed on a text by a specific machine translation tool, the more effects unique to that translator are observed. We also propose a novel method to perform machine translator and authorship attribution of translated texts using a feature set that led to 91.13% and 91.54% accuracy on average, respectively. We claim that the features leading to highest accuracy in translator attribution are translator-dependent features and that even though translator-effect-heavy features are present in translated text, we can still succeed in authorship attribution. These findings demonstrate that stylometric features of the original text are preserved at some level despite multiple consequent translations and the introduction of translator-dependent features. The main contribution of our work is the discovery of a feature set used to accurately perform both translator and authorship attribution on a corpus of diverse topics from the twenty-first century, which has been consequently translated multiple times using machine translation tools.

  • Dissertation
  • Cite Count Icon 1
  • 10.35662/unine-thesis-2876
An empirical comparison of recurrent neural network models on authorship analysis tasks
  • Jan 1, 2021
  • Nils Schaetti

In the last few years, a machine learning field named Deep-Learning (DL) has improved the results of several challenging tasks mainly in the field of computer vision. Deep architectures such as Convolutional Neural Networks (CNN) have been shown as very powerful for computer vision tasks. For those related to language and timeseries the state of the art models such as Long Short-Term Memory (LSTM) have a recurrent component that take into account the order of inputs and are able to memorise them. Among these tasks related to Natural Language Processing (NLP), an important problem in computational linguistics is authorship attribution where the goal is to find the true author of a text or, in an author profiling perspective, to extract information such as gender, origin and socio-economic background. However, few work have tackle the issue of authorship analysis with recurrent neural networks (RNNs). Consequently, we have decided to explore in this study the performances of several recurrent neural models, such as Echo State Networks (ESN), LSTM and Gated Recurrent Units (GRU) on three authorship analysis tasks. The first one on the classical authorship attribution task using the Reuters C50 dataset where models have to predict the true author of a document in a set of candidate authors. The second task is referred as author profiling as the model must determine the gender (male/female) of the author of a set of tweets using the PAN 2017 dataset from the CLEF conference. The third task is referred as author verification using an in-house dataset named SFGram and composed of dozens of science-fiction magazines from the 50s to the 70s. This task is separated into two problems. In the first, the goal is to extract passages written by a particular author inside a magazine co-written by several dozen authors. The second is to find out if a magazine contains passages written by a particular author. In order for our research to be applicable in authorship studies, we limited evaluated models to those with a so-called many-to-many architecture. This fulfills a fundamental constraint of the field of stylometry which is the ability to provide evidences for each prediction made. To evaluate these three models, we defined a set of experiments, performance measures and hyperparame-ters that could impact the output. We carried out these experiments with each model and their corresponding hyperparameters. Then we used statistical tests to detect significant di˙erences between these models, and with state-of-the-art baseline methods in authorship analysis. Our results shows that shallow and simple RNNs such as ESNs can be competitive with traditional meth-ods in authorship studies while keeping a learning time that can be used in practice and a reasonable number of parameters. These properties allow them to outperform much more complex neural models such as LSTMs and GRUs considered as state of the art in NLP. We also show that pretraining word and character features can be useful on stylometry problems if these are trained on a similar dataset. Consequently, interesting results are achievable on such tasks where the quantity of data is limited and therefore diÿcult to solve for deep learning methods. We also show that representations based on words and combinations of three characters (trigrams) are the most e˙ective for this kind of methods. Finally, we draw a landscape of possi-ble research paths for the future of neural networks and deep learning methods in the field authorship analysis.

  • Research Article
  • Cite Count Icon 3
  • 10.22214/ijraset.2024.64168
Deep Learning for Stylometry and Authorship Attribution: a Review of Literature
  • Sep 30, 2024
  • International Journal for Research in Applied Science and Engineering Technology
  • Nishchal Sharma + 1 more

The application of deep learning techniques to stylometry and authorship attribution has emerged as a promising frontier in computational linguistics, offering new possibilities for understanding literary style and authorship in both historical and contemporary contexts. This review paper synthesizes recent advances in the use of deep learning models, such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and transformer architectures, for identifying and attributing authorship based on stylistic analysis. We examine the effectiveness of these models in comparison to traditional statistical methods, highlighting their ability to capture complex linguistic patterns and nuances that are often overlooked by conventional approaches. Furthermore, we explore how deep learning models handle challenges such as multilingual texts, limited data, and variations across genres and periods. This review also addresses the interpretability of neural networks in the context of stylometry and discusses the implications of these methods for fields ranging from literary studies to digital forensics. By providing a comprehensive overview of the current state of research, this paper identifies key trends, challenges, and future directions for the application of deep learning to stylometry and authorship attribution.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 24
  • 10.1038/s41598-021-97195-6
Research on improved convolutional wavelet neural network
  • Sep 9, 2021
  • Scientific Reports
  • Jingwei Liu + 4 more

Artificial neural networks (ANN) which include deep learning neural networks (DNN) have problems such as the local minimal problem of Back propagation neural network (BPNN), the unstable problem of Radial basis function neural network (RBFNN) and the limited maximum precision problem of Convolutional neural network (CNN). Performance (training speed, precision, etc.) of BPNN, RBFNN and CNN are expected to be improved. Main works are as follows: Firstly, based on existing BPNN and RBFNN, Wavelet neural network (WNN) is implemented in order to get better performance for further improving CNN. WNN adopts the network structure of BPNN in order to get faster training speed. WNN adopts the wavelet function as an activation function, whose form is similar to the radial basis function of RBFNN, in order to solve the local minimum problem. Secondly, WNN-based Convolutional wavelet neural network (CWNN) method is proposed, in which the fully connected layers (FCL) of CNN is replaced by WNN. Thirdly, comparative simulations based on MNIST and CIFAR-10 datasets among the discussed methods of BPNN, RBFNN, CNN and CWNN are implemented and analyzed. Fourthly, the wavelet-based Convolutional Neural Network (WCNN) is proposed, where the wavelet transformation is adopted as the activation function in Convolutional Pool Neural Network (CPNN) of CNN. Fifthly, simulations based on CWNN are implemented and analyzed on the MNIST dataset. Effects are as follows: Firstly, WNN can solve the problems of BPNN and RBFNN and have better performance. Secondly, the proposed CWNN can reduce the mean square error and the error rate of CNN, which means CWNN has better maximum precision than CNN. Thirdly, the proposed WCNN can reduce the mean square error and the error rate of CWNN, which means WCNN has better maximum precision than CWNN.

  • Book Chapter
  • Cite Count Icon 8
  • 10.1007/978-3-319-99579-3_50
A Comparative Survey of Authorship Attribution on Short Arabic Texts
  • Jan 1, 2018
  • Siham Ouamour + 1 more

In this paper, we deal with the problem of authorship attribution (AA) on short Arabic texts. So, we make a survey on a set of several features and classifiers that are employed for the task of AA. This investigation uses characters, character bigrams, character trigrams, character tetragrams, words, word bigrams and rare words. The AA is ensured by 4 different measures, 3 classifiers (Multi-Layer Perceptron (MLP), Support Vector Machines (SVM) and Linear Regression (LR)) and a new proposed fusion called VBF (i.e. Vote Based Fusion). The evaluation is done on short Arabic texts extracted from the AAAT dataset (AA of Ancient Arabic Texts). Although the task of AA is known to be difficult on short texts, the different results have revealed interesting information on the performances of the features and classification techniques on Arabic text data. For instance, character-based features appear to be better than word-based features for short texts. Furthermore, the proposed VBF fusion provided high performances with an accuracy of 90% of good AA, which is higher than the score of the original classifier using only one feature. Globally, the results of this investigation shed light on the efficiency and pertinency of several features and classifiers in AA of short Arabic texts.

  • Research Article
  • Cite Count Icon 13
  • 10.1145/3487061
Authorship Attribution for a Resource Poor Language—Urdu
  • Dec 13, 2021
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Zulqarnain Nazir + 5 more

Authorship attribution refers to examining the writing style of authors to determine the likelihood of the original author of a document from a given set of potential authors. Due to the wide range of authorship attribution applications, a plethora of studies have been conducted for various Western, as well as Asian, languages. However, authorship attribution research in the Urdu language has just begun, although Urdu is widely acknowledged as a prominent South Asian language. Furthermore, the existing studies on authorship attribution in Urdu have addressed a considerably easier problem of having less than 20 candidate authors, which is far from the real-world settings. Therefore, the findings from these studies may not be applicable to the real-world settings. To that end, we have made three key contributions: First, we have developed a large authorship attribution corpus for Urdu, which is a low-resource language. The corpus is composed of over 2.6 million tokens and 21,938 news articles by 94 authors, which makes it a closer substitute to the real-world settings. Second, we have analyzed hundreds of stylometry features used in the literature to identify 194 features that are applicable to the Urdu language and developed a taxonomy of these features. Finally, we have performed 66 experiments using two heterogeneous datasets to evaluate the effectiveness of four traditional and three deep learning techniques. The experimental results show the following: (a) Our developed corpus is many folds larger than the existing corpora, and it is more challenging than its counterparts for the authorship attribution task, and (b) Convolutional Neutral Networks is the most effective technique, as it achieved a nearly perfect F1 score of 0.989 for an existing corpus and 0.910 for our newly developed corpus.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/bracis.2019.00146
A Multiview Clustering Approach for Mining Authorial Affinities in Literary Texts
  • Oct 1, 2019
  • Andrea B Duque + 2 more

In this work, we investigate the use of multiview learning for the task of authorship attribution. The main goal of this task is to assign authors to texts whose authorship is unknown or disputed. It has gained substantial attention recently because of other applications such as plagiarism detection and forensic investigation. Although the problem is traditionally seen as a supervised learning task, recent works have advocated the use of unsupervised methods as an alternative. The main argument for such an approach is that, by placing a text with a disputed or unknown author in a cluster of works from another author or group of authors, the method is revealing authorial affinities due to stylistic similarities that may be better used by domain experts. Nonetheless, there is no consensus in the literature on what set of features should be used to determine these stylistic similarities. Since the nature of the features may vary drastically, e.g. word frequencies (lexical) versus part-of-speech tags (syntactic), we adopt an agnostic view on which is the best, and, instead, believe that each set of features provides relevant, if not complementary, perspectives on the writing styles of the authors. In this sense, we investigate the use of multiview unsupervised learning for the task of authorship attribution. We use a real-world traditional corpus in authorship attribution research to assess the performance of our approach. Our experiments with the corpus containing plays from different authors from the Shakespeare Era indicate significant improvement compared to the ordinary single-view clustering approach.

  • Research Article
  • Cite Count Icon 4
  • 10.1145/3655620
Crossing Linguistic Barriers: Authorship Attribution in Sinhala Texts
  • May 10, 2024
  • ACM Transactions on Asian and Low-Resource Language Information Processing
  • Raheem Sarwar + 4 more

Authorship attribution involves determining the original author of an anonymous text from a pool of potential authors. The author attribution task has applications in several domains, such as plagiarism detection, digital text forensics, and information retrieval. While these applications extend beyond any single language, existing research has predominantly centered on English, posing challenges for application in languages such as Sinhala due to linguistic disparities and a lack of language processing tools. We present the first comprehensive study on cross-topic authorship attribution for Sinhala texts and propose a solution that can effectively perform the authorship attribution task even if the topics within the test and training samples differ. Our solution consists of three main parts: (i) extraction of topic-independent stylometric features, (ii) generation of a small candidate author set with the help of similarity search, and (iii) identification of the true author. Several experimental studies were carried out to demonstrate that the proposed solution can effectively handle real-world scenarios involving a large number of candidate authors and a limited number of text samples for each candidate author.

  • Research Article
  • Cite Count Icon 44
  • 10.1016/j.bspc.2019.03.009
Automatic staging model of heart failure based on deep learning
  • Apr 2, 2019
  • Biomedical Signal Processing and Control
  • Dengao Li + 3 more

Automatic staging model of heart failure based on deep learning

  • Research Article
  • Cite Count Icon 1
  • 10.48084/etasr.8302
Authorship Attribution for English Short Texts
  • Oct 9, 2024
  • Engineering, Technology & Applied Science Research
  • Tawfeeq Alsanoosy + 2 more

Internet and social media explosive growth has led to the rapid and widespread dissemination of information, which often takes place anonymously. This anonymity has fostered the rise of uncredited copying, posing a significant threat of copyright infringement and raising serious concerns in fields where verifying information's authenticity is paramount. Authorship Attribution (AA), a critical classification task within Natural Language Processing (NLP), aims to mitigate these concerns by identifying the original source of content. Although extensive research exists for longer texts, AA for short texts, namely informal texts like tweets, remains challenging due to the latter’s brevity and stylistic variation. Thus, this study aims to investigate and measure the performance of various Machine Learning (ML) and Deep Learning (DL) methods deployed for feature extraction from short text data, using tweets. The employed feature extraction methods were: Bag-of-Words (BoW), TF-IDF, n-grams, word-level, and character-level features. These methods were evaluated in conjunction with six ML classifiers, i.e. Naive Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN), and Random Forest (RF) along with two DL architectures, i.e. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The highest accuracy achieved with an ML model was 92.34%, using an SVM with TF-IDF features. Even though the basic CNN DL model reached 88% accuracy, this outcome still surpassed the previously established baseline for this task. The findings of this research not only advance the technical capabilities of AA, but also extend its practical applications, providing tools that can be adapted across various domains to ensure proper attribution and expose copyright infringement.

  • Conference Article
  • Cite Count Icon 2
  • 10.1109/smc53992.2023.10393898
Integrating Bidirectional Long Short-Term Memory with Subword Embedding for Authorship Attribution
  • Oct 1, 2023
  • Abiodun Modupe + 3 more

The problem of unveiling the author of a given text document from multiple candidate authors is called authorship attribution. Manifold word-based stylistic markers have been successfully used in deep learning methods to deal with the intrinsic problem of authorship attribution. Unfortunately, the performance of word-based authorship attribution systems is limited by the vocabulary of the training corpus. Literature has recommended character-based stylistic markers as an alternative to overcome the hidden word problem. However, character-based methods often fail to capture the sequential relationship of words in texts which is a chasm for further improvement. The question addressed in this paper is whether it is possible to address the ambiguity of hidden words in text documents while preserving the sequential context of words. Consequently, a method based on bidirectional long short-term memory (BLSTM) with a 2-dimensional convolutional neural network (CNN) is proposed to capture sequential writing styles for authorship attribution. The BLSTM was used to obtain the sequential relationship among characteristics using subword information. The 2-dimensional CNN was applied to understand the local syntactical position of the style from unlabeled input text. The proposed method was experimentally evaluated against numerous state-of-the-art methods across the public corporal of CCAT50, IMDb62, Blog50, and Twitter50. Experimental results indicate accuracy improvement of 1.07%, and 0.96%, on CCAT50 and Twitter, respectively, and produce comparable results on the remaining datasets.

  • Conference Article
  • 10.1109/ijcnn52387.2021.9533360
Approaching authorship attribution as a multi-view supervised learning task
  • Jul 18, 2021
  • Luis Goncalves + 1 more

Authorship attribution is the problem of identifying the author of texts based on the author's writing style. It is usually assumed that the writing style contains traits inaccessible to conscious manipulation and can thus be safely used to identify the author of a text. Several style markers have been proposed in the literature, nevertheless, there is still no consensus on which best represent the choices of authors. Here we assume an agnostic viewpoint on the dispute for the best set of features that represents an author's writing style. We rather investigate how different sources of information may unveil different aspects of an author's style, complementing each other to improve the overall process of authorship attribution. For this we model authorship attribution as a multi-view learning task. We assess the effectiveness of our proposal applying it to a set of well-studied corpora. We compare the performance of our proposal to the state-of-the-art approaches for authorship attribution. We thoroughly analyze how the multi-view approach improves on methods that use a single data source. We confirm that our approach improves both in accuracy and consistency of the methods and discuss how these improvements are beneficial for linguists and domain specialists.

  • Research Article
  • Cite Count Icon 2
  • 10.34229/2707-451x.21.3.6
Comparative Analysis of the Application of Multilayer and Convolutional Neural Networks for Recognition of Handwritten Letters of the Azerbaijani Alphabet
  • Sep 30, 2021
  • Cybernetics and Computer Technologies
  • Elshan Mustafayev + 1 more

Introduction. The implementation of information technologies in various spheres of public life dictates the creation of efficient and productive systems for entering information into computer systems. In such systems it is important to build an effective recognition module. At the moment, the most effective method for solving this problem is the use of artificial multilayer neural and convolutional networks. The purpose of the paper. This paper is devoted to a comparative analysis of the recognition results of handwritten characters of the Azerbaijani alphabet using neural and convolutional neural networks. Results. The analysis of the dependence of the recognition results on the following parameters is carried out: the architecture of neural networks, the size of the training base, the choice of the subsampling algorithm, the use of the feature extraction algorithm. To increase the training sample, the image augmentation technique was used. Based on the real base of 14000 characters, the bases of 28000, 42000 and 72000 characters were formed. The description of the feature extraction algorithm is given. Conclusions. Analysis of recognition results on the test sample showed: as expected, convolutional neural networks showed higher results than multilayer neural networks; the classical convolutional network LeNet-5 showed the highest results among all types of neural networks. However, the multi-layer 3-layer network, which was input by the feature extraction results; showed rather high results comparable with convolutional networks; there is no definite advantage in the choice of the method in the subsampling layer. The choice of the subsampling method (max-pooling or average-pooling) for a particular model can be selected experimentally; increasing the training database for this task did not give a tangible improvement in recognition results for convolutional networks and networks with preliminary feature extraction. However, for networks learning without feature extraction, an increase in the size of the database led to a noticeable improvement in performance. Keywords: neural networks, feature extraction, OCR.

  • PDF Download Icon
  • Research Article
  • Cite Count Icon 24
  • 10.3390/info15030131
Authorship Attribution Methods, Challenges, and Future Research Directions: A Comprehensive Survey
  • Feb 28, 2024
  • Information
  • Xie He + 3 more

Over the past few decades, researchers have put their effort and paid significant attention to the authorship attribution field, as it plays an important role in software forensics analysis, plagiarism detection, security attack detection, and protection of trade secrets, patent claims, copyright infringement, or cases of software theft. It helps new researchers understand the state-of-the-art works on authorship attribution methods, identify and examine the emerging methods for authorship attribution, and discuss their key concepts, associated challenges, and potential future work that could help newcomers in this field. This paper comprehensively surveys authorship attribution methods and their key classifications, used feature types, available datasets, model evaluation criteria and metrics, and challenges and limitations. In addition, we discuss the potential future research directions of the authorship attribution field based on the insights and lessons learned from this survey work.

Save Icon
Up Arrow
Open/Close
Notes

Save Important notes in documents

Highlight text to save as a note, or write notes directly

You can also access these Documents in Paperpal, our AI writing tool

Powered by our AI Writing Assistant