Character-level Features Research Articles

Natural Language Processing (NLP) field is taking great advantage from adopting models and methodologies from Artificial Intelligence. In particular, Part-Of-Speech (POS) tagging is a building block for many NLP applications. In this paper, a POS tagging system based on a deep neural network is proposed. It is made of a static and task-independent pre-trained model for representing words semantics enriched by morphological information, by approximating the Word Embedding representation learned from an unlabelled corpus by the fastText model, so as to handle consistently common and known words as well as rare and Out-of-Vocabulary words. A character-level representation of words is dynamically learned according to the POS tagging task, and is concatenated to the previous one. This joint representation is fed to the main network, comprising a Bi-LSTM layer, trained to associate a sequence of tags to a sequence of words. The effectiveness of the contributions of the proposed system with respect to the state-of-the-art is proven by an extensive experimental campaign, which provides evidence that improvements are gained in POS tagging accuracy by using Word Embeddings enriched with morphological information, by estimating embeddings for both known and unknown words, and by concatenating Word Embeddings with character-level information of the same size. Similar trends are obtained for two languages of different characteristics, namely English and Italian: in both cases, the overall accuracy on the POS tagging test set was increased with respect to the most advanced existing systems, with particular improvements on the accuracy of Out-of-Vocabulary words. Finally, the method has a general basis, and could be proficiently used for all languages, particularly for those showing a wide morphological richness.

Read full abstract

Authorship analysis (AA) is the study of unveiling the hidden properties of authors from textual data. It extracts an author's identity and sociolinguistic characteristics based on the reflected writing styles in the text. The process is essential for various areas, such as cybercrime investigation, psycholinguistics, political socialization, etc. However, most of the previous techniques critically depend on the manual feature engineering process. Consequently, the choice of feature set has been shown to be scenario- or dataset-dependent. In this paper, to mimic the human sentence composition process using a neural network approach, we propose to incorporate different categories of linguistic features into distributed representation of words in order to learn simultaneously the writing style representations based on unlabeled texts for AA. In particular, the proposed models allow topical, lexical, syntactical, and character-level feature vectors of each document to be extracted as stylometrics. We evaluate the performance of our approach on the problems of authorship characterization, authorship identification and authorship verification with the Twitter, blog, review, novel, and essay datasets. The experiments suggest that our proposed text representation outperforms the static stylometrics, dynamic n -grams, latent Dirichlet allocation, latent semantic analysis, distributed memory model of paragraph vectors, distributed bag of words version of paragraph vector, word2vec representations, and other baselines.

Read full abstract

Character-level Features Research Articles

Related Topics

Articles published on Character-level Features

Character convolutions for Arabic Named Entity Recognition with Long Short-Term Memory Networks

N-FTRN: Neighborhoods based fully convolutional network for Chinese text line recognition

Multilingual POS tagging by a composite deep architecture based on character-level features and on-the-fly enriched Word Embeddings

Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

Chinese Text Sentiment Analysis using Bilinear Character-Word Convolutional Neural Networks

Hybrid Attention Networks for Chinese Short Text Classiﬁcation

Webshell Traffic Detection With Character-Level Features Based on Deep Learning

Paraphrase plagiarism identification with character-level features

Learning Stylometric Representations for Authorship Analysis.

A Chinese Named Entity Recognition System with Neural Networks

Sentence-Level Dialects Identification in the Greater China Region

Named Entity Recognition with Bidirectional LSTM-CNNs

Patterns of local discourse coherence as a feature for authorship attribution

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Character-level Features Research Articles

Related Topics

Articles published on Character-level Features

Character convolutions for Arabic Named Entity Recognition with Long Short-Term Memory Networks

N-FTRN: Neighborhoods based fully convolutional network for Chinese text line recognition

Multilingual POS tagging by a composite deep architecture based on character-level features and on-the-fly enriched Word Embeddings

Morphological Segmentation and Part-of-Speech Tagging for the Arabic Heritage

Chinese Text Sentiment Analysis using Bilinear Character-Word Convolutional Neural Networks

Hybrid Attention Networks for Chinese Short Text Classiﬁcation

Webshell Traffic Detection With Character-Level Features Based on Deep Learning

Paraphrase plagiarism identification with character-level features

Learning Stylometric Representations for Authorship Analysis.

A Chinese Named Entity Recognition System with Neural Networks

Sentence-Level Dialects Identification in the Greater China Region

Named Entity Recognition with Bidirectional LSTM-CNNs

Patterns of local discourse coherence as a feature for authorship attribution