Text Preprocessing Research Articles

The presented publication is devoted to an overview of the problem of presenting textual informationfor the subsequent implementation of cluster analysis in the framework of processingand managing high-dimensional information. Modern requirements for analytical, search andrecommendation information systems demonstrate the weak formation of a holistic solution thatcan provide a sufficient level of speed and quality of the results obtained within the framework ofthe current information technology market. The search for a solution to the presented problementails the need to conduct an objective analysis of existing solutions for representing textual informationin vector space, in order to form a holistic view of the advantages and disadvantages ofthe analyzed approaches, as well as the formation of criteria that allow one to implement theirown approach, devoid of identified weaknesses. The presented work is analytical, and allows youto get an idea of the current state and elaboration of the identified problem within a limited subjectarea. Clustering of text data is the automatic formation of subsets, the elements of which are instancesof documents of some researched, unstructured sample of a fixed dimension. This processcan be classified as unsupervised learning, which implies the absence of an expert who personallyassigns class indices to the original sample of documents. However, the implementation of clusteranalysis of text data without any pre-processing is impossible. To do this, it is necessary to ensurestandardization and reduction of input data to a single format and form. Within the framework ofthis stage of the implementation of cluster analysis, the presented publication discusses methodsfor preprocessing text data. The novelty of the presented publication lies in the formation of thetheoretical basis of the main methods of text data vectorization, by systematizing and objectifyingthe proposed assumptions, by conducting a series of experimental studies. The main difference ofthis work from the already published scientific works is the systematization and analysis of modernsolutions, as well as the hypotheses about the relevance and effectiveness of our own hybridizedapproach designed for text data vectorization.

Read full abstract

The internet and social networks produce an increasing amount of data. There is a serious necessity for a recommendation system because exploring through the huge collection is time-consuming and difficult. In this study, a multi-modal classifier is introduced which makes use of the output from dual deep neural networks: GRU for text analysis and Faster R-CNN for image analysis. These two networks reduce overall complexity with minimal computational time while retaining accuracy. More precisely, the GRU network is utilized to process movie reviews and the Faster RCNN is used to recognize each frames of the movie trailers. Gated Recurrent Unit (GRU) is a well-known variety of RNN that computes sequential data across recurrent structures. Faster RCNN is an enhanced version of Fast RCNN, it combines with the rectangular region proposals and with the features is extract by the ResNet-101. Initially, the trailer of the movie is manually splitted into frames and these frames are pre-processed using fuzzy elliptical filter for image analysis and the movie reviews are also tokenized for text analysis. The pre-processed text is taken as an input for GRU to classify offensive and non-offensive movies and the pre-processed images are taken as an input for Faster R-CNN to classify violence and non- violence movies based on the extracted features from the movie trailer. Afterwards, the four classified outputs are given as input for fuzzy decision-making unit for recommending best movies based on the Mamdani fuzzy inference system with gauss membership functions. The performance of the dual deep neural networks was evaluated using the specific parameters like specificity, precision, recall, accuracy and F1 score measures. The proposed GRU yields accuracy range of 97.73% for reviews and FRCNN yields the accuracy range of 98.42% for movie trailer.

Read full abstract

Text Preprocessing Research Articles

Related Topics

Articles published on Text Preprocessing

Multidisciplinary classification for Indonesian scientific articles abstract using pre-trained BERT model

Mobile Message Classification Using Natural Language Processing and Machine Learning Algorithms

TPTS: Text pre-processing Techniques for Sindhi Language

BERT: A Review of Applications in Sentiment Analysis

IMPLEMENTASI ALGORITMA TF-IDF DAN SUPPORT VECTOR MACHINE TERHADAP ANALISIS PENDETEKSI KOMENTAR CYBERBULLYING DI MEDIA SOSIAL TIKTOK

СРАВНИТЕЛЬНЫЙ АНАЛИЗ МЕТОДОВ ВЕКТОРИЗАЦИИ ТЕКСТОВЫХ ДАННЫХ БОЛЬШОЙ РАЗМЕРНОСТИ

Text Classification Using Genetic Programming with Implementation of Map Reduce and Scraping

The Automatic Classification System for Academic Performance Evaluation at the Faculty of Information Technology Atma Jaya University of Makassar

Perbandingan Metode Pembobotan TF-RF Dan TF-ABS Pada Kategorisasi Berita Di BDI Denpasar

Legal Document Analysis

Classification of Sentiment Analysis and Community Opinion Modeling Topics for Application of ICT in Government Operations

Exploring the Capabilities and Limitations of ChatGPT and Alternative Big Language Models

A Scoping Literature Review of Natural Language Processing Application to Safety Occurrence Reports

Clinical Text Classification with Word Representation Features and Machine Learning Algorithms

ArSentBERT: fine-tuned bidirectional encoder representations from transformers model for Arabic sentiment classification

Alternative Text Pre-Processing using Chat GPT Open AI

Klasifikasi Sentiment Ulasan Aplikasi Sausage Man Menggunakan VADER Lexicon dan Naïve Bayes Classifier

Klasifikasi Sentimen Transformasi dan Reformasi Sepak Bola Indonesia Pada Twitter Menggunakan Algoritma Bernoulli Naïve Bayes

Movie recommendation system via fuzzy decision making based dual deep neural networks

Implementasi Algoritma Term Frequency Inverse Document Frequency (TF-IDF) dalam Menganalisis Sentimen Masyarakat Terhadap Covid-19 Varian Omicron

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Text Preprocessing Research Articles

Related Topics

Articles published on Text Preprocessing

Multidisciplinary classification for Indonesian scientific articles abstract using pre-trained BERT model

Mobile Message Classification Using Natural Language Processing and Machine Learning Algorithms

TPTS: Text pre-processing Techniques for Sindhi Language

BERT: A Review of Applications in Sentiment Analysis

IMPLEMENTASI ALGORITMA TF-IDF DAN SUPPORT VECTOR MACHINE TERHADAP ANALISIS PENDETEKSI KOMENTAR CYBERBULLYING DI MEDIA SOSIAL TIKTOK

СРАВНИТЕЛЬНЫЙ АНАЛИЗ МЕТОДОВ ВЕКТОРИЗАЦИИ ТЕКСТОВЫХ ДАННЫХ БОЛЬШОЙ РАЗМЕРНОСТИ

Text Classification Using Genetic Programming with Implementation of Map Reduce and Scraping

The Automatic Classification System for Academic Performance Evaluation at the Faculty of Information Technology Atma Jaya University of Makassar

Perbandingan Metode Pembobotan TF-RF Dan TF-ABS Pada Kategorisasi Berita Di BDI Denpasar

Legal Document Analysis

Classification of Sentiment Analysis and Community Opinion Modeling Topics for Application of ICT in Government Operations

Exploring the Capabilities and Limitations of ChatGPT and Alternative Big Language Models

A Scoping Literature Review of Natural Language Processing Application to Safety Occurrence Reports

Clinical Text Classification with Word Representation Features and Machine Learning Algorithms

ArSentBERT: fine-tuned bidirectional encoder representations from transformers model for Arabic sentiment classification

Alternative Text Pre-Processing using Chat GPT Open AI

Klasifikasi Sentiment Ulasan Aplikasi Sausage Man Menggunakan VADER Lexicon dan Naïve Bayes Classifier

Klasifikasi Sentimen Transformasi dan Reformasi Sepak Bola Indonesia Pada Twitter Menggunakan Algoritma Bernoulli Naïve Bayes

Movie recommendation system via fuzzy decision making based dual deep neural networks

Implementasi Algoritma Term Frequency Inverse Document Frequency (TF-IDF) dalam Menganalisis Sentimen Masyarakat Terhadap Covid-19 Varian Omicron