Text Classification and Topic Modelling of Web Extracted Data

Niraj Kumar,R.R Suman,Sanjay Kumar

doi:10.1109/gcat52182.2021.9587459

Abstract

Text classification and Topic Modelling is the backbone for the text analysis of huge amount of corpus of data. With an increase in unstructured data around us, it is very difficult to analyse the data very easily. There is a need for some methods that can be applied to the data to get the sensitive and semantic information from the corpus. Text classification is categorization of text in organised way for the interpretation of sensitive information from the text, while Topic modelling is finding the abstract topic for the collection of text or document. Topic modelling is used frequently to find semantic information from the textual data. In this paper we applied Parsing techniques on various websites to extract the HTML and XML data which includes the textual data and also applied Preprocessing techniques to clean the data. For the text classification purpose some of the Machine learning based classifiers that we have used in our experiment are Naive Bayes and also Logistic Regression Classifier. The models of the document are built using three different topic modelling methods which are Latent Semantic Analysis, Probabilistic Latent Semantic Analysis and Latent Dirichlet Allocation. In the further experiment we have done analysis and also comparison based upon the performance of the models and classifiers on the processed textual data.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Text Classification and Topic Modelling of Web Extracted Data

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

An Overview of Topic Representation and Topic Modelling Methods for Short Texts and Long Corpus
D Yamunathangam ... G Shobana
-
D Yamunathangam, et. al.D Yamunathangam ... G Shobana
08 Oct 2021
08 Oct 2021

Evaluation of clustering and topic modeling methods over health-related tweets and emails
Juan Antonio Lossio-Ventura ... Jiang Bian
Artificial Intelligence in Medicine | VOL. 117
Juan Antonio Lossio-Ventura, et. al.Juan Antonio Lossio-Ventura ... Jiang Bian
07 May 2021
Artificial Intelligence in Medicine | VOL. 117

Probabilistic Topic Models for Text Data Retrieval and Analysis
Chengxiang Zhai
-
Chengxiang ZhaiChengxiang Zhai
07 Aug 2017
07 Aug 2017

A Tutorial on Probabilistic Topic Models for Text Data Retrieval and Analysis
Chengxiang Zhai ... Chase Geigle
-
Chengxiang Zhai, et. al.Chengxiang Zhai ... Chase Geigle
27 Jun 2018
27 Jun 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Text Classification and Topic Modelling of Web Extracted Data

Abstract

Talk to us

Similar Papers