Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists

Rajesh Bhatia,Shaily Jain,Shivani Gautam

doi:10.52756/ijerr.2023.v34spl.008

Abstract

Text classification also called (text categorization or text tagging) is a crucial and extensively used approach in Natural Language Processing (NLP), to predict unseen content documents into prearranged categories. In this paper, we evaluate the dataset construction and evaluation process as a component of text classification. To begin with, we produced a newly created dataset for Indian Origin Scientists for text classification, which was collected by applying focused crawling and web scraping techniques. We then demonstrate an extensive evaluation of numerous models on this recently constructed dataset. Our evaluations display that the Random forest model outperforms the rest of the supervised models. Our results produce a fine beginning for additional research in Indian Origin Scientists' classification of text. Investigational outcome with K Nearest Neighbor, Logistic Regression, and Support Vector Machine for Indian-origin scientists produced much better performances for Random Forest when combined with SMOTE and K fold cross-validation techniques. We apply the Area under the ROC Curve to compute the effectiveness of the chosen models. Overall, the Random Forest classifier exhibited the best output along with 90% micro-average AUC.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists

Abstract

Talk to us

Similar Papers

More From: International Journal of Experimental Research and Review

Lead the way for us

Journal: International Journal of Experimental Research and Review	Publication Date: Oct 30, 2023
License type: CC BY-NC-ND 4.0

Similar Papers

Dynamics of automatized measures of creativity: mapping the landscape to quantify creative ideation
Ijaz Ul Haq ... Manoli Pifarré
Frontiers in Education | VOL. 8
Ijaz Ul Haq, et. al.Ijaz Ul Haq ... Manoli Pifarré
12 Oct 2023
Frontiers in Education | VOL. 8

Portability of natural language processing methods to detect suicidality from clinical text in US and UK electronic health records.
Marika Cusick ... Jyotishman Pathak
Journal of affective disorders reports | VOL. 10
Marika Cusick, et. al.Marika Cusick ... Jyotishman Pathak
01 Dec 2022
Journal of affective disorders reports | VOL. 10

Question to Question Similarity Analysis Using Morphological, Syntactic, Semantic, and Lexical Features
Mahmoud Hammad ... Qanita Baker
JUCS - Journal of Universal Computer Science | VOL. 26
Mahmoud Hammad, et. al.Mahmoud Hammad ... Qanita Baker
28 Jun 2020
JUCS - Journal of Universal Computer Science | VOL. 26

Research On Text Classification Based On Deep Neural Network
Deageon Kim
International Journal of Communication Networks and Information Security (IJCNIS) | VOL. 14
Deageon KimDeageon Kim
31 Dec 2022
International Journal of Communication Networks and Information Security (IJCNIS) | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Classification and analysis for Focused Crawled Textual Dataset for retrieving Indian origin scientists

Abstract

Talk to us

Similar Papers

More From: International Journal of Experimental Research and Review