Abstract

Many artificial intelligence studies focus on designing new neural network models or optimizing hyperparameters to improve model accuracy. To develop a reliable model, appropriate data are required, and data preprocessing is an essential part of acquiring the data. Although various studies regard data preprocessing as part of the data exploration process, those studies lack awareness about the need for separate technologies and solutions for preprocessing. Therefore, this study evaluated combinations of preprocessing types in a text-processing neural network model. Better performance was observed when two preprocessing types were used than when three or more preprocessing types were used for data purification. More specifically, using lemmatization and punctuation splitting together, lemmatization and lowering together, and lowering and punctuation splitting together showed positive effects on accuracy. This study is significant because the results allow better decisions to be made about the selection of the preprocessing types in various research fields, including neural network research.

Highlights

  • Attempts have been made to increase work efficiency through studies using similarities between sentences

  • Existing data preprocessing studies have been mainly conducted in the field of data mining. ere have been studies that process web data to format them into an analytical form. ese studies did not explain the effect of data preprocessing on the algorithm as a method included in the process of preparing data for analysis [17,18,19]. ere is a study that analyzed the effect of data preprocessing on predictive ability, limited to numerical data in neural network models [4, 20]

  • Is study analyzed the effect of preprocessing through text data preprocessing of sentence models

Read more

Summary

Introduction

Attempts have been made to increase work efficiency through studies using similarities between sentences. Studying the similarity between sentences requires a deep understanding of the semantic and structural information of the language. Erefore, attempts have been made to learn a language model that computes probability distributions without extracting features. A method has been proposed that combines a word-embedding method, in which information about the meaning or structure of a word is expressed in terms of a real-time multidimensional vector and a deep belief network structure that uses a prelearning method [3]. To improve the prediction accuracy of a high-performance neural-network-based sentence model or a naturallanguage-based study, confidence in the data should be the highest priority. Erefore, it is necessary to investigate the data preprocessing features that should be selected for machine learning [4] as well as the effects of various preprocessing tasks on the performance of classification models [5,6,7] Data for research studies should be processed through a filtering step, in which the researcher himself conducts the preprocessing. erefore, it is necessary to investigate the data preprocessing features that should be selected for machine learning [4] as well as the effects of various preprocessing tasks on the performance of classification models [5,6,7]

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.