Abstract

The current deep learning models detecting relevant web pages show low accuracy because of the poor quality of the training data. In this paper, we propose a novel algorithm to automatically generate high-quality training data based on the frequency of the document including the entity of interest. Our experimental results with movies and cellphones data sets show that the average F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> -score of the deep learning models (FNN, CNN, Bi-LSTM, and SeqGAN) trained with our proposed algorithm shows up to 0.9992 in F <sub xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">1</sub> -score.

Highlights

  • With the advent of the fourth industrial revolution, Artificial Intelligence (AI)–based data mining algorithms play a key role in extracting unknown but informative knowledge from big data to improve enterprise productivity and bring out technological innovation

  • Given a target entity and a set of web pages, we propose a novel automatic algorithm for High-quality Training data Generation (HiTGen), thereby considerably improving the accuracy of the existing deep learning models

  • Kim et al.: High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models pages with high frequency are relatively relevant with a target entity rather than ones with low frequency

Read more

Summary

INTRODUCTION

With the advent of the fourth industrial revolution, Artificial Intelligence (AI)–based data mining algorithms play a key role in extracting unknown but informative knowledge from big data to improve enterprise productivity and bring out technological innovation. The other is to use high–quality training data In the former approach, deep learning models themselves have high complexity with a number of hidden units, weights, and bias parameters. Given a target entity and a set of web pages, we propose a novel automatic algorithm for High-quality Training data Generation (HiTGen), thereby considerably improving the accuracy of the existing deep learning models. Kim et al.: High-Quality Train Data Generation for Deep Learning-Based Web Page Classification Models pages with high frequency are relatively relevant with a target entity rather than ones with low frequency. With the class label and the feature set, we automatically make the high-quality training set for deep learning based web page classification models.

RELATED WORK
MAIN PROPOSAL
DISCUSSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.