Abstract

False information on the Internet is being heralded as serious social harm to our society. To recognize false text information, in this paper, an effective method for mining text features is proposed in the field of false drug advertisements. Firstly, the data of false drug advertisements and real drug advertisements were collected from the official websites to build a database of false and real drug advertisements. Secondly, by performing feature extraction on the text of drug advertisements, this work built a characteristic matrix based on the effective features and assigned positive or negative labels to the feature vector of the matrix according to whether it is a fake medical advertisement or not. Thirdly, this study trained and tested several different classifiers, selected the classification model with the best performance in identifying false drug advertisements, and found the key characteristics that can determine the classification. Finally, the model with the best performance was used to predict new false drug advertisements collected from Sina Weibo. In the case of identifying false drug advertisements, the classification effect of the support vector machine (SVM) classifier established on the feature set after feature selection was the most effective. The findings of this study can provide an effective method for the government to identify and combat false advertisements. This study has a certain reference significance in demonstrating the use of text data mining technology to identify and detect information fraud behavior.

Highlights

  • In recent years, with the rapid development of the Internet and the increasing number of Internet users, false information has begun to spread rapidly and become more serious

  • The various forms of drug advertising on the Internet are not limited to medical websites. ey can be hidden in medicine-related post bars, forums, publicity microblogs, and promotion platforms

  • Document frequency is the simplest feature selection algorithm. It determines how many texts contain a certain word in the entire dataset, and document frequency (DF) is calculated for each feature in the training set

Read more

Summary

Introduction

With the rapid development of the Internet and the increasing number of Internet users, false information has begun to spread rapidly and become more serious. Supervised methods use class label records in fraud or real samples to model and tag the category attributes of the new records. It is more effective in classifying types of fraud that have already occurred and performs less well on new types. The effectiveness of methods based on text feature mining technology applied to recognize network false drug information has not been conclusively reported by a large number of authoritative studies in the literature.

Summary of the Basic Theory
Evaluation
Results and Discussion
Test Design and Result Analysis
Analysis and Comparison of Classification Results
Total number of characters 1 1 Amount of Chinese characters 1
Total number of characters 1 Amount of Chinese characters
Conclusion and Prospects
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call