Abstract

Providing privacy protection for classification algorithms has become a research hotspot in current data mining. In this paper, differential privacy is applied to the random forest classification algorithm, and a random forest algorithm based on differential privacy is proposed to protect the privacy information in the data classification process. Firstly, differential privacy provides privacy protection by adding perturbation noise, which leads to a decrease in the classification accuracy of random forest algorithms. In order to reduce the impact of differential privacy protection on the accuracy of random forest classification, a hybrid decision tree algorithm is proposed. For the construction of a single decision tree in a random forest, the information gain ratio in the ID3 algorithm and the information gain ratio in the C4.5 are combined to generate a new attribute metric IG_GR to improve the classification accuracy of a single decision tree. Secondly, a new privacy budget allocation strategy is proposed. For nodes of different depths in the decision tree, the privacy budget is allocated to its counting query and attribute query by weight, which is used to balance the signal-to-noise ratio of differential privacy technology to nodes of different depths in the decision tree. At the same time, the hybrid decision tree algorithm is applied to the construction of random forest, which balances the privacy and classification accuracy of the random forest algorithm based on differential privacy. Finally, this paper conducted experiments on UCI's Adult and Mushroom datasets. The results show that compared with the traditional decision tree algorithm, the algorithm proposed in this paper has better classification accuracy; and the DPRF can provide effective privacy protection under the premise of ensuring high classification performance. The work of this paper achieves a balance between privacy and classification accuracy, and has practical application value.

Highlights

  • With the rapid development of Internet technology, in addition to the government, many companies have a huge amount of data about citizens’ personal information

  • This paper proposes Differential Privacy Random Forest (DPRF), a new random forest classification algorithm based on differential privacy protection

  • Proof: According to the strategy of dividing the privacy budget by weight of each layer of the decision tree, the weight of the privacy protection budget assigned to the first layer is w1 = 2 dm, and the actual privacy protection budget obtained by the root node according to the unit share of the privacy protection budget is e1 = eu ∗ (2 dm); the privacy protection budget corresponding to the Algorithm 2 Differential Privacy Random Forest (DPRF)

Read more

Summary

INTRODUCTION

With the rapid development of Internet technology, in addition to the government, many companies have a huge amount of data about citizens’ personal information. The SuLQ-based ID3 algorithm uses the Laplace noise mechanism to add noise to construct the decision tree each time the information gain of the dataset attribute is calculated [19], [20] This causes a problem of excessive noise introduction, which is 30% lower than that of the ID3 algorithm without differential privacy protection [21]. The algorithm does not need to preprocess the dataset, that is, it does not need to discretize the continuous attributes before constructing the decision tree It extends the monotonous privacy budget allocation strategy that DiffPRF algorithm can only deal with discrete attributes, and uses an exponential mechanism to select classification points for continuous attributes [30], [31].

HYBRID DECISION TREE
EXPERIMENTS AND ANALYSIS
EXPERIMENT 1
EXPERIMENT 2
EXPERIMENT 3
EXPERIMENT 4
Findings
CONCLUSION
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call