An Efficient Solution for Privacy-preserving Naïve Bayes Classification in Fully Distributed Data Model

—Recently, privacy preservation has become one of the most important problems in data mining and machine learning. In this paper, we propose a novel privacy-preserving Naïve Bayes classifier for the fully distributed data scenario where each record is only kept by a unique owner. Our proposed solution is based on a secure multi-party computation protocol, so that it has the capability to securely protect each data owner’s privacy, as well as accurately guarantee the classification model. Furthermore, our experimental results show that the new solution is efficient enough for practical applications.


I. INTRODUCTION
In recent years, the growth of data mining and machine learning (DM and ML) has brought many benefits to organizations and individuals. However, the processes of DM and ML can violate sensitive/private information in datasets.
Hence, privacy preservation has become one of the most important issues in DM and ML fields.
In general, privacy-preserving DM and ML solutions have three important properties, i.e., privacy, accuracy, efficiency [1]. They are based on the following approaches: Randomization approach: the original input data of privacy-preserving DM and ML solutions following this way often has randomly transformed or added noise. As a result, such solutions' performance is high, but they have a trade-off between privacy and accuracy.
Cryptography approach: such privacypreserving DM and ML techniques are often based on secure multi-party computation protocols (SMC) using homomorphic cryptosystems. As a result, cryptography-based privacy preserving DM and ML solutions can preserve each data holder's privacy, as well as guarantee the accuracy property. However, their performance is quite poor.
Hybrid approach: privacy-preserving DM and ML methods following the hybrid approach use SMC protocols combined with randomization techniques. Hence, such solutions must balance the accuracy, privacy and efficiency properties.
In this paper, we focus on privacypreservation solutions for Naïve Bayes algorithm, one of the most common machine learning techniques. Particularly, we investigate privacy-preserving Naïve Bayes classification (PPNBC) solutions in the fully distributed setting which is a special case of the horizontally distributed data model. In this scenario, each data record is kept by a unique holder.

An Efficient Solution for Privacy-preserving Naïve Bayes Classification in Fully Distributed Data Model
Up to now, researchers have proposed many privacy-preserving Naïve Bayes classifiers for the fully and horizontally distributed data settings, and such solutions are often based on cryptography and hybrid approaches.
(i) Cryptography-based PPNBC solutions: In 2003, Kantarcıoˇglu and Vaidya [2] first introduced a privacy-preserving Naïve Bayes classifier for the horizontally distributed data setting (a similar version can be found in [3]). The solution [2] is based on the simple sum computation protocol [4], [5] that is insecure, if there exist several colluding parties. Hence, each data holder's privacy in [2] is not securely protected.
Yang et al. [6] proposed a PPNBC solution for the fully distributed setting by executing the privacy-preserving frequency mining protocol multiple times. Consequently, although this proposal can preserve the parties' privacy, its cost is quite expensive.
In 2008, Yi et al. [7] presented a privacypreserving Naïve Bayes classifier on distributed data using two semi-trusted mixers who do not collude together. This leads the performance of [7] to be high, but each data holder's privacy cannot be protected.
Skarkala et al. [8] proposed privacypreserving Naive Bayes classification techniques based on the multicandidate election schema. By using the Paillier cryptosystem [9] and authentication methods, data providers are protected. Unfortunately, Skarkala et al.'s solution require each data provider to share the frequencies of his/her dataset for the miner. Thus, the private property of [8] cannot be ensured.
(ii) Hybrid approach-based PPNBC solutions: Based on Gentry's scheme [10], Li et al. propounded privacy-preserving outsourced PPNBC [11] in the cloud model. In the training phase of [11], the evaluator approximately computes a classification model from the encrypted training set. Consequently, this solution cannot guarantee the accuracy property. Furthermore, the performance of [11] is expensive, because the Gentry's cryptosystem is costly.
Huai et al. [12] described a PPNBC solution based on the privacy-preserving aggregation protocol combined with a data perturbation technique. Moreover, Huai et al. use a trusted dealer to generate the necessary secret parameters. Thus, this solution must have a trade-off between the privacy and accuracy properties, and its computational cost is pricey.
Based on differential privacy methods and homomorphic cryptosystems, privacypreserving Naïve Bayes classifiers [13], [14] can protect data providers' privacy. However, these solutions must have a trade-off between the data providers' privacy and the classification model's accuracy. Additionally, data providers are required to spend high costs performing the tasks.
It can be seen that the existing PPNBC solutions for the fully and horizontally distributed data settings have many disadvantages. Therefore, it is significant to construct an efficient privacy-preserving Naïve Bayes classifier that has high security level, as well as guarantees the accuracy property. To build the Naive Bayes classification model based on the dataset with privacy constraints, a miner needs to compute the necessary probabilities using the following frequency values: Number of data vectors that their class label is ( = 1, ̅̅̅̅ ), denoted as #( ).
while each data owner discloses nothing about his/her data vector.
Basically, these frequency values are often privately calculated by using secure sum computation or privacy-preserving frequency mining protocols. This paper nominates the secure multi-party sum protocol [15] as one of the most suitable and efficient candidates for privately computing the above frequency values used in Naïve Bayes classifier. In the other words, by executing the secure multi-party sum protocol [15] multiple times, we obtain a privacy-preserving Naïve Bayes classifier in the semi-honest model for the fully distributed data setting.

B. AN EFFICIENT AND SECURE MULTI-PARTY SUM
PROTOCOL [15] This section presents the efficient and secure multi-party sum protocol in our previous work [15] (see in Fig. 1) that is employed as a main component of the proposed PPNBC solution.
Note that and are two large primes such that |( − 1), and is an element in ℤ satisfying ≠ 1 and = 1. All computations in the protocol [15] are performed in ℤ .

Input:
users { , … , }, each holds a secret value ∈ { , }. Output: the miner obtains = ∑ = , while the users do not reveal their private values with anyone.
Step 1: Each user chooses two private keys , ∈ [ , − ], and computes the public keys = & = , then he/she shares , for the miner. . − , then sends to the miner.
Step 4: The miner aggregates = ∏ = , and computes that satisfies = .  It can be seen in 0 that our privacy-preserving protocol for Naïve Bayes classification in the fully distributed setting is composed of secure multi-party sum protocols. Furthermore, the secure multi-party sum protocol's privacy was recognized in [15]. Thus, based on the security definition of a secure cryptographic protocol following the semi-honest model and the composition theorem mentioned in the book [16], the proposed privacy-preserving Naïve Bayes classification solution is semantically secure.

D. ACCURACY ANALYSIS
Because the secure multi-party sum protocol's correctness was proved in [15], the protocol presented in 0 accurately computes the frequency values. Briefly, the Naïve Bayes classification model's accuracy is guaranteed in our proposal.

E. EFFICIENCY EVALUATION
To show the efficiency of the proposed solution, this section compares the running time among three typical privacy-preserving Naive Bayes classifiers, i.e., the solution of Yang et al. in [6], the solution based on the secure e-voting protocol [17]  Particularly, we consider the total running time of the miner and each data owner in the compared solutions when tested on the preprocessed German credit dataset [18] at UCI Machine Learning repository.
Our experiments are implemented in Python language and run on the virtual machine with Ubuntu operating system, 2 cores of the Intel core 5 − 8250 @1. 6 CPU, 4 threads, and 4 memory.
The experimental results are presented in Additionally, the number of private keys used in Yang's and our solution is much less than the one in Hao's-based solution. In summary, the above experimental results show that the proposed PPNBC solution is more efficient than the typical others. Thus, our solution is suitable for practical applications.

IV. CONCLUSION
In this work, we proposed an efficient method based on a secure multi-party sum protocol for privacy-preserving Naïve Bayes classification in the fully distributed data setting. Our proposed PPNBC solution not only protects each data owner's privacy but also guarantees the classification model's accuracy. The