Privacy-Preserving Decision Tree Solution in the 2-Part Fully Distributed Setting

— Data mining has emerged as an important technology for obtaining knowledge from big data. However, there are growing concerns that the use of this technology is infringing on privacy. This work proposes a decision tree mining solution according to the ID3 algorithm that ensures privacy in the 2-Part Fully Distributed setting.


I. INTRODUCTION
Data mining is the process of extracting potentially valuable information from large amounts of data stored in databases or data warehouses. More specifically, it is the process of extracting, generating hidden, unknown, but useful knowledge or patterns from Big databases. Simultaneously, it is the process of generalizing discrete facts in data into generalized, regularisation knowledge that actively supports decision-making processes. However, due to legal constraints on privacy laws and information security policies of individuals and organizations, many organizations and individuals are not allowed to provide data sets for the mining process (for example personal data of customers in a bank, patient medical data...). As a result, the question is how to permit data mining on data sets while protecting the private information of individuals and organizations contained in the data. Solutions to this problem have been around since the 2000s, collectively known as Privacy-Preserving Data Mining (PPDM) [1].
Most PPDM techniques use some form of transformation on the original data to perform privacy protection [2]. There are two main approaches: randomization-based and cryptography-based.
These approaches are based on randomization techniques, such as additive data perturbation and random subspace projection, that mask the underlying data while preserving the statistical properties of the overall dataset. While these approaches are fast and efficient, they do not provide strong security guarantees and are often susceptible to attacks [3]. The solutions based on the perturbation approach are highly efficient but have a trade-off between privacy and accuracy, i.e., if we require more privacy, the miner loses more accuracy in the data mining results, and vice-versa [4].
The PPDM solutions based on cryptography typically consider the entire data (all attributes) as private and use cryptographic protocols such as homomorphic encryption, Yao's garbled circuits, etc. Most cryptographic-based approaches rely on peer-to-peer communication and are usually defined in 2-party scenarios, with extension to multi-party scenarios often resulting in significant communication overhead. For the PPDM solutions based on cryptography, the privacy of data holders is safely preserved and the output result is accurately guaranteed, but the performance is quite poor [5].
The decision tree algorithm is an algorithm commonly used in classification problems, such as letter classification in text recognition, etc. The ID3 decision tree algorithm (Iterative Dichotomiser 3) was born very early and is a widely used decision  [6], in which the dataset is distributed over a large number of users, each record is owned by two different users, and one user only knows the value for a subset of the attributes while the other knows the values for the remaining attributes. Miner aims to build an ID3 decision tree while protecting the privacy of each user.
Although there have been numerous studies on the privacy-preserving ID3 Algorithm, these studies are limited to two-party horizontal partitioning data mode [7], or horizontal partitioning data model with more than two-party [8,9,10,11,12,13,14,15], or vertical partitioning data model with more than two-party [16,17,18,14]. Therefore, they cannot be applied to the 2PFD setting.
Our problem can be solved by using the available solutions such as [14,19]. However, due to the characteristics of the 2PFD setting, letting the parties exchange directly and sequentially with each other like the above solutions will lead to large communication costs and time costs. Furthermore, these solutions also assume that each pair of participants has a separate channel.
In this paper, we develop a privacy-preserving ID3 decision tree solution in the 2PFD setting. This solution does not require communication channels between different users. Additionally, many phases can be performed in parallel. First, we rewrite the formula that determines the best attribute. Then, we use the privacy-preserving frequency computation protocol in the 2PFD setting [20] to develop the privacy-preserving entropy of attribute protocol. Using this protocol, we construct the privacy-preserving ID3 decision tree solution. Finally, we evaluate the solution's performance and privacy.
The remainder of the paper is structured as follows: Section 2 reviews some technical preliminaries used in this work. Our protocol is described in Section 3. Finally, we will be the conclusion of the paper.

A. ELLIPTIC CURVE CRYPTOGRAPHY
Elliptic curve cryptography (ECC) is a publickey cryptosystem based on the discrete logarithm problem of elliptic curves over finite fields. ECC is well-known for its smaller key size and faster for the same level of security than other public-key cryptosystems (like RSA) [21].
Let E( ) be an Elliptic curve over a finite field with a point at infinity and p be a large prime, in which elliptic curve discrete logarithm problem is hard. In addition, is a base point of the elliptic curve with order q (i.e., . = ). The private key is the random number  [1, − 1], and the corresponding public key curve point is = . . To encrypt the plaintext , the sender uses the public key to compute the ciphertext from the plaintext as follows: he randomly chooses from [1, − 1] and computes the ciphertext = ( 1 = + . , 2 = . ) where is a point of with = . To decrypt the ciphertext using the private key , the receiver may compute = , in which = 1 + (− . 2 ).
Under the decisional Diffie-Hellman assumption [22] for the curve , the elliptic curve analog of the ElGamal system is semantically secure.

B. THE ID3 ALGORITHM
The main purpose of the algorithm is to construct a decision tree from a data set of examples and their classes using information theory. The ID3 algorithm builds a decision tree in a top-down manner with information about the patterns.
The best object classification will be obtained by starting at the root. The information gain is used to compute the best prediction. An attribute 's information gain is defined [9] as where, ( ): the entropy of a data set of tuples ( is the total number of different values the target class can take on ), is defined as: with | | and | | are the number of tuples in S and the number of tuples in S having value for the class attribute, respectively.
( ): the entropy of attribute, is defined as: where is the number of possible values for the attribute .
the subset of with tuples having value for attribute .
In ID3, at each node, the selected attribute is determined based on: * = argmax ( , ) = argmin ( ) i.e. the attribute that makes the information gain maximum.
The ID3 algorithm is shown in Figure 1. 7: Create a node that is not a leaf node .

8:
for (1 ≤ ≤ ) // is the number of values of attribute .  In this section, we briefly introduce the privacypreserving frequency computation protocol in the 2PFD setting is proposed in [20] as follows: Let E(Z d ) be an elliptic curve with a point O at infinity and d be a large prime, in which the elliptic curve discrete logarithm problem is hard. In addition, G is a base point of the elliptic curve E with order d (i.e., d.G = O).
Each user keeps a private value ∈ {0, 1}. Nobody knows this value, beyond him. Before the PPFM protocol starts, each user chooses three private keys , , ∈ [1, − 1], after that he computes the corresponding public keys = . , = . , = . . These public keys are sent to the miner before the protocol starts.
Each user keeps a private value ∈ {0, 1}. Nobody knows this value, beyond him. Before the PPFM protocol starts, each user chooses three private keys , , ∈ [1, − 1], after that he computes the corresponding public keys = . , = . , = . . These public keys are sent to the miner before the protocol starts.
The privacy-preserving frequency co protocol in 2-PFD consists of five phases described in Fig. 2.  In this section, we will discuss a privacypreserving ID3 decision tree solution in the 2PFD setting. Furthermore, the miner only knows what attributes are in the system and their respective value domains but not who owns them.

A. PROBLEM STATEMENT
We consider the 2PFD setting: There are attributes, 1 , 2 , … , , … , and one class attribute . and one class attribute ) and its class label owned by as illustrated in Figure 3. Our purpose is to allow the miner to train the decision tree using data from all users while protecting the privacy of each user. Therefore, our protocol allows the miner to obtain the attribute entropy by privately computing the frequencies ( , , ) by using the primitive presented in Section II.D. This protocol does not reveal any of each user's privacy information to the miner beyond the frequencies in all user's data. Furthermore, the protocol keeps the miner in the dark about the set of attributes that each user has. For more convenience, in the proposed protocol, we denote be a tuple of the domain × . Here is the domain of the attribute , is the domain of the attribute, ( = 1, . . , ) is the index of the th tuple in the domain × , = | × |, and the first value and the second value of the tuple are denoted by . and . , respectively.
We assume that each user has private keys and public keys as presented in Section II.D. Note that the security of ciphertext depends on new random values being used for each encryption. In the frequency mining protocol, the , , and are random values, and associated and cannot be reused in different uses of the protocol. Therefore, in the protocol of privacy-preserving attribute entropy, with each computed frequency, each chooses a random element in to randomize its public keys that results in the randomization of parameters , in each done computation. In particular, if the protocol is to be run many times, many randomizations of values and could be implemented so that keys , , and can be reused. Our protocol is depicted as follow: ).
Basically, the correctness and privacy of our privacy-preserving attribute entropy protocol can be derived from the frequency computing in Section II.C. Therefore, the protocol outputs attribute entropy correctly.

Định lý 3.2.
This protocol preserves the privacy of the honest users against the miner and up to 2 − 2 corrupted users. In cases with only two honest users, it remains correct as long as two honest users do not own the attribute values of the same record.
Proof. Note that in the protocol, the values , and are independently and randomly chosen for every frequency value, so the computation is independently done for every frequency, therefore this corollary follows immediately from the privacy-preserving frequency computing protocol in [20].
From the above two theorems, this protocol ensures accuracy and privacy.

C. SECURE ID3 DECISION TREE ALGORITHM
It is assumed that each user's data includes sensitive attribute values (without loss of generality, assuming that all attribute values of each user are sensitive). As a result, no user is prepared to give the miner his data without protecting privacy. Furthermore, the miner does not know what attributes the user owns, but only knows the set of attributes and their value domain. To allow the miner to build a decision tree while protecting the privacy of each user, we design a privacypreserving decision tree solution.
The miner implements the ID3 decision tree algorithm as follows:

Input:
A, a set of attributes.
C, the class attribute.
S, data set of tuples.

7:
Choose the attribute with the highest information gain in as the node.  We assess the proposed solution's correctness, privacy, and performance.

Correctness analysis
The security frequency computation protocol in Section II.C and the secure attribute entropy computation protocol in Section III.B can be used to determine the correctness of the privacypreserving ID3 decision tree solution. Proof. The secure attribute entropy protocol in Section III.B correctly computes each ∈ in .

Privacy analysis
The privacy-preserving frequency computing protocol in section II.C and the secure attribute entropy computing protocol in section III.B, respectively, can be used to provide privacy in this solution. Proof. Another key theory that we adopt to prove the privacy preservation property of the proposed solution is the Composition Theorem under the semi-honest model (Theorem 3.3). Detailed proof of Theorem 3 could be found in [11], and thus is omitted here. Theorem 3.3 (Composition theorem for the semi-honest model, multi-party case) [11]. Suppose that the m-ary functionality g is privately reducible to the k-ary functionality f and that there exists a k-party protocol for privately computing f. Then there exists an m-party protocol for privately computing g.
According to this theorem, in the semi-honest model, if a protocol is built on the concatenation of many (proven) secure subprotocols, then the protocol is also secure. Thus combined with the computation being performed independently for all frequencies, this consequence follows right from the privacy-preserving frequency computing protocol and the secure attribute entropy computing protocol.

Communication and Computational cost
Next, we compare the performance of our solution with the solution in [14]. We'll refer to denote as the number of non-class attributes, as the number of class attribute valuend as the maximum number of non-attribute values class, is the length of the encryption key ( is usually very large).
In our solution, to determine the best data classifier attribute, secure attribute entropy computing protocols need to be implemented. In each of these protocols, each user needs to compute 5 ciphertext in phase 1 and phase 3, each user computes 3 ciphertext in phase 2, miner computes 2 ciphertext sum of 2 ciphertext in phase 1, and ciphertext sum of ciphertexts in phase 4, since in phases users and are assumed to execute concurrently, computation cost = ( (8 + 5 ) ). In terms of communication costs, each user sends 4 messages to the miner in phase 1, receives 5 messages from the miner, and sends messages to the miner in phase 3. Each user sends 2 messages to the miner in phase 1, receives 4 messages from the miner, and sends 3 messages to the miner in phase 3, so communication cost = ( (2 + 17 ) . The grid will be 2 horizontally and vertically in the solution of horizontal merge and vertical development [14], therefore the computation cost is ( ( + + )4 3 ) and the communication cost is ( ( + )4 ). As a result, the proposed procedure is more efficient than [14].

IV. CONCLUSION
In this paper, we have proposed a privacypreserving ID3 decision solution in the 2PFD setting. This solution allows the miner to correctly construct the ID3 decision tree while maintaining the privacy of each user's sensitive data in the 2PFD setting. It even ensures the privacy of the user's attribute ownership model.
We will continue to research privacypreserving data mining solutions in the 2PFD setting model in the future.