A novel secure deep ensemble learning protocol based on Conjugacy search problem homomorphic encryption scheme

— Nowadays, machine learning and deep learning have been widely employed. User privacy is an issue to consider in problems such as medicine, and finance. Machine learning models not only require accurate predictions but also ensure the privacy and security of data for users. In this paper, we propose a method to ensure the privacy for training and using deep learning models that employs a homomorphic encryption scheme based on the conjugate search problem. This method implements encryption on the data before transferring them to a cloud server, which stores local deep learning models from participants to predict the encrypted data, then the encrypted prediction results are sent back to users, and they perform decryption to get the model’s prediction result. These results can also be assembled to create a new training dataset for a model from the client. It is evident that our proposed model on the MNIST dataset produces an accuracy over 98% with some very simple network architectures and approximates the accuracy of centralized complex models, which does not ensure privacy.


INTRODUCTION
Deep learning is one of the advanced approaches to Machine learning and draws more and more attention recently. Deep learning is widely utilized in various areas such as image processing, face recognition, voice identification, medicine prediction... The advantage of the deep learning model is the ability to automatically learn features of the data in order to establish better new features for prediction.
However, the effectiveness of deep learning models remarkably relies on the quality and quantity of the data. Sharing data for fitting an effective deep learning model is naturally necessary. However, there has to be a trade-off between user privacy and accuracy. Data in the training and prediction phase can contain highly sensitive information such as medical, financial or personal data, in such cases, data is confidential. Sharing data between the participants of the collaborative model building may result in hacking, copying, reuse, and information leakage. This poses a significant threat to user privacy, causing serious damage to the reputation, economy, users and organizations. Different from traditional models, the deep learning model has many special features such as a large number of parameters, complex structure, and nonlinear computations that work on real numbers. As a result, ensuring privacy for deep learning is distinctive. Basically, these approaches can be classified according to the data-sharing model. Based on this classification, research in this field mainly focuses on three approaches. The first one is transforming the input data, ensuring original data is utilized in training, not revealing information, and an accurate prediction. In this model, training and usage of the model are provided by a service server. The data will be uploaded from the user and calculation is carried out by the server. In addition, the type of data shared between participants and servers is the input data. To ensure the privacy of data, this approach is targeted at researching solutions to transform the data. Local inputs from the participants are transformed and forwarded directly to service servers for performing computations and training. According to the approach, original data is often transformed before using the homomorphic cryptographic algorithms, secure multi-party computation protocols, and secret sharing techniques, or noise perturbation. This method is quite classical and traditional, which has been used a lot in previous studies on private data mining, and machine learning with privacy assurance. This method has the advantage that it might be used in both training and predicting phases. On the contrary, there are a lot of problems coming along, especially in terms of performance as well as accuracy. Therefore, in deep learning models, this method is often not employed in the training phase, which requires many repetitive complex calculations.
The second approach is considered as the most useful way of training distributed data models. This is also a method that has many practical applications such as Google Keyboard, IoT, model sharing, or distributed learning with popular representations such as distributed model training, split learning, SGD with large batch size and federated learning, is an efficient approach that allows participants to collaborate together on training an aggregate model based on their local data. In this approach, each participant in the training phase is responsible for training its own local data and sending local intermediate training results (parameter or gradient) to other parties or servers to assemble a global model.
With the methods mentioned above, the model's architecture and parameters are shared among the participants, along with intermediate training results of the model such as gradients, activation functions, and updated weights as well. Although data is not leaked directly, in [12], the authors have shown that the original raw data can be approximately reconstructed by an attacker, especially in the case when the architecture and parameters of the model are not protected. Additionally, models need to embrace the common architecture, which causes much inconvenience during operation.
To solve these problems, Papernot et al. have come up with a promising method called Private Aggregation of Teacher Ensembles or PATE in short. This is the associative learning approach, in the black box way, multiple models are trained with distinct datasets, such as records from different subsets of users. These models directly are trained on sensitive data, so these models are not made public, but instead used as a "teacher" to train a "student" model. Students learn to predict selected outputs based on the votes of all teachers and are not allowed to directly access every teacher or teacher's fundamental arguments.
The student model is trained by all the teachers, no teacher or dataset that determines the training process, therefore, the teacher's data and model are not revealed even if the opponent keeps the student model.
The advantage of associative learning is that participants do not need to agree on the hyperparameter and model architecture among the participants. However, its disadvantage is that the accuracy will decrease significantly. It also requires teacher models to be of good quality, which is unlikely to happen in practically. Furthermore, when the student model makes queries to the teacher model sufficiently, the results can be used to perform a black-box attack on the teacher's model in order to replicate its training data. Moreover, the student model must have enough publicly available data transferred to teachers, which is a remarkable risk to privacy, especially in medical or financial issues. Therefore, this study aims to find ways for protecting and balancing the privacy of training data, availability and maintainability of Deep learning model performance. We propose a solution to increase the security and efficiency of this associative learning model by using homomorphic encryption based on the conjugation analysis problem.
The paper is divided into 6 sections: Section 1: Presenting approaches of privacy assurance for machine learning models; Section 2: Giving information about the considered model; Section 3: Describing homomorphic encryption based on the problem of Conjugacy search problem; Section 4: Presenting associative learning protocol concealing inputs using homomorphic encryption based on the conjugate search problem; Section 5: Providing several evaluations about the security and performance of the proposed protocol. Finally, a conclusion is made about the results and limitations of the paper.

II. PROBLEM STATEMENTS
In gathering data for model building and evaluation, no matter how we try, the data collected is not always all the available data. Collecting all the data is completely impractical. Therefore, there is no certainty that the prediction model that we build on the collected data always provides good results on the unseen data. On the other hand, the data sample itself contains noise. Therefore, whatever algorithm is used for model building, we also need to complement techniques in order to avoid or reduce the overfitting and improve the generalization of the model. Since each algorithm is built on different approaches, even if training data can be different, for each problem, based on the "large number principle", a combination of results from many different models is likely to yield better results, which is called Ensemble learning.
Consider the following practical problem: A retail company has information about potential customers. This company wants to propose installment packages to customers in order to stimulate revenue. However, the problem is that there is no certainty about the credit history of the customer to ensure that the customer will repay on time. Usually, the way to solve this problem is that the company has to send the banks a list of potential customers for evaluation. This might expose customer information, as well as, difficultly to customize specific requirements such as extending credit for a certain group of clients. The question is how can the retail company build its own credit model based on unlabeled data and ensure the data is not disclosed to anyone, even the banks?
In the proposed model, it is supposed that there are + 1 participants in the training phase, where: A participant acts as a student owning the unlabeled dataset Dunlabeled. This data is private and the student does not wish to disclose it to any other party. The student wants to make use of the data so as to construct their own model without revealing private information. The difficulty for students is finding a way to label data without disclosing information.
participants { 1, 2, . . . , } play a role as teacher models. These participants keep local models trained by their own private data. Teachers provide students with labeling results for student data based on their local models.
In this model, students send unlabeled data to teachers and receive predictions of labels in the data. Based on the result, the student will choose a label for its Dunlabeled data and then build a suitable machine learning model based on the data and label received.
The requirement is that a solution needs to be proposed to ensure the privacy of data from a student sent to teacher models so that no participants (including attackers) could get students' private information except themselves, also information about the model of participating teachers.
For simplicity in the evaluation, we assume that all participants (teachers and students) are semi-trusted or honest-but-curious. These members will strictly comply with a designed protocol but attempt to infer additional information from other parties when implementing the protocol. In other words, parties do not actively interfere with the protocol, but only try to get as much information as possible from the data obtained.
In addition, for the simplicity of evaluation, we suppose that all teachers have the same deep learning model with common hyper-parameters regarding architecture, the number of neurons and similar types. Models embrace different parameters during training because of distinct local datasets.
Based on these assumptions, the paper proposes a protocol that ensures the privacy of data from a student by a homomorphic cryptosystem based on the Conjugacy search problem.

A. Conjugacy Search Problem-CSP
The conjugacy search problem is one of the difficult problems (NP-Hard) used in cryptography to build highly secure cryptosystems. The special thing is that cryptosystems based on the problem permit operations on real numbers, which hardly occurs in cryptosystems. Conventional cryptosystems based on problems such as Discrete Logarithm, Prime factorization, LWE...usually require large integers or polynomials; hence, there should be solutions to transform data when the inputs are real numbers. This makes computation extremely large. Cryptosystems based on the Conjugacy search problem are responsible for addressing this problem. The problem is stated as follows: Given a non-commutative algebraic structure The difficulty of the CSP problem is even still useful for post-quantum cryptosystems. At present, it is still very difficult to deal with known quantum algorithms. In fact, CSP is a special form of the Group Factorization Problem (GFP). This problem was recently shown to be unsolvable with 4 d  , on a linear group GLd(R). Therefore, the order of the matrix used in our protocol is chosen to be a minimum of 4, in order for protocol security.

B.
Homomorphic cryptosystem on Noncommutative ring Initialization: On a non-commutative ring ,  Considering homomorphism with addition: . Therefore, the cryptosystem is also homomorphic for multiplication.

C. Security of cryptosystem
The security of the above cryptosystem relies on the noncommutative property of the matrix, which makes the encryption scheme one-way. This ensures that the hacker will not be able to obtain the plaintext message from the ciphertext. In fact, the secrecy of the message is based on the difficulty of the conjugacy search and eigenvalues of the ciphertext matrix on a noncommutative ring.
Due to the difficulty of the conjugate search problem, the adversary will not be able to achieve by splitting the ciphertext 1 C HMH − = . Therefore, message m is not leaked by splitting ciphertext C. According to the encryption algorithm, a plaintext message can be regarded as an eigenvalue of the ciphertext matrix C given: 11  To ensure that the input forwarded from a student to a teacher is not revealed, the paper proposes the application of a homomorphic cryptosystem based on the conjugacy search problem. In general, a teacher can be any machine learning model, however, for simplicity and consistency in the evaluation, we assume that all teacher models are of deep neural network. As mentioned previously, we can approximate classical models as a deep neuronal models with appropriate architecture.
Data submitted by students will be passed through deep neural network models on teachers to make a prediction based on the respective model of each teacher. The prediction results will be utilized to build a student model. This model is also a deep neural network model. The data sent to the teacher from a student is encrypted using a homomorphic encryption algorithm based on the conjugacy search problem in the previous section.
However, one thing that we need to notice is that the deep learning model needs to have nonlinear activation functions, which does not guarantee the essentials of homomorphic encryption, therefore, we need to transform the deep learning model, in another word, we have to transform the activation function of the deep learning model from nonlinear to a proper form available for homomorphic cryptography.
To ensure features of homomorphic cryptosystems, the model should be modified as follows: Activation Layer: The common activation layer is a nonlinear function such as ReLU, Sigmoid, and Tanh. Therefore, we need to find alternative activation functions to approximately ensure the properties of these functions on usage.
Aggregate sampling class: In a homomorphic cryptosystem, it is impossible to use the max pooling function. Consequently, we make use of the mean pooling function, the average pooling function has only addition so it can be used on homomorphically encrypted data.
Dropout class: Dropout class gets rid of random data during training. So Dropout class cannot be used.
From the above modifications, we can build a non-disclosure association learning protocol using a homomorphic cryptosystem based on a conjugacy search problem like algorithm 1 below. -Select labels for data point x based on the results obtained from teacher models; -Remove labeled element from unlabelled D .

Student:
-Use labeled data to train the student model.
-Return model S W .
One thing to notice is that we can extend the size of matrix H and mask matrices i X with respect to 22 mm  given 1 m  . However, to reduce communication and computation costs, we choose the minimum size as algorithm above.

A. Protocol security analysis
Based on the security of the cryptosystem and conjugacy search problem, even if an attacker obtains the ciphertext i C , it is impossible to reverse the corresponding plaintext i x .
Therefore, during data transmission from students to teachers, the protocol is secure. For each data point, a student uses a different randomly generated key, so the security of a data point completely depends only on the key in the transmission phase. So the protocol is not affected by key sharing attacks.

B. Communication and computation cost analysis
The protocol implements two-phase connections. In the first one, student computes ciphertexts and sends these to the teacher machines. The number of ciphertexts is n corresponding to the size of the input vector. Each ciphertext has a size of 22  . Therefore, the student needs to send an amount of data 4n to a teacher. The number of teachers is N , the total amount of data to send is 4nN . In the second phase, teachers compute and return encrypted prediction matrix cj R to the client. The matrix has size of 22  . As a consequence, the total bandwidth required to carry out the transmission in the phase is 4N .
The bandwidth required for transmission over the entire protocol is 44 nN N + where n is the size of the input vector and N is the number of teachers.
In the protocol, math operations are very simple on low-level matrices, therefore, the speed and performance are quite good. It almost doesn't effect execution performance compared to the case when encryption is not applied.

C. Evaluate accuracy of the training model
To evaluate the proposed model, in this paper, we use the MNIST handwritten dataset. In the experiment, we split the MNIST dataset into 3 subsets. The first consists of 50,000 labeled samples distributed among teachers. The second one has 10,000 unlabeled samples kept by students, representing the unlabelled dataset. Finally, 10,000 samples are reserved for testing the accuracy of the model.
In the first labeled 50,000-sample dataset, we divide it by the number of teachers so as to evaluate the effect of teachers on the protocol. In practice, it is hardly that the number of teachers is huge and each teacher requires acceptable accuracy. So we only choose the evaluation benchmarks with the number of teachers corresponding to 2, 3, 4 and 5. The teacher models will be randomly selected between VGG11 and VGG13 models for simplicity of calculation. The test results with more parameters model might yield better results, however, the goal of the topic is not to built a model with high accuracy but confirming the effectiveness of privacy protection protocol. As a result, we will not use models with complex architectures. Given 2 teachers, each teacher keeps around 25,000 data in the labeled training dataset, where one model is VGG 11 and the other is of VGG 13 architecture.
With a model of 3 teachers, we choose 2 among them using VGG 11 architecture, and the remaining teacher has VGG 13 architecture with the dataset of each teacher including 17,000, 16000 and 17000 samples, respectively.
Given a model of 4 teachers, we choose 3 with VGG 11 architecture, and the other has VGG 13 architecture, with datasets of 5,000, 10,000, 15,000 and 20,000, respectively. To ensure the relative randomness and unbalance of the data distribution.
For a model of 5 teachers, we choose 3 models with VGG 11 architecture, and the others use VGG 13 architecture, with the amount of data 10000 evenly distributed among teachers.
In the first labeled 50,000-sample dataset, we divide it by the number of teachers so as to evaluate the effect of teachers on the protocol. In practice, it is hardly that the number of teachers is huge and each teacher requires acceptable accuracy. So we only choose the evaluation benchmarks with the number of teachers corresponding to 2, 3, 4 and 5. The teacher models will be randomly selected between VGG11 and VGG13 models for simplicity of calculation. The test results with more parameters model might yield better results, however, the goal of the topic is not to built a model with high accuracy but to confirm the effectiveness of privacy protection protocol. As a result, we will not use models with complex architectures.
On the student side, we use the VGG 11 model to retrain the selected and labeled data by the teachers. To evaluate the effect of the number of samples possessed by the student, we divide the unlabeled dataset of 10,000 samples into small subsets and train the student model using 500, 1,000, 2,000, 5,000 and 10,000 samples respectively. With the evaluation scenarios, we make an evaluation on both models that do not use the proposed protocol and the model with the proposed protocol.
The results indicate that the proposed protocol produces a little high accuracy, close to the model that does not make use of the protocol under the same conditions and is above 98%. Some cases with the proposed protocol are even more accurate than the case without protection protocol.
In general, the proposed protocol gives an accuracy comparable to the application of native models without protection.
In summary, the proposed protocol is feasible in practice and does not affect the efficiency of the training model much.

VI. CONCLUSION
This paper has presented a general approach for ensuring privacy for machine learning models and deep learning based on the ensemble learning model. The article analyzes some advantages and disadvantages of the model, then proposes an improved solution to improve the privacy of the training process of the deep learning networks according to the ensemble learning model. The proposed model uses secure multi-party computation techniques and allows privacy for student data as well as prediction results. The results show that the proposed protocols are quite effective in terms of implementation on ensuring high accuracy while maintaining the privacy of data. The results show that the proposed model is capable of achieving accuracy up to more than 98%, which is almost equivalent to the data-centralized and nonprivate models. This shows the efficiency of the model.