Convolutional neural network based side-channel attacks

—The profiled attack is considered one of the most effective side-channel attacks (SCA) methods used to reveal the secret key and evaluate the security of the cryptographic devices. By considering a classification problem, profiled SCA can be successfully conducted by machine learning techniques, as shown by recent works. However, these studies only provide general principles of the attack. Therefore, this paper presents technical aspects and specific instructions for an attacker when performing a profiled attack on a specific cryptographic device using a popular deep learning technique called convolution neural network. The experimental process and the results of the attack on AES-128 are presented to prove the effectiveness of the attack procedure

Side-channel attacks (SCA) is a powerful cryptanalytic technique that exploits the information leaked from the physical implementations of cryptographic algorithms to break the secret key [1].SCA can be classified into two main types: non-profiled attacks such as Differential Power Analysis (DPA) [1], Correlation Power Analysis (CPA) [2] and profiled sidechannel attack.Profiled attacks play an important role in the security evaluation of cryptographic implementations [3].Indeed, they provide a security assessment assuming the worst-case scenario.The profiled SCA attacks based on supervised learning techniques have recently received significant attention in the SCA community.Researchers in the security field explore different machine learning techniques to assess their effectiveness in the SCA context.As a consequence, there are several papers on the intersection of machine learning techniques and profiled SCA attacks [4] [5].While different scenarios usually require different machine learning techniques, almost all work demonstrates that Support Vector Machines (SVM) and Random Forests (RF) are good baseline algorithms for profiled SCA attacks.
Although machine learning-based profiled attacks relax the need for probability distributions of side-channel leakage samples, they still require specific extraction techniques to identify points of interest (POIs) on the trace.For unprotected devices, finding POIs is quite easy based on methods such as signal-to-noise ratios (SNR), the sum of squared differences (SOSD), and correlation power analysis (CPA) [4] [3] [5].However, for protected devices, determining POIs is a challenge for SCAs [6] [7].So far, no effective method has been proposed for selecting POIs for such devices.Fortunately, the deep learning method can solve the problem of modelling without extracting specific features in the pre-processing phase of traces [8] [6] [7].Therefore, in recent years, deep learning has begun to demonstrate its powerful efficiency in profiled SCA attacks because it almost perfectly approximates arbitrary functions.
Several studies have already investigated the performance of deep neural networks in profiled SCA attacks.Maghrebi et al [8] first compared the SCA-efficiency of deep learning and machine learning in terms of the number of sidechannel traces.The work by Cagli et al. [9] evaluates the performance of convolutional neural networks (CNNs) in scenarios where power consumption traces are misaligned due to countermeasures or hardware-related effects.Their research shows that CNNs combined with data augmentation techniques can effectively suppress those misalignment effects.Prouff et al. [6] give an empirical solution to the problem of choosing hyper-parameters for CNNs and multilayer perceptrons (MLP), and further established the power of applying deep learning to profiled SCA attacks.The other important contribution is the release of the public ASCAD dataset, which provides side-channel traces of a masked 128-bit AES implementation.The ASCAD dataset makes it easy for researchers to improve existing models or compare new deep neural network architectures.Zaid et al. [7] highlight the importance of configuring the hyperparameters and architecture; without proper configuration, the models do not perform well.They state that when we do not comprehend the influence of a hyperparameter we cannot realize the maximum potential of deep learning architectures.However, the above researches only describe the theoretical aspects of the attack without providing the details of the attack process from training CNN to finding the secret key of the device under attack.Therefore, in this article, we show the comprehensive aspects of attacks, specific instructions to execute the profiled attack using CNN for an attacker when conducting the attack and evaluate the security of specific devices.
The paper is structured as follows: Part 2 introduces the basics of profiled attacks and deep learning.Part 3 presents the method of profiled attacks using a convolutional neural network.Experiments and experimental results are presented in Part 4. The conclusions of the paper are presented in Part 5.

A. Profiled side-channel attacks
For profiled SCA attacks, the adversary is assumed to have a pair of identical devices: a profiling device and a target device.In the attack scenario of our paper, the target device runs a symmetric cryptographic algorithm with a fixed secret key.The attacker has access to control the input and the key of the profiling device, so he can characterize the leaked information very precisely by applying statistical techniques.The profiled SCA attacks are performed in two phases: the profiling phase and the attack phase.
In the profiling phase, a dataset of   profiling traces is acquired on the profiled device.It will be seen as a realization of the random variable   ≜ {( Template attack (TA) is a typical profiled attack that assumes that (|) follows a Gaussian distribution for each target value   : where  represents a N-dimensional vector,  is the mean vector,  is the covariance matrix which is called templates.For TA, the attacker builds different templates for different classes, which corresponds to different intermediate values of  in the learning phase.In the attacking phase, the attacker uses the maximum likelihood estimation (1) for the key recovery process.

B. Deep Learning
Deep learning is a branch of machine learning that has been applied to image classification, speech recognition, and other fields [14].Machine learning usually requires manual feature engineering while CNNs learn the automatic features directly from raw data.Furthermore, the features extracted by convolutional layers are independent of their position in the data, and dense layers can identify the features related to the labeled traces.Therefore, convolutional neural networks should be robust to jitter effects from unstable clock domains or even desynchronization [9].The common architecture of CNNs consists of two parts, namely, feature extraction and classification.The main block of a CNN is a convolution layer (CONV) directly followed by an activation layer (ACT).The former locally extracts information from the input thanks to filters and the latter increases the complexity of the learned classification function through its non-linearity.After the activation, batch normalization (BN) is used to train deep neural networks to be faster and more stable.After some (CONV • ACT• BN) blocks, a pooling layer (POOL) is usually added to reduce the number of neurons.This block is repeated in the neural network until an output of a reasonable size is achieved.Then, some fully connected (FC) layers are introduced to obtain a global result that depends on the entire input.The last layer of CNN is the output layer with the number of neurons equal to the number of classes to be distinguished and the activation function is softmax.To sum up, a common convolutional network can be characterized by the following formula: where  1 and  2 are the number of convolution and fully connected layers.

A. Attack procedure
The application of deep learning requires carefully analyzing the problem and configuring the neural network.The network for performing SCA attacks on cryptographic devices requires at least one section for performing the function of detecting and learning the features of traces and one section for performing the classification.Of the deep learning network architectures, the convolutional neural network CNN satisfies these purposes effectively.In CNN networks, the convolution layers are responsible for detecting the features of traces and the hidden neurons in the MLP network structure are responsible for classifying.Therefore, the proposed deep learning network architecture for use in profiled attacks is CNN and the general procedure of attack is shown in Figure 1.The profiled attack using CNN in Figure 1 proceeds through two phases: a profiling phase and an attack phase.In the profiling phase, traces collected during the operation of the cryptographic algorithm are performed on the profiling device to form a trace set.This trace set is labeled according to the intermediate value of the algorithm that needs to be profiled  1 , . . .,   .Usually, these intermediate values are taken at the output of the S-box.This labeled set of traces is used to train a CNN to obtain a CNN network model describing the dependency characteristic of the intermediate value   on device power consumption.Specifically, in the profiling phase, the attacker does the following: B1: First, the attacker procures a similar or identical device to the target device.In the attack phase, the attacker tries to reveal the secret key from the device under attack.A   unlabeled traces collected from the target is classified by the trained CNN model to determine the probabilities of the traces for classes  1 , … ,   .These class probabilities are then associated with a key byte hypothesis in order to extract the likelihood (equation ( 2)) for each key byte candidate and the key   with the highest score is the most likely prediction.
where   = (  , );  = 1, … ,  và   is the plaintext of trace   .Specifically, during the attack phase, the attacker does the following: A1: The attacker finds a target device that is the same or similar to the profiled device.
A2: He identifies the attack point; the same attack point as in the characterization phase must be used, such as the S-Box operation.
A3: He creates an attack set by recording multiple traces of the identified operation.A4: He applies the measured traces to the trained CNN which predicts the probability of each class.The results of this step can be found in Equation 3, where (  )  represents the output of the CNN for trace   for class j and   the number of recorded traces, and the number of classes in the CNN is 256 (from 0 to 255).
(3) A5: Finally, he recovers the key value, one sub-key at a time, using the log-likelihood function.We explain this recovery in three steps: -Suppose the attacker wants to recover the subkey  while attacking an encryption algorithm.He first computes the SBox value of the XOR of all possible combinations of   and the key value .Let's assume that all the results are stored in a matrix P of size   by 256.For example, the element  , , where  and  represents the row and column indices, presents the value (  ⊕   ), i.e., the Sbox value S of the XOR operation of the sth byte of trace  and the sub-key value   = .(5) -Finally, he takes the logarithmic sum of each column of matrix S with the objective to identify the most probably key value; the index of the column with the largest sum represents the value of the sub-key.

B. CNN architectures selection for the attacks
The basic architecture of CNN consists of convolutional layers used to detect features of power consumption traces and hidden neuron layers to classify power consumption traces.The ability of CNN to classify power consumption traces is greatly influenced by main parameters such as the number of convolutional layers, the kernel size of the convolutional layer, the number of hidden layers, and their number of neurons.For each cryptographic device, these parameters should be selected appropriately to ensure that the CNN reaches the maximum classification accuracy.
For unprotected devices, according to [7], the more convolutional layers of CNN, the less confident it is in feature detection because the information on the trace is lost when it passes through the Pooling layer and the smaller of kernel size, the ability to focus on detecting the features of a trace is better.Therefore, the CNN architecture is recommended for unprotected devices consisting of one convolutional layer with 4 filters of kernel size 3, one pooling layer with the pooling size and stride is 2, one hidden layer of 10 neurons and the output layer of 256 neurons with a softmax activation function.
For protected devices by random delay insertion [10]: The protection uses random delay countermeasure as described by Coron and Kizhvatov [10].Adding random delays to the normal operation of a cryptographic algorithm has an effect on the misalignment of important features, which in turns makes the attack more difficult to conduct.According to [7], the CNN architecture that is used to attack this kind of devices consists of 3 convolutional layers: the first layer with a small kernel size is used to detect the feature of power traces, the second layer tries to detect the value of the desynchronization due to the delay in power traces, the third block aims at reducing the dimensionality of each trace in order to focus the network on the relevant points and to remove any irrelevant ones.The details of CNN architecture are as follows: first convolution layer: number of filters 4, filter size 3, second convolution layer: number of filters 8, filter size 50, third convolution layer: number of filters 8, filter size 3, followed by 02 hidden layers with 20 neurons, and finally the output layer with 256 neurons using the softmax activation function.
For masking protected devices [6]: Attacks against the masking-protected devices are known as higher-order side-channel attacks, where an attacker need to combine independent feature by the operations that relate to the mask values and masked values.In order to conduct successfully profiled attacks based on CNN, the CNN network must be able to detect the features of power traces and the combination between them.According to [11], the CNN architecture that is used to attack this kind of devices consists of 2 convolutional layers: the first layer with a small kernel size is used to detect the feature of power traces and the second layer tries to generate the combination between features.The details of CNN architecture are as follows: first convolution layer: number of filters 4, filter size 3, second convolution layer: number of filters 8, filter size 51, followed by 02 hidden layers with 10 neurons, and finally the output layer with 256 neurons using the softmax activation function.

IV. EXPERIMENTS
In this section, we present the experimental results of implementing profiled attacks based on the CNN architectures and TA attacks for different devices.The parameters used to evaluate effectiveness are as follows: -The ability to reveal the correct key: To confirm that our profiled attacks can reveal the correct key used by AES-128, we figure out the probability of the correct key over all keys.The key with the highest probability is the best one.
-The guessing entropy (GE) [12]: This is widely used to evaluate the effects of attacks in multi-trace experimental scenarios.When using maximum likelihood estimation to recover the secret key, we pay more attention to the final probability output of each side-channel trace.The output probability of each key candidate is ranked in descending order.The guessing entropy is then defined as the index or real key's rank within the sorted probabilities.We care about the number of traces that required to achieve a guessing entropy of zero, that is, the number of traces required to recover the key.We estimate such a guessing entropy after 10 independent attacks.

A. Results with an unprotected device
To conduct the attack for this type of devices, we use the DPA contest v4 trace data set.The set consists of 100000 traces, each consisting of 4000 features, of a masked AES implementation.However, the traces leak first-order data and this dataset is only used as an unprotected dataset after unmasking the S-box output.The targeted sensitive variable is the output of S-box, ( +  * ) ⊕ , where M is the known mask.This dataset is publicly available at: http://www.dpacontest.org/v4.In the attack phase, the estimated probability of the hypothetical keys is determined by the maximum likelihood estimation.The correct key is defined as the key with the highest probability.Figure 2 shows the correct reveal key (130) having the largest probability value.The GE values obtained by the attack based CNN and TA are shown in Figure 3.The attack based CNN architecture is more effective in terms of the number of traces required for GE to reach 0. It requires only 2 traces to reach 0 while TA requires more than 7 traces.This result demonstrates that CNN can profile the characteristic of power traces more precisely than the template attack.

B. Results with an unprotected device
For random delay insertion countermeasure devices, AES-RD [10] traces data set is used.AES-RD is obtained from an 8-bit AVR microcontroller where a random delay desynchronization is implemented.For maskingprotected devices, ASCAD traces data set presented in [13] is used.This data set is set up like the MNIST dataset and has 50000 profiling traces and 10000 attack traces.The traces are recorded from an 8-bit AVR microcontroller from a masked implementation of AES-128.
The attack results in Figure 4 and Figure 6 show that the profiled attack using CNN is able to recover the correct key of the protected devices.The correct keys found are 43 and 224, which have the highest decision scores among all hypothetical keys.In Figure 5 and Figure 7, comparing the attack efficiency, the template attack needs more than 500 traces, while for a profiled attack using CNN, AES-RD needs 5 traces and ASCAD need about 190 traces to rank the correct key first.The efficiency of profiled attacks using CNN is much better because the CNN network can automatically learn the hidden features in the power consumption traces, thereby classifying the traces with high accuracy.As for the template attack, the selection of trace features needs to be done manually before the attack, which makes the attack efficiency low.The article presents in detail, technical aspects to conduct the profiled attacks using CNN deep learning technique.By using CNN, the attack can succeed on different cryptographic devices with better efficiency than the template attack.However, when performing attacks on different devices, the CNN architecture needs to be configured in accordance with the characteristics of the traces of each device.

B2:
He selects an intermediate attack point of the target cryptographic algorithm.For example, AES is Sbox ouput.B3: He records several traces of the targeted operation and labels them according to   = (  ⊕ ), where   is the plaintext of  ℎ trace and  is the secret key of the profiling device.B4: He selects a CNN and trains it based on the training traces set obtained in step B3.During the training, the dataset is divided into two unequally sized groups; approximately 10% of the data set is randomly selected and used as validation while the other 90% is used as a training set.Once the accuracy of the neural network is high enough, the training ends.As a result, the attacker has a trained CNN model describing the power consumption characteristics of the device at the attack point that is determined in step B2.

-
He replaces each value in the matrix  by (  )  , in the matrix  to get the matrix  (equation 5), where for each element the probability of a trace   is encrypted by the key   = .

Figure 2 .
Figure 2. Estimation probability of all hypothetical keys for unprotected devices.

Figure 3 .
Figure 3. Guessing entropy results for unprotected devices

Figure 4 .
Figure 4. Estimation probability of all hypothetical keys for delay insertion-protected devices.

Figure 5 .
Figure 5. Guessing entropy results for random delay insertion-protected devices.

Figure 6 .
Figure 6.Estimation probability of all hypothetical keys for masking-protected devices.

Figure 7 .
Figure 7. Guessing entropy results for maskingprotected devices V. CONCLUSION

TABLE 2 :
EDWARDS TWISTED CURVES SATISFYING