Deep Learning Techniques to Detect Botnet

— Over the past time, the world has witnessed an unprecedented explosion of Deep Learning. Besides the development of Information Technology, security and safety threats are also increasing, one of which is the Botnet network. Botnet network is increasingly complex and difficult to detect, and traditional techniques are no longer effective, so one of the urgent problems today is to find an effective solution to detecting botnets [2]. Based on the characteristics of deep learning such as scalability, interpretability, etc., therefore, in this paper, the author proposes to use deep learning techniques to detect Botnet networks.


INTRODUCTION
A BotNet is the shorthand term for "Bots Network". Just a network of infected computers (Bots/Zombies) dominated by another computer. The larger the Botnet network, the higher the danger. Botnets are actually a group of compromised Internet devices that are controlled remotely by cybercriminals. Cybercriminals use Botnets to launch coordinated attacks and perform other malicious activities. The word "botnet" is a combination of two words, "robot" and "network".
Here, a cybercriminal performs the role of a "botmaster" using a Trojan virus to compromise the security of several computers and connect them to the network for malicious purposes. Each computer on the network acts as a "bot" and is controlled by the bad guys to spread malware, spam, or malicious content to launch the attack. The number, scale, level of danger, and especially the hidden ability of Botnet networks is increasingly sophisticated and complex. In Vietnam, according to information from the Vietnam Computer Emergency Response Center (VNCERT), in the first two quarters of 2019, nearly 100,000 Vietnamese network (IP) addresses were queuing and connecting to Internet sites every day. computer network (Botnet) and up to 6,219 incidents of cyberattacks on Vietnamese websites.
BotHunter is one of the earliest behavior-based Botnet detection systems, it works based on the use of SNORT software to generate alerts about the behavior of each individual machine. However, the fact that it works by observing packet payloads makes this system not very effective in detecting Botnet networks that have encrypted connections. So BotMiner appeared, BotMiner works by grouping the behavior of different machines in the same Botnet.
On the other hand, in this early time, there was some research on how to detect malicious network flows by Botnet based on machine learning, but these network flows are only from internal machines in the LAN. After that, machine learning became a popular technique in Botnet detection. The two-stage system includes a feature extraction phase and a machine learning phase. This system is used to detect Botnet networks based on IRC (Internet Relay Chat -a protocol designed for realtime chat communication based on Client-Server architecture), it works by using the Bayesian classification method to detect network traffic from Botnet's C&C (Command-and-Control) machine, this system results in 90% detection rate and 15.4% Deep Learning Techniques to Detect Botnet Doan Trung Son, Nguyen Thi Khanh Tram, Pham Minh Hieu rate false detection rate (false detection rate is still high) [4].
Several other Botnet detections work based on DNS queries, and it has achieved a malicious DNS query detection rate of up to 92.5% [3].
However, detection based on DNS queries makes the systems workable only with Botnets that use the DNS system to search their C&C servers (mostly these networks are Botnets). concentrate).
The system for detecting P2P Botnet network flows works by assuming that the network traffic generated by users will fluctuate wildly unlike the network traffic of P2P Botnet. This system achieves detection rates up to 98%, but false detection rates are high (30%) [4]. Botnets spread very quickly along with that, Botnet types are increasingly evolving and developing, which makes Botnet types increasingly dangerous and difficult to detect. Due to its ability to spread easily, any industry, profession, or field is at risk of being compromised by Botnet. So that when a Botnet signal appears, it is possible to investigate and detect if a device has been infected, how the infection took place, and how over a period of time the attacker has performed the following activities. what's on the device. Today, the network environment is becoming more and more complex, so security requirements are more difficult than ever. And with the Botnet being controlled by the botmaster through the C&C channel, in addition to being easy to hide, it also makes the Botnet an effective tool for cybercriminals to perform various types of malicious behavior. Some of the behaviors that Botnets can perform are: spamming, phishing, click fraud, distributed denial of service (DDoS) attacks, and malicious program distribution... Botnets are commonly used. frequently in Distributed Denial of Service (DDoS) attacks. An attacker can control a large number of hijacked computers at a remote station, exploit their bandwidth and send connection requests to the target machine. Many networks suffer terrible consequences after suffering attacks of this type [5].

II.
BOTNET DETECTION AND DEEP LEARNING TECHNIQUES

A. BOTNET DETECTION TECHNIQUES
Haddidadi and Zincir-Heywood [8] selected the feature to detect anomalies. The two-stage early Botnet detection method was implemented by Wang and Paschalidis [9]. Stevanovic and Pedersen [8] introduced a data stream-based Botnet detection method. The random forest algorithm achieves 94% accuracy. Kirubavathi and Anitha [8] introduced a Botnet-based detection method that modeled the behavior of network traffic using data flow characteristics and supervised machine learning (ML) methods. Nogueira et al. [7] proposed a method of Botnet detection based on characteristic network traffic patterns. Guntuku et al. [8] integrated a regular Bayesian model to preprocess the features and select the most representative set of features. Although the size of the benign network traffic dataset, and their categories are not given, the proposed method still detects the Botnet with 99.2% accuracy. The results of the evaluation of the three methods above show that these methods are superior to other related methods, and it can detect Botnets with an accuracy of up to 99%.
Along with that, the model training time is acceptable when applied to a fairly large set of Botnets, the detection results are very accurate and have great potential for practical application. Besides, the Botnet detection model built based on the Artificial Neural Network (ANN) with the output of the Softmax output activation function is also the most applied and most effective model.

B. DEEP LEARNING TECHNIQUE TO DETECT BOTNET
In this paper, the authors build a classification model based on deep learning, in which: The model is built based on the activation function which is the Softmax function, the model is trained and tested with train and test sets. 10 best features have been extracted previously. The model is built on top of Keras with TensorFlow support. The application used is the Google Collaboration application by Google.

Introduction of dataset:
The data set used by the author is a Bot-IoT built by Nickolaos Koroniotis, Nour Moustafaa, Elena Sitnikova, Benjamin Turnbull [6] from the University of New South Wales Canberra, Austria. The dataset is a combination of often simulated IoT network flows, along with diverse attack methods. The dataset is huge with 72000000 records, with 16.7 GB for CSV files and 69.3 GB for Pcap files.
The Bot-IOT dataset is a data set that has been built and analyzed thoroughly and completely, through which machine learning and deep learning methods can be easily applied. This dataset has shown its advantages when compared with other datasets as follows:  Darpa98  T  F  T  F  T  T  F   KDD99  T  F  T  F  T  T  T   DEFCON   -8   F  F  F  F  T  T  F   UNIBS  T  T  T  F  F  T  F   CAIDA  T  T  F  F  F  F  F   LBNL  F  T  F  F  T  F  F   UNSW-NB15   T  T  T  F  T  T  T   ISCX  T  T  T  F  T  T  T   CICIDS  2017   T  T  T  F  T  T  T   TUIDS  T  T  T  F  T  T  T   Bot-IoT  T  T  T  T  T  T  T The dataset is built on three components, namely: Networking platform, simulation of IoT devices, and finally feature extraction and investigation, and analysis. First, the networking platform consists of normal and attack virtual machines (VMs). Second, about simulating IoT devices, IoT devices will be simulated through the Node-red tool. Finally, it is about extracting and investigating and analyzing features. Here, the Argus tool is used to extract features about the data so that machine learning techniques can then be applied.
Data preprocessing: First, it is necessary to import the dataset. First, we initialize the path to the dataset, and then use Python's CSV-formatted data reading function.
Next, we will get the data features, here the feature columns used to identify the data stream saddr, daddr, proto, sport, dport will be removed. In addition, we also remove the label columns in the data such as an attack, category, and subcategory. These are the labels of the data streams so we keep them separate.
In addition, the label used to distinguish it is the category. We need to convert the labels of the category column to numeric values. It is similar to a dictionary of 5 values: {0:A, 1:B, 2:C, 3:D, 4:E.} where label A is replaced with 0, label B is replaced with 1. Then we need to convert the numbers corresponding to the labels into vectors. For example, if the first column label is 0, it will convert to a 5-dimensional vector of [1,0,0,0,0], the second column label of 4 will convert it to [0,0,0,1,0]. Because the explosive lattice model requires the input of the label to be a vector, we need to convert its label to a vector.
For the extracted dataset, we proceed to classify the columns in the dataset into alphanumeric columns.
In which, the variable num_keys contains columns of numeric values and the variable cat_keys contains columns of literal values, the iloc command is used to browse each row in the column (keys) and the select_dtypes command is used to determine the data type to choose. Forcolumns with numeric values, we need to proceed to adjust the data of the characteristics to a common scale with a small enough range of values, the purpose of which is to help classifiers work properly. most accurate and effective. And the technique applied here is MinMax Scaling.
The above formula will normalize the properties to values in the range [0, 1]. However, because the values in the data set are too large, MinMax Scaling cannot be applied. Therefore, we first need to use the base 10 logarithmic formula to reduce the data size.
Similarly, for columns with literal values, we will also use MinMax Scaling to return data in the range [0,1]. However, before that, we also need to have a lexicographic method to convert alphanumeric values.
Building neural network model: The neural network model is built based on the output activation function, which is the Softmax function. The model consists of 3 layers as follows: The input layer is a 14-valued attribute, followed by the first hidden layer with 128 nodes, then the second hidden layer with 64 nodes, and the last is an output label with 5 labels. Here, to increase model efficiency and reduce overfiting during training we use dropout technique.
After initializing the neural network, we proceed to train the model: Where epoch is the number of repetitions, and batch_size is the number of samples, these two values are calculated according to the given formula. After each iteration, we will drop out, edit the model to achieve the lowest loss and highest accuracy, and use the graph to display the training results based on the accuracy.

A. MODEL EVALUATION
To make the assessment simple and effective, the author has evaluated the model based on 4 common metrics: Precision, Recall, F1-Score and Accuracy, these measures have scores equal to the value average of each label. The evaluation will be done by running the algorithm 3 times and calculating the results, which are: To make the assessment more intuitive, the author calculated the measured value for each label: Precision, Recall and F1-score, and gave the results for each part as follows:   From the measures for each component, we can draw the comment that when the number of positive samples of that type is large, the prediction accuracy of negative and positive values for that type will be more accurate. When the number of positive samples is small, for example, the Normal-type, the prediction will no longer be accurate, but because the difference between the number of positive and negative samples is too large, the measures of TNR and NPV are still give high results, similar to Theft's case. Therefore, the evaluation of the model's operability will be most accurate based on the following metrics: Precision, Recall, and F1-score.   Figure 7. Compare methods Thus, when compared with other methods, the applied deep learning method gives better results, along with a shorter training time. And also from the table, we can see that 2 other methods using neural networks, RNN and LSTM, also give high results compared to the SVM method. However, the Precision parameter of the SVM is 1, which shows that the SVM method predicts positive cases very accurately, not mistakenly predicting negative to positive, but it has the disadvantage that it is easy to miss a lot. positive case.
Therefore, it can be concluded that SVM method will be better than deep learning methods in predicting small data sets or data sets with few positive cases. However, within the scope of Network Forensics data, it is clear that the optimization of deep learning methods will be more clearly demonstrated.

IV. CONCLUSION
The article has drawn the role of deep learning in Botnet detection and realized Botnet network detection is based on deep learning. The article has outlined two methods of Botnet detection based on Deep Learning, which are BoTShark-SA and BoTShark-CNN based on 2 techniques and Autoencoder and CNN. The obtained results show the effectiveness and superiority of the Botnet detection deep learning model. In the future, the author wishes to directly execute on the raw data set after intercepting network data, thereby reducing data processing time and enhancing the applicability of the model. Building a deep learning application to detect Botnets can contribute to the task of ensuring information security, and protecting state secrets in the People's Public Security force in the coming time.