Machine learning approach detects DDoS attacks

— Denial of Service attacks have been around since the dawn of the internet age. Along with the development and explosion of the Internet, denial of service attacks are also increasingly powerful and become a serious threat in cyberspace. The article aims to evaluate machine learning algorithms: K-nearest neighbor (KNN) algorithm, Decision Tree, Random Forest algorithm and Support Vector Machine (SVM) on various metrics in detecting DDoS attacks. The main objective of the paper is to analyze the algorithms, collect data and evaluate the effectiveness of the algorithms in DDoS attack detection.


I. INTRODUCTION
Distributed Denial of Service (DDoS) attack is accomplished by increasing online traffic from multiple sources to the server. This causes the server to run out of resources and bandwidth. DDoS first appeared in 1999.
Vietnam is facing a great risk of being attacked and distributed by DDoS attacks with the 6th position globally after China, the US, France, Russia and Brazil, the 2nd position in the region.

Asia
Pacific region and leading in Southeast Asia [1].
DDoS involves making requests from a network of computers made up of millions of computers with different IP addresses over which control has been previously established (Botnet). Computers and other networked resources such as IoT devices together create "Tsunamis" of traffic. A DDoS attack can be understood as a sudden traffic jam that blocks a highway, preventing normal traffic from reaching its destination. Because it is dispersed into many access points with different IP ranges, DDoS is much stronger than DoS, and it is often difficult to recognize or prevent DDoS attacks.
Different types of DDoS attacks target different components of a network connection. Based on the target and behavior, DDoS attacks can be classified into three types traffic/fragmentation attacks, bandwidth/volume attacks, and application layer attacks.
In late 1999, CERT first published its report on the threat of DDoS attacks and outlined specific prevention actions to mitigate this threat [2]. A few months later, the Internet suffered its first large-scale DDoS attack [3], and successive attacks of increasingly large scale in the following years. Since then, researchers have analyzed a number of tools used to launch DDoS attacks [4,5,6], measured their impact on the Internet, and come up with a number of defense methods [7]. Accordingly, these research efforts have resulted in a number of effective and reliable anti-DDoS products offered as stand-alone devices or cloud-based services.
In recent years, along with the strong development of Artificial Intelligence (AI), machine learning (ML) and deep learning methods are being used more and more in detecting DDoS attacks. Sambadi and Gondi propose an approach that uses multiple linear regression to detect DDoS attacks [8].

Machine learning approach detects DDoS attacks
Nguyen Thi Khanh Tram, Doan Trung Son, Nguyen Thi Thu Huong, Tran Thi Thu P. Sangkatsanee et al. [9] built a real-time detection mechanism applying machine learning techniques. In it, 12 essential network traffic characteristics are proposed, which distinguish between normal data and DDoS.
Sofi et al. [10] upgraded a new dataset consisting of 27 features and five different traffic classes. Four machine learning algorithms namely Naive Bayes, SVM, decision tree and MLP have been applied to identify DDoS attacks. In which, the MLP algorithm gives the best results.
Mahadev et al [11] used the Naive Bayes classifier in the weka tool to analyze the network traffic flow and found it to provide 99% accuracy in detecting DDoS attacks.
S Duque et al. [12] show that the K-means clustering algorithm gives increased efficiency with the correct usage of the number of clusters. Furthermore, note that with an increase in the number of clusters over the number of data types, the false-negative, detection rate decreases, but the false-positive rate increases.

II. MACHINE LEARNING ALGORITHM
The four algorithms for performing DDoS attack detection in this paper refer to KNN, Decision Tree, Random Forest and SVM. These are all commonly used classical machine learning algorithms.

A. KNN
The K-nearest neighbor (KNN) algorithm is one of the simplest supervised learning algorithms (which is effective in some cases) in machine learning. When training, this algorithm does not learn anything from the training data, all calculations are performed when it needs to predict the outcome of the new data. With KNN, in the classification problem, the label of a new data point is directly inferred from the K nearest data points in the training set using distance measures such as Euclidean distance, Manhattan distance and Minkowski distance.
Implementation steps: Step 1. Calculate the distance Step 2. Find nearest neighbors Step 3: Predict labels. Decision Tree -is a supervised and nonparametric learning algorithm used for classification and regression. The methods create a highly accurate, stable, and easy-to-follow tree model, eliminating unnecessary attributes. Each inner node is equivalent to a variable, each arc goes to a child node corresponding to the possible value of that variable. The leaves correspond to the predicted target values for the variables.
Decision tree learning is also a very popular method in data mining. Where a decision tree describes a tree structure in which leaves represent classes and branches represent combinations of features that lead to classification. A tree can be learned by dividing the source set into subsets based on the values of the test attributes. This process is repeated on each obtained subset. The recursion ends when it cannot be divided any further or when each element of the subset has been labeled. Decision trees are described by calculating conditional probabilities. Decision trees can be described as a combination of techniques learning and computational algorithms that support the description, classification, and generalization of a given data set.

C. RANDOM FOREST
Random Forest builds many decision trees using the Decision Tree algorithm, but each decision tree will be different (with a random element). The prediction results are then aggregated from the decision trees. Random forest is a supervised family algorithm that can solve both regression and classification problems. Random Forest works in 4 steps: Step 1. Select random samples from the given data set.
Step 2. Set up a decision tree for each sample and get prediction result from each.
Step 3. Vote for each prediction results.
Step 4. Select the most predicted result as the final prediction.
In addition, Random Forest has the following notable characteristics:  A collection of unrelated trees performing the same task is better than having each tree count one by one;  Assuming the trees are independent of each other in error rate or have little correlation with each other to ensure independence;  Feature selection must be good enough for the tree to classify better than random selection;  The predictability and error of each tree have little correlation with each other.
D. SVM Support vector machine (SVM) is a supervised machine learning algorithm that is very commonly used today in classification or regression problems. The idea of SVM is to find a hyperplane (to separate the data points. This hyperplane will divide the space into different domains, and each domain will contain a type of data.
The optimal hyperplane we need to choose is the split hyperplane with the largest margin. Machine learning theory has shown that such a hyperplane minimizes the error limit.  Table 1 lists the log counts for these types of attacks. Table 2 shows the processed features of the data set. The proposed data collection system follows these steps:  Collect and control: all network traffic from NIDS is collected and examined;  Preprocessing data format: remove redundant and duplicate records;  Feature extraction: extract feature parameters from the collected network traffic and assign each feature to each data column; they will be used as a vector in the new dataset;  Statistical measurements: in this step, the features are additionally calculated using statistical equations. The authors use a data collection system inherited from the topic "Building an application to collect transmission data for network investigation" of author Nguyen The Hoang.
After collecting the data set, it is fed into the system to identify denial of service attacks. The steps are as follows training model's weight and the model's accuracy evaluation parameters.
• Receiving input dataset: the system receives user-provided network attack datasets; • Machine learning model training: the system stores machine learning algorithms commonly used in network attack detection, then trains those algorithms with the input dataset; • Changing model parameters: the system makes adjustments to change some parameters with each certain algorithm to increase the accuracy of the algorithm; • Display training and model evaluation results: the system will output the results of the training model's weight and the model's accuracy evaluation parameters; • Make a conclusion whether the network behavior is a denial of service attack or not.

B. DATA PROCESSING
Figures with the above dataset, process the data before putting it into the experiment. The input information must be processed at the same cost. Therefore, data cleaning is always the first step in designing a machine learning model. Remove the symbolic features (Symbolic) such as PKT_TYPE, FLAGS, NODE_NAME_FROM, NODE_NAME_TO, PKT_CLASS and unimportant features like SRC_ADD, DES_ADD.
Because the data set has a relatively high number of records belonging to normal behavior, to balance the machine learning model, take 10000 records for 2 labels Normal and UDP Flood. The input data set is divided into training and testing sets in the ratio of 7:3.

C. HYPERPARAMETER SELECTION
Hyperparameter Tuning is an important step in machine learning techniques. Hyperparameters are user-defined parameters that control the training process of the model and play an important role in determining the performance of the model. Such parameter tuning is usually done by traversing a predefined grid of parameters. This parameter grid can be defined values, or it can also be random following a definite distribution or condition. In this paper, the parameter grid with defined values is used as shown in the following table:  The indicators used to evaluate the results include: Accuracy: this is the ratio of correctly predicted points to the total number of points in the test dataset.
Precision or Positive predictive value (PPV): Is the ratio of the number of points in the attack behavior that the model correctly predicts to the total number of points the model predicts in the attack behavior. The higher the Precision metric, the higher the number of points the model predicts that an attack is an attack. Precision = 1, i.e. all scores that the model predicts as an attack are correct, or none of the scores labeled as normal behavior that the model mistakenly predicts is an attack.
Recall: The ratio of the number of points that are correctly predicted by the model attack to the total number of points that are actually the attack (or the total number of points labeled as the original attack). The higher the recall, the lower the score is that the attack is missed. Recall = 1, i.e. all points labeled as attack behavior are recognized by the model. Recall is also known as True Positive rate (TPR), Sensitivity, Hit rate. F1-score: Is the harmonic mean between Precision and Recall when these two quantities are non-zero. Calculated by the formula: False positive rate is (FPR) also known as False Alarm Rate is false detection rate, a behavior is normal but the model considers it as attack behavior.

IV. RESULTS AND DISCUSSION
The results of running the 4 mentioned algorithms are presented in the following table: According to the results from Table 4, the decision tree algorithm gives the lowest probability of correct detection (90.93%) as well as the highest false detection rate, the Random Forest algorithm gives the highest probability (95.08%), the algorithm gives the highest probability (95.08%). SVM with longest running time, lowest false detection rate. In general, the 4 algorithms using scikit-learn library provide relatively good results and are optimized for better performance.

VI. CONCLUSION
Based on the newly collected dataset containing four types of DDoS attacks as follows: (HTTP Flood, SIDDOS, UDP Flood) and no redundant or duplicate records, the author conducted experiments with 4 machine learning algorithms. for DDoS attack detection. As a result, all 4 algorithms are capable of detecting DDoS attacks with high accuracy, fast speed and efficiency.
Recently, with the continuous development of 5G, a large number of insecure Internet of Things (IoT) devices are connected to the Internet, which presents great challenges to protect against attacks. DDoS attacks, especially when attackers are trying to "recruit" more devices to the Botnet (Example Mirai Botnet) to increase the frequency, size and throughput of DDoS attacks worldwide. In the future, attackers will most likely take advantage of artificial intelligence and machine knowledge that allows automatic alteration of attacks so that they evolve to more optimal attack techniques. In that case, it is necessary to improve the DDoS attack detection algorithms towards real-time processing of the raw data of the attacks obtained.