A new approach to improving web application firewall performance based on support vector machine method with analysis of Http request

- Amount of attacks on information system is rapidly increasing not only in numbers but also in quality. Each attack violates properties of confidentiality, integrity, and accessibility of information, most attacks pursue financial gain, especially web attacks because almost companies use web applications for their businesses. The issue of protecting personal data from these attacks has become critical for all organizations and companies. Thus, the need to use an intrusion detection system and an intrusion prevention system to protect these data is relevant. Traditional means of protecting access to the corporate network (firewalls) are not able to protect against most threats directed at Web resources. The reason is that attacks on such resources most often occur at the application level, in the form of HTTP / HTTPS-requests to the site, where traditional firewalls have extremely limited opportunities for analysis and detection attacks. For protecting web resources from attacks at the application level we have special tools - web application firewall (WAF). The task of the tool is detecting and blocking attacks on Web resources at the application level. However, the analysis of incidents of information security shows that even with a class of means of detecting attacks on Web resources, their effectiveness does not provide a 100% detection level. With an aim of applying machine learning methods to improve WAF performance. The author discusses as popular types of attacks on Web applications and the survey of machine learning methods in the attack detection task to build an algorithm for automatic detection attacks based on the support vector machine and analysis of HTTP request.

In addition, users of web applications are at risk, as successful attacks can steal credentials, perform actions on websites on behalf of users, and infect workstations with malware.  According to Positive Technologies' Web Application Firewall Survey, the largest average number of attacks per day -approximately 3,500 attacks -was recorded during pilot projects in government agencies. Online stores rank second in this ranking: about 2,200 attacks were registered per day, and almost all of them were carried out without the use of automated scanning tools. Therefore, the task of protecting the information technology system of organizations becomes urgent.

II. COMMON ATTACKS ON WEB APPLICATIONS
In the age of information technology, every web application has vulnerabilities. Knowing what types of vulnerabilities are most dangerous and how to mitigate the risks will give the system administrator an edge in protecting the web applications of companies and organizations.
A. CODE INJECTION. Code injection occurs when an attacker sends invalid and untrusted data to an application as part of a command or request. An attacker has the malicious intent to force an application to perform unintended behavior in order to collect data or create damage. Some of the most common types of injections are:  Injection or SQL injection [2 -4];  OS command;  LDAP injection [5]; The main reason for such injections is the lack of validation and cleaning of the data used by the application. Injection prevention guidelines depend on the technology programmers are using. In general, the systems specialist must ensure that his team of employees adheres to the security requirements when using commands in the system.

MANAGEMENT
Invalid authentication and session control vulnerabilities allow attackers to use manual or automatic methods to gain control over any account on the system. Web applications are one of the most vulnerable and common targets. Attackers have access to information on hundreds of millions of username and password combinations for standard accounts. This access allows attackers to easily perform dictionary attacks, to automate brute force attacks, and other GPU hacking tools to gain access to the system. Leakage of confidential data is one of the most popular vulnerabilities for users of online resources. An attacker has access to personal data of users without their permission.
Sensitive data such as passwords, credit card numbers, credentials, social security numbers, medical records require additional protection. Therefore, it is important for any organizations to understand the necessity of protecting user data.
According to OWASP, one of the most common and serious situations is when a website does not use TLS for all pages or supports weak encryption. Confidential data must be encrypted at all times, including when sending and storing -no exceptions are allowed. Credit card information and user passwords are never sent or stored unencrypted. Obviously, encryption and hashing algorithms are not a weak security method. In addition, web security standards recommend using AES (256 bits or more) and RSA (2048 bits or more).

D. CROSS-SITE SCRIPTING.
Cross-site scripting (XSS) [6,7] is a user data validation error that allows attackers to transmit dangerous code on the server from the user's browser.
The complexity of this attack lies in the fact that the algorithm for filtering incoming data should not create unreasonable restrictions for legal users, but at the same time, it should make an XSS attack by an attacker impossible. OWASP XSS Prevention provides details on the required data escaping techniques.
To protect a web application from this attack, the sysadmin needs to apply context sensitive encoding when modifying the browser document on the client side to act against DOM XSS and navigate away from the untrusted context-based HTTP request data in HTML output (body, attribute, JavaScript, CSS or URL) to eliminate stored XSS vulnerabilities.

E. DENIAL OF SERVICE
Denial of service [8][9][10] is one of the most popular types of attacks on web applications.
DDoS attacks are getting stronger, more sophisticated, and more difficult to prevent. The types of DDoS attacks vary, but they all affect the performance of an organization's website. Therefore, organizations and online stores need to understand and anticipate possible risks.

F. CROSS-SITE REQUEST FORGERY
CSRF (Cross-Site Request Forgery) allows an attacker to perform actions on the server on behalf of the victim. The browser has been misled by some third party that is abusing its powers.
CSRF very rarely appears among CVEs (common vulnerabilities and threats) -less than 0.1% in 2008, but in reality, it is a "sleeping giant". CSRF is becoming an important security issue.
While there are a huge number of attacks on web applications, there are also many known ways to detect them.
Let us now consider the main methods of detection, as well as the advantages and disadvantages of their application in the problem of classifying attacks, which is solved to detect them.

Journal of Science and Technology on Information security
Special Issue CS (15) 2022 65 the most popular method is the method using machine learning.

A. SIGNATURE METHODS
Signature analysis is based on the assumption that the attack scenario is known and an attempt to implement, can be detected in event logs or by analyzing network traffic. Ideally, the system administrator should fix all known vulnerabilities.
For example, the free product Snort [16,17] is a typical intrusion detection system because, in terms of capabilities, Snort is primarily concerned with detecting attacks with a large signature database, and its prevention functionality is limited compared to other attack prevention systems. The intrusion detection system works with a signature, tracking network packets and comparing them with a signature database, attributes of known attacks, similar to how antivirus software works. The main problem with this system is that it may not detect a new attack if the signature for its identification was not updated.

B. ANOMALY DETECTION METHODS
Anomaly-based detection is a way to detect unusual traffic behavior within a network. The IDS will, based on the anomalies, identify network traffic by monitoring methods and compare it with the established baseline.
The attacker, trying to attack the system, the frequent use of applications and different testing methods. Intrusion operations are often different from the usual activities of users working on the system. Any penetration testing software can recognize suspicious activity beyond a certain threshold.
The main problem with this system is the inability to build an accurate model for all attacks, although it can detect a new attack.
Since the above two methods have advantages and disadvantages, the system administrator needs to combine both methods and apply machine learning to improve the efficiency of the system.

C. MACHINE LEARNING METHODS FOR DATA
CLASSIFICATION PROBLEM Machine learning methods [18], as well as methods of computational intelligence, are used both in the detection of anomalies and in the detection of abuse of the granted rights and powers. Here's a quick overview of the main machine learning methods.

Bayesian network
Bayesian network [19,20] is a graph probabilistic model, which is a set of variables and their probabilistic dependences according to Bayes. Many works use a naive Bayesian algorithm.
A naive Bayesian classifier [21,22] combines a model with a decision rule.
Based on the experience of working with a naive Bayesian classifier on a large dataset, the following conclusions can be drawn:  The condition of independence of variables in the model is satisfied (depending on the nature of the data), it is believed that the naive Bayesian classifier gives better results than logistic regression when there is less data for training.
 Although the training and classification times are shorter than most machine learning methods when working with a large dataset, the accuracy of this method is low.

K-nearest neighbors
The k-nearest neighbor (k-NN) method [23,24] is a classification method, the basic principle of which is to assign an object of the class that is the most common among the neighbors of a given object. Neighbors are formed from a set of objects, the classes of which are already known, based on a given value ( 1) kk  . It is determined which of the classes is the most numerous among them. If In [25], a mixed approach was usedcombining the genetic algorithm [26] and the knearest neighbors' classifier to detect denial of service attacks. The purpose of the genetic algorithm -to find the optimal weight vector, which is presented as a "weight" sign i , 1 in  . The vector W will influence the computation of the distance and promote the classification of the k-nearest neighbors. For any two weight vectors of features 12 ( , , , ) n X x x x  and 12 ( , , , ) n Y y y y  the distance between them is calculated as follows: Each value belongs to the segment [0,1]. After the evolution of the genetic algorithm at the training stage, the optimal weight vector can be obtained, which leads to the best result of the KNN classification. The detection accuracy of this approach is approximately 96.75%.

Decision tree
Decision tree [26,27] is a decision support tool used in statistics and data analysis for predictive models. A decision tree is a tree-like structure of "leaves" and "branches".
There are two types of decision tree relatives:  Regression trees estimate value functions with real numbers rather than be used for classification problems;  The classification tree used in the classification task.
Compared to other data analysis, the use of decision trees has the following advantages:  The study of characteristics from input and output data is the result of the shape of a decision tree. This means that it is easy to see the characteristics of the input data;  Other types of machine learning methods require a lot of preprocessing, and the decision tree requires almost no preprocessing;  For types of machine learning methods such as neural networks that are considered black box models, the decision tree is similar to the white box model;  Support for assessing the accuracy of the created models.

Neural network
An artificial neural network (ANN) [28] is a mathematical model, as well as its software or hardware implementation, built on the principle of the organization and functioning of biological neural networks -networks of nerve cells of a living organism.
Since a conventional intrusion detection system is not always able to identify every attack, it is necessary to have a system that can regularly update the signs of new attacks on the system. The purpose of using neural networks is to detect attacks in the system, such as: the use of malicious code (viruses, Trojans ...), packet congestion, network scanning, denial of service (DoS) attacks, privilege escalation, and so on. The essence of the method is to transform the original data space into a new final space, in which a simpler classification is possible. Any point in the dataset will be anchored to a specific coordinate.
SVM has some advantages:  Obtaining a classification function with a minimum upper estimate of the expected risk (level of classification error);  Using a linear classifier to work with nonlinearly separated data, combining simplicity with efficiency.
The disadvantages of the methods are as follows:  In the case where the number of attributes of the dataset is much larger than the number of data, the algorithm gives rather poor results;  Classification, according to this method, is an attempt to differentiate objects into two layers, separated by hyperplanes, and does not explain the likelihood of the appearance of points in the separating set.
Let us consider the essence of the proposed new approach in the problem of classification of queries based on the support vector machine and some additional attributes of queries.

IV. A NEW APPROACH TO IMPROVING WAF PERFORMANCE IN THE CLASSIFICATION PROBLEM
Since there are many algorithms for classifying queries in modern WAF, modern WAF contains several built-in modules: regular expression module, behavior module, tokenization module, artificial intelligence module, and so on. The new approach is to use a machine learning method (support vector machine) with some query attributes.
The scheme of the classification process consists of 7 main stages: data collection to create a query base (A1); preliminary data processing (A2); payload comparison (A3); checking regular expressions (A4); calculation of request attributes (A5); converting text data into vectors (A6); classification of queries based on the support vector machine (A7).
Each stage of the work will be described in detail below. Data preprocessing is a very important step in solving any machine learning problem. Most datasets used in machine learning tasks need to be processed, cleaned, and transformed before a machine learning algorithm can be trained on them.
The datasets corresponding to the problems are actually different. In this task, the data is sorted in tables by fields. Before sorting the data, we deleted all words that have no meaning (stop words or stop words). These words are defined by the development of an embedded software function (Anacoda with Python 3.0) and the stop word library by the author of this article.

C. PAYLOAD COMPARISON (A3)
Payload is an important term used in many fields of science, including information technology.  In the learning phase: with the exploration of dangerous queries, the payload is retrieved and stored in the database.
 In the discovery phase: after processing the data, the payload comparison module is launched. The payload is retrieved from the input request and then compared to the list of saved payloads. If the payload is found, then this request will be blocked. Otherwise, all data remaining requests will be sent to the module "check regular expressions".

D. CHECKING REGULAR EXPRESSIONS (A4)
Regular expressions are patterns used to find character sets that are concatenated into character strings. In many programming languages like JavaScript, C # and so on, regular expressions are also objects.
 In the learning phase: with the investigation of dangerous queries, regular expressions were created.
 In the discovery phase: after the payload comparison module, the regular expression validation module is launched. Input queries are checked against the list of stored regular expressions. If the request template does not match at least one saved template, then this request will be blocked. Otherwise, all data from other requests will be sent to the module "calculating attributes".
In Table 1 presents some regular expressions to detect the type of code injection attacks.

E. CALCULATION OF REQUEST ATTRIBUTES (A5)
After examining the structure of Http requests from datasets, we added three new attributes in the process of classifying requests: changing the length of requests, changing the length of the request arguments, and the frequency of occurrence of key characters. During the training phase, we saved all the necessary values, such as query lengths, query argument lengths and key symbols. These values are stored in the database (our task uses the MySQL database management system).

The length of the request sent from the browser
We assume that the length of the request sent from the user's browser varies slightly within a certain range. However, in the event of a hack, the length of the data field may change, therefore, the length of the request increases. For example, in the case of SQL injection, cross-site scripting. Therefore, it is proposed to use the change in the request length to detect attacks from users.  In the discovery phase: the mathematical expectation u  and variance 2 u  of the data set are specified. Application of Chebyshev's inequality gives an estimate of the probability that a random variable will take a value far from its mean.
where x is a random variable; u t -any value.

Journal of Science and Technology on Information security
Special Issue CS (15) 2022 69 Accordingly, for any probability distribution with mean u  and variance 2 u  , a given value is obtained x , then the deviation x from the mean u  exceeds any blocked threshold 2 2 u u t  .
In this case ij u a u tl  ( a l : input request length) is selected. The higher the value, the closer the value l is to the average value, otherwise it is a sign of an attack.
2. The length of the arguments of the request sent from the browser  In phase of study: we assume that the length of the argument is the input data 12 , , , a a a n l l l , the expectation a  and variance 2 a  of the data set.
 In the discovery phase: the mathematical expectation a  and variance 2 a  of the data set are specified. Application of Chebyshev's inequality gives an estimate of the probability that a random variable will take a value far from its mean.
where x is a random variable, a t is any value.
Accordingly, for any probability distribution with mean a  and variance 2 a  , a given value is obtained x , then the deviation x from the mean a  exceeds any blocked

The frequency of occurrence of key symbols
From the training sample of legitimate requests, separate non-repeating characters (including those taking into account different encodings) must be selected in order to compose a set of alphabet characters S . Thus, when a symbol appears in a request bS  , the counter value b p for this attribute is increased by one. The value of the attribute itself is calculated as the ratio of the counter value to the cardinality of the alphabet set: The module for converting string data into vectors is implemented using the tf-idf technology, you can evaluate the importance of a word in a query string.
Let's apply the tf-idf technology in this task, for each request we will find the words in the request. For each word t in query d in the set of queries D, the following formulas are used: where tf, idf values are calculated as: where v are the rest of the words in the query d ( ) log . : After the tf-idf computation process, the query string data is converted to vectors. The 3 values in the above paragraph are added to the query vectors. ; 0 is not an attack, but 1 is an attack. We will build a linear threshold classifier:  (13) where: 12 ( , , , ) is the vector of dual variables.
To solve this problem, we calculate: Putting (14) and (15) Using (17), the class of incoming requests is determined.
After applying the support vector machine to classify queries, two sets of queries will be received with tags "0" or "1". All requests with marks "0" will be executed on the server, and the rest of requests with marks "1" will be blocked. The experiment used data from popular sources (the source address is specified in paragraph 4.1) and 3-grams with cross-validation to verify the results of the approach. The result of checking the dataset (20,000 dangerous queries and 100,000 normal queries) using tf-idf technology (80% of the data for training and 20% of the data for testing) is presented in Table 2. Since the accuracy of the support vector machine operation depends on the choice of its parameters, an experimental evaluation of this method was carried out with a change in some parameters. The results of this assessment are shown in Figure 1. Note that the new threeattribute support vector machine approach gives better results than the known ones. Accuracy values approach 1 (80% training data and 20% testing data). The new approach is only effective for variable query length attacks (code injection, cross-site scripting) and requires high computational power as the dataset grows in size.

VI. CONCLUSION
The paper provides a brief overview of popular attacks on web applications and methods for detecting them, also a comparative analysis of these methods. Each method has its own advantages and disadvantages, hence the study used not only the signature method, but also machine learning methods to improve the performance of the WAF. It is noted that the machine learning method is widespread and is used in many intrusion detection systems, including intrusion detection systems.
Combining signature-based methods with machine learning methods makes intrusion detection systems more intelligent and autonomous when new attacks are detected, since static methods can be bypassed by attackers.
To improve the accuracy of the proposed approach, it is proposed:  Using a combination of machine learning methods;  Increasing the number of quality attributes of queries;  Adding regular expressions for specific attacks;  Updating signature databases, since WAF works not only with signatures, but also with anomaly detection (including machine learning methods).
Further research will be focused on cloud intrusion detection systems and firewall service for cloud web application, as cloud computing is a major paradigm shift for computer networks.