Static Feature Selection for IoT Malware Detection

—Our world has recently witnessed the explosive growth of IoT networks as one of the pillars of the 4th industrial revolution. Malware on IoT devices also grows accordingly in number and sophisticated techniques. Therefore, it is necessary to come up with more efficient approaches to IoT malware detection with machine learning models that can be used in solutions using limited resources. In this paper, we study and evaluate the efficiency of using a weight of term frequency– inverse document frequency model in feature selection method combined with an effective machine learning model in IoT malware detection based on opcode sequence features. We performed experiments on a MIPS ELF dataset that included 4,511 malicious samples with main four classes and 4,393 benign programs. Experiment results show that our proposed method has very good performance on the above dataset with detection and classification accuracy which are 99.8% and 95.8% respectively while the models only use 20 opcodes that have the highest weight values.


I. INTRODUCTION
The quantity of IoT devices in information systems has increased rapidly from over 16 billion in 2015 to 30 billion in 2020 [2] and 30.9 billion devices are installed by 2025 [3]. Common cyberattacks involve malware, APT, and ransomware, which control and destroy systems. According to Gibert et al. [1], malware metamorphosis increased over 50% on mobile devices in 2017. Besides, IoT attacks increased by nearly 600%, which the Mirai malware and its variants created some of the most powerful distributed denial-of-service attacks in the past. Therefore, security for IoT devices against attacks and malware is a crucial requirement today.
As usual, the fight against malicious software will begin with knowledge about the signatures of malicious activities. The multiped use of the Internet has resulted in organizations constructing more advanced technologies and criterion security solutions to resist attacking IoT devices from hackers. There are mainly two methods that are available for malware detection include behavioralbased and signature-based detection. Malware detection based on signature is a very effective method in detecting known malware thanks to its high accuracy and clearness in building detection systems. However, the method has disadvantages in detecting unknown malware, polymorphic and metamorphic malware. On the other hand, in malware detection based on behavior, an abnormal is defined as monitoring a dataset that takes shape to be incompatible with the residue of that dataset [4]. Malicious behavior is explored as a divergence from the regular behavior set of the users in the information system. The advantage of behavioralbased detection is in unknown intrusion detections. Anomaly-based detection methods can detect

Static Feature Selection for IoT Malware Detection
Nguyen Ngoc Toan, Luong The Dung, Dang Quang Thang unknown malware, but modeling normal behavior is complex work. The normal behavior is modeled based on machine learning techniques. Therefore, anomaly-based detection methods can be used in IoT malware detection and previously underserved architectural platforms such as Embedded Linux OS.
Most of the researchers in the information security community work on the techniques used to identify and detect Windows malware samples among others, particularly, Intel's x86-64 architecture. However, the MIPS processor architecture is used in popular IoT devices such as switches, routers, access points, and IP cameras [3]. It is a fact that when an application runs on different processor architectures and operating systems, its behaviors are dissimilar. Therefore, it is necessary to study malware detection on IoT devices that are used the Embedded Linux OS and MIPS processor. In this direction, researchers had many promising results on malware detection based on machine learning using static or dynamic features. Dynamic features include memory usage [30], instruction traces [9], network traffic [11], API call trace [10], [12]. The effectiveness of dynamic analysis is highly dependent on the malware execution environment. With static features, common forms have been used include strings [13], bytes n-gram [14], opcode [15], function call graph [16], entropy-based [17], etc. In this paper, we first focus on the IoT malware detection method using the opcode sequences. Quality features are crucial for building effective machine learning-based classification models. The opcode sequence is used effectively in malware detection, but if there are too many features leads to the model complexity and the "curse of dimensionality" problem will occur [8]. In addition to that, data normally contains significant noises and irrelevant features that add little or no value to the performance of the learning algorithms. Therefore, we propose an effective opcode features selection method that can overcome those problems above. Following are 3 main contributions of our work:  We propose a top MIPS opcode feature list that can be used well for IoT malware detection and classification.
 We formulate a framework for IoT malware detection based on static features and evaluate the malware detection and classify possibilities of the feature selection techniques using machine learning models.
 We demonstrate the usefulness of feature selection in IoT malware classification systems.
The rest of the paper is structured as follows. Related works malware detection based on static features are discussed in section 2. Our feature selection method is introduced in section 3. Section 4 highlights the framework used in this paper with experiments and evaluations. Finally, the conclusion and future works are discussed.

FEATURES
Malware analysis is a process of determining the malicious behavior of a program. Malware analysis is often based on static and dynamic features [3]. Dynamic behavior-based malware detection methods usually need a secure and controlled environment, such as virtual machines, sandbox, etc. In IoT malware analysis, dynamic featuresbased methods have significant limitations such as time consuming, considerable attention requirement, and analysis environment dependency. Meanwhile, the static analysis is unaffected by the analytical environments because static features are extracted by analyzing the software source code or binary code from the perspectives of syntax and semantics without running the software.
A lot of research has been done and many ways have been brought forward to detect malware based on machine learning using static features. Specific features for training the machine learning model include extracting operational codes (opcode) sequences after disassembling the binary file, or extracting the control flow graph (CFG) from the assembly file, extracting API calls from the binary, etc.
J. Z. Kolter and M. A. Maloof [5] proposed using n-gram instead of a byte sequence and compared the performance of Naive Bayes, decision trees, support vector machines for malware detection. Later, artificial neural network [6] [7] were also used for malware detection. However, it takes more processing time with a large amount of data.
Ding at el. [24] proposed static characteristic extraction method, called Control flow-based features (CFG), has the ability to detect malware with high effectiveness. However, the method seems to work efficiently in case of a small file size with the average number of vertices CFG less than 6,000 peaks.
Bilar [2] analyzed executable files and demonstrated a difference in the opcode frequency distributions of benign files and malware. This research shows that rare opcodes seem to be a stronger predictor, explaining 12-63% of frequency variation. However, the authors only focus on the frequency of opcodes and the sample size for their experiments was only 97.

B. OPCODE FEATURE SELECTION
Operation code (Opcode) is one of the common features used in malware detection. An opcode is a single instruction that can be executed by a microprocessor (CPU), which can describe the behavior of an executable file.
The performance of any detection, classification, and prediction system is closely dependent on features. Today, feature selection approaches are organized into filters, wrapper, embedded and hybrid methods. The filter methods select a subset of features without altering their original representation [20]. This method is used in many machine learning models because it is not constrained by any machine learning method. Feature selection based on wrapper involves a classification model to assess the suitability of the features.
Canedo et al. [19] analyzed the seven feature selection methods based on filter, two methods based on the wrapper, and two embedded feature selection methods to creatmicroarray data sets under four machine learning classifiers. Xue et al. [21] compared the wrapper-based feature selection and filter-based feature subset selection methods with respect to classification performance and execution time. These researches show that the filter methods are faster than the wrapper methods but have much lower classification performance than the wrapper methods.

C. MACHINE LEARNING METHODS FOR MALWARE DETECTION BASED ON OPCODE FEATURES
A faster classification method is needed to recompense the exponentially increasing number of malware adaptations. A machine learning approach for malware detection should be adopted for faster and more efficient classification. Machine learning models for malware detection and classification can be trained on opcode features that are extracted from executable programs.
The use of opcode feature has brought a lot of efficiency in detecting malware in general and IoT malware in particular. Santos et al. [25] extracted opcodes by decompiling the binary file and converting them into a sequence of length n subsequences. The authors use the frequency of the opcode and category of the source binary to assign a weight to each opcode. The machine learning models include K-nearest neighbors (KNN), Decision Tree (DT), Support Vector Machine (SVM), Naïve Bayesian classifier, and Bayesian network were then used to detect and compare performances according to numerous sizes of n. Support Vector Machines (SVM) is a problem to find the interface so that the closest distance from a data point to the interface (margin) is found to be the largest, which means that the data points are "safest" compared to other data points. separation face. The SVM classifier method is chosen to be used in the text classification system, malware classification problem and promotes its ability with the n-gram feature extraction method. A decision tree (DT) is a structured hierarchical tree used to classify objects based on a series of rules. The attributes of the object (except the categorical attributes) can be of different data types including binary, nominal, ordinal, quantitative values, while the categorical attribute must have a data type of binary or ordinal. Given data about objects consisting of attributes and their classes, the decision tree will generate rules to predict the class of the unseen data. Random Forest (RF) is a member of the family of decision tree algorithms. Random Forest treats each decision tree as an independent voter (like a real election). At the end of the election, the answer with the most votes from the decision trees will be selected. To make sure that not all decision trees give the same answer, Random Forest deletes some observations and repeats others randomly. Naive Bayes (NB) is based on a probability calculation, which has given good results in detecting malicious code [22]. Naive Bayes Classification is an algorithm based on Bayes theorem of probability theory to make judgments as well as classify data based on observed and statistical data. Naive Bayes Classification is one of the algorithms used to make the most accurate predictions based on a collected dataset.
Yewale et al. [26] selected the 20 most frequently used opcodes from a set of benign programs and detected malicious programs using DT, SVM models. However, only a small-scale dataset was used for the model training and testing, therefore the same performance on a larger dataset is not expected. Jerome et al. [27] used opcode sequences with machine learning and experimented with 2, 3, 4 and 5-gram opcode features. Feature ranking and selection were done based on computing the information gain of each n-gram of opcode sequences. BooJoong Kang [28] presented and evaluated to identify and categorize Android malware based on n-gram opcode features and machine learning. The method allows for automatic extraction and learning features from given datasets. Good classification accuracy can be achieved when by using frequency opcodes with a small n.

III. PROPOSED METHOD
As shown in Figure1, the general automated classification framework for IoT malware detection consists of four phases: (1) the preprocessing (includes feature extraction) phase, (2) a phase for selecting feature opcode, (3) detection phase, and (4) classification phase. The opcode sequences collection process starts with using the IDA pro tool [29] to decompile ELF file samples. Then, the assembly files obtained after the decompile process are processed with a python script to get the opcode sequences. After extracting opcode sequences of the samples, the opcode sequences of the samples that are packed or too short are removed. As a result of this process, opcode datasets are collected.

B. FEATURE SELECTION
Both the number and quality of the features are considered to train ELF file classification models with respectable accuracy and fewer resources. Therefore, feature selection is used to remove irrelevant, constant, redundant, and correlated features from the raw features dataset before training the models. A variety of methods for selecting the best features in malware detection research have been widely deployed. In our approach, one opcode is treated as a word in the language model. An opcode sequence is taken as a sentence in the language model and as such we can predict the meaning of a sentence based on some keywords in the sentence. Therefore, in an opcode sequence, each opcode has a different level of meaning in the sentence, some opcodes can represent that opcode sequences.
Generally, the opcode feature selection method follows a typical scenario described in Fig.2. Our paper estimates the significance of an opcode based on the weight of Term Frequency-Inverse Document Frequency (TF-IDF) model. TF -IDF is the weight of a word in a document obtained through statistics showing the importance of this word in a document, which itself is in a set of documents. Term frequency (TF) is used to estimate the frequency of occurrence of opcodes in the opcode sequences. Each sequence length is different, the number of occurrences of the opcodes can vary greatly. So, the number of occurrences of the opcodes will be divided by the length of the sequences (the same as the total number of words in a text). The term frequency of opcode x is calculated as follows: where fr(x) is the number of times x occurs in opcode sequence s, sum(s) is the total number of opcodes of sequence set s. Inverse Document Frequency (IDF) is an estimate of the influence of opcodes. When only the frequency of occurrence of the opcode is calculated, the opcodes are considered equally important. However, there are some opcodes that are often used but are not important to express the meaning of the opcode sequence. Therefore, the IDF is capable of redefining the corresponding weights for major opcodes that always appear. The Inverse Document Frequency is described as: where N is the total number of opcode sequences set, D(x) is the number of opcode sequences containing opcode x.
Determining the importance of opcodes in sequences has much in common with relevance of words to documents. The researchers had many promising results on term frequency-inverse document frequency of words to documents such as [32], [33]. Therefore, we propose to use the measure Term Frequency-Inverse Document Frequency (TF-IDF) of opcode x is determined as:

TF-IDF (x) = TF(x).IDF(x,D)
(3) By determining the TF-IDF of the opcodes in the opcode dataset, we use n highest weighted opcodes for the malware detection and classifier problem.
Besides, the n-gram method is used to calculate the quantities f(x) and D(x) in the formula of the TF-IDF and is used in feature extraction for training on machine learning models. N-gram is the frequency of occurrence of words in the corpus. In this paper, a sequence of opcodes is embedded into vector space using n-gram. Each element of a feature vector represents the presence or absence of the corresponding n-gram in the opcode sequence. The ngram method has been proven effective in malware detection based on sequence [18], [22]. Kang et al. [23] presented and evaluated an n-gram opcode features-based approach that utilizes machine learning to identify and categorize Android malware with an fmeasure of 98%. In the n-gram feature extraction method, if n is too small (unigrams), the information obtained will only be the frequency of occurrence of single system calls. If n is too large, the number of features is very large, especially malware that uses transformation techniques. There are some other studies also show that bigrams, trigrams can give good results.

C. MALWARE DETECTION AND CLASSIFICATION MODELS
The ideal detection and classification models for evaluating the direct impact of the feature selection algorithms are models which not capable of embedding feature selection. The mentioned classification methods are not only popular machine learning models but also do not perform embedded feature selection. Therefore, they are suitable for evaluating the proposed feature selection method.
In this paper, we use experimental methods to choose the effective machine learning algorithm model based on unsupervised learning algorithms such as SVM, RF, NB.

IV. EXPERIMENTS AND EVALUATIONS
A. DATA COLLECTION An IoT dataset used for testing includes 8,904 MIPS ELF samples including 4,511 malware and 4,393 benign samples. The malware dataset is collected from different sources on the Internet such as IoTPOT, VirusShare, VirusTotal, Detux, and available programs on Embedded Linux. In our experiments, the label of malware samples is taken under the Symantec label because it is a program with good malware detection and is explicitly named. There are 37 different families of malware with many popular families of malware such as Mirrai, LightAidra/ Aidra/ Zendran, Gafgyt/ BASHLITE/ Lizkebab/ Torlus, Dofloo/ Spike/ MrBlack/ Wrkatk/ Sotdas/ AES.DDoS/ DnsAmp, Moose, Hajime, Tsunami/Kaiten, Trojan.Gen, SecurityRisk, etc... We only select 4 families of malicious code with the number of samples in the set over 100 samples to avoid spreading out in number when classifying. The number of malware samples with labels is shown in Figure 3: Then, assembly files are extracted by IDA Pro 6.6. After that, opcode sequences are generated by a python script. The opcode sequences of the samples that packed or are shorter than 50 will be removed. The opcode sequences dataset result collected are shown in Table I.  The top MIPS opcode features are combined with the n-gram method to feature selection before using machine learning models to perform classification stage.

C. EVALUATION METRIC
In our paper, several evaluation standards are used to calculate the effectiveness of the approach.
Accuracy can be described as: where True Positive (TP) indicates that the number of malware samples identified correctly; False Positive (FP) is the number of benign samples truly predicted to be malware; True Negative (TN) is the number of trusted applications identified correctly; False Negative (FN) is the number of malware samples is taken as trusted programs.
The F1-score is the harmonic average of the recall and precision of one class.
In the above formula, False Positive (FP) is the number of trusted programs is detected as malware.
Recall is a fraction of system call sequence in ground truth that is correctly classified: The experiments analyze the influence of feature selection methods on the classifier models in terms of performance. Our experiments were run on the 64-bits Windows 10 operating system, with Intel Core i7-6500U, 2.59 GHz v, 8GB RAM. We consider the predictive performance of the opcode feature recommended by the feature selection methods discussed in section V. In order to evaluate performance, three machine learning models which use n-gram feature selection method are conducted on the same dataset as our method. In SVM, we have chosen the corresponding parameters including gamma='scale', decision_function_shape='ovr', cache_size=500, tol=2e-3, and break_ties='True'. Max_depth=200 and random_state=2 are used in RF method. The comparison results are shown in the tables below.
In the malware detection approach, when the number of the top opcode features is 20, the Random Forest model has the highest accuracy of 99.8 % and F1-Weight of 99.8%. If the number of opcode features is small, malware detection results are not satisfactory due to the lack of information to identify the characteristics of the malware and benign. Besides, if many opcode features are chosen, it has noisy features, which is not effective in classifying malicious code. In research [31], 14 opcodes most frequently in Intel are selected for detection models. In the experiment, we consider choosing the number of top opcode features (include 5, 10,14,16,20,30,40) with three machine learning models based on n-gram, results as in Table III, and Table IV.   These results can be illustrated by the fact that the unfiltered opcode features accommodate various anomalies such as noise that impact the performance of a machine learning algorithm.
Overall, the Random Forest gives the best results as compared to the other machine learning models with an accuracy of 99.8% for both 2-gram and 3-gram with selecting 20 opcodes. Besides, Figure 5 shows the execution time of the various models. We can observe from the figure that Naive Bayes (NB) is the fastest classifier. The RF is naturally robust for small number opcode but it consumes larger memory and computation time for training. In the same case, SVM needs more computation time for training because of the need for large memory. By filtering out extraneous opcode features from the feature set, the running time of the learning algorithms and space complexity can be significantly reduced and the space complexity and yields a more general classifier. Therefore, an optimal set of features is necessary to build efficient machine learning models. In the malware classifier approach with selecting 20 opcodes, the highest accuracy of 95.8 % and F1-Weight of 95.7% are achieved with random forest based on 4-gram. Malware classification results using 20 opcode features as shown in Figure 5.
Similar to the detection approach, Figure 6 shows that the Naive Bayes (NB) is also the fastest classifier model. In summary, the opcode sequence of programs has been employed and showed good performance in detecting IoT malware. In our proposed method, we estimate the significance of an opcode based on the weight of Term Frequency-Inverse Document Frequency to select the most effective opcodes in malware detection and classification. To evaluate the performance of our method, some experiments have been done and the results show that our method can achieve the highest accuracy of 99.8% for detection and 95.8% for classification approaches with only 20 opcodes.
In a static analysis in general, and opcode-based malware analysis in particular, feature extraction is still difficult when malware uses complex techniques such as code encryption, obfuscation, polymorphic, etc. In the future, other opcode sequence analysis methods can be extended to solve more complicated malware detection problems such as combining dynamic features and static features. Deep learning methods combined with more other features could also be considered to detect early detection and exact classification.