Detection of source code vulnerabilities using Nature language processing and deep graph network
DOI:
https://doi.org/10.54654/isj.v3i23.1057Keywords:
Model, classification, graph, neural network, BERTTóm tắt
The software production sector gains advantages from automated code generating techniques, yet encounters issues related to vulnerabilities in the resulting code. This research presents a hybrid paradigm, termed GBD, for detecting vulnerabilities in software written in C and C++. It integrates Graph Convolution Network (GCN), Bidirectional Encoder Representations from Transformers (BERT), and Dropout. During Phase 2 of the GBD model, the subsequent tasks are executed concurrently: (i) obtaining node and edge features utilizing the GCN graph convolution network; (ii) deriving segment features employing the BERT model; (iii) constructing a source code profile via the Code Property Graph (CPG). Phase 3 of the model implements the Dropout strategy to mitigate overfitting. Phase 4 is the classifier that ascertains the presence of vulnerabilities in the source code. Experimental findings demonstrate the superiority of the proposed model relative to alternative methods, attaining a prediction accuracy of 61.21% for vulnerable code and 88.94% for normal files. Additionally, the classification outcomes demonstrate that with a token length of 512, the GBD model yields the most uniform results across all metrics: Accuracy (86.65%), Precision (38.59%), Recall (66.21%), and F1-score (48.76%). This corresponds with our analysis of the Verum experimental dataset, indicating that over 70% of the source code files have lengths exceeding 256 but less than 512. Furthermore, the GBD model exhibits strong performance across both individual and multiple datasets. For example, in the Verum dataset, the GBD model surpasses five alternative methodologies—REVEAL [1], Russell [2], VulDeePecker [3], SySeVR [4], and Devign [5] - by 4% in Accuracy and between 15% and 57% in Precision, Recall, and F1-score. In comparison to SySeVR [4], the GBD model exceeds it by 3% to 25% across all metrics. In comparison to Devign [5], GBD achieves improvements of 5% to 39% in Precision, Recall, and F1-score. Upon assessment of the FFmpeg+Qume dataset, the GBD model attains an Accuracy improvement ranging from 0.2% to 10% above all other studies. In terms of precision, GBD surpasses alternative methods by 0.3% to 9%. In terms of Recall, GBD is marginally worse than REVEAL by 1.5%, although surpasses all other methodologies by 10% to over 31%. In terms of F1-score, GBD is 0.3% inferior to REVEAL but surpasses other studies by 7% to 30%. The results indicate that the GBD model is effective on both individual and multiple datasets
Downloads
References
S. Chakraborty, R. Krishna, Y. Ding and B. Ray, “Deep Learning based Vulnerability Detection: Are We There Yet?”, IEEE Transactions on Software Engineering, vol. 48, no. 9, pp. 3280-3296, 2022, doi: 10.1109/TSE.2021.3087402.
R. L. Russell, L. Kim, L. H. Hamilton, T. Lazovich, J. A. Harer, O. Ozdemir, P. M. Ellingwood and M. W. McConley, “Automated Vulnerability Detection in Source Code Using Deep Representation Learning”, In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018; pp. 757-762, doi: 10.1109/ICMLA.2018.00120.
Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng and Y. Zhong, “VulDeePecker: A Deep Learning-Based System for Vulnerability Detection”, in Network and Distributed Systems Security (NDSS) Symposium 2018, 18-21 February 2018, San Diego, CA, USA, https://arxiv. org/abs/1801.01681.
Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu and Z. Chen, "SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities", in IEEE Transactions on Dependable and Secure Computing, vol 19, no 4, pp 2244-2258, July-Aug. 2022, doi: 10.1109/TDSC.2021.3051525.
Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks”, in 33rd Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA, no. 915, pp. 10197–10207, Dec. 2019.
NIST, Artificial intelligence (2022), , Access time VN 20/06/2024, https://www.nist.gov/.
B. Casey, J. C. S. Santos, G. Perry, “A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks”, ACM Comput. Surv. 37, vol. 4, no. 111, 35 pages, March. 2024. doi: 10.48500/arXiv.2403.10646.
CVE, All News, https://www.cve.org/Media/News/AllNews (2024), Access time: 20/06/2024.
CWE, CWE Top 25 Most Dangerous Software Weaknesses (2021), Access time: 20/06/2024, ,https://cwe.mitre.org/top25/archive/2021/2021_cwe_top25.html.
C. D. Xuan, D. H. Mai, M. C. Thanh and B. V. Cong, “A novel approach for software vulnerability detection based on intelligent cognitive computing”, the Journal of Supercomputing, vol 79, pp. 17042–17078, 2023. https://doi.org/10.1007/s11227-023-05282-4.
J. C. S. Santos, K. Tarrit and M. Mirakhorli, “A Catalog of Security Architecture Weaknesses”, Conference: 2017 IEEE International Conference on Software Architecture Workshops (ICSAW), 2017, pp. 220–223.
W. Cai, J. Chen, J. Yu and L. Gao, “A software vulnerability detection method based on deep learning with complex network analysis and subgraph partition”, in Information and Software Technology, vol. 164, no. 7, December. 2023, doi:https://doi.org/10.1016/j.infsof.2023.10732.
H. Wang, G. Ye, Z. Tang, S. H. Tan, S. Huang and D. Fang, “Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection”, in IEEE Transactions on Information Forensics and Security, vol. 16, pp. 1943-1958, 2021, doi: 10.1109/TIFS.2020.3044773.
H. Weic and M. Li, “Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code”, in Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, pp. 3034–3040, August 2017.
X. Li, L. Wang, Y. Xin, Y. Yang, Q. Tang and Y. Chen, “Automated Software Vulnerability Detection Based on Hybrid Neural Network”, Appl. Sci. 2021, vol. 11, no. 7, pp. 3201. https://doi.org/10.3390/app11073201.
P. Zeng, G. Lin, L. Pan, Y. Tai and J. Zhang, “Software Vulnerability Analysis and Discovery Using Deep Learning Techniques: A Survey”, in IEEE Access, vol. 8, pp. 197158-197172, 2020, doi: 10.1109/ACCESS.2020.3034766.
V. K. Linh, N. V. Hung, T. N. Anh, D. D. Nhuan and D. C. Hien, “Enhance deep learning model for malware detection with a new image representation method”, the Journal of Science and Technology on Information security, vol. 21, no. 1, pp. 31-39, 2024, doi: https://doi.org/10.54654/isj.v1i21.1000.
F. Yamaguchi, N. Golde, D. Arp and K. Rieck, “Modeling and Discovering Vulnerabilities with Code Property Graphs”, IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 2014, doi: 10.1109/SP.2014.44
J. Devlin,M. W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2018, pp. 4171-4186, arXiv:1810.04805.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique”, Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
B. Liu, W. Guan, C. Yang, Z. Fang and Z. Lu, “Transformer and Graph Convolutional Network for Text Classification”, International Journal of Computational Intelligence Systems, vol. 16, October 2023. https://doi.org/10.1007/s44196-023-00337-z.
F. Subhan, X. Wu, L. Bo, X. Sun and M. Rahman, “A deep learning‐based approach for software vulnerability detection using code metrics”, Institution of Engineering and Technology - IET, vol. 16, pp. 516-526, 2022.
V. C. Bui and X. C. Do, “Detecting software vulnerabilities based on source code analysis using GCN transformer”, in 2023 RIVF Int. Conf. Comput. Commun. Technol. (RIVF), pp.112–117, 2023.
G. Tang, L. Yang, L. Zhang, W. Cao, L. Meng, H. He, H. Kuang, F. Yang and H. Wang, “An attention-based automatic vulnerability detection approach with GGNN”, Int. J. Mach. Learn. & Cyber, vol. 14, pp. 3113–3127, 2023, https://doi.org/10.1007/s13042-023-01824-7
Download Ffmpeg, Access time: 20/06/2024, https://ffmpeg.org/download.html.
T. T. Nguyen and H. D. Vo, “Context-based statement-level vulnerability localization”, Information and Software Technology, vol. 169, 107406 pages, 2024.
JOERN, The Bug Hunter's Workbench (2024), , Access time: 20/06/2024, https://joern.io/.
B. Chernis and R. Verma, “Machine Learning Methods for Software Vulnerability Detection”, in IWSPA '18: Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics, March 19–21, 2018, Tempe, AZ, USA, pp. 31-39. https://doi.org/10.1145/3180445.3180453.
Q. Li, J. Song, D. Tan, H. Wang and J. Liu, “PDGraph: A Large-Scale Empirical Study on Project Dependency of Security Vulnerabilities”, in 2021 51st Annual IEEE/IFIP Int. Conf. Depen. Sys. Net. (DSN), pp.161–173, 2021.
T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks”, International Conference on Learning Representations, 9 September 2016, doi: 10.48550/arXiv.1609.02907.
K. Yang, P. Miller and J. Martinez-Del-Rincon, “Convolutional Neural Network for Software Vulnerability Detection”, in IEEE Transactions on Information Forensics and Security, 2022, DOI: 10.1109/Cyber-CI55324.2022.10032684
J. Chen, Y. Yin, S. Cai, W. Wang, S. Wang and J. Chen, “iGnnVD: A novel software vulnerability detection model based on integrated graph neural networks”, Science of Computer Programming, vol. 238, pp. 103156, 2024.
H. Wang, Z. Qu and L. Sun, “E-GVD: Efficient Software Vulnerability Detection Techniques Based on Graph Neural Network”, ICST Transactions on Scalable Information Systems, vol. 11, March 2024. doi:10.4108/eetsis.5056
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929-1958, 2014.
P. Baldi and P. J. Sadowski, “Understanding Dropout”, In: Proceedings in the Advances in Neural Information Processing Systems 26. Red Hook, NY, USA, December. 2013.
X. Li, S. Chen, X. Hu and J. Yang, “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift”. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019; pp. 2677-2685, doi. 10.1109/CVPR.2019.00279.
K. Duan, S. S. Keerthi, W. Chu, S. K. Shevade and A. N. Poo, “Multi-category Classification by Soft-Max Combination of Binary Classifiers”, In proceedings of the 4th International Workshop, MCS 2003 Guildford, UK, 11–13, pp 125–134, June 2003. doi: 10.1007/3-540-44938-8_13.
X. Xu, C. Liu, Q. Feng, H. Yin, L. Song and D. Song, “Neural networkbased graph embedding for cross-platform binary code similarity detection”, in Proc. ACM SIGSAC Conf. Comput. Commun. Secur, pp. 363–376, Oct. 2017.
Y. Li, S. Wang and T. N. Nguyen, “Vulnerability detection with fine-grained interpretations”, Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 292–303, August 2021. https://doi.org/10.1145/3468264.3468597.
Downloads
Published
How to Cite
Issue
Section
License
Proposed Policy for Journals That Offer Open Access
Authors who publish with this journal agree to the following terms:
1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Proposed Policy for Journals That Offer Delayed Open Access
Authors who publish with this journal agree to the following terms:
1. Authors retain copyright and grant the journal right of first publication, with the work [SPECIFY PERIOD OF TIME] after publication simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).