Detection of source code vulnerabilities using Nature language processing and  deep graph network

Bùi Văn Cong; Do Xuan Cho; Do Trung Tuan

doi:10.54654/isj.v3i23.1057

Authors

Bui Van Cong Department of Information Technology, University of Economics and Technical Industries
Do Xuan Cho
Do Trung Tuan

DOI:

https://doi.org/10.54654/isj.v3i23.1057

Keywords:

Model, classification, graph, neural network, BERT

Tóm tắt

The software production sector gains advantages from automated code generating techniques, yet encounters issues related to vulnerabilities in the resulting code. This research presents a hybrid paradigm, termed GBD, for detecting vulnerabilities in software written in C and C++. It integrates Graph Convolution Network (GCN), Bidirectional Encoder Representations from Transformers (BERT), and Dropout. During Phase 2 of the GBD model, the subsequent tasks are executed concurrently: (i) obtaining node and edge features utilizing the GCN graph convolution network; (ii) deriving segment features employing the BERT model; (iii) constructing a source code profile via the Code Property Graph (CPG). Phase 3 of the model implements the Dropout strategy to mitigate overfitting. Phase 4 is the classifier that ascertains the presence of vulnerabilities in the source code. Experimental findings demonstrate the superiority of the proposed model relative to alternative methods, attaining a prediction accuracy of 61.21% for vulnerable code and 88.94% for normal files. Additionally, the classification outcomes demonstrate that with a token length of 512, the GBD model yields the most uniform results across all metrics: Accuracy (86.65%), Precision (38.59%), Recall (66.21%), and F1-score (48.76%). This corresponds with our analysis of the Verum experimental dataset, indicating that over 70% of the source code files have lengths exceeding 256 but less than 512. Furthermore, the GBD model exhibits strong performance across both individual and multiple datasets. For example, in the Verum dataset, the GBD model surpasses five alternative methodologies—REVEAL [1], Russell [2], VulDeePecker [3], SySeVR [4], and Devign [5] - by 4% in Accuracy and between 15% and 57% in Precision, Recall, and F1-score. In comparison to SySeVR [4], the GBD model exceeds it by 3% to 25% across all metrics. In comparison to Devign [5], GBD achieves improvements of 5% to 39% in Precision, Recall, and F1-score. Upon assessment of the FFmpeg+Qume dataset, the GBD model attains an Accuracy improvement ranging from 0.2% to 10% above all other studies. In terms of precision, GBD surpasses alternative methods by 0.3% to 9%. In terms of Recall, GBD is marginally worse than REVEAL by 1.5%, although surpasses all other methodologies by 10% to over 31%. In terms of F1-score, GBD is 0.3% inferior to REVEAL but surpasses other studies by 7% to 30%. The results indicate that the GBD model is effective on both individual and multiple datasets

Downloads

Download data is not yet available.

References

S. Chakraborty, R. Krishna, Y. Ding and B. Ray, “Deep Learning based Vulnerability Detection: Are We There Yet?”, IEEE Transactions on Software Engineering, vol. 48, no. 9, pp. 3280-3296, 2022, doi: 10.1109/TSE.2021.3087402.

R. L. Russell, L. Kim, L. H. Hamilton, T. Lazovich, J. A. Harer, O. Ozdemir, P. M. Ellingwood and M. W. McConley, “Automated Vulnerability Detection in Source Code Using Deep Representation Learning”, In: 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), 2018; pp. 757-762, doi: 10.1109/ICMLA.2018.00120.

Z. Li, D. Zou, S. Xu, X. Ou, H. Jin, S. Wang, Z. Deng and Y. Zhong, “VulDeePecker: A Deep Learning-Based System for Vulnerability Detection”, in Network and Distributed Systems Security (NDSS) Symposium 2018, 18-21 February 2018, San Diego, CA, USA, https://arxiv. org/abs/1801.01681.

Z. Li, D. Zou, S. Xu, H. Jin, Y. Zhu and Z. Chen, "SySeVR: A Framework for Using Deep Learning to Detect Software Vulnerabilities", in IEEE Transactions on Dependable and Secure Computing, vol 19, no 4, pp 2244-2258, July-Aug. 2022, doi: 10.1109/TDSC.2021.3051525.

Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, “Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks”, in 33rd Int. Conf. Neural Inf. Process. Syst., Red Hook, NY, USA, no. 915, pp. 10197–10207, Dec. 2019.

NIST, Artificial intelligence (2022), , Access time VN 20/06/2024, https://www.nist.gov/.

B. Casey, J. C. S. Santos, G. Perry, “A Survey of Source Code Representations for Machine Learning-Based Cybersecurity Tasks”, ACM Comput. Surv. 37, vol. 4, no. 111, 35 pages, March. 2024. doi: 10.48500/arXiv.2403.10646.

CVE, All News, https://www.cve.org/Media/News/AllNews (2024), Access time: 20/06/2024.

CWE, CWE Top 25 Most Dangerous Software Weaknesses (2021), Access time: 20/06/2024, ,https://cwe.mitre.org/top25/archive/2021/2021_cwe_top25.html.

C. D. Xuan, D. H. Mai, M. C. Thanh and B. V. Cong, “A novel approach for software vulnerability detection based on intelligent cognitive computing”, the Journal of Supercomputing, vol 79, pp. 17042–17078, 2023. https://doi.org/10.1007/s11227-023-05282-4.

J. C. S. Santos, K. Tarrit and M. Mirakhorli, “A Catalog of Security Architecture Weaknesses”, Conference: 2017 IEEE International Conference on Software Architecture Workshops (ICSAW), 2017, pp. 220–223.

W. Cai, J. Chen, J. Yu and L. Gao, “A software vulnerability detection method based on deep learning with complex network analysis and subgraph partition”, in Information and Software Technology, vol. 164, no. 7, December. 2023, doi:https://doi.org/10.1016/j.infsof.2023.10732.

H. Wang, G. Ye, Z. Tang, S. H. Tan, S. Huang and D. Fang, “Combining Graph-Based Learning With Automated Data Collection for Code Vulnerability Detection”, in IEEE Transactions on Information Forensics and Security, vol. 16, pp. 1943-1958, 2021, doi: 10.1109/TIFS.2020.3044773.

H. Weic and M. Li, “Supervised deep features for software functional clone detection by exploiting lexical and syntactical information in source code”, in Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, Melbourne, Australia, pp. 3034–3040, August 2017.

X. Li, L. Wang, Y. Xin, Y. Yang, Q. Tang and Y. Chen, “Automated Software Vulnerability Detection Based on Hybrid Neural Network”, Appl. Sci. 2021, vol. 11, no. 7, pp. 3201. https://doi.org/10.3390/app11073201.

P. Zeng, G. Lin, L. Pan, Y. Tai and J. Zhang, “Software Vulnerability Analysis and Discovery Using Deep Learning Techniques: A Survey”, in IEEE Access, vol. 8, pp. 197158-197172, 2020, doi: 10.1109/ACCESS.2020.3034766.

V. K. Linh, N. V. Hung, T. N. Anh, D. D. Nhuan and D. C. Hien, “Enhance deep learning model for malware detection with a new image representation method”, the Journal of Science and Technology on Information security, vol. 21, no. 1, pp. 31-39, 2024, doi: https://doi.org/10.54654/isj.v1i21.1000.

F. Yamaguchi, N. Golde, D. Arp and K. Rieck, “Modeling and Discovering Vulnerabilities with Code Property Graphs”, IEEE Symposium on Security and Privacy, Berkeley, CA, USA, 2014, doi: 10.1109/SP.2014.44

J. Devlin,M. W. Chang, K. Lee and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”, 2018, pp. 4171-4186, arXiv:1810.04805.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique”, Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.

B. Liu, W. Guan, C. Yang, Z. Fang and Z. Lu, “Transformer and Graph Convolutional Network for Text Classification”, International Journal of Computational Intelligence Systems, vol. 16, October 2023. https://doi.org/10.1007/s44196-023-00337-z.

F. Subhan, X. Wu, L. Bo, X. Sun and M. Rahman, “A deep learning‐based approach for software vulnerability detection using code metrics”, Institution of Engineering and Technology - IET, vol. 16, pp. 516-526, 2022.

V. C. Bui and X. C. Do, “Detecting software vulnerabilities based on source code analysis using GCN transformer”, in 2023 RIVF Int. Conf. Comput. Commun. Technol. (RIVF), pp.112–117, 2023.

G. Tang, L. Yang, L. Zhang, W. Cao, L. Meng, H. He, H. Kuang, F. Yang and H. Wang, “An attention-based automatic vulnerability detection approach with GGNN”, Int. J. Mach. Learn. & Cyber, vol. 14, pp. 3113–3127, 2023, https://doi.org/10.1007/s13042-023-01824-7

Download Ffmpeg, Access time: 20/06/2024, https://ffmpeg.org/download.html.

T. T. Nguyen and H. D. Vo, “Context-based statement-level vulnerability localization”, Information and Software Technology, vol. 169, 107406 pages, 2024.

JOERN, The Bug Hunter's Workbench (2024), , Access time: 20/06/2024, https://joern.io/.

B. Chernis and R. Verma, “Machine Learning Methods for Software Vulnerability Detection”, in IWSPA '18: Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics, March 19–21, 2018, Tempe, AZ, USA, pp. 31-39. https://doi.org/10.1145/3180445.3180453.

Q. Li, J. Song, D. Tan, H. Wang and J. Liu, “PDGraph: A Large-Scale Empirical Study on Project Dependency of Security Vulnerabilities”, in 2021 51st Annual IEEE/IFIP Int. Conf. Depen. Sys. Net. (DSN), pp.161–173, 2021.

T. N. Kipf and M. Welling, “Semi-Supervised Classification with Graph Convolutional Networks”, International Conference on Learning Representations, 9 September 2016, doi: 10.48550/arXiv.1609.02907.

K. Yang, P. Miller and J. Martinez-Del-Rincon, “Convolutional Neural Network for Software Vulnerability Detection”, in IEEE Transactions on Information Forensics and Security, 2022, DOI: 10.1109/Cyber-CI55324.2022.10032684

J. Chen, Y. Yin, S. Cai, W. Wang, S. Wang and J. Chen, “iGnnVD: A novel software vulnerability detection model based on integrated graph neural networks”, Science of Computer Programming, vol. 238, pp. 103156, 2024.

H. Wang, Z. Qu and L. Sun, “E-GVD: Efficient Software Vulnerability Detection Techniques Based on Graph Neural Network”, ICST Transactions on Scalable Information Systems, vol. 11, March 2024. doi:10.4108/eetsis.5056

N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever and R. Salakhutdinov, “Dropout: A Simple Way to Prevent Neural Networks from Overfitting”, Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929-1958, 2014.

P. Baldi and P. J. Sadowski, “Understanding Dropout”, In: Proceedings in the Advances in Neural Information Processing Systems 26. Red Hook, NY, USA, December. 2013.

X. Li, S. Chen, X. Hu and J. Yang, “Understanding the Disharmony Between Dropout and Batch Normalization by Variance Shift”. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019; pp. 2677-2685, doi. 10.1109/CVPR.2019.00279.

K. Duan, S. S. Keerthi, W. Chu, S. K. Shevade and A. N. Poo, “Multi-category Classification by Soft-Max Combination of Binary Classifiers”, In proceedings of the 4th International Workshop, MCS 2003 Guildford, UK, 11–13, pp 125–134, June 2003. doi: 10.1007/3-540-44938-8_13.

X. Xu, C. Liu, Q. Feng, H. Yin, L. Song and D. Song, “Neural networkbased graph embedding for cross-platform binary code similarity detection”, in Proc. ACM SIGSAC Conf. Comput. Commun. Secur, pp. 363–376, Oct. 2017.

Y. Li, S. Wang and T. N. Nguyen, “Vulnerability detection with fine-grained interpretations”, Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 292–303, August 2021. https://doi.org/10.1145/3468264.3468597.

Detection of source code vulnerabilities using Nature language processing and deep graph network

Authors

DOI:

Keywords:

Tóm tắt

Downloads

References

Downloads

Published

How to Cite

Issue

Section

License

Most read articles by the same author(s)

Information

An toàn thông tin