Evaluating the Efficiency of Vietnamese SMS Spam Detection Techniques

Authors

  • Vu Minh Tuan
  • Nguyen Xuan Thang
  • Tran Quang Anh

DOI:

https://doi.org/10.54654/isj.v1i18.932

Keywords:

SMS Spam, Vietnamese SMS Spam, machine learning, deep learning, transfer learning, PhoBert

Tóm tắt

Abstract— This paper is aimed at evaluating the efficiency of Vietnamese SMS spam detection methods on different variants of Vietnamese datasets by utilizing both traditional machine learning models and deep learning models. The researchers experimented with five algorithms, which were Support Vector Machine (SVM), Naive Bayes (NB), Random Forests (RF), Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM), on three different Vietnamese datasets. The findings reveal that the LSTM and CNN, supported by a transformer learning model - PhoBert, were more efficient than the traditional machine learning models. The LSTM model showed the highest accuracy of 97,77% when operating on the full-accent Vietnamese dataset. Similarly, the CNN model and PhoBert model showed the highest accuracy of 95,56% when dealing with non-diacritic Vietnamese dataset.

Downloads

Download data is not yet available.

References

CTIA, “2021 Annual Survey HIGHLIGHTS,” 2021. [Online]. Available: https://www.ctia.org/news/2021-annual-survey-highlights.

Attentive, “2021 SMS Marketing Benchmarks Report,” 2021. [Online]. Available: https://www.attentivemobile.com/2021-sms-marketing-benchmarks-report. [Accessed 2022].

Shafi’I Muhammad Abdulhamid; Muhammad Shafie Abd Latiff; Haruna Chiroma; Oluwafemi Osho; Gaddafi Abdul-Salaam, “A Review on Mobile SMS Spam Filtering Techniques,” IEEE Access, vol. 5, pp. 15650 - 15666, 2017.

K. Yadav, S. K. Saha, P. Kumaraguru, and R. Kumra, “Take control of your smses: Designing an usable spam sms filtering system,” in 2012 IEEE 13th International Conference on Mobile Data Management, Bengaluru, India, 2012.

El-Alfy, E.-S.M. and AlHasan, A.A., “Spam filtering framework for multimodal mobile communication based on dendritic cell algorithm,” Future Generation Computer Systems, vol. 64, pp. 98-107, 2016.

A. Narayan and P. Saxena, “The curse of 140 characters: evaluating the efficacy of sms spam detection on android,” in Third ACM workshop on Security and privacy in smartphones & mobile devices, Berlin, Germany, 2013.

Milivoje Popovac, Mirjana Karanovic, Srdjan Sladojevic, Marko Arsenovic, Andras Anderla, “Convolutional Neural Network Based SMS Spam Detection,” in 2018 26th Telecommunications Forum (TELFOR), Belgrade, Serbia , 2018.

Gauri Jain, Manisha Sharma, Basant Agarwal , “Optimizing semantic LSTM for spam detection,” International Journal of Information Technology, vol. 11, pp. 239 - 250, 2019.

W. Gomaa, “The Impact of Deep Learning Techniques on SMS Spam Filtering,” International Journal of Advanced Computer Science and Applications, vol. 11, no. 1, pp. 544 - 549, 2020.

Aliaksandr Barushka, Petr Hajek, “Spam filtering using integrated distribution-based balancing approach and regularized deep neural networks,” Applied Intelligence , vol. 48, p. 3538–3556, 2018.

. Vu Minh Tuan, Dang Dinh Quan, Nguyen Thanh Ha, Tran Quang Anh, “Lọc tin nhắn rác với Spam-Assassin,” Journal of Science and Technology on Information and Communications, vol. 3, no. 4, pp. 34-41, 2017.

Vu Minh Tuan, Quang Anh Tran, Minh Quang Ha, Lam Bui Thu, “A Multi-objective Approach for Vietnamese Spam Detection,” in Knowledge and Systems Engineering 2013, Hanoi, 2014.

Thai Hoang Pham, Phuong Le Hong, “Content-based Approach for Vietnamese Spam SMS Filtering,” in The 20th International Conference on Asian Language , Taiwain, 2016.

R. Johnson, T. Zhang, “Supervised and semi-supervised text categorization using LSTM for region embeddings,” in The 33rd International Conference on Machine Learning, New York, 2016.

X. Zhang, J. Zhao, Y. LeCun, “Character-level convolutional networks for text classification,” in The 28th Advances in Neural Information Processing Systems, Quebec, 2015.

Kiem-Hieu Nguyen, Cheol-Young Ock, “Diacritics Restoration in Vietnamese: Letter Based vs. Syllable Based Model,” in PRICAI 2010: Trends in Artificial Intelligence, Berlin, Heidelberg, 2010.

Jakub Náplava, Milan Straka, Pavel Straňák, Jan Hajič, “Diacritics Restoration Using Neural Networks,” in the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 2018.

Hilal Tekgöz; Halil İbrahim Çelenli; Sevinç İlhan Omurca, “Semantic Similarity Comparison of Word Representation Methods in the Field of Health,” in 2021 6th International Conference on Computer Science and Engineering (UBMK), Ankara, Turkey, 2021.

Dat Quoc Nguyen, Anh Tuan Nguyen, “PhoBERT: Pre-trained language models for Vietnamese,” in Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.

G. Forman, “BNS feature scaling: an improved representation over tf-idf for svm text classification,” in Proceedings of the 17th ACM conference on Information and knowledge management, Napa Valley California USA, 2008.

J.A.K. Suykens; J. Vandewalle , “Least Squares Support Vector Machine Classifiers,” Neural Processing Letters , vol. 9, pp. 293 - 300, 1999.

George H. John, Pat Langley, “Estimating Continuous Distributions in Bayesian Classifiers,” in Eleventh Conference on Uncertainty in Artificial Intelligence (UAI1995), Quebec, Canada, 1995.

L. Breiman, “Random Forests,” Machine Learning volume , vol. 45, no. 1, pp. 5-32, 2001.

Downloads

Abstract views: 349 / PDF downloads: 76

Published

2023-06-23

How to Cite

Tuấn, V. M., Thắng, N. X. ., & Anh, T. Q. (2023). Evaluating the Efficiency of Vietnamese SMS Spam Detection Techniques. Journal of Science and Technology on Information Security, 1(18), 30-37. https://doi.org/10.54654/isj.v1i18.932

Issue

Section

Papers