Named entity recognition in Vietnamese document using machine learning and application in ensuring cyber security

Authors

  • Nguyễn Ngọc Toàn People's Security Academy
  • Lê Xuân Tuấn
  • Lương Thế Dũng
  • Trần Nghi Phú

DOI:

https://doi.org/10.54654/isj.v1i16.824

Keywords:

named entity recognition, NER system, machine learning, Vietnamese text; negative, reactionary

Tóm tắt

Abstract Named Entity Recognition (NER) in Vietnamese documents is currently a challenging task because of the lacking of standard datasets, or these datasets might be not large enough. Moveover, recognition models are often built mainly based on deep learning methods. In this paper, we present a systematic approach in building entity recognition models of Vietnamese documents, beginning with collecting and building data sets then applying and refining machine learning models. In addition to that, we also propose some scenarios of application which proof the capability of our model in dealing with information security problems. Specifically, we built a dataset of more than 5000 documents collected from social networks using Vietnamese, naming and assigning 1 of 4 predefined labels to the entities in the documents and then apply the pre-training model XLM-RoBERTa with the appropriate fine-tuned initial parameters to recognize these entities. Preliminary results show that the proposed system is effective with the ability to recognize the entity of the model and achieve the F1- measure up to 95.6%, which is better than some NER systems curently available for Vietnamese documents on the same dataset which we have built. The proposed model has been used in building support systems for cybersecurity protection currently.

Downloads

Download data is not yet available.

References

. G.I.Parisi, J.Tani, C.Weber and S.Wermter, “Lifelong learning of human actions with deep neural network self-organization”, Neural Networks 96, pp.137-149, 2017. https://doi.org/10.1016/j.neunet.2017.09.001.

. T.H. Pham and P. Le-Hong, “End-to-end Recurrent Neural Network Models for Vietnamese Named Entity Recognition: Word-level vs. Characterlevel”, 2017, arXiv preprint arXiv:1705.04044.

. L.Shu, H.Xu and B.Liu, “Doc: Deep open classification of text documents”, 2017, arXiv preprint arXiv:1709.08716.

. A.A.Rusu, N.C. Rabinowitz, G.Desjardins, H.Soyer, J.Kirkpatrick, K. Kavukcuoglu, and R.Hadsell, “Progressive neural networks”, 2016, arXiv preprint arXiv:1606.0467.

. N.Patil, A.S.Patil and B.Pawar, “Survey of named entity recognition systems with respect to Indian and foreign languages”. Int. J. Comput. Appl. 134, pp.21–26, 2016.doi=10.1.1.736.1297

. D. Wu, Y.Zhang, S.Zhao, T.Liu, “Identification of web query intent based on query text and web knowledge”, In Proceedings of the 2010 First International Conference on Pervasive Computing, Signal Processing and Applications, Harbin, China, 17–19; pp. 128–131, 2010. doi: 10.1109/PCSPA.2010.40.

. VLSP 2016, [Online] https://vlsp.org.vn/vlsp2016.

. VLSP 2021, [Online] https://vlsp.org.vn/vlsp2021.

. D. Bikel, S. Miller, R. Schwartz, R. Weischedel, “A High- Performance Learning Name-finder”, Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194–201, 1998. arXiv preprint cmp-lg/9803003.

. A. Borthwick, J. Sterling, E. Agichtein, R,. Grishman, “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition”, Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada , 1998. https://aclanthology.org/W98-1118.pdf

. Y. Wu, T. Fan, Y. Lee, S. Yen, “Extracting Named Entities Using Support Vector Machines”, Bremer, E.G., Hakenberg, J., Han, E.-H(S.), Berrar, D., Dubitzky, W. (eds.) KDLL 2006. LNCS (LNBI), vol. 3886, pp. 91–103, 2006. https://doi.org/10.1007/11683568_8

. A. Mansouri, L. Affendey, A. Mamat, “Named Entity Recognition Using a New Fuzzy Support Vector Machine”, Proceedings of the International Journal of Computer Science and Network Security, IJCSNS 8(2), pp.320–325, 2008. https://www.researchgate.net/profile/Lilly-Affendey/publication/251928363_Named_Entity_Recognition_Using_a_New_Fuzzy_Support_Vector_Machine/links/544854050cf22b3c14e30cc5/Named-Entity-Recognition-Using-a-New-Fuzzy-Support-Vector-Machine.pdf

. T.C. Nguyen, O.T. Tran, H.X. Phan, T.Q. Ha, “Named Entity Recognition in Vietnamese Free-Text and Web Documents Using Conditional Random Fields”, Proceedings of the Eighth Conference on Some Selection Problems of Information Technology and Telecommunication, Hai Phong, Viet Nam, 2005.doi=10.1.1.300.3597

. Pham, T., Kawazoe, A., Dinh, D., Collier, N.: Construction of Vietnamese Corpora for Named Entity Recognition. In: Conference RIAO 2007, Pittsburgh PA, U.S.A, May 30-June 1, 2007. doi=10.1.1.106.7855

. Q.Tri Tran, et al. "Named entity recognition in Vietnamese documents." Progress in Informatics Journal 5, pp. 14-17, 2007.

. GermEval 2014 NER: [Online] https://sites.google.com/site/germeval2014ner/.

Downloads

Abstract views: 286 / PDF downloads: 298

Published

2023-02-13

How to Cite

Toàn, N. N., Tuấn, L. X., Dũng, L. T., & Phú, T. N. (2023). Named entity recognition in Vietnamese document using machine learning and application in ensuring cyber security. Journal of Science and Technology on Information Security, 2(16), 39-49. https://doi.org/10.54654/isj.v1i16.824

Issue

Section

Papers