Named entity recognition in Vietnamese document using machine learning and application in ensuring cyber security
DOI:
https://doi.org/10.54654/isj.v1i16.824Keywords:
named entity recognition, NER system, machine learning, Vietnamese text; negative, reactionaryTóm tắt
Abstract— Named Entity Recognition (NER) in Vietnamese documents is currently a challenging task because of the lacking of standard datasets, or these datasets might be not large enough. Moveover, recognition models are often built mainly based on deep learning methods. In this paper, we present a systematic approach in building entity recognition models of Vietnamese documents, beginning with collecting and building data sets then applying and refining machine learning models. In addition to that, we also propose some scenarios of application which proof the capability of our model in dealing with information security problems. Specifically, we built a dataset of more than 5000 documents collected from social networks using Vietnamese, naming and assigning 1 of 4 predefined labels to the entities in the documents and then apply the pre-training model XLM-RoBERTa with the appropriate fine-tuned initial parameters to recognize these entities. Preliminary results show that the proposed system is effective with the ability to recognize the entity of the model and achieve the F1- measure up to 95.6%, which is better than some NER systems curently available for Vietnamese documents on the same dataset which we have built. The proposed model has been used in building support systems for cybersecurity protection currently.
Downloads
References
. G.I.Parisi, J.Tani, C.Weber and S.Wermter, “Lifelong learning of human actions with deep neural network self-organization”, Neural Networks 96, pp.137-149, 2017. https://doi.org/10.1016/j.neunet.2017.09.001.
. T.H. Pham and P. Le-Hong, “End-to-end Recurrent Neural Network Models for Vietnamese Named Entity Recognition: Word-level vs. Characterlevel”, 2017, arXiv preprint arXiv:1705.04044.
. L.Shu, H.Xu and B.Liu, “Doc: Deep open classification of text documents”, 2017, arXiv preprint arXiv:1709.08716.
. A.A.Rusu, N.C. Rabinowitz, G.Desjardins, H.Soyer, J.Kirkpatrick, K. Kavukcuoglu, and R.Hadsell, “Progressive neural networks”, 2016, arXiv preprint arXiv:1606.0467.
. N.Patil, A.S.Patil and B.Pawar, “Survey of named entity recognition systems with respect to Indian and foreign languages”. Int. J. Comput. Appl. 134, pp.21–26, 2016.doi=10.1.1.736.1297
. D. Wu, Y.Zhang, S.Zhao, T.Liu, “Identification of web query intent based on query text and web knowledge”, In Proceedings of the 2010 First International Conference on Pervasive Computing, Signal Processing and Applications, Harbin, China, 17–19; pp. 128–131, 2010. doi: 10.1109/PCSPA.2010.40.
. VLSP 2016, [Online] https://vlsp.org.vn/vlsp2016.
. VLSP 2021, [Online] https://vlsp.org.vn/vlsp2021.
. D. Bikel, S. Miller, R. Schwartz, R. Weischedel, “A High- Performance Learning Name-finder”, Proceedings of the Fifth Conference on Applied Natural Language Processing, pp. 194–201, 1998. arXiv preprint cmp-lg/9803003.
. A. Borthwick, J. Sterling, E. Agichtein, R,. Grishman, “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition”, Proceedings of the Sixth Workshop on Very Large Corpora, Montreal, Canada , 1998. https://aclanthology.org/W98-1118.pdf
. Y. Wu, T. Fan, Y. Lee, S. Yen, “Extracting Named Entities Using Support Vector Machines”, Bremer, E.G., Hakenberg, J., Han, E.-H(S.), Berrar, D., Dubitzky, W. (eds.) KDLL 2006. LNCS (LNBI), vol. 3886, pp. 91–103, 2006. https://doi.org/10.1007/11683568_8
. A. Mansouri, L. Affendey, A. Mamat, “Named Entity Recognition Using a New Fuzzy Support Vector Machine”, Proceedings of the International Journal of Computer Science and Network Security, IJCSNS 8(2), pp.320–325, 2008. https://www.researchgate.net/profile/Lilly-Affendey/publication/251928363_Named_Entity_Recognition_Using_a_New_Fuzzy_Support_Vector_Machine/links/544854050cf22b3c14e30cc5/Named-Entity-Recognition-Using-a-New-Fuzzy-Support-Vector-Machine.pdf
. T.C. Nguyen, O.T. Tran, H.X. Phan, T.Q. Ha, “Named Entity Recognition in Vietnamese Free-Text and Web Documents Using Conditional Random Fields”, Proceedings of the Eighth Conference on Some Selection Problems of Information Technology and Telecommunication, Hai Phong, Viet Nam, 2005.doi=10.1.1.300.3597
. Pham, T., Kawazoe, A., Dinh, D., Collier, N.: Construction of Vietnamese Corpora for Named Entity Recognition. In: Conference RIAO 2007, Pittsburgh PA, U.S.A, May 30-June 1, 2007. doi=10.1.1.106.7855
. Q.Tri Tran, et al. "Named entity recognition in Vietnamese documents." Progress in Informatics Journal 5, pp. 14-17, 2007.
. GermEval 2014 NER: [Online] https://sites.google.com/site/germeval2014ner/.
Downloads
Published
How to Cite
Issue
Section
License
Proposed Policy for Journals That Offer Open Access
Authors who publish with this journal agree to the following terms:
1. Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).
Proposed Policy for Journals That Offer Delayed Open Access
Authors who publish with this journal agree to the following terms:
1. Authors retain copyright and grant the journal right of first publication, with the work [SPECIFY PERIOD OF TIME] after publication simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
2. Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
3. Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).