A Proposed Ensemble Approach for Searching Hacking News Semantically

Authors

  • Do Ngoc Long
  • Nguyen The Hung
  • Nguyen Trung Dung
  • Do Van Khanh
  • Nguyen Anh Tu
  • Pham Thi Bich Van

DOI:

https://doi.org/10.54654/isj.v2i22.1033

Keywords:

semantic search, large language models, hacking news

Tóm tắt

Efficient search of hacking information has been a topic of great discussion in recent years. Many challenges are encountered when searching for this information. In particular, researchers may encounter unfamiliar and potentially challenging terms, ideas, tools, and other items that are unique to hacking. Effective comprehension of synonyms and polysemy is necessary. These reasons serve as the driving force behind our efforts to develop a productive method for semantic hacking information searches. Semantic search, using advanced NLP techniques, has transformed information retrieval by improving search result accuracy and relevance. Unlike traditional lexical methods, neural models like sentence-transformers handle synonyms and polysemy efficiently. However, processing time increases with model size. This paper proposes a novel ensemble semantic search (NESS) approach that aggregates mini or small neural embedding models, leveraging their distinct advantages. Evaluated on a dataset with over 300,000 Hacker News stories, our proposed method significantly enhances ranking quality and retrieval accuracy compared to existing techniques, while requiring half the processing time of the best-performing large model. The findings underscore the trade-offs between model complexity, retrieval accuracy, and processing efficiency, offering insights for optimizing semantic search systems.

Downloads

Download data is not yet available.

References

Sun, Nan, et al (2023). Cyber threat intelligence mining for proactive cybersecurity defense: a survey and new perspectives. IEEE Communications Surveys & Tutorials 2023.

Thakur, Manikant. Cyber security threats and countermeasures in digital age. Journal of Applied Science and Education (JASE) 4.1 (2024): 1-20.

Benjamin, Victor, and Hsinchun Chen. "Developing understanding of hacker language through the use of lexical semantics." 2015 IEEE International Conference on Intelligence and Security Informatics (ISI). IEEE, 2015.

Li, Ying, et al. "NEDetector: Automatically extracting cybersecurity neologisms from hacker forums." Journal of Information Security and Applications 58 (2021): 102784.

Satyapanich, Taneeya, Tim Finin, and Francis Ferraro. "Extracting rich semantic information about cybersecurity events." 2019 IEEE International Conference on Big Data (Big Data). IEEE, 2019.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. CoRR, abs/1301.3781.

Mitra, B., & Craswell, N. (2018). An introduction to neural information retrieval. Foundations and Trends in Information Retrieval, 13(1), 1-126.

Mehrish, A., Majumder, N., Bharadwaj, R., Mihalcea, R., & Poria, S. (2023). A review of deep learning techniques for speech processing. Information Fusion, 101869.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. Technical report, OpenAI.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems (pp. 5998-6008).

Hugo Touvron, Thibaut Lavril, Xavier Martinet, et al. 2023. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.

Tom Kenter, Alexey Borisov, and Maarten de Rijke. 2016. Siamese CBOW: Optimizing word embeddings for sentence representations. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 941-951).

Siva Reddy, Danqi Chen, and Christopher D. Manning. 2019. CoQA: A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7, 249-266.

Brandon Nye, JinJin Li, Ramya Patel, et al. 2018. A corpus with multi-level annotations of patients, interventions and outcomes to support language processing for medical literature. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (pp. 197-203).

Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 2251-2259).

JD Prater. AI-Powered Semantic Search: Everything You Need to Know. https://www.graft.com/blog/the-future-is-semantic-transforming-search-in-the-age-of-ai (2023).

Trotman, Andrew, Antti Puurula, and Blake Burgess (2014). Improvements to BM25 and language models examined. Proceedings of the 2014 Australasian Document Computing Symposium (ADCS 2014) (pp. 58-65).

Stephen E. Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval 3, 4 (2009), 333-389.

Zhu, Y., Yuan, H., Wang, S., Liu, J., Liu, W., Deng, C., Dou, Z., & Wen, J. 2023. Large Language Models for Information Retrieval: A Survey. arXiv, abs/2308.07107.

Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2021. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, Online. Association for Computational Linguistics.

Reimers, Nils and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (pp. 3982-3992).

Khattab, Omar, and Matei Zaharia. Colbert: Efficient and effective passage search via contextualized late interaction over bert. Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 2020 (pp. 39–48).

Sanh, Victor et al. 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv abs/1910.01108.

Chen, J., Xiao, S., Zhang, P., Luo, K., Lian, D., & Liu, Z. 2024. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv, abs/2402.03216.

Ryan Michael. kerinin/hackernews-stories. huggingface.co/datasets/kerinin/hackernews-stories.

Downloads

Abstract views: 83 / PDF downloads: 31

Published

2024-10-01

How to Cite

Long, D. N., Hung, N. T., Dung, N. T., Khanh, D. V., Tu, N. A., & Van, P. T. B. (2024). A Proposed Ensemble Approach for Searching Hacking News Semantically. Journal of Science and Technology on Information Security, 2(22), 83-92. https://doi.org/10.54654/isj.v2i22.1033

Issue

Section

Papers