Journal Home > Volume 26 , Issue 3

Network texts have become important carriers of cybersecurity information on the Internet. These texts include the latest security events such as vulnerability exploitations, attack discoveries, advanced persistent threats, and so on. Extracting cybersecurity entities from these unstructured texts is a critical and fundamental task in many cybersecurity applications. However, most Named Entity Recognition (NER) models are suitable only for general fields, and there has been little research focusing on cybersecurity entity extraction in the security domain. To this end, in this paper, we propose a novel cybersecurity entity identification model based on Bidirectional Long Short-Term Memory with Conditional Random Fields (Bi-LSTM with CRF) to extract security-related concepts and entities from unstructured text. This model, which we have named XBiLSTM-CRF, consists of a word-embedding layer, a bidirectional LSTM layer, and a CRF layer, and concatenates X input with bidirectional LSTM output. Via extensive experiments on an open-source dataset containing an office security bulletin, security blogs, and the Common Vulnerabilities and Exposures list, we demonstrate that XBiLSTM-CRF achieves better cybersecurity entity extraction than state-of-the-art models.


menu
Abstract
Full text
Outline
About this article

Cybersecurity Named Entity Recognition Using Bidirectional Long Short-Term Memory with Conditional Random Fields

Show Author's information Pingchuan MaBo Jiang( )Zhigang LuNing LiZhengwei Jiang
School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049, China.
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100093, China.

Abstract

Network texts have become important carriers of cybersecurity information on the Internet. These texts include the latest security events such as vulnerability exploitations, attack discoveries, advanced persistent threats, and so on. Extracting cybersecurity entities from these unstructured texts is a critical and fundamental task in many cybersecurity applications. However, most Named Entity Recognition (NER) models are suitable only for general fields, and there has been little research focusing on cybersecurity entity extraction in the security domain. To this end, in this paper, we propose a novel cybersecurity entity identification model based on Bidirectional Long Short-Term Memory with Conditional Random Fields (Bi-LSTM with CRF) to extract security-related concepts and entities from unstructured text. This model, which we have named XBiLSTM-CRF, consists of a word-embedding layer, a bidirectional LSTM layer, and a CRF layer, and concatenates X input with bidirectional LSTM output. Via extensive experiments on an open-source dataset containing an office security bulletin, security blogs, and the Common Vulnerabilities and Exposures list, we demonstrate that XBiLSTM-CRF achieves better cybersecurity entity extraction than state-of-the-art models.

Keywords: security blogs, Long Short-Term Memory (LSTM), Named Entity Recognition (NER)

References(14)

[1]
S. Mittal, P. K. Das, V. Mulwad, A. Joshi, and T. Finin, CyberTwitter: Using twitter to generate alerts for cybersecurity threats and vulnerabilities, in Proc. 2016 IEEE/ACM Int. Conf. on Advances in Social Networks Analysis and Mining, San Francisco, CA, USA, 2016, pp. 860-867.
DOI
[2]
S. Weerawardhana, S. Mukherjee, I. Ray, and A. Howe, Automated extraction of vulnerability information for home computer security, in Int. Symp. on Foundations and Practice of Security, F. Cuppens, J. Garcia-Alfaro, N. Zincir Heywood, and P. Fong, eds. Springer, 2014, pp. 356-366.
[3]
E. F. T. K. Sang and F. De Meulder, Introduction to the CONLL-2003 shared task: Language-independent named entity recognition, in Proc. 7th Conf. on Natural Language Learning at HLT-NAACL 2003 - Volume 4, Edmonton, Canada, 2003, pp. 142-147.
[4]
G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer, Neural architectures for named entity recognition, arXiv preprint: 1603.01360, 2016.
DOI
[5]
X. J. Liao, K. Yuan, X. F. Wang, Z. Li, L. Y. Xing, and R. Beyah, Acing the IOC game: Toward automatic discovery and analysis of open-source cyber threat intelligence, in Proc. 2016 ACM SIGSAC Conf. on Computer and Communications Security, Vienna, Austria, 2016, pp. 755-766.
DOI
[6]
L. Obrst, P. Chase, and R. Markeloff, Developing an ontology of the cyber security domain, in STIDS, 2012, pp. 49-56.
[7]
R. Lal, Information extraction of security related entities and concepts from unstructured text, Master dissertation, University of Maryland Baltimore County, Baltimore, MD, USA, 2013.
[8]
L. Luo, Z. H. Yang, P. Yang, Y. Zhang, L. Wang, H. F. Lin, and J. Wang, An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition, Bioinformatics, vol. 34, no. 8, pp. 1381-1388, 2018.
[9]
A. Ritter, S. Clark, Mausam, and O. Etzioni, Named entity recognition in tweets: An experimental study, in Proc. 2011 Conf. on Empirical Methods in Natural Language Processing, Edinburgh, UK, 2011, pp. 1524-1534.
[10]
E. Minkov, R. C. Wang, and W. W. Cohen, Extracting personal names from email: Applying named entity recognition to informal text, in Proc. Conf. on Human Language Technology and Empirical Methods in Natural Language Processing, Vancouver, Canada, 2005, pp. 443-450.
DOI
[11]
S. More, M. Matthews, A. Joshi, and T. Finin, A knowledge-based approach to intrusion detection modeling, in 2012 IEEE Symp. on Security and Privacy Workshops, San Francisco, CA, USA, 2012, pp. 75-81.
DOI
[12]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, Efficient estimation of word representations in vector space, arXiv preprint: 1301.3781, 2013.
[13]
J. R. Finkel, T. Grenager, and C. Manning, Incorporating non-local information into information extraction systems by Gibbs sampling, in Proc. 43rd Annual Meeting on Association for Computational Linguistics, Ann Arbor, MI, USA, 2005, pp. 363-370.
DOI
[14]
S. Hochreiter and J. Schmidhuber, Long short-term memory, Neural Comput., vol. 9, no. 8, pp. 1735-1780, 1997.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 19 July 2019
Accepted: 26 July 2019
Published: 12 October 2020
Issue date: June 2021

Copyright

© The author(s) 2021.

Acknowledgements

This research was supported by the National Natural Science Foundation of China (Nos. 61702508, 61802404, and U1836209), the National Key Research and Development Program of China (Nos. 2018YFB0803602 and 2016QY06X1204), and the National Social Science Foundation of China (No. 19BSH022). This research was also supported by the Key Laboratory of Network Assessment Technology, Chinese Academy of Sciences, and Beijing Key Laboratory of Network Security and Protection Technology.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return