Journal Home > Volume 7 , Issue 2

To efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into a standardized text format. Subsequently, five tactics category labels are annotated, creating a multi-label dataset for tactics classification. Addressing the limitations of low execution efficiency and scalability in the sequential deep forest algorithm, our PDFMLC algorithm employs broadcast variables and the Lempel-Ziv-Welch (LZW) algorithm, significantly enhancing its acceleration ratio. Furthermore, our proposed PDFMLC algorithm incorporates label mutual information from the established dataset as input features. This captures latent label associations, significantly improving classification accuracy. Finally, we present the PDFMLC-based Threat Intelligence Mining (PDFMLC-TIM) method. Experimental results demonstrate that the PDFMLC algorithm exhibits exceptional node scalability and execution efficiency. Simultaneously, the PDFMLC-TIM method proficiently conducts text classification on cybersecurity analysis reports, extracting tactics entities to construct comprehensive threat intelligence. As a result, successfully formatted STIX2.1 threat intelligence is established.


menu
Abstract
Full text
Outline
About this article

Unstructured Big Data Threat Intelligence Parallel Mining Algorithm

Show Author's information Zhihua Li1Xinye Yu1Tao Wei1Junhao Qian2( )
School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
School of IoT Engineering, Jiangnan University, Wuxi 214122, China

Abstract

To efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into a standardized text format. Subsequently, five tactics category labels are annotated, creating a multi-label dataset for tactics classification. Addressing the limitations of low execution efficiency and scalability in the sequential deep forest algorithm, our PDFMLC algorithm employs broadcast variables and the Lempel-Ziv-Welch (LZW) algorithm, significantly enhancing its acceleration ratio. Furthermore, our proposed PDFMLC algorithm incorporates label mutual information from the established dataset as input features. This captures latent label associations, significantly improving classification accuracy. Finally, we present the PDFMLC-based Threat Intelligence Mining (PDFMLC-TIM) method. Experimental results demonstrate that the PDFMLC algorithm exhibits exceptional node scalability and execution efficiency. Simultaneously, the PDFMLC-TIM method proficiently conducts text classification on cybersecurity analysis reports, extracting tactics entities to construct comprehensive threat intelligence. As a result, successfully formatted STIX2.1 threat intelligence is established.

Keywords: unstructured big data mining, parallel deep forest, multi-label classification algorithm, threat intelligence

References(28)

[1]

N. Sun, M. Ding, J. Jiang, W. Xu, X. Mo, Y. Tai, and J. Zhang, Cyber threat intelligence mining for proactive cybersecurity defense: A survey and new perspectives, IEEE Commun. Surv. Tut., vol. 25, no. 3, pp. 1748–1774, 2023.

[2]
S. M. Arıkan and S. Acar, A data mining based system for automating creation of cyber threat intelligence, in Proc. 9 th Int. Symp. Digital Forensics and Security (ISDFS ), Elazig, Türkiye, 2021, pp. 1–7.
DOI
[3]
G. Husari, E. Al-Shaer, M. Ahmed, B. Chu, and X. Niu, TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources, in Proc. 33 rd Annu. Computer Security Applications Conf. (ACSAC ), Orlando, FL, USA, 2017, pp. 103–115.
DOI
[4]
W. Ge and J. Wang, SeqMask: Behavior extraction over cyber threat intelligence via multi-instance learning, Comput. J.
[5]
MITRE ATT&CK, https://attack.mitre.org/, 2019.
[6]

G. Wang, H. Peng, Y. W. Tang, and Y. Q. Jin, Error repair technology of Lempel-Ziv-Welch (LZW) compression data, (in Chinese), Trans. Beijing Inst. Technol., vol. 40, no. 5, pp. 562–569, 2020.

[7]
S. X. Lin, Z. J. Li, T. Y. Chen, and D. J. Wu, Attack tactic labeling for cyber threat hunting, in Proc. 24 th Int. Conf. Advanced Communication Technology (ICACT ), Pyeongchang, Republic of Korea, 2022, pp. 34–39.
DOI
[8]

R. Rahim, M. Dahria, M. Syahril, and B. Anwar, Combination of the Blowfish and Lempel-Ziv-Welch algorithms for text compression, World Trans. Eng. Technol. Educ., vol. 15, no. 3, pp. 292–297, 2017.

[9]

P. E. Latham and Y. Roudi, Mutual information, Scholarpedia, vol. 4, no. 1, p. 1658, 2009.

[10]

M. Zbili and S. Rama, A quick and easy way to estimate entropy and mutual information for neuroscience, Front. Neuroinform., vol. 15, p. 596443, 2021.

[11]
Curmmy, Beautiful soup documentation, https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html, 2023.
[12]
PDFminer, https://euske.github.io/pdfminer/, 2014
[13]

J. Deng, G. Y. Shi, T. H. Cai, J. Zhu, and L. B. Huai, Research on the method of filling of the incomplete poems of famous monks in the tang dynasty based on TF-IDF, (in Chinese), Mod. Comput., vol. 25, no. 8, pp. 7–11&15, 2019.

[14]
S. Kalra, L. Li, and H. R. Tizhoosh, Automatic classification of pathology reports using TF-IDF features, arXiv preprint arXiv: 1903.07406, 2019.
[15]
G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas, MULAN: A java library for multi-label learning, J. Mach. Learn. Res., vol. 12, pp. 2411–2414, 2011.
[16]

E. P. Xing, Q. Ho, P. Xie, and D. Wei, Strategies and principles of distributed machine learning on big data, Engineering, vol. 2, no. 2, pp. 179–195, 2016.

[17]

J. X. Shao, Y. N. Xing, F. Z. Nan, X. Zhao, T. H. Ma, and Y. R. Qian, Improved CK-means+algorithm and parallel implementation, (in Chinese), Comput. Eng. Des., vol. 43, no. 5, pp. 1240–1248, 2022.

[18]
Z. H. Zhou and J. Feng, Deep forest, arXiv preprint arXiv: 1702.08835, 2017.
[19]

L. Breiman, Random forests, Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.

[20]

P. Geurts, D. Ernst, and L. Wehenkel, Extremely randomized trees, Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006.

[21]

Y. Mao, J. Geng, and L. Chen, Improved parallel deep forest algorithm combining with information theory, (in Chinese), Comput. Eng. Appl., vol. 58, no. 7, pp. 106–115, 2022.

[22]
V. Legoy, M. Caselli, C. Seifert, and A. Peter, Automated retrieval of ATT&CK tactics and techniques for cyber threat reports, arXiv preprint arXiv: 2004.14322, 2020.
[23]

S. R. Gunn, Support vector machines for classification and regression, Technical report, https://see.xidian.edu.cn/faculty/chzheng/bishe/indexfiles/new_folder/svm.pdf, 2023.

[24]
F. Li, X. Yu, R. Ge, Y. Wang, Y. Cui, and H. Zhou, BCSE: Blockchain-based trusted service evaluation model over big data, Big Data Mining and Analytics, vol. 5, no. 1, pp. 1–14, 2022.
DOI
[25]
Proteus-Cyber, Cyber security report about Hogfish, https://proteuscyber.com/privacy-database/news/6493-abuse-of-legitimate-security-tools-and-health-sectorcybersecurity, 2022.
[26]

H. Wang, K. Qin, G. Duan, and G. Luo, Denoising graph inference network for document-level relation extraction, Big Data Mining and Analytics, vol. 6, no. 2, pp. 248–262, 2023.

[27]

Y. Huo, J. Fan, Y. Wen, and R. Li, A cross-layer cooperative jamming scheme for social internet of things, Tsinghua Science and Technology, vol. 26, no. 4, pp. 523–535, 2021.

[28]

M. Moutaib, T. Ahajjam, M. Fattah, Y. Farhaoui, B. Aghoutane, and M. El Bekkali, Application of internet of things in the health sector: Toward minimizing energy consumption, Big Data Mining and Analytics, vol. 5, no. 4, pp. 302–308, 2022.

Publication history
Copyright
Rights and permissions

Publication history

Received: 09 August 2023
Revised: 23 October 2023
Accepted: 02 November 2023
Published: 22 April 2024
Issue date: June 2024

Copyright

© The author(s) 2023.

Rights and permissions

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return