AI Chat Paper
Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.
{{lang === 'zh_CN' ? '文章概述' : 'Summary'}}
{{lang === 'en_US' ? '中' : 'Eng'}}
Chat more with AI
PDF (12.6 MB)
Collect
Submit Manuscript AI Chat Paper
Show Outline
Outline
Show full outline
Hide outline
Outline
Show full outline
Hide outline
Open Access

Unstructured Big Data Threat Intelligence Parallel Mining Algorithm

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China
School of IoT Engineering, Jiangnan University, Wuxi 214122, China
Show Author Information

Abstract

To efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into a standardized text format. Subsequently, five tactics category labels are annotated, creating a multi-label dataset for tactics classification. Addressing the limitations of low execution efficiency and scalability in the sequential deep forest algorithm, our PDFMLC algorithm employs broadcast variables and the Lempel-Ziv-Welch (LZW) algorithm, significantly enhancing its acceleration ratio. Furthermore, our proposed PDFMLC algorithm incorporates label mutual information from the established dataset as input features. This captures latent label associations, significantly improving classification accuracy. Finally, we present the PDFMLC-based Threat Intelligence Mining (PDFMLC-TIM) method. Experimental results demonstrate that the PDFMLC algorithm exhibits exceptional node scalability and execution efficiency. Simultaneously, the PDFMLC-TIM method proficiently conducts text classification on cybersecurity analysis reports, extracting tactics entities to construct comprehensive threat intelligence. As a result, successfully formatted STIX2.1 threat intelligence is established.

References

[1]

N. Sun, M. Ding, J. Jiang, W. Xu, X. Mo, Y. Tai, and J. Zhang, Cyber threat intelligence mining for proactive cybersecurity defense: A survey and new perspectives, IEEE Commun. Surv. Tut., vol. 25, no. 3, pp. 1748–1774, 2023.

[2]
S. M. Arıkan and S. Acar, A data mining based system for automating creation of cyber threat intelligence, in Proc. 9 th Int. Symp. Digital Forensics and Security (ISDFS ), Elazig, Türkiye, 2021, pp. 1–7.
[3]
G. Husari, E. Al-Shaer, M. Ahmed, B. Chu, and X. Niu, TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources, in Proc. 33 rd Annu. Computer Security Applications Conf. (ACSAC ), Orlando, FL, USA, 2017, pp. 103–115.
[4]
W. Ge and J. Wang, SeqMask: Behavior extraction over cyber threat intelligence via multi-instance learning, Comput. J.
[5]
MITRE ATT&CK, https://attack.mitre.org/, 2019.
[6]

G. Wang, H. Peng, Y. W. Tang, and Y. Q. Jin, Error repair technology of Lempel-Ziv-Welch (LZW) compression data, (in Chinese), Trans. Beijing Inst. Technol., vol. 40, no. 5, pp. 562–569, 2020.

[7]
S. X. Lin, Z. J. Li, T. Y. Chen, and D. J. Wu, Attack tactic labeling for cyber threat hunting, in Proc. 24 th Int. Conf. Advanced Communication Technology (ICACT ), Pyeongchang, Republic of Korea, 2022, pp. 34–39.
[8]

R. Rahim, M. Dahria, M. Syahril, and B. Anwar, Combination of the Blowfish and Lempel-Ziv-Welch algorithms for text compression, World Trans. Eng. Technol. Educ., vol. 15, no. 3, pp. 292–297, 2017.

[9]

P. E. Latham and Y. Roudi, Mutual information, Scholarpedia, vol. 4, no. 1, p. 1658, 2009.

[10]

M. Zbili and S. Rama, A quick and easy way to estimate entropy and mutual information for neuroscience, Front. Neuroinform., vol. 15, p. 596443, 2021.

[11]
Curmmy, Beautiful soup documentation, https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html, 2023.
[12]
PDFminer, https://euske.github.io/pdfminer/, 2014
[13]

J. Deng, G. Y. Shi, T. H. Cai, J. Zhu, and L. B. Huai, Research on the method of filling of the incomplete poems of famous monks in the tang dynasty based on TF-IDF, (in Chinese), Mod. Comput., vol. 25, no. 8, pp. 7–11&15, 2019.

[14]
S. Kalra, L. Li, and H. R. Tizhoosh, Automatic classification of pathology reports using TF-IDF features, arXiv preprint arXiv: 1903.07406, 2019.
[15]
G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas, MULAN: A java library for multi-label learning, J. Mach. Learn. Res., vol. 12, pp. 2411–2414, 2011.
[16]

E. P. Xing, Q. Ho, P. Xie, and D. Wei, Strategies and principles of distributed machine learning on big data, Engineering, vol. 2, no. 2, pp. 179–195, 2016.

[17]

J. X. Shao, Y. N. Xing, F. Z. Nan, X. Zhao, T. H. Ma, and Y. R. Qian, Improved CK-means+algorithm and parallel implementation, (in Chinese), Comput. Eng. Des., vol. 43, no. 5, pp. 1240–1248, 2022.

[18]
Z. H. Zhou and J. Feng, Deep forest, arXiv preprint arXiv: 1702.08835, 2017.
[19]

L. Breiman, Random forests, Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.

[20]

P. Geurts, D. Ernst, and L. Wehenkel, Extremely randomized trees, Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006.

[21]

Y. Mao, J. Geng, and L. Chen, Improved parallel deep forest algorithm combining with information theory, (in Chinese), Comput. Eng. Appl., vol. 58, no. 7, pp. 106–115, 2022.

[22]
V. Legoy, M. Caselli, C. Seifert, and A. Peter, Automated retrieval of ATT&CK tactics and techniques for cyber threat reports, arXiv preprint arXiv: 2004.14322, 2020.
[23]

S. R. Gunn, Support vector machines for classification and regression, Technical report, https://see.xidian.edu.cn/faculty/chzheng/bishe/indexfiles/new_folder/svm.pdf, 2023.

[24]
F. Li, X. Yu, R. Ge, Y. Wang, Y. Cui, and H. Zhou, BCSE: Blockchain-based trusted service evaluation model over big data, Big Data Mining and Analytics, vol. 5, no. 1, pp. 1–14, 2022.
[25]
Proteus-Cyber, Cyber security report about Hogfish, https://proteuscyber.com/privacy-database/news/6493-abuse-of-legitimate-security-tools-and-health-sectorcybersecurity, 2022.
[26]

H. Wang, K. Qin, G. Duan, and G. Luo, Denoising graph inference network for document-level relation extraction, Big Data Mining and Analytics, vol. 6, no. 2, pp. 248–262, 2023.

[27]

Y. Huo, J. Fan, Y. Wen, and R. Li, A cross-layer cooperative jamming scheme for social internet of things, Tsinghua Science and Technology, vol. 26, no. 4, pp. 523–535, 2021.

[28]

M. Moutaib, T. Ahajjam, M. Fattah, Y. Farhaoui, B. Aghoutane, and M. El Bekkali, Application of internet of things in the health sector: Toward minimizing energy consumption, Big Data Mining and Analytics, vol. 5, no. 4, pp. 302–308, 2022.

Big Data Mining and Analytics
Pages 531-546
Cite this article:
Li Z, Yu X, Wei T, et al. Unstructured Big Data Threat Intelligence Parallel Mining Algorithm. Big Data Mining and Analytics, 2024, 7(2): 531-546. https://doi.org/10.26599/BDMA.2023.9020032

949

Views

327

Downloads

2

Crossref

0

Web of Science

2

Scopus

0

CSCD

Altmetrics

Received: 09 August 2023
Revised: 23 October 2023
Accepted: 02 November 2023
Published: 22 April 2024
© The author(s) 2023.

The articles published in this open access journal are distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/).

Return