Unstructured Big Data Threat Intelligence Parallel Mining Algorithm

Zhihua Li; Xinye Yu; Tao Wei; Junhao Qian

doi:10.26599/BDMA.2023.9020032

| Sign up

PDF (12.6 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

Open Access

Unstructured Big Data Threat Intelligence Parallel Mining Algorithm

Zhihua Li^¹, Xinye Yu^¹, Tao Wei^¹, Junhao Qian^²()

1School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China

2School of IoT Engineering, Jiangnan University, Wuxi 214122, China

Show Author Information

Abstract

To efficiently mine threat intelligence from the vast array of open-source cybersecurity analysis reports on the web, we have developed the Parallel Deep Forest-based Multi-Label Classification (PDFMLC) algorithm. Initially, open-source cybersecurity analysis reports are collected and converted into a standardized text format. Subsequently, five tactics category labels are annotated, creating a multi-label dataset for tactics classification. Addressing the limitations of low execution efficiency and scalability in the sequential deep forest algorithm, our PDFMLC algorithm employs broadcast variables and the Lempel-Ziv-Welch (LZW) algorithm, significantly enhancing its acceleration ratio. Furthermore, our proposed PDFMLC algorithm incorporates label mutual information from the established dataset as input features. This captures latent label associations, significantly improving classification accuracy. Finally, we present the PDFMLC-based Threat Intelligence Mining (PDFMLC-TIM) method. Experimental results demonstrate that the PDFMLC algorithm exhibits exceptional node scalability and execution efficiency. Simultaneously, the PDFMLC-TIM method proficiently conducts text classification on cybersecurity analysis reports, extracting tactics entities to construct comprehensive threat intelligence. As a result, successfully formatted STIX2.1 threat intelligence is established.

Keywords

unstructured big data mining parallel deep forest multi-label classification algorithm threat intelligence

References

[1]

N. Sun, M. Ding, J. Jiang, W. Xu, X. Mo, Y. Tai, and J. Zhang, Cyber threat intelligence mining for proactive cybersecurity defense: A survey and new perspectives, IEEE Commun. Surv. Tut., vol. 25, no. 3, pp. 1748–1774, 2023.

Crossref Google Scholar

[2]

S. M. Arıkan and S. Acar, A data mining based system for automating creation of cyber threat intelligence, in Proc. 9^th Int. Symp. Digital Forensics and Security (ISDFS ), Elazig, Türkiye, 2021, pp. 1–7.

Crossref

[3]

G. Husari, E. Al-Shaer, M. Ahmed, B. Chu, and X. Niu, TTPDrill: Automatic and accurate extraction of threat actions from unstructured text of CTI sources, in Proc. 33^rd Annu. Computer Security Applications Conf. (ACSAC ), Orlando, FL, USA, 2017, pp. 103–115.

Crossref

[4]

W. Ge and J. Wang, SeqMask: Behavior extraction over cyber threat intelligence via multi-instance learning, Comput. J.

[5]

MITRE ATT&CK, https://attack.mitre.org/, 2019.

[6]

G. Wang, H. Peng, Y. W. Tang, and Y. Q. Jin, Error repair technology of Lempel-Ziv-Welch (LZW) compression data, (in Chinese), Trans. Beijing Inst. Technol., vol. 40, no. 5, pp. 562–569, 2020.

Google Scholar

[7]

S. X. Lin, Z. J. Li, T. Y. Chen, and D. J. Wu, Attack tactic labeling for cyber threat hunting, in Proc. 24^th Int. Conf. Advanced Communication Technology (ICACT ), Pyeongchang, Republic of Korea, 2022, pp. 34–39.

Crossref

[8]

R. Rahim, M. Dahria, M. Syahril, and B. Anwar, Combination of the Blowfish and Lempel-Ziv-Welch algorithms for text compression, World Trans. Eng. Technol. Educ., vol. 15, no. 3, pp. 292–297, 2017.

Crossref Google Scholar

[9]

P. E. Latham and Y. Roudi, Mutual information, Scholarpedia, vol. 4, no. 1, p. 1658, 2009.

Crossref Google Scholar

[10]

M. Zbili and S. Rama, A quick and easy way to estimate entropy and mutual information for neuroscience, Front. Neuroinform., vol. 15, p. 596443, 2021.

Crossref Google Scholar

[11]

Curmmy, Beautiful soup documentation, https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html, 2023.

[12]

PDFminer, https://euske.github.io/pdfminer/, 2014

[13]

J. Deng, G. Y. Shi, T. H. Cai, J. Zhu, and L. B. Huai, Research on the method of filling of the incomplete poems of famous monks in the tang dynasty based on TF-IDF, (in Chinese), Mod. Comput., vol. 25, no. 8, pp. 7–11&15, 2019.

Google Scholar

[14]

S. Kalra, L. Li, and H. R. Tizhoosh, Automatic classification of pathology reports using TF-IDF features, arXiv preprint arXiv: 1903.07406, 2019.

[15]

G. Tsoumakas, E. Spyromitros-Xioufis, J. Vilcek, and I. Vlahavas, MULAN: A java library for multi-label learning, J. Mach. Learn. Res., vol. 12, pp. 2411–2414, 2011.

[16]

E. P. Xing, Q. Ho, P. Xie, and D. Wei, Strategies and principles of distributed machine learning on big data, Engineering, vol. 2, no. 2, pp. 179–195, 2016.

Crossref Google Scholar

[17]

J. X. Shao, Y. N. Xing, F. Z. Nan, X. Zhao, T. H. Ma, and Y. R. Qian, Improved CK-means+algorithm and parallel implementation, (in Chinese), Comput. Eng. Des., vol. 43, no. 5, pp. 1240–1248, 2022.

Google Scholar

[18]

Z. H. Zhou and J. Feng, Deep forest, arXiv preprint arXiv: 1702.08835, 2017.

[19]

L. Breiman, Random forests, Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001.

Crossref Google Scholar

[20]

P. Geurts, D. Ernst, and L. Wehenkel, Extremely randomized trees, Mach. Learn., vol. 63, no. 1, pp. 3–42, 2006.

Crossref Google Scholar

[21]

Y. Mao, J. Geng, and L. Chen, Improved parallel deep forest algorithm combining with information theory, (in Chinese), Comput. Eng. Appl., vol. 58, no. 7, pp. 106–115, 2022.

Google Scholar

[22]

V. Legoy, M. Caselli, C. Seifert, and A. Peter, Automated retrieval of ATT&CK tactics and techniques for cyber threat reports, arXiv preprint arXiv: 2004.14322, 2020.

[23]

S. R. Gunn, Support vector machines for classification and regression, Technical report, https://see.xidian.edu.cn/faculty/chzheng/bishe/indexfiles/new_folder/svm.pdf, 2023.

[24]

F. Li, X. Yu, R. Ge, Y. Wang, Y. Cui, and H. Zhou, BCSE: Blockchain-based trusted service evaluation model over big data, Big Data Mining and Analytics, vol. 5, no. 1, pp. 1–14, 2022.

Crossref

[25]

Proteus-Cyber, Cyber security report about Hogfish, https://proteuscyber.com/privacy-database/news/6493-abuse-of-legitimate-security-tools-and-health-sectorcybersecurity, 2022.

[26]

H. Wang, K. Qin, G. Duan, and G. Luo, Denoising graph inference network for document-level relation extraction, Big Data Mining and Analytics, vol. 6, no. 2, pp. 248–262, 2023.

Crossref Google Scholar

[27]

Y. Huo, J. Fan, Y. Wen, and R. Li, A cross-layer cooperative jamming scheme for social internet of things, Tsinghua Science and Technology, vol. 26, no. 4, pp. 523–535, 2021.

Crossref Google Scholar

[28]

M. Moutaib, T. Ahajjam, M. Fattah, Y. Farhaoui, B. Aghoutane, and M. El Bekkali, Application of internet of things in the health sector: Toward minimizing energy consumption, Big Data Mining and Analytics, vol. 5, no. 4, pp. 302–308, 2022.

Crossref Google Scholar

Big Data Mining and Analytics

Volume 7 Issue 2,
June 2024

Pages 531-546

DOI: 10.26599/BDMA.2023.9020032

Cite this article:

Li Z, Yu X, Wei T, et al. Unstructured Big Data Threat Intelligence Parallel Mining Algorithm. Big Data Mining and Analytics, 2024, 7(2): 531-546. https://doi.org/10.26599/BDMA.2023.9020032