Efficient Feature Extraction Using Apache Spark for Network Behavior Anomaly Detection

Xiaoming Ye; Xingshu Chen; Dunhu Liu; Wenxian Wang; Li Yang; Gang Liang; Guolin Shao

doi:10.26599/TST.2018.9010021

Tsinghua Science and Technology 2018, 23(5): 561-573 https://doi.org/10.26599/TST.2018.9010021

Open Access | Issue | Published: 17 September 2018

Efficient Feature Extraction Using Apache Spark for Network Behavior Anomaly Detection

Show Author's Information Hide Author's Information Xiaoming Ye, Xingshu Chen(

), Dunhu Liu, Wenxian Wang, Li Yang, Gang Liang, Guolin Shao

School of Cybersecurity, Chengdu University of Information Technology, Chengdu

610225

College of Cybersecurity, Sichuan University, Chengdu 610065, China.

School of Management, Chengdu University of Information Technology, Chengdu 610103, China.

College of Compute Science, Sichuan University, Chengdu 610065, China.

Keywords:

feature extraction, network behavior, anomaly detection, graph theory, Apache Spark

Cite this article:

Ye X, Chen X, Liu D, et al. Efficient Feature Extraction Using Apache Spark for Network Behavior Anomaly Detection. Tsinghua Science and Technology, 2018, 23(5): 561-573. https://doi.org/10.26599/TST.2018.9010021

Download citation

EndNote(RIS)

BibTeX

602

Views

Downloads

Citations

Crossref

N/A

WoS

Scopus

CSCD

Abstract Full text About this article

Abstract

Extracting and analyzing network traffic feature is fundamental in the design and implementation of network behavior anomaly detection methods. The traditional network traffic feature method focuses on the statistical features of traffic volume. However, this approach is not sufficient to reflect the communication pattern features. A different approach is required to detect anomalous behaviors that do not exhibit traffic volume changes, such as low-intensity anomalous behaviors caused by Denial of Service/Distributed Denial of Service (DoS/DDoS) attacks, Internet worms and scanning, and BotNets. We propose an efficient traffic feature extraction architecture based on our proposed approach, which combines the benefit of traffic volume features and network communication pattern features. This method can detect low-intensity anomalous network behaviors and conventional traffic volume anomalies. We implemented our approach on Spark Streaming and validated our feature set using labelled real-world dataset collected from the Sichuan University campus network. Our results demonstrate that the traffic feature extraction approach is efficient in detecting both traffic variations and communication structure changes. Based on our evaluation of the MIT-DRAPA dataset, the same detection approach utilizes traffic volume features with detection precision of 82.3% and communication pattern features with detection precision of 89.9%. Our proposed feature set improves precision by 94%.

Full text

Abstract

Full text

Outline

About this article

Efficient Feature Extraction Using Apache Spark for Network Behavior Anomaly Detection

Show Author's information Hide Author's Information Xiaoming Ye, Xingshu Chen(

), Dunhu Liu, Wenxian Wang, Li Yang, Gang Liang, Guolin Shao

School of Cybersecurity, Chengdu University of Information Technology, Chengdu

610225

College of Cybersecurity, Sichuan University, Chengdu 610065, China.

School of Management, Chengdu University of Information Technology, Chengdu 610103, China.

College of Compute Science, Sichuan University, Chengdu 610065, China.

Abstract

Keywords: feature extraction, network behavior, anomaly detection, graph theory, Apache Spark

References(35)

[1]

K. Xu, F. Wang, and L. Gu, Behavior analysis of internet traffic via bipartite graphs and one-mode projections, IEEE/ACM Trans. Netw., vol. 22, no. 3, pp. 931-942, 2014.

DOI Google Scholar

[2]

A. Sperotto, R. Sadre, P. T. Boer, and A. Pras, Hidden Markov model modeling of SSH brute-force attacks, in Proc. 20th IFIP/IEEE Int. Workshop on Distributed Systems: Operations and Management: Integrated Management of Systems Services Processes and People in IT, Venice, Italy, 2009, pp. 164-176.

DOI

[3]

K. Huang, Z. W. Qi, and B. Liu, Network anomaly detection based on statistical approach and time series analysis, in Proc. 23th Int. Conf. Advanced Information Networking and Applications Workshops, Bradford, UK, 2009, pp. 205-211.

[4]

T. Andrysiak, Ł Saganowski, M. Choraś, and R. Kozik, Network traffic prediction and anomaly detection based on ARFIMA model, in Proc. Int. Joint Conf. SOCO’14-CISIS’14-ICEUTE’14, Bilbao, Spain, 2014, pp. 545-554.

DOI

[5]

M. M. Ding and H. Tian, PCA-based network traffic anomaly detection, Tsinghua Sci. Technol., vol. 21, no. 5, pp. 500-509, 2016.

DOI Google Scholar

[6]

X. M. Ye, X. S. Chen, H. Z. Wang, X. M. Zeng, G. L. Shao, X. Y. Yin, and C. Xu, An anomalous behavior detection model in Cloud Computing, Tsinghua Sci. Technol., vol. 21, no. 3, pp. 322-332, 2016.

DOI Google Scholar

[7]

W. Willinger, M. S. Taqqu, R. Sherman, and D. V. Wilson, Self-similarity through high-variability: Statistical analysis of Ethernet LAN traffic at the source level, IEEE/ACM Trans. Netw., vol. 5, no. 1, pp. 71-86, 1997.

DOI Google Scholar

[8]

T. Babaie, S. Chawla, and S. Ardon, Network traffic decomposition for anomaly detection, Computer Science, vol. 96, no. 2, pp. 201-212, 2014.

Google Scholar

[9]

P. Winter, H. Lampesberger, M. Zeilinger, and E. Hermann, On detecting abrupt changes in network entropy time series, in Proc. 12th IFIP TC 6/TC 11 Int. Conf. Communications and Multimedia Security, Ghent, Belgium, 2011, pp. 194-205.

DOI

[10]

W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, On the self-similar nature of Ethernet traffic (extended version), IEEE/ACM Trans. Netw., vol. 2, no. 1, pp. 1-15, 1994.

DOI Google Scholar

[11]

M. Iliofotou, M. Faloutsos, and M. Mitzenmacher, Exploiting dynamicity in graph-based traffic analysis: Techniques and applications, in Proc. 5th Int. Conf. Emerging Networking Experiments and Technologies, Rome, Italy, 2009, pp. 241-252.

DOI

[12]

L. Akoglu, H. H. Tong, and D. Koutra, Graph based anomaly detection and description: A survey, Data Min. Knowl. Discov., vol. 29, no. 3, pp. 626-688, 2015.

DOI Google Scholar

[13]

D. Q. Le, T. Jeong, H. E. Roman, and J. W. K. Hong, Traffic dispersion graph based anomaly detection, in Proc. 2nd Symp. on Information and Communication Technology, Hanoi, Vietnam, 2011, pp. 36-41.

DOI

[14]

M. S. Rahman, T. K. Huang, H. V. Madhyastha, and M. Faloutsos, Efficient and scalable socware detection in online social networks, in Proc. 21st USENIX Conf. Security Symp., Bellevue, WA, USA, 2012, p. 32.

[15]

U. Khurana, S. Parthasarathy, and D. Turaga, Graph–based exploration of non-graph datasets, Proc. VLDB Endow., vol. 9, no. 13, pp. 1557-1560, 2016.

DOI Google Scholar

[16]

C. R. Harshaw, R. A. Bridges, M. D. Iannacone, J. W. Reed, and J. R. Goodall, GraphPrints: Towards a graph analytic method for network anomaly detection, in Proc. 11th Annu. Cyber and Information Security Research Conf., Oak Ridge, TN, USA, 2016, pp. 1-4.

DOI

[17]

J. François, S. N. Wang, R. D. State, and T. Engel, BotTrack: Tracking botnets using NetFlow and PageRank, in Proc. 10th Int. IFIP TC 6 Conf. Networking, Valencia, Spain, 2011, pp. 1-14.

DOI

[18]

Q. Ding, N. Katenka, P. Barford, E. Kolaczyk, and M. Crovella, Intrusion as (anti)social communication: Characterization and detection, in Proc. 18th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Beijing, China, 2012, pp. 886-894.

DOI

[19]

S. Weigert, M. A. Hiltunen, and C. Fetzer, Community-based analysis of netflow for early detection of security incidents, in Proc. 25th Int. Conf. Large Installation System Administration, Boston, MA, USA, 2011, p. 20.

[20]

K. Ishibashi, T. Kondoh, S. Harada, T. Mori, R. Kawahara, and S. Asano, Detecting anomalous traffic using communication graphs, in Telecommunications: The Infrastructure for the 21st Century, Vienna, Austria, 2010, pp. 1-6.

[21]

Z. M. Chen, K. Y. Chai, S. L. F. Bu, and C. T. Lau, Combining MIC feature selection and feature-based MSPCA for network traffic anomaly detection, in Proc. 3rd Int. Conf. on Digital Information Processing, Data Mining, and Wireless Communications, Moscow, Russia, 2016, pp. 176-181.

DOI

[22]

J. Tan, X. S. Chen, M. Du, and K. Zhu, A novel internet traffic identification approach using wavelet packet decomposition and neural network, J. Cent. South Univ., vol. 19, no. 8, pp. 2218-2230, 2012.

DOI Google Scholar

[23]

S. R. Kundu, S. Pal, K. Basu, and S. K. Das, Fast classification and estimation of Internet traffic flows, in Proc. 8th Int. Conf. Passive and Active Network Measurement, Louvainla-Neuve, Belgium, 2007, pp. 155-164.

DOI

[24]

P. Barford and D. Plonka, Characteristics of network traffic flow anomalies, in Proc. 1st ACM SIGCOMM Workshop on Internet Measurement, San Francisco, CA, USA, 2001, pp. 69-73.

DOI

[25]

H. Bunke, P. J. Dickinson, M. Kraetzl, and W. D. Wallis, A graph-theoretic approach to enterprise network dynamics, Progress in Computer Science and Applied Logic, vol. 24, pp. 63-78, 2007.

Google Scholar

[26]

M. Iliofotou, H. C. Kim, M. Faloutsos, M. Mitzenmacher, P. Pappu, and G. Varghese, Graption: A graph–based P2P traffic classification framework for the internet backbone, Comput. Netw., vol. 55, no. 8, pp. 1909-1920, 2011.

DOI Google Scholar

[27]

C. Chaparro and C. Eberle, Detecting anomalies in mobile telecommunication networks using a graph based approach, in Proc. 28th Int. Florida Artificial Intelligence Research Society Conf., Hollywood, FL, USA, 2015, pp. 410-415.

[28]

A. Sanfeliu and K. S. Fu, A distance measure between attributed relational graphs for pattern recognition, IEEE Trans. Syst. Man. Cybern., vol. 13, no. 3, pp. 353-362, 1983.

DOI Google Scholar

[29]

L. Mookiah, W. Eberle, and L. Holder, Detecting suspicious behavior using a graph-based approach, in Proc IEEE Conf. Visual Analytics Science and Technology, Paris, France, 2014, pp. 357-358.

DOI

[30]

J. Lin, E. Keogh, S. Lonardi, and B. Chiu, A symbolic representation of time series, with implications for streaming algorithms, in Proc. 8th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, San Diego, CA, USA, 2003, pp. 2-11.

DOI

[31]

E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, Dimensionality reduction for fast similarity search in large time series databases, Knowl. Inf. Syst., vol. 3, no. 3, pp. 263-286, 2001.

DOI Google Scholar

[32]

T. Karagiannis, M. Molle, and M. Faloutsos, Longrange dependence ten years of Internet traffic modeling, IEEE Internet Comput., vol. 8, no. 5, pp. 57-64, 2004.

DOI Google Scholar

[33]

S I. Tadaki, Long-term power-law fluctuation in Internet traffic, J. Phys. Soc. Jpn., vol. 76, no. 4, p. 044001, 2007.

DOI Google Scholar

[34]

G. Samorodnitsky, Long range dependence, Found. Trends Stoch. Syst., vol. 1, no. 3, pp. 163-257, 2007.

DOI Google Scholar

[35]

M. V. Mahoney and P. K. Chan, An analysis of the 1999 DARPA/Lincoln laboratory evaluation data for network anomaly detection, Recent Advances in Intrusion Detection, vol. 1, no. 1, pp. 220-237, 2003.

DOI Google Scholar

About this article

Publication history

Acknowledgements

Rights and permissions

Publication history

Received: 24 September 2017

Accepted: 29 September 2017

Published: 17 September 2018

Issue date: October 2018

Copyright

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 61272447), Sichuan Province Science and Technology Planning (Nos. 2016GZ0042, 16ZHSF0483, and 2017GZ0168), Key Research Project of Sichuan Provincial Department of Education (Nos. 17ZA0238 and 17ZA0200), and Scientific Research Staring Foundation for Young Teachers of Sichuan University (No. 2015SCU11079).