Open Access
Issue
Published:
*04 April 2019*

Keywords:

Spark, frequent subgraphs, isomorphism
Cite this article:

Jena BS, Khan C, Sunderraman R.
High Performance Frequent Subgraph Mining on Transaction Datasets: A Survey and Performance Comparison.
Big Data Mining and Analytics,
2019, 2(3): 159-180.
https://doi.org/10.26599/BDMA.2019.9020006
Download citation

693

Views

43

Downloads

Citations

7

Crossref

5

WoS

9

Scopus

0

CSCD

Graph data mining has been a crucial as well as inevitable area of research. Large amounts of graph data are produced in many areas, such as Bioinformatics, Cheminformatics, Social Networks, etc. Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities. Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs. To tackle this problem, many main memory-based methods were proposed, which proved to be inefficient as the data size grew exponentially over time. In the past few years, several research groups have attempted to handle the Frequent Subgraph Mining (FSM) problem in multiple ways. Many authors have tried to achieve better performance using Graphic Processing Units (GPUs) which has multi-fold improvement over in-memory while dealing with large datasets. Later, Google’s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing. Although MapReduce came with many benefits, its disk I/O and non-iterative style model could not help much for FSM domain since subgraph mining process is an iterative approach. In recent years, Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability. This is a right fit solution for iterative style of programming as well. In this survey, we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results.

menu

Abstract

Full text

Outline

About this article

Graph data mining has been a crucial as well as inevitable area of research. Large amounts of graph data are produced in many areas, such as Bioinformatics, Cheminformatics, Social Networks, etc. Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities. Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs. To tackle this problem, many main memory-based methods were proposed, which proved to be inefficient as the data size grew exponentially over time. In the past few years, several research groups have attempted to handle the Frequent Subgraph Mining (FSM) problem in multiple ways. Many authors have tried to achieve better performance using Graphic Processing Units (GPUs) which has multi-fold improvement over in-memory while dealing with large datasets. Later, Google’s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing. Although MapReduce came with many benefits, its disk I/O and non-iterative style model could not help much for FSM domain since subgraph mining process is an iterative approach. In recent years, Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability. This is a right fit solution for iterative style of programming as well. In this survey, we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results.

[1]

R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proc. 20th Int. Conf. Very Large Data Bases, Santiago, Chile, 1994, pp. 487-499.

[2]

C. Liu, X. F. Yan, L. Fei, J. W. Han, and S. P. Midkiff, SOBER: Statistical model-based bug localization, *ACM SIGSOFT Soft. Eng. Notes*, vol. 30, no. 5, pp. 286-295, 2005.

[3]

X. Jiang, H. Xiong, C. Wang, and A. H. Tan, Mining globally distributed frequent subgraphs in a single labeled graph, *Data Know. Eng*., vol. 68, no. 10, pp. 1034-1058, 2009.

[4]

M. Kuramochi and G. Karypis, Finding frequent patterns in a large sparse graph, *Data Min. Know. Dis*., vol. 11, no. 3, pp. 243-271, 2005.

[5]

M. Kuramochi and G. Karypis, Grew—A scalable frequent subgraph discovery algorithm, in Proc. Fourth IEEE Int. Conf. Data Mining, Brighton, UK, 2004, pp. 439-442.

[6]

S. Ghazizadeh and S. S. Chawathe, SEuS: Structure extraction using summaries, in *Discovery Science*, S. Lange, K. Satoh, and C. H. Smith, eds. Springer, 2002, pp. 71-85.

[7]

N. Vanetik, E. Gudes, and S. E. Shimony, Computing frequent graph patterns from semistructured data, in Proc. 2002 IEEE Int. Conf. Data Mining, Maebashi City, Japan, 2002, pp. 458-465.

[8]

K. Yoshida, H. Motoda, and N. Indurkhya, Graph-based induction as a unified learning framework, *Appl. Intell*., vol. 4, no. 3, pp. 297-316, 1994.

[9]

L. B. Holder, D. J. Cook, and S. Djoko, Substucture discovery in the SUBDUE system, in Proc. Workshop on Knowledge Discovery in Databases, 1994, pp. 169-180.

[10]

U. Kang, C. E. Tsourakakis, and C. Faloutsos, PEGASUS: A peta-scale graph mining system implementation and observations, in Proc. Ninth IEEE Int. Conf. Data Mining, Miami, FL, USA, 2009, pp. 229-238.

[11]

S. P. Reinhardt and G. Karypis, A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph, in Proc. 12th Int. Workshop on High-Level Parallel Programming Models and Supportive Environments, Rome, Italy, 2007, pp. 1-8.

[12]

B. Wu and Y. L. Bai, An efficient distributed subgraph mining algorithm in extreme large graphs, in *Artificial Intelligence and Computational Intelligence*, F. L. Wang, H. Deng, Y. Gao, and J. Lei, eds. Springer, 2010, pp. 107-115.

[13]

L. Dehaspe, H. Toivonen, and R. D. King, Finding frequent substructures in chemical compounds, in Proc. Fourth Int. Conf. Knowledge Discovery and Data Mining, 1998.

[14]

A. Inokuchi, T. Washio, and H. Motoda, An apriori-based algorithm for mining frequent substructures from graph data, in *Principles of Data Mining and Knowledge Discovery*, D. A. Zighed, J. Komorowski, and J. Zytkow, eds. Springer, 2000, pp. 13-23.

[15]

A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda, *A fast algorithm for mining frequent connected subgraphs*, IBM Research report, RT0448, 2002.

[16]

X. F. Yan and J. W. Han, gSpan: Graph-based substructure pattern mining, in Proc. 2002 IEEE Int. Conf. Data Mining, Maebashi City, Japan, 2002, pp. 721-724.

[17]

J. Huan, W. Wang, and J. Prins, Efficient mining of frequent subgraphs in the presence of isomorphism, in Proc. Third IEEE Int. Conf. Data Mining, Melbourne, FL, USA, 2003, pp. 549-552.

[18]

S. Nijssen and J. N. Kok, A quickstart in frequent structure mining can make a difference, in Proc. Tenth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2004, pp. 647-652.

[19]

S. Fortin, The graph isomorphism problem, Technical Report 96-20, University of Alberta, Edomonton, Alberta, Canada, 1996.

[20]

C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, Mining behavior graphs for "Backtrace" of noncrashing bugs, in Proc. 5th SIAM Int. Conf. on Data Mining, 2005, pp. 286-297.

[21]

R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, Network motifs: Simple building blocks of complex networks, *Science*, vol. 298, no. 5594, pp. 824-827, 2002.

[22]

R. Milo, N. Kashtan, S. Itzkovitz, M. E. J. Newman, and U. Alon, On the uniform generation of random graphs with prescribed degree sequences, arXiv preprint arXiv:cond-mat/0312028v2, 2003.

[23]

C. Borgelt and M. R. Berthold, Mining molecular fragments: Finding relevant substructures of molecules, in Proc. 2002 IEEE Int. Conf. Data Mining, Maebashi City, Japan, 2002, pp. 51-58.

[24]

B. Srichandan and R. Sunderraman, Oo-fsg: An object-oriented approach to mine frequent subgraphs, in Proc. Ninth Australasian Data Mining Conf, Darlinghurst, Australia, 2011, pp. 221-228.

[25]

S. Hill, B. Srichandan, and R. Sunderraman, An iterative MapReduce approach to frequent subgraph mining in biological datasets, in Proc. ACM Conf. Bioinformatics, Computational Biology and Biomedicine, New York, NY, USA, 2012, pp. 661-666.

[26]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, *Commun. ACM*, vol. 51, no. 1, pp. 107-113, 2008.

[27]

B. Jena, C. Khan, and R. Sunderraman, SparkFSM: A highly scalable frequent subgraph mining approach using apache spark, in Proc. ICDM Workshops 2018, 2018.

[28]

J. W. Han and M. Kamber, *Data Mining*: *Concepts and Techniques, 3rd edition*. San Francisco, CA, USA: Morgan Kaufmann, 2012.

[29]

R. Agrawal, T. Imieliński, and A. Swami, Mining association rules between sets of items in large databases, *ACM Sigmod Record Homep*., vol. 22, no. 2, pp. 207-216, 1993.

[30]

D. Burdick, M. Calimlim, and J. Gehrke, MAFIA: A maximal frequent itemset algorithm for transactional databases, in Proc. 17th Int. Conf. Data Engineering, Heidelberg, Germany, 2001, pp. 443-452.

[31]

R. J. Bayardo Jr, Efficiently mining long patterns from databases, *ACM Sigmod Record*, vol. 27, no. 2, pp. 85-93, 1998.

[32]

J. W. Han, J. Pei, and Y. W. Yin, Mining frequent patterns without candidate generation, *ACM SIGMOD Record*, vol. 29, no. 2, pp. 1-12, 2000.

[33]

M. J. Zaki and C. J. Hsiao, CHARM: An efficient algorithm for closed itemset mining, in Proc. 2nd SIAM Int. Conf. Data Mining, 2002, pp. 457-473.

[34]

M. J. Zaki and K. Gouda, Fast vertical mining using diffsets, in Proc. Ninth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2003, pp. 326-335.

[35]

J. Pei, J. W. Han, B. Mortazavi-Asl, H. Pinto, Q. M. Chen, U. Dayal, and M. C. Hsu, PrefixSpan,: Mining sequential patterns efficiently by prefix-projected pattern growth, in Proc. 17th Int. Conf. Data Engineering, Heidelberg, Germany, 2001.

[36]

H. Mannila, H. Toivonen, and A. I. Verkamo, Discovery of frequent episodes in event sequences, *Data Min. Know. Disc*., vol. 1, no. 3, pp. 259-289, 1997.

[37]

R. Agrawal and R. Srikant, Mining sequential patterns, in Proc. Eleventh Int. Conf. Data Engineering, Washington, DC, USA, 1995, pp. 3-14.

[38]

T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa, Efficient substructure discovery from large semi-structured data, in Proc. 2002 SIAM Int. Conf. Data Mining, Arlington, VA, USA, 2002.

[39]

M. J. Zaki, Efficiently mining frequent trees in a forest, in Proc. Eighth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2002, pp. 71-80.

[40]

R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, A tree projection algorithm for generation of frequent item sets, *J. Parall. Distrib. Comput*., vol. 61, no. 3, pp. 350-371, 2001.

[41]

M. Kuramochi and G. Karypis, Frequent subgraph discovery, in Proc. IEEE Int. Conf. Data Mining, San Jose, CA, USA, 2001, pp. 313-320.

[42]

X. Yan, Mining, indexing and similarity search in large graph data sets, PhD dissertation, University of Illinois at Urbana-Champaign, IL, USA, 2006.

[43]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, *Introduction to Algorithms*. Boston, MA, USA: MIT Press and McGraw-Hill, 2001.

[44]

A. Savasere, E. Omiecinski, and S. B. Navathe, An efficient algorithm for mining association rules in large databases, in Proc. 21th Int. Conf. Very Large Data Bases, San Francisco, CA, USA, 1995, pp. 432-444.

[45]

C. Wang, W. Wang, J. Pei, Y. T. Zhu, and B. L. Shi, Scalable mining of large disk-based graph databases, in Proc. Tenth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2004, pp. 316-325.

[46]

J. M. Wang, W. Hsu, M. L. Lee, and C. Sheng, A partition-based approach to graph mining, in Proc. 22nd Int. Conf. Data Engineering, Atlanta, GA, USA, 2006, p. 74.

[47]

S. N. Nguyen, M. E. Orlowska, and X. Li, Graph mining based on a data partitioning approach, in Proc. Nineteenth Conf. Australasian Database, Darlinghurst, Australia, 2007, pp. 31-37.

[48]

S. N. Nguyen and M. E. Orlowska, Improvements in the data partitioning approach for frequent itemsets mining, in *Knowledge Discovery in Databases*: *PKDD 2005*, A. M. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, eds. Springer, 2005, pp. 625-633.

[49]

S. Chakravarthy, R. Beera, and R. Balachandran, DB-Subdue: Database approach to graph mining, in *Advances in Knowledge Discovery and Data Mining*, H. Dai, R. Srikant, and C. Zhang, eds. Springer, 2004, pp. 341-350.

[50]

D. J. Cook and L. B. Holder, Graph-based data mining, *IEEE Intell. Syst. Their Appl*., vol. 15, no. 2, pp. 32-41, 2000.

[51]

J. Rissanen, *Stochastic Complexity in Statistical Inquiry*. Teaneck, NJ, USA: World Scientific Publishing Co., Inc, 1989.

[52]

R. Balachandran, S. Padmanabhan, and S. Chakravarthy, Enhanced DB-subdue: Supporting subtle aspects of graph mining using a relational approach, in *Advances in Knowledge Discovery and Data Mining*, W. K. Ng, M. Kitsuregawa, J. Li, and K. Chang, eds. Springer, 2006, pp. 673-678.

[53]

S. Padmanabhan and S. Chakravarthy, HDB-subdue: A scalable approach to graph mining, in *Data Warehousing and Knowledge Discovery*, T. B. Pedersen, M. K. Mohania, and A. M. Tjoa, eds. Springer, 2009, pp. 325-338.

[54]

S. Chakravarthy and S. Pradhan, DB-FSG: An SQL-based approach for frequent subgraph mining, in *Database and Expert Systems Applications*, S. S. Bhowmick, J. Küng, and R. Wagner, eds. Springer, 2008, pp. 684-692.

[55]

H. F. Li and N. Zhang, Mining maximal frequent itemsets on graphics processors, in Proc. 2010 Seventh Int. Conf. Fuzzy Systems and Knowledge Discovery, Yantai, China, 2010, pp. 1461-1464.

[56]

W. B. Fang, M. Lu, X. Y. Xiao, B. S. He, and Q. Luo, Frequent itemset mining on graphics processors, in Proc. Fifth Int. Workshop on Data Management on New Hardware, New York, NY, USA, 2009, pp. 34-42.

[57]

H. F. Li, A GPU-based maximal frequent itemsets mining algorithm over stream, in Proc. 2010 Int. Conf. Computer and Communication Technologies in Agriculture Engineering, Chengdu, China, 2010, pp. 289-292.

[58]

R. R. Amossen and R. Pagh, A new data layout for set intersection on GPUs, in Proc. 2011 IEEE Int. Parallel & Distributed Processing Symp., Anchorage, AK, USA, 2011, pp. 698-708.

[59]

G. Teodoro, N. Mariano, W. Meira Jr, and R. Ferreira, Tree projection-based frequent itemset mining on multicore CPUs and GPUs, in Proc. 2010 22nd Int. Symp. Computer Architecture and High Performance Computing, Petropolis, Brazil, 2010, pp. 47-54.

[60]

D. W. Cheung, J. W. Han, V. T. Ng, A. W. Fu, and Y. J. Fu, A fast distributed algorithm for mining association rules, in Proc. Fourth Int. Conf. Parallel and Distributed Information Systems, Miami Beach, FL, USA, 1996, pp. 31-42.

[61]

L. Liu, E. Li, Y. M. Zhang, and Z. Z. Tang, Optimization of frequent itemset mining on multiple-core processor, in Proc. 33rd Int. Conf. Very Large Data Bases, Vienna, Austria, 2007, pp. 1275-1285.

[62]

H. Y. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, PFP: Parallel FP-growth for query recommendation, in Proc. 2008 ACM Conf. Recommender Systems, New York, NY, USA, 2008, pp. 107-114.

[63]

I. Miliaraki, K. Berberich, R. Gemulla, and S. Zoupanos, Mind the gap: Large-scale frequent sequence mining, in Proc. 2013 ACM SIGMOD Int. Conf. Management of Data, New York, NY, USA, 2013, pp. 797-808.

[64]

Y. Liu, X. H. Jiang, H. J. Chen, J. Ma, and X. Y. Zhang, MapReduce-based pattern finding algorithm applied in motif detection for prescription compatibility network, in *Advanced Parallel Processing Technologies*, Y. Dou, R. Gruber, and J. M. Joller, eds. Springer, 2009, pp. 341-355.

[65]

C. Wang and S. Parthasarathy, Parallel algorithms for mining frequent structural motifs in scientific data, in Proc. 18th Ann. Int. Conf. Supercomputing, New York, NY, USA, 2004, pp. 31-40.

[66]

M. Coatney and S. Parthasarathy, Motifminer: A general toolkit for efficiently identifying common substructures in molecules, in Proc. Third IEEE Symp. Bioinformatics and Bioengineering, Bethesda, MD, USA, 2003, pp. 336-340.

[67]

D. J. Cook, L. B. Holder, G. Galal, and R. Maglothin, Approaches to parallel graph-based knowledge discovery, *J. Parall. Distrib. Comput*., vol. 61, no. 3, pp. 427-446, 2001.

[68]

T. Meinl, I. Fischer, and M. Philippsen, Parallel mining for frequent fragments on a shared-memory multiprocessor-results and java-obstacles, in Proc. Lernen, Wissensentdeckung und Adaptivität, Saarbrücken, Germany, 2005.

[69]

Y. Shiloach and U. Vishkin, An O (logn ) parallel connectivity algorithm, *J. Algorithms*, vol. 3, no. 1, pp. 57-67, 1982.

[70]

B. Awerbuch and T. Singh, New connectivity and MSF algorithms for ultracomputer and PRAM, *IEEE Transcactions on Computers*, vol. 36, no. 10, pp. 1258-1263, 1987.

[71]

D. S. Hirschberg, A. K. Chandra, and D. V. Sarwate, Computing connected components on parallel computers, *Commun. ACM*, vol. 22, no. 8, pp. 461-464, 1979.

[72]

U. Kang, C. E. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec, Hadi: Mining radii of large graphs, *ACM Trans. Know. Discovery Data*, vol. 5, no. 2, p. 8, 2011.

[73]

Z. Zhao, G. Y. Wang, A. R. Butt, M. Khan, V. A. Kumar, and M. V. Marathe, SAHAD: Subgraph analysis in massive networks using Hadoop, in Proc. 2012 IEEE 26th Int. Parallel and Distributed Processing Symp., Shanghai, China, 2012, pp. 390-401.

[74]

N. Alon, P. Dao, I. Hajirasouliha, F. Hormozdiari, and S. C. Sahinalp, Biomolecular network motif counting and discovery by color coding, *Bioinformatics*, vol. 24, no. 13, pp. i241-i249, 2008.

[75]

N. Alon, R. Yuster, and U. Zwick, Color-coding, *J. ACM*, vol. 42, no. 4, pp. 844-856, 1995.

[76]

F. N. Afrati, D. Fotakis, and J. D. Ullman, Enumerating subgraph instances using map-reduce, in Proc. 2013 IEEE 29th Int. Conf. Data Engineering, Brisbane, Australia, 2013, pp. 62-73.

[77]

F. N. Afrati and J. D. Ullman, Optimizing multiway joins in a map-reduce environment, *IEEE Trans. Know. Data Eng*., vol. 23, no. 9, pp. 1282-1298, 2011.

[78]

J. G. Xiang, C. Guo, and A. Aboulnaga, Scalable maximum clique computation using MapReduce, in Proc. 2013 IEEE 29th Int. Conf. Data Engineering, Brisbane, Australia, 2013, pp. 74-85.

[79]

Di G. Fatta and M. R. Berthold, Dynamic load balancing for the distributed mining of molecular structures, *IEEE Trans. Parall. Distrib. Syst*., vol. 17, no. 8, pp. 773-785, 2006.

[80]

Y. F. Luo, J. H. Guan, and S. G. Zhou, Towards efficient subgraph search in cloud computing environments, in Proc. 16th Int. Conf. Database Systems for Advanced Applications, Hong Kong, China, 2011, pp. 2-13.

[81]

G. Buehrer, S. Parthasarathy, and Y. K. Chen, Adaptive parallel graph mining for CMP architectures, in Proc. Sixth Int. Conf. Data Mining, Hong Kong, China, 2006, pp. 97-106.

[82]

S. Aridhi, L. D’ Orazio, M. Maddouri, and M. E. Nguifo, Density-based data partitioning strategy to approximate large-scale subgraph mining, *Inf. Syst*., vol. 48, pp. 213-223, 2013.

[83]

M. A. Bhuiyan and M. A. Hasan, MIRAGE: An iterative MapReduce based frequent subgraph mining algorithm, arXiv Preprint arXiv: 1307.5894, 2013.

[84]

C. H. Teixeira, A. J. Fonseca, M. Serafini, G. Siganos, M. J. Zaki, and A. Aboulnaga, Arabesque: A system for distributed graph mining, in Proc. 25th Symp Operating Systems Principles, New York, NY, USA, 2015, pp. 425-440.

[85]

W. Q. Lin, X. K. Xiao, and G. Ghinita, Large-scale frequent subgraph mining in MapReduce, in Proc. 2014 IEEE 30th Int. Conf. Data Engineering, Chicago, IL, USA, 2014, pp. 844-855.

[86]

F. C. Qiao, X. Zhang, P. Li, Z. Y. Ding, S. S. Jia, and H. Wang, A parallel approach for frequent subgraph mining in a single large graph using Spark, *Appl. Sci*., vol. 8, no. 2, p. 230. 2018.

[87]

A. M. Petermann, M. Junghanns, and E. Rahm, DIMSpan-transactional frequent subgraph mining with distributed in-memory dataflow systems, in Proc. Fourth IEEE/ACM Int. Conf. Big Data Computing, Applications and Technologies, New York, NY, USA, 2017.

Publication history

Copyright

Rights and permissions

Received: 13 November 2018

Accepted: 15 February 2019

Published:
04 April 2019

Issue date: September 2019

© The author(s) 2019