High Performance Frequent Subgraph Mining on Transaction Datasets: A Survey and Performance Comparison

Bismita S. Jena; Cynthia Khan; Rajshekhar Sunderraman

doi:10.26599/BDMA.2019.9020006

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (6.8 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

High Performance Frequent Subgraph Mining on Transaction Datasets: A Survey and Performance Comparison

Bismita S. Jena(

), Cynthia Khan, Rajshekhar Sunderraman

∙ Department of Computer Science, Georgia State University, Atlanta, GA 30302, USA.

Show Author Information

Abstract

Graph data mining has been a crucial as well as inevitable area of research. Large amounts of graph data are produced in many areas, such as Bioinformatics, Cheminformatics, Social Networks, etc. Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities. Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs. To tackle this problem, many main memory-based methods were proposed, which proved to be inefficient as the data size grew exponentially over time. In the past few years, several research groups have attempted to handle the Frequent Subgraph Mining (FSM) problem in multiple ways. Many authors have tried to achieve better performance using Graphic Processing Units (GPUs) which has multi-fold improvement over in-memory while dealing with large datasets. Later, Google’s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing. Although MapReduce came with many benefits, its disk I/O and non-iterative style model could not help much for FSM domain since subgraph mining process is an iterative approach. In recent years, Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability. This is a right fit solution for iterative style of programming as well. In this survey, we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results.

Keywords

Spark frequent subgraphs isomorphism

References

[1]

R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proc. 20th Int. Conf. Very Large Data Bases, Santiago, Chile, 1994, pp. 487-499.

[2]

C. Liu, X. F. Yan, L. Fei, J. W. Han, and S. P. Midkiff, SOBER: Statistical model-based bug localization, ACM SIGSOFT Soft. Eng. Notes, vol. 30, no. 5, pp. 286-295, 2005.

Crossref Google Scholar

[3]

X. Jiang, H. Xiong, C. Wang, and A. H. Tan, Mining globally distributed frequent subgraphs in a single labeled graph, Data Know. Eng., vol. 68, no. 10, pp. 1034-1058, 2009.

Crossref Google Scholar

[4]

M. Kuramochi and G. Karypis, Finding frequent patterns in a large sparse graph, Data Min. Know. Dis., vol. 11, no. 3, pp. 243-271, 2005.

Crossref Google Scholar

[5]

M. Kuramochi and G. Karypis, Grew—A scalable frequent subgraph discovery algorithm, in Proc. Fourth IEEE Int. Conf. Data Mining, Brighton, UK, 2004, pp. 439-442.

Crossref

[6]

S. Ghazizadeh and S. S. Chawathe, SEuS: Structure extraction using summaries, in Discovery Science, S. Lange, K. Satoh, and C. H. Smith, eds. Springer, 2002, pp. 71-85.

Crossref

[7]

N. Vanetik, E. Gudes, and S. E. Shimony, Computing frequent graph patterns from semistructured data, in Proc. 2002 IEEE Int. Conf. Data Mining, Maebashi City, Japan, 2002, pp. 458-465.

[8]

K. Yoshida, H. Motoda, and N. Indurkhya, Graph-based induction as a unified learning framework, Appl. Intell., vol. 4, no. 3, pp. 297-316, 1994.

Crossref Google Scholar

[9]

L. B. Holder, D. J. Cook, and S. Djoko, Substucture discovery in the SUBDUE system, in Proc. Workshop on Knowledge Discovery in Databases, 1994, pp. 169-180.

[10]

U. Kang, C. E. Tsourakakis, and C. Faloutsos, PEGASUS: A peta-scale graph mining system implementation and observations, in Proc. Ninth IEEE Int. Conf. Data Mining, Miami, FL, USA, 2009, pp. 229-238.

Crossref

[11]

S. P. Reinhardt and G. Karypis, A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph, in Proc. 12th Int. Workshop on High-Level Parallel Programming Models and Supportive Environments, Rome, Italy, 2007, pp. 1-8.

Crossref

[12]

B. Wu and Y. L. Bai, An efficient distributed subgraph mining algorithm in extreme large graphs, in Artificial Intelligence and Computational Intelligence, F. L. Wang, H. Deng, Y. Gao, and J. Lei, eds. Springer, 2010, pp. 107-115.

Crossref

[13]

L. Dehaspe, H. Toivonen, and R. D. King, Finding frequent substructures in chemical compounds, in Proc. Fourth Int. Conf. Knowledge Discovery and Data Mining, 1998.

[14]

A. Inokuchi, T. Washio, and H. Motoda, An apriori-based algorithm for mining frequent substructures from graph data, in Principles of Data Mining and Knowledge Discovery, D. A. Zighed, J. Komorowski, and J. Zytkow, eds. Springer, 2000, pp. 13-23.

Crossref

[15]

A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda, A fast algorithm for mining frequent connected subgraphs, IBM Research report, RT0448, 2002.

Google Scholar

[16]

X. F. Yan and J. W. Han, gSpan: Graph-based substructure pattern mining, in Proc. 2002 IEEE Int. Conf. Data Mining, Maebashi City, Japan, 2002, pp. 721-724.

[17]

J. Huan, W. Wang, and J. Prins, Efficient mining of frequent subgraphs in the presence of isomorphism, in Proc. Third IEEE Int. Conf. Data Mining, Melbourne, FL, USA, 2003, pp. 549-552.

[18]

S. Nijssen and J. N. Kok, A quickstart in frequent structure mining can make a difference, in Proc. Tenth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2004, pp. 647-652.

Crossref

[19]

S. Fortin, The graph isomorphism problem, Technical Report 96-20, University of Alberta, Edomonton, Alberta, Canada, 1996.

[20]

C. Liu, X. Yan, H. Yu, J. Han, and P. S. Yu, Mining behavior graphs for "Backtrace" of noncrashing bugs, in Proc. 5th SIAM Int. Conf. on Data Mining, 2005, pp. 286-297.

Crossref

[21]

R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon, Network motifs: Simple building blocks of complex networks, Science, vol. 298, no. 5594, pp. 824-827, 2002.

Crossref Google Scholar

[22]

R. Milo, N. Kashtan, S. Itzkovitz, M. E. J. Newman, and U. Alon, On the uniform generation of random graphs with prescribed degree sequences, arXiv preprint arXiv:cond-mat/0312028v2, 2003.

Google Scholar

[23]

C. Borgelt and M. R. Berthold, Mining molecular fragments: Finding relevant substructures of molecules, in Proc. 2002 IEEE Int. Conf. Data Mining, Maebashi City, Japan, 2002, pp. 51-58.

[24]

B. Srichandan and R. Sunderraman, Oo-fsg: An object-oriented approach to mine frequent subgraphs, in Proc. Ninth Australasian Data Mining Conf, Darlinghurst, Australia, 2011, pp. 221-228.

[25]

S. Hill, B. Srichandan, and R. Sunderraman, An iterative MapReduce approach to frequent subgraph mining in biological datasets, in Proc. ACM Conf. Bioinformatics, Computational Biology and Biomedicine, New York, NY, USA, 2012, pp. 661-666.

Crossref

[26]

J. Dean and S. Ghemawat, MapReduce: Simplified data processing on large clusters, Commun. ACM, vol. 51, no. 1, pp. 107-113, 2008.

Crossref Google Scholar

[27]

B. Jena, C. Khan, and R. Sunderraman, SparkFSM: A highly scalable frequent subgraph mining approach using apache spark, in Proc. ICDM Workshops 2018, 2018.

Crossref

[28]

J. W. Han and M. Kamber, Data Mining: Concepts and Techniques, 3rd edition. San Francisco, CA, USA: Morgan Kaufmann, 2012.

[29]

R. Agrawal, T. Imieliński, and A. Swami, Mining association rules between sets of items in large databases, ACM Sigmod Record Homep., vol. 22, no. 2, pp. 207-216, 1993.

Crossref Google Scholar

[30]

D. Burdick, M. Calimlim, and J. Gehrke, MAFIA: A maximal frequent itemset algorithm for transactional databases, in Proc. 17th Int. Conf. Data Engineering, Heidelberg, Germany, 2001, pp. 443-452.

[31]

R. J. Bayardo Jr, Efficiently mining long patterns from databases, ACM Sigmod Record, vol. 27, no. 2, pp. 85-93, 1998.

Crossref Google Scholar

[32]

J. W. Han, J. Pei, and Y. W. Yin, Mining frequent patterns without candidate generation, ACM SIGMOD Record, vol. 29, no. 2, pp. 1-12, 2000.

Crossref Google Scholar

[33]

M. J. Zaki and C. J. Hsiao, CHARM: An efficient algorithm for closed itemset mining, in Proc. 2nd SIAM Int. Conf. Data Mining, 2002, pp. 457-473.

Crossref

[34]

M. J. Zaki and K. Gouda, Fast vertical mining using diffsets, in Proc. Ninth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2003, pp. 326-335.

Crossref

[35]

J. Pei, J. W. Han, B. Mortazavi-Asl, H. Pinto, Q. M. Chen, U. Dayal, and M. C. Hsu, PrefixSpan,: Mining sequential patterns efficiently by prefix-projected pattern growth, in Proc. 17th Int. Conf. Data Engineering, Heidelberg, Germany, 2001.

[36]

H. Mannila, H. Toivonen, and A. I. Verkamo, Discovery of frequent episodes in event sequences, Data Min. Know. Disc., vol. 1, no. 3, pp. 259-289, 1997.

Crossref Google Scholar

[37]

R. Agrawal and R. Srikant, Mining sequential patterns, in Proc. Eleventh Int. Conf. Data Engineering, Washington, DC, USA, 1995, pp. 3-14.

[38]

T. Asai, K. Abe, S. Kawasoe, H. Arimura, H. Sakamoto, and S. Arikawa, Efficient substructure discovery from large semi-structured data, in Proc. 2002 SIAM Int. Conf. Data Mining, Arlington, VA, USA, 2002.

Crossref

[39]

M. J. Zaki, Efficiently mining frequent trees in a forest, in Proc. Eighth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2002, pp. 71-80.

Crossref

[40]

R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad, A tree projection algorithm for generation of frequent item sets, J. Parall. Distrib. Comput., vol. 61, no. 3, pp. 350-371, 2001.

Crossref Google Scholar

[41]

M. Kuramochi and G. Karypis, Frequent subgraph discovery, in Proc. IEEE Int. Conf. Data Mining, San Jose, CA, USA, 2001, pp. 313-320.

[42]

X. Yan, Mining, indexing and similarity search in large graph data sets, PhD dissertation, University of Illinois at Urbana-Champaign, IL, USA, 2006.

[43]

T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms. Boston, MA, USA: MIT Press and McGraw-Hill, 2001.

[44]

A. Savasere, E. Omiecinski, and S. B. Navathe, An efficient algorithm for mining association rules in large databases, in Proc. 21th Int. Conf. Very Large Data Bases, San Francisco, CA, USA, 1995, pp. 432-444.

[45]

C. Wang, W. Wang, J. Pei, Y. T. Zhu, and B. L. Shi, Scalable mining of large disk-based graph databases, in Proc. Tenth ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2004, pp. 316-325.

Crossref

[46]

J. M. Wang, W. Hsu, M. L. Lee, and C. Sheng, A partition-based approach to graph mining, in Proc. 22nd Int. Conf. Data Engineering, Atlanta, GA, USA, 2006, p. 74.

[47]

S. N. Nguyen, M. E. Orlowska, and X. Li, Graph mining based on a data partitioning approach, in Proc. Nineteenth Conf. Australasian Database, Darlinghurst, Australia, 2007, pp. 31-37.

[48]

S. N. Nguyen and M. E. Orlowska, Improvements in the data partitioning approach for frequent itemsets mining, in Knowledge Discovery in Databases: PKDD 2005, A. M. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, eds. Springer, 2005, pp. 625-633.

Crossref

[49]

S. Chakravarthy, R. Beera, and R. Balachandran, DB-Subdue: Database approach to graph mining, in Advances in Knowledge Discovery and Data Mining, H. Dai, R. Srikant, and C. Zhang, eds. Springer, 2004, pp. 341-350.

Crossref

[50]

D. J. Cook and L. B. Holder, Graph-based data mining, IEEE Intell. Syst. Their Appl., vol. 15, no. 2, pp. 32-41, 2000.

Crossref Google Scholar

[51]

J. Rissanen, Stochastic Complexity in Statistical Inquiry. Teaneck, NJ, USA: World Scientific Publishing Co., Inc, 1989.

[52]

R. Balachandran, S. Padmanabhan, and S. Chakravarthy, Enhanced DB-subdue: Supporting subtle aspects of graph mining using a relational approach, in Advances in Knowledge Discovery and Data Mining, W. K. Ng, M. Kitsuregawa, J. Li, and K. Chang, eds. Springer, 2006, pp. 673-678.

Crossref

[53]

S. Padmanabhan and S. Chakravarthy, HDB-subdue: A scalable approach to graph mining, in Data Warehousing and Knowledge Discovery, T. B. Pedersen, M. K. Mohania, and A. M. Tjoa, eds. Springer, 2009, pp. 325-338.

Crossref

[54]

S. Chakravarthy and S. Pradhan, DB-FSG: An SQL-based approach for frequent subgraph mining, in Database and Expert Systems Applications, S. S. Bhowmick, J. Küng, and R. Wagner, eds. Springer, 2008, pp. 684-692.

[55]

H. F. Li and N. Zhang, Mining maximal frequent itemsets on graphics processors, in Proc. 2010 Seventh Int. Conf. Fuzzy Systems and Knowledge Discovery, Yantai, China, 2010, pp. 1461-1464.

Crossref

[56]

W. B. Fang, M. Lu, X. Y. Xiao, B. S. He, and Q. Luo, Frequent itemset mining on graphics processors, in Proc. Fifth Int. Workshop on Data Management on New Hardware, New York, NY, USA, 2009, pp. 34-42.

Crossref

[57]

H. F. Li, A GPU-based maximal frequent itemsets mining algorithm over stream, in Proc. 2010 Int. Conf. Computer and Communication Technologies in Agriculture Engineering, Chengdu, China, 2010, pp. 289-292.

Crossref

[58]

R. R. Amossen and R. Pagh, A new data layout for set intersection on GPUs, in Proc. 2011 IEEE Int. Parallel & Distributed Processing Symp., Anchorage, AK, USA, 2011, pp. 698-708.

Crossref

[59]

G. Teodoro, N. Mariano, W. Meira Jr, and R. Ferreira, Tree projection-based frequent itemset mining on multicore CPUs and GPUs, in Proc. 2010 22nd Int. Symp. Computer Architecture and High Performance Computing, Petropolis, Brazil, 2010, pp. 47-54.

Crossref

[60]

D. W. Cheung, J. W. Han, V. T. Ng, A. W. Fu, and Y. J. Fu, A fast distributed algorithm for mining association rules, in Proc. Fourth Int. Conf. Parallel and Distributed Information Systems, Miami Beach, FL, USA, 1996, pp. 31-42.

[61]

L. Liu, E. Li, Y. M. Zhang, and Z. Z. Tang, Optimization of frequent itemset mining on multiple-core processor, in Proc. 33rd Int. Conf. Very Large Data Bases, Vienna, Austria, 2007, pp. 1275-1285.

[62]

H. Y. Li, Y. Wang, D. Zhang, M. Zhang, and E. Y. Chang, PFP: Parallel FP-growth for query recommendation, in Proc. 2008 ACM Conf. Recommender Systems, New York, NY, USA, 2008, pp. 107-114.

Crossref

[63]

I. Miliaraki, K. Berberich, R. Gemulla, and S. Zoupanos, Mind the gap: Large-scale frequent sequence mining, in Proc. 2013 ACM SIGMOD Int. Conf. Management of Data, New York, NY, USA, 2013, pp. 797-808.

Crossref

[64]

Y. Liu, X. H. Jiang, H. J. Chen, J. Ma, and X. Y. Zhang, MapReduce-based pattern finding algorithm applied in motif detection for prescription compatibility network, in Advanced Parallel Processing Technologies, Y. Dou, R. Gruber, and J. M. Joller, eds. Springer, 2009, pp. 341-355.

[65]

C. Wang and S. Parthasarathy, Parallel algorithms for mining frequent structural motifs in scientific data, in Proc. 18th Ann. Int. Conf. Supercomputing, New York, NY, USA, 2004, pp. 31-40.

Crossref

[66]

M. Coatney and S. Parthasarathy, Motifminer: A general toolkit for efficiently identifying common substructures in molecules, in Proc. Third IEEE Symp. Bioinformatics and Bioengineering, Bethesda, MD, USA, 2003, pp. 336-340.

[67]

D. J. Cook, L. B. Holder, G. Galal, and R. Maglothin, Approaches to parallel graph-based knowledge discovery, J. Parall. Distrib. Comput., vol. 61, no. 3, pp. 427-446, 2001.

Crossref Google Scholar

[68]

T. Meinl, I. Fischer, and M. Philippsen, Parallel mining for frequent fragments on a shared-memory multiprocessor-results and java-obstacles, in Proc. Lernen, Wissensentdeckung und Adaptivität, Saarbrücken, Germany, 2005.

[69]

Y. Shiloach and U. Vishkin, An O(logn) parallel connectivity algorithm, J. Algorithms, vol. 3, no. 1, pp. 57-67, 1982.

Crossref Google Scholar

[70]

B. Awerbuch and T. Singh, New connectivity and MSF algorithms for ultracomputer and PRAM, IEEE Transcactions on Computers, vol. 36, no. 10, pp. 1258-1263, 1987.

Crossref Google Scholar

[71]

D. S. Hirschberg, A. K. Chandra, and D. V. Sarwate, Computing connected components on parallel computers, Commun. ACM, vol. 22, no. 8, pp. 461-464, 1979.

Crossref Google Scholar

[72]

U. Kang, C. E. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec, Hadi: Mining radii of large graphs, ACM Trans. Know. Discovery Data, vol. 5, no. 2, p. 8, 2011.

Crossref Google Scholar

[73]

Z. Zhao, G. Y. Wang, A. R. Butt, M. Khan, V. A. Kumar, and M. V. Marathe, SAHAD: Subgraph analysis in massive networks using Hadoop, in Proc. 2012 IEEE 26th Int. Parallel and Distributed Processing Symp., Shanghai, China, 2012, pp. 390-401.

Crossref

[74]

N. Alon, P. Dao, I. Hajirasouliha, F. Hormozdiari, and S. C. Sahinalp, Biomolecular network motif counting and discovery by color coding, Bioinformatics, vol. 24, no. 13, pp. i241-i249, 2008.

Crossref Google Scholar

[75]

N. Alon, R. Yuster, and U. Zwick, Color-coding, J. ACM, vol. 42, no. 4, pp. 844-856, 1995.

Crossref Google Scholar

[76]

F. N. Afrati, D. Fotakis, and J. D. Ullman, Enumerating subgraph instances using map-reduce, in Proc. 2013 IEEE 29th Int. Conf. Data Engineering, Brisbane, Australia, 2013, pp. 62-73.

Crossref

[77]

F. N. Afrati and J. D. Ullman, Optimizing multiway joins in a map-reduce environment, IEEE Trans. Know. Data Eng., vol. 23, no. 9, pp. 1282-1298, 2011.

Crossref Google Scholar

[78]

J. G. Xiang, C. Guo, and A. Aboulnaga, Scalable maximum clique computation using MapReduce, in Proc. 2013 IEEE 29th Int. Conf. Data Engineering, Brisbane, Australia, 2013, pp. 74-85.

[79]

Di G. Fatta and M. R. Berthold, Dynamic load balancing for the distributed mining of molecular structures, IEEE Trans. Parall. Distrib. Syst., vol. 17, no. 8, pp. 773-785, 2006.

Crossref Google Scholar

[80]

Y. F. Luo, J. H. Guan, and S. G. Zhou, Towards efficient subgraph search in cloud computing environments, in Proc. 16th Int. Conf. Database Systems for Advanced Applications, Hong Kong, China, 2011, pp. 2-13.

Crossref

[81]

G. Buehrer, S. Parthasarathy, and Y. K. Chen, Adaptive parallel graph mining for CMP architectures, in Proc. Sixth Int. Conf. Data Mining, Hong Kong, China, 2006, pp. 97-106.

Crossref

[82]

S. Aridhi, L. D’ Orazio, M. Maddouri, and M. E. Nguifo, Density-based data partitioning strategy to approximate large-scale subgraph mining, Inf. Syst., vol. 48, pp. 213-223, 2013.

Crossref Google Scholar

[83]

M. A. Bhuiyan and M. A. Hasan, MIRAGE: An iterative MapReduce based frequent subgraph mining algorithm, arXiv Preprint arXiv: 1307.5894, 2013.

Google Scholar

[84]

C. H. Teixeira, A. J. Fonseca, M. Serafini, G. Siganos, M. J. Zaki, and A. Aboulnaga, Arabesque: A system for distributed graph mining, in Proc. 25th Symp Operating Systems Principles, New York, NY, USA, 2015, pp. 425-440.

Crossref

[85]

W. Q. Lin, X. K. Xiao, and G. Ghinita, Large-scale frequent subgraph mining in MapReduce, in Proc. 2014 IEEE 30th Int. Conf. Data Engineering, Chicago, IL, USA, 2014, pp. 844-855.

Crossref

[86]

F. C. Qiao, X. Zhang, P. Li, Z. Y. Ding, S. S. Jia, and H. Wang, A parallel approach for frequent subgraph mining in a single large graph using Spark, Appl. Sci., vol. 8, no. 2, p. 230. 2018.

Crossref Google Scholar

[87]

A. M. Petermann, M. Junghanns, and E. Rahm, DIMSpan-transactional frequent subgraph mining with distributed in-memory dataflow systems, in Proc. Fourth IEEE/ACM Int. Conf. Big Data Computing, Applications and Technologies, New York, NY, USA, 2017.

Crossref

Big Data Mining and Analytics

Volume 2 Issue 3,
September 2019

Pages 159-180

DOI: 10.26599/BDMA.2019.9020006

Cite this article:

Jena BS, Khan C, Sunderraman R. High Performance Frequent Subgraph Mining on Transaction Datasets: A Survey and Performance Comparison. Big Data Mining and Analytics, 2019, 2(3): 159-180. https://doi.org/10.26599/BDMA.2019.9020006

746

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 13 November 2018

Accepted: 15 February 2019

Published: 04 April 2019