Open Access
Issue
Published:
*25 January 2018*

Keywords:

latent Dirichlet allocation, collapsed Gibbs sampling, Monte-Carlo, graph computing, large-scale machine learning
Cite this article:

Zhao B, Zhou H, Li G, et al.
ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform.
Big Data Mining and Analytics,
2018, 1(1): 57-74.
https://doi.org/10.26599/BDMA.2018.9020006
Download citation

374

Views

16

Downloads

Citations

16

Crossref

N/A

WoS

22

Scopus

0

CSCD

Recently, topic models such as Latent Dirichlet Allocation (LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects: (1) it converts the commonly used serial Collapsed Gibbs Sampling (CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian (MCCB) estimation method, which is embarrassingly parallel; (2) it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity; (3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.

menu

Abstract

Full text

Outline

About this article

Recently, topic models such as Latent Dirichlet Allocation (LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects: (1) it converts the commonly used serial Collapsed Gibbs Sampling (CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian (MCCB) estimation method, which is embarrassingly parallel; (2) it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity; (3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.

[1]

Hofmann T., Probabilistic latent semantic analysis, in Proc. 5th Conf. Uncertainty in Artificial Intelligence, Stockholm, Sweden, 1999, pp. 289-296.

[2]

Blei D. M., Ng A. Y., and Jordan M. I., Latent Dirichlet allocation, *J. Mach. Learn. Res.*, vol. 3, pp. 993-1022, 2003.

[3]

Newman D., Asuncion A., Smyth P., and Welling M., Distributed algorithms for topic models, *J. Mach. Learn. Res.*, vol. 10, pp. 1801-1828, 2009.

[4]

Wang Y., Bai H. J., Stanton M., Chen W. Y., and Chang E. Y., PLDA: Parallel latent dirichlet allocation for large-scale applications, in Proc. 5th Int. Conf. Algorithmic Aspects in Information and Management, San Francisco, CA, USA, 2009, pp. 301-314.

[5]

Ahmed A., Aly M., Gonzalez J., Narayanamurthy S., and Smola A. J., Scalable inference in latent variable models, in Proc. 5th ACM Int. Conf. Web Search and Data Mining, Seattle, WA, USA, 2012, pp. 123-132.

[6]

Liu Z. Y., Zhang Y. Z., Chang E. Y., and Sun M. S., PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing, *ACM Trans. Intell. Syst. Technol.*, vol. 2, no. 3, p. 26, 2011.

[7]

Ho Q. R., Cipar J., Cui H. G., Kim J. K., Lee S., Gibbons P. B., Gibson G. A., Ganger G. R., and Xing E. P., More effective distributed ML via a stale synchronous parallel parameter server, in Proc. 26th Int. Conf. Neural Information Processing Systems, Lake Tahoe, NV, USA, 2013, pp. 1223-1231.

[8]

Yuan J. H., Gao F., Ho Q. R., Dai W., Wei J. L., Zheng X., Xing E. P., Liu T. Y., and Ma W. Y., LightLDA: Big topic models on modest computer clusters, in Proc. 24th Int. Conf. World Wide Web, Florence, Italy, 2015, pp. 1351-1361.

[9]

Xing E. P., Ho Q. R., Dai W., Kim J. K., Wei J. L., Lee S., Zheng X., Xie P. T., Kumar A., and Yu Y. L., Petuum: A new platform for distributed machine learning on big data, in Proc. 21th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Sydney, Australia, 2015, pp. 1335-1344.

[10]

MPI Forum, Message passing interface, http://mpiforum.org/, 2017.

[11]

OpenMP ARB, OpenMP specifications, http://www.openmp.org/specifications/, 2017.

[12]

Wang Y., Zhao X. M., Sun Z. L., Yan H., Wang L. F., Jin Z. H., Wang L. B., Gao Y., Law C., and Zeng J., Peacock: Learning long-tail topic features for industrial applications, *ACM Trans. Intell. Syst. Technol.*, vol. 6, no. 4, p. 47, 2015.

[13]

Tora S. and Eguchi K., MPI/OpenMP hybrid parallel inference methods for latent dirichlet allocation—Approximation and evaluation, *IEICE Trans. Inf. Syst.*, vol. E96.D, no. 5, pp. 1006-1015, 2013.

[14]

Li M., Andersen D. G., Park J. W., Smola A. J., Ahmed A., Josifovski V., Long J., Shekita E. J., and Su B. Y., Scaling distributed machine learning with the parameter server, in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation, Broomfield, CO, USA, 2014, pp. 583-598.

[15]

Dean J. and Ghemawat S., MapReduce: Simplified data processing on large clusters, in Proc. 6th Conf. Symposium on Opearting Systems Design & Implementation, San Francisco, CA, USA, 2004, p. 10.

[16]

Isard M., Budiu M., Yu Y., Birrell A., and Fetterly D., Dryad: Distributed data-parallel programs from sequential building blocks, in Proc. 2nd ACM SIGOPS/EuroSys European Conf. Computer Systems 2007, Lisbon, Portugal, 2007, pp. 59-72.

[17]

Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M. J., Shenker S., and Stoica I., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. 9th USENIX Conf. Networked Systems Design and Implementation, San Jose, CA, USA, 2012, p. 2.

[18]

Meng X. R., Bradley J., Yavuz B., Sparks E., Venkataraman S., Liu D., Freeman J., Tsai D. B., Amde M., Owen S., et al., MLlib: Machine learning in apache spark, *J. Mach. Learn. Res.*, vol. 17, no. 1, pp. 1235-1241, 2016.

[19]

Qiu Z. L., Wu B., Wang B., Shi C., and Yu L., Gibbs collapsed sampling for latent Dirichlet allocation on spark, in Proc. 3rd Int. Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, New York, NY, USA, 2014, pp. 17-28.

[20]

Bradley J., Topic modeling with LDA: MLlib meets GraphX, https://databricks.com/blog/2015/03/25/topicmodeling-with-lda-mllib-meets-graphx.html, 2015.

[21]

Microsoft Research, Distributed machine learning toolkit, https://github.com/microsoft/dmtk, 2017.

[22]

Griffiths T. L. and Steyvers M., Finding scientific topics, Proc. Natl. Acad. Sci. USA, vol. 101, no. S1, pp. 5228-5235, 2004.

[23]

Li A. Q., Ahmed A., Ravi S., and Smola A. J., Reducing the sampling complexity of topic models, in Proc. 20th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2014, pp. 891-900.

[24]

Yu H. F., Hsieh C. J., Yun H., Vishwanathan S. V. N., and Dhillon I. S., A scalable asynchronous distributed algorithm for topic modeling, in Proc. 24th Int. Conf. World Wide Web, Florence, Italy, 2015, pp. 1340-1350.

[25]

Teh Y. W., Newman D., and Welling M., A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation, in Proc. 19th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2006, pp. 1353-1360.

[26]

Chen J. F., Li K. W., Zhu J., and Chen W., WarpLDA: A cache efficient O(1) algorithm for latent Dirichlet allocation, Proc. VLDB Endowment, vol. 9, no. 10, pp. 744-755, 2016.

[27]

Zaheer M., Wick M., Tristan J. B., Smola A., and Steele Jr G. L., Exponential stochastic cellular automata for massively parallel inference, in Proc. 19th Int. Conf. Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain, 2016, pp. 966-975.

[28]

Gonzalez J. E., Xin R. S., Dave A., Crankshaw D., Franklin M. J., and Stoica I., GraphX: Graph processing in a distributed dataflow framework, in Proc. 11th USENIX Conf. Operating Systems Design and Implementation, Broomfield, CO, USA, 2014, pp. 599-613.

[29]

Yao L. M., Mimno D., and McCallum A., Efficient methods for topic model inference on streaming document collections, in Proc. 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 937-946.

[30]

Gonzalez J. E., Low Y., Gu H. J., Bickson D., and Guestrin C., PowerGraph: Distributed graph-parallel computation on natural graphs, in Proc. 10th USENIX Conf. Operating Systems Design and Implementation, Hollywood, CA, USA, 2012, pp. 17-30.

[31]

Xie C., Yan L., Li W. J., and Zhang Z. H., Distributed power-law graph computing: theoretical and empirical analysis, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 1673-1681.

[32]

Esoteric Software, Kryo, https://github.com/Esoteric-Software/kryo, 2017.

[33]

Lemire D., JavaFastPFOR: A simple integer compression library in Java, https://github.com/lemire/JavaFastPFOR, 2017.

[34]

Hall D., Breeze, https://github.com/scalanlp/breeze, 2017.

[35]

Vavilapalli V. K., Murthy A. C., Douglas C., Agarwal S., Konar M., Evans R., Graves T., Lowe J., Shah H., Seth S., et al., Apache Hadoop YARN: Yet another resource negotiator, in Proc. 4th Annual Symposium on Cloud Computing, Santa Clara, CA, USA, 2013.

[36]

Wallach H. M., Mimno D., and McCallum A., Rethinking LDA: Why priors matter, in Proc. 23rd Annu. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2009, pp. 1973-1981.

Publication history

Copyright

Acknowledgements

Rights and permissions

Received: 22 August 2017

Accepted: 30 November 2017

Published:
25 January 2018

Issue date: March 2018

© The author(s) 2018

This work was initially studied during the internship at Microsoft Research Asia (MSRA). We thank MSRA for the support for this work. This work was partially supported by the National Natural Science Foundation of China (No. 61572250) and the Science and Technology Program of Jiangsu Province (No. BE2017155).