ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

Bo Zhao; Hucheng Zhou; Guoqiang Li; Yihua Huang

doi:10.26599/BDMA.2018.9020006

AI Chat Paper

Note: Please note that the following content is generated by AMiner AI. SciOpen does not take any responsibility related to this content.

Chat more with AI

| Sign up

Browse by Subject

Search for peer-reviewed journals with full access.

Journals A - Z

About Us

Discover the SciOpen Platform and Achieve Your Research Goals with Ease.

About Us

Publish with Us

Support

Journals A - Z

About Us

Publish with Us

Support

PDF (2.2 MB)

Cite

EndNote(RIS) BibTeX

Collect

Submit Manuscript

AI Chat Paper

Show Outline

Outline

Show full outline

Hide outline

Outline

Show full outline

Hide outline

Open Access

ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

Bo Zhao, Hucheng Zhou, Guoqiang Li, Yihua Huang(

)

∙ National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China and Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210023, China.

∙ Microsoft Research, Beijing 100080, China.

∙ Huawei Technologies Co., Ltd., Shenzhen 518129, China.

Show Author Information

Abstract

Recently, topic models such as Latent Dirichlet Allocation (LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects: (1) it converts the commonly used serial Collapsed Gibbs Sampling (CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian (MCCB) estimation method, which is embarrassingly parallel; (2) it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity; (3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.

Keywords

latent Dirichlet allocation collapsed Gibbs sampling Monte-Carlo graph computing large-scale machine learning

References

[1]

Hofmann T., Probabilistic latent semantic analysis, in Proc. 5th Conf. Uncertainty in Artificial Intelligence, Stockholm, Sweden, 1999, pp. 289-296.

Crossref

[2]

Blei D. M., Ng A. Y., and Jordan M. I., Latent Dirichlet allocation, J. Mach. Learn. Res., vol. 3, pp. 993-1022, 2003.

Google Scholar

[3]

Newman D., Asuncion A., Smyth P., and Welling M., Distributed algorithms for topic models, J. Mach. Learn. Res., vol. 10, pp. 1801-1828, 2009.

Google Scholar

[4]

Wang Y., Bai H. J., Stanton M., Chen W. Y., and Chang E. Y., PLDA: Parallel latent dirichlet allocation for large-scale applications, in Proc. 5th Int. Conf. Algorithmic Aspects in Information and Management, San Francisco, CA, USA, 2009, pp. 301-314.

Crossref

[5]

Ahmed A., Aly M., Gonzalez J., Narayanamurthy S., and Smola A. J., Scalable inference in latent variable models, in Proc. 5th ACM Int. Conf. Web Search and Data Mining, Seattle, WA, USA, 2012, pp. 123-132.

Crossref

[6]

Liu Z. Y., Zhang Y. Z., Chang E. Y., and Sun M. S., PLDA+: Parallel latent dirichlet allocation with data placement and pipeline processing, ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, p. 26, 2011.

Crossref Google Scholar

[7]

Ho Q. R., Cipar J., Cui H. G., Kim J. K., Lee S., Gibbons P. B., Gibson G. A., Ganger G. R., and Xing E. P., More effective distributed ML via a stale synchronous parallel parameter server, in Proc. 26th Int. Conf. Neural Information Processing Systems, Lake Tahoe, NV, USA, 2013, pp. 1223-1231.

[8]

Yuan J. H., Gao F., Ho Q. R., Dai W., Wei J. L., Zheng X., Xing E. P., Liu T. Y., and Ma W. Y., LightLDA: Big topic models on modest computer clusters, in Proc. 24th Int. Conf. World Wide Web, Florence, Italy, 2015, pp. 1351-1361.

Crossref

[9]

Xing E. P., Ho Q. R., Dai W., Kim J. K., Wei J. L., Lee S., Zheng X., Xie P. T., Kumar A., and Yu Y. L., Petuum: A new platform for distributed machine learning on big data, in Proc. 21th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Sydney, Australia, 2015, pp. 1335-1344.

Crossref

[10]

MPI Forum, Message passing interface, http://mpiforum.org/, 2017.

[11]

OpenMP ARB, OpenMP specifications, http://www.openmp.org/specifications/, 2017.

[12]

Wang Y., Zhao X. M., Sun Z. L., Yan H., Wang L. F., Jin Z. H., Wang L. B., Gao Y., Law C., and Zeng J., Peacock: Learning long-tail topic features for industrial applications, ACM Trans. Intell. Syst. Technol., vol. 6, no. 4, p. 47, 2015.

Crossref Google Scholar

[13]

Tora S. and Eguchi K., MPI/OpenMP hybrid parallel inference methods for latent dirichlet allocation—Approximation and evaluation, IEICE Trans. Inf. Syst., vol. E96.D, no. 5, pp. 1006-1015, 2013.

Crossref Google Scholar

[14]

Li M., Andersen D. G., Park J. W., Smola A. J., Ahmed A., Josifovski V., Long J., Shekita E. J., and Su B. Y., Scaling distributed machine learning with the parameter server, in Proc. 11th USENIX Symposium on Operating Systems Design and Implementation, Broomfield, CO, USA, 2014, pp. 583-598.

Crossref

[15]

Dean J. and Ghemawat S., MapReduce: Simplified data processing on large clusters, in Proc. 6th Conf. Symposium on Opearting Systems Design & Implementation, San Francisco, CA, USA, 2004, p. 10.

[16]

Isard M., Budiu M., Yu Y., Birrell A., and Fetterly D., Dryad: Distributed data-parallel programs from sequential building blocks, in Proc. 2nd ACM SIGOPS/EuroSys European Conf. Computer Systems 2007, Lisbon, Portugal, 2007, pp. 59-72.

Crossref

[17]

Zaharia M., Chowdhury M., Das T., Dave A., Ma J., McCauley M., Franklin M. J., Shenker S., and Stoica I., Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in Proc. 9th USENIX Conf. Networked Systems Design and Implementation, San Jose, CA, USA, 2012, p. 2.

[18]

Meng X. R., Bradley J., Yavuz B., Sparks E., Venkataraman S., Liu D., Freeman J., Tsai D. B., Amde M., Owen S., et al., MLlib: Machine learning in apache spark, J. Mach. Learn. Res., vol. 17, no. 1, pp. 1235-1241, 2016.

Google Scholar

[19]

Qiu Z. L., Wu B., Wang B., Shi C., and Yu L., Gibbs collapsed sampling for latent Dirichlet allocation on spark, in Proc. 3rd Int. Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, New York, NY, USA, 2014, pp. 17-28.

[20]

Bradley J., Topic modeling with LDA: MLlib meets GraphX, https://databricks.com/blog/2015/03/25/topicmodeling-with-lda-mllib-meets-graphx.html, 2015.

[21]

Microsoft Research, Distributed machine learning toolkit, https://github.com/microsoft/dmtk, 2017.

[22]

Griffiths T. L. and Steyvers M., Finding scientific topics, Proc. Natl. Acad. Sci. USA, vol. 101, no. S1, pp. 5228-5235, 2004.

Crossref

[23]

Li A. Q., Ahmed A., Ravi S., and Smola A. J., Reducing the sampling complexity of topic models, in Proc. 20th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, New York, NY, USA, 2014, pp. 891-900.

[24]

Yu H. F., Hsieh C. J., Yun H., Vishwanathan S. V. N., and Dhillon I. S., A scalable asynchronous distributed algorithm for topic modeling, in Proc. 24th Int. Conf. World Wide Web, Florence, Italy, 2015, pp. 1340-1350.

[25]

Teh Y. W., Newman D., and Welling M., A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation, in Proc. 19th Int. Conf. Neural Information Processing Systems, Vancouver, Canada, 2006, pp. 1353-1360.

Crossref

[26]

Chen J. F., Li K. W., Zhu J., and Chen W., WarpLDA: A cache efficient O(1) algorithm for latent Dirichlet allocation, Proc. VLDB Endowment, vol. 9, no. 10, pp. 744-755, 2016.

Crossref

[27]

Zaheer M., Wick M., Tristan J. B., Smola A., and Steele Jr G. L., Exponential stochastic cellular automata for massively parallel inference, in Proc. 19th Int. Conf. Artificial Intelligence and Statistics (AISTATS) 2016, Cadiz, Spain, 2016, pp. 966-975.

[28]

Gonzalez J. E., Xin R. S., Dave A., Crankshaw D., Franklin M. J., and Stoica I., GraphX: Graph processing in a distributed dataflow framework, in Proc. 11th USENIX Conf. Operating Systems Design and Implementation, Broomfield, CO, USA, 2014, pp. 599-613.

[29]

Yao L. M., Mimno D., and McCallum A., Efficient methods for topic model inference on streaming document collections, in Proc. 15th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 937-946.

Crossref

[30]

Gonzalez J. E., Low Y., Gu H. J., Bickson D., and Guestrin C., PowerGraph: Distributed graph-parallel computation on natural graphs, in Proc. 10th USENIX Conf. Operating Systems Design and Implementation, Hollywood, CA, USA, 2012, pp. 17-30.

[31]

Xie C., Yan L., Li W. J., and Zhang Z. H., Distributed power-law graph computing: theoretical and empirical analysis, in Proc. 27th Int. Conf. Neural Information Processing Systems, Montreal, Canada, 2014, pp. 1673-1681.

[32]

Esoteric Software, Kryo, https://github.com/Esoteric-Software/kryo, 2017.

[33]

Lemire D., JavaFastPFOR: A simple integer compression library in Java, https://github.com/lemire/JavaFastPFOR, 2017.

[34]

Hall D., Breeze, https://github.com/scalanlp/breeze, 2017.

[35]

Vavilapalli V. K., Murthy A. C., Douglas C., Agarwal S., Konar M., Evans R., Graves T., Lowe J., Shah H., Seth S., et al., Apache Hadoop YARN: Yet another resource negotiator, in Proc. 4th Annual Symposium on Cloud Computing, Santa Clara, CA, USA, 2013.

Crossref

[36]

Wallach H. M., Mimno D., and McCallum A., Rethinking LDA: Why priors matter, in Proc. 23rd Annu. Conf. on Neural Information Processing Systems, Vancouver, Canada, 2009, pp. 1973-1981.

Big Data Mining and Analytics

Volume 1 Issue 1,
March 2018

Pages 57-74

DOI: 10.26599/BDMA.2018.9020006

Cite this article:

Zhao B, Zhou H, Li G, et al. ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform. Big Data Mining and Analytics, 2018, 1(1): 57-74. https://doi.org/10.26599/BDMA.2018.9020006

861

Views

Downloads

Crossref

Web of Science

Scopus

CSCD

Google Scholar
Citation

Altmetrics

Received: 22 August 2017

Accepted: 30 November 2017

Published: 25 January 2018