Journal Home > Volume 24 , Issue 1

Inverted indexes are widely adopted in the vast majority of information systems. Growing requirements for efficient query processing have motivated the development of various compression techniques with different space-time characteristics. Although a single encoder yields a relatively stable point in the space-time tradeoff curve, flexibly transforming its characteristic along the curve to fit different information retrieval tasks can be a better way to prepare the index. Recent research comes out with an idea of integrating different encoders within the same index, namely, exploiting access skewness by compressing frequently accessed regions with faster encoders and rarely accessed regions with succinct encoders, thereby improving the efficiency while minimizing the compressed size. However, these methods are either inefficient or result in coarse granularity. To address these issues, we introduce the concept of bicriteria compression, which aims to formalize the problem of optimally trading the compressed size and query processing time for inverted index. We also adopt a Lagrangian relaxation algorithm to solve this problem by reducing it to a knapsack-type problem, which works in O(nlogn) time and O(n) space, with a negligible additive approximation. Furthermore, this algorithm can be extended via dynamic programming pursuing improved query efficiency. We perform an extensive experiment to show that, given a bounded time/space budget, our method can optimally trade one for another with more efficient indexing and query performance.


menu
Abstract
Full text
Outline
About this article

A Flexible Space-Time Tradeoff on Hybrid Index with Bicriteria Optimization

Show Author's information Xingshen Song( )Yuexiang YangYu Jiang
School of Advanced Interdisciplinary Studies, National University of Defense Technology, Changsha 410000, China.
College of Computer, National University of Defense Technology, Changsha 410000, China.
Northwest Institute of Nuclear Technology, Xi’an 710000, China.

Abstract

Inverted indexes are widely adopted in the vast majority of information systems. Growing requirements for efficient query processing have motivated the development of various compression techniques with different space-time characteristics. Although a single encoder yields a relatively stable point in the space-time tradeoff curve, flexibly transforming its characteristic along the curve to fit different information retrieval tasks can be a better way to prepare the index. Recent research comes out with an idea of integrating different encoders within the same index, namely, exploiting access skewness by compressing frequently accessed regions with faster encoders and rarely accessed regions with succinct encoders, thereby improving the efficiency while minimizing the compressed size. However, these methods are either inefficient or result in coarse granularity. To address these issues, we introduce the concept of bicriteria compression, which aims to formalize the problem of optimally trading the compressed size and query processing time for inverted index. We also adopt a Lagrangian relaxation algorithm to solve this problem by reducing it to a knapsack-type problem, which works in O(nlogn) time and O(n) space, with a negligible additive approximation. Furthermore, this algorithm can be extended via dynamic programming pursuing improved query efficiency. We perform an extensive experiment to show that, given a bounded time/space budget, our method can optimally trade one for another with more efficient indexing and query performance.

Keywords: inverted index, bicriteria compression, Lagrangian relaxation

References(34)

[1]
Catena M., Macdonald C., and Ounis I., On inverted index compression for search engine efficiency, in Advances in Information Retrieval, de Rijke M., Kenter T., de Vries A. P., Zhai C. X., de Jong F., Radinsky K., and Hofmann K., eds. Springer, 2014, pp. 359-371.
[2]
Trotman A., Compression, SIMD, and postings lists, in Proc. 2014 Australasian Document Computing Symp., Melbourne, Australia, 2014, p. 50.
DOI
[3]
Zobel J. and Moffat A., Inverted files for text search engines, ACM Comput. Surv. (CSUR), vol. 38, no. 2, p. 6, 2006.
[4]
Lemire D. and Boytsov L., Decoding billions of integers per second through vectorization, Softw.: Pract. Exp., vol. 45, no. 1, pp. 1-29, 2015.
[5]
Lemire D., Boytsov L., and Kurz N., SIMD compression and the intersection of sorted integers, Softw.: Pract. Exp., vol. 46, no. 6, pp. 723-749, 2015.
[6]
Zhao W. X., Zhang X. D., Lemire D., Shan D. D., Nie J. Y., Yan H. F., and Wen J. R., A general SIMD-based approach to accelerating compression algorithms, ACM Trans. Inf. Syst. (TOIS), vol. 33, no. 3, p. 15, 2015.
[7]
Lemire D., Kurz N., and Rupp C., Stream VByte: Faster byte-oriented integer compression, Inf. Process. Lett., vol. 130, pp. 1-6, 2018.
[8]
Ottaviano G., Tonellotto N., and Venturini R., Optimal space-time tradeoffs for inverted indexes, in Proc. 8th ACM Int. Conf. Web Search and Data Mining, Shanghai, China, 2015, pp. 47-56.
DOI
[9]
Ottaviano G. and Venturini R., Partitioned Elias-Fano indexes, in Proc. 37th Int. ACM SIGIR Conf. Research & Development in Information Retrieval, Gold Coast, Australia, 2014, pp. 273-282.
DOI
[10]
Petri M., Moffat A., and Culpepper J. S., Score-safe term-dependency processing with hybrid indexes, in Proc. 37th Int. ACM SIGIR Conf. Research & Development in Information Retrieval, Gold Coast, Australia, 2014, pp. 899-902.
DOI
[11]
Silvestri F. and Venturini R., VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming, in Proc. 19th ACM Int. Conf. Information and Knowledge Management, Toronto, Canada, 2010, pp. 1219-1228.
DOI
[12]
Moffat A. and Stuiver L., Binary interpolative coding for effective index compression, Inf. Retr., vol. 3, no. 1, pp. 25-47, 2000.
[13]
Lin J., Crane M., Trotman A., Callan J., Chattopadhyaya I., Foley J., Ingersoll G., Macdonald C., and Vigna S., Toward reproducible baselines: The open-source IR reproducibility challenge, in Proc. 38th European Conf. IR Research, Padua, Italy, 2016, pp. 408-420.
DOI
[14]
Yan H., Ding S., and Suel T., Inverted index compression and query processing with optimized document ordering, in Proc. 18th Int. Conf. World Wide Web, Madrid, Spain, 2009, p. 401.
DOI
[15]
Farruggia A., Ferragina P., Frangioni A., and Venturini R., Bicriteria data compression, in Proc. 25th Ann. ACM-SIAM Symp. Discrete Algorithms, Portland, OR, USA, 2014, pp. 1582-1595.
DOI
[16]
Mehlhorn K. and Ziegelmann M., Resource constrained shortest paths, in Proc. 8th Ann. European Symp. Algorithms, Saarbrücken, Germany, 2000, pp. 326-337.
DOI
[17]
Handler G. Y. and Zang I., A dual algorithm for the constrained shortest path problem, Networks, vol. 10, no. 4, pp. 293-309, 1980.
[18]
Ding S. and Suel T., Faster top-K document retrieval using block-max indexes, in Proc. 34th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Beijing, China, 2011, pp. 993-1002.
DOI
[19]
Konow R., Navarro G., Clarke C. L. A., and López-Ortíz A., Faster and smaller inverted indices with treaps, in Proc. 36th Int. ACM SIGIR Conf. Research and Development in Information Retrieval, Dublin, Ireland, 2013, pp. 193-202.
DOI
[20]
Navarro G. and Puglisi S. J., Dual-sorted inverted lists, in String Processing and Information Retrieval, Chavez E. and Lonardi S., eds. Springer, 2010, pp. 309-321.
DOI
[21]
Stepanov A. A., Gangolli A. R., Rose D. E., Ernst R. J., and Oberoi P. S., SIMD-based decoding of posting lists, in Proc. 20th ACM Int. Conf. Information and Knowledge Management, Glasgow, UK, 2011, pp. 317-326.
DOI
[22]
Delbru R., Campinas S., Samp K., and Tummarello G., Adaptive frame of reference for compressing inverted lists, Technical Report, DERI–Digital Enterprise Research Institute, Ireland, 2010.
[23]
Goldstein J., Ramakrishnan R., and Shaft U., Compressing relations and indexes, in Proc. 14th Int. Conf. Data Engineering, Orlando, FL, USA, 1998, pp. 370-379.
[24]
Anh V. N. and Moffat A., Index compression using fixed binary codewords, in Proc. 15th Australasian Database Conf.-Volume 27, Dunedin, New Zealand, 2004, pp. 61-67.
[25]
Anh V. N. and Moffat A., Inverted index compression using word-aligned binary codes, Inf. Retr., vol. 8, no. 1, pp. 151-166, 2005.
[26]
Anh V. N. and Moffat A., Improved word-aligned binary compression for text indexing, IEEE Trans. Knowl. Data Eng., vol. 18, no. 6, pp. 857-861, 2006.
[27]
Anh V. N. and Moffat A., Index compression using 64-bit words, Softw.: Pract. Exp., vol. 40, no. 2, pp. 131-147, 2010.
[28]
Delbru R., Campinas S., and Tummarello G., Searching web data: An entity retrieval and high-performance indexing model, Web Semant., vol. 10, pp. 33-58, 2012.
[29]
Zukowski M., Heman S., Nes N., and Boncz P., Super-scalar RAM-CPU cache compression, in Proc. 22nd Int. Conf. Data Engineering, Atlanta, GA, USA, 2006, p. 59.
DOI
[30]
Ferragina P., Nitto I., and Venturini R., On optimally partitioning a text to improve its compression, Algorithmica, vol. 61, no. 1, pp. 51-74, 2011.
[31]
Song X. S., Jiang K., Jiang Y., and Yang Y. X., On optimizing partitioning strategies for faster inverted index compression, in Proc. 16th Int. Conf. Computational Science and Its Applications, Beijing, China, 2016, pp. 246-260.
DOI
[32]
Lawler E. L., Combinatorial Optimization: Networks and Matroids. New York, NY, USA: Courier Corporation, 2001.
[33]
Pass G., Chowdhury A., and Torgeson C., A picture of search, in Proc. 1st Int. Conf. Scalable Information Systems, Hong Kong, China, 2006, p. 1.
DOI
[34]
Giuseppe Ottaviano, https://github.com/ot/ds2i.git, 2017.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 02 August 2017
Accepted: 30 November 2017
Published: 08 November 2018
Issue date: February 2019

Copyright

© The author(s) 2019

Acknowledgements

Our implementation is based on the codes provided by Giuseppe Ottaviano. We are grateful to the anonymous reviewers for their valuable comments. We thank the Natural Science Foundation of Hunan Province (No. 2016JJ2007) for their financial support.

Rights and permissions

Return