Journal Home > Volume 18 , Issue 1

Keyword search has become a ubiquitous method for users to access text data in the face of information explosion. Inverted lists are usually used to index underlying documents to retrieve documents according to a set of keywords efficiently. Since inverted lists are usually large, many compression techniques have been proposed to reduce the storage space and disk I/O time. However, these techniques usually perform decompression operations on the fly, which increases the CPU time. This paper presents a more efficient index structure, the Generalized INverted IndeX (Ginix), which merges consecutive IDs in inverted lists into intervals to save storage space. With this index structure, more efficient algorithms can be devised to perform basic keyword search operations, i.e., the union and the intersection operations, by taking the advantage of intervals. Specifically, these algorithms do not require conversions from interval lists back to ID lists. As a result, keyword search using Ginix can be more efficient than those using traditional inverted indices. The performance of Ginix is also improved by reordering the documents in datasets using two scalable algorithms. Experiments on the performance and scalability of Ginix on real datasets show that Ginix not only requires less storage space, but also improves the keyword search performance, compared with traditional inverted indexes.


menu
Abstract
Full text
Outline
About this article

Ginix: Generalized Inverted Index for Keyword Search

Show Author's information Hao Wu( )Guoliang LiLizhu Zhou
Department of Computer Science and Technology, Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, Beijing 100084, China

Abstract

Keyword search has become a ubiquitous method for users to access text data in the face of information explosion. Inverted lists are usually used to index underlying documents to retrieve documents according to a set of keywords efficiently. Since inverted lists are usually large, many compression techniques have been proposed to reduce the storage space and disk I/O time. However, these techniques usually perform decompression operations on the fly, which increases the CPU time. This paper presents a more efficient index structure, the Generalized INverted IndeX (Ginix), which merges consecutive IDs in inverted lists into intervals to save storage space. With this index structure, more efficient algorithms can be devised to perform basic keyword search operations, i.e., the union and the intersection operations, by taking the advantage of intervals. Specifically, these algorithms do not require conversions from interval lists back to ID lists. As a result, keyword search using Ginix can be more efficient than those using traditional inverted indices. The performance of Ginix is also improved by reordering the documents in datasets using two scalable algorithms. Experiments on the performance and scalability of Ginix on real datasets show that Ginix not only requires less storage space, but also improves the keyword search performance, compared with traditional inverted indexes.

Keywords: keyword search, index compression, document reordering

References(25)

[1]
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel, Compression of inverted indexes for fast query evaluation, in Proc. of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Tammpere, Finland, 2002, pp. 222-229.
DOI
[2]
M. Zukowski, S. H¨¦man, N. Nes, and P. A. Boncz, Super-scalar RAM-CPU cache compression, in Proc. of the 22nd International Conference on Data Engineering, Atlanta, Georgia, USA, 2006, pp. 59.
DOI
[3]
W. Shieh, T. Chen, J. J. Shann, and C. Chung, Inverted file compression through document identifier reassignment, Information Processing and Management, vol. 39, no. 1, pp. 117-131, 2003.
[4]
R. Blanco and A. Barreiro, TSP and cluster-based solutions to the reassignment of document identifiers, Information Retrieval, vol. 9, no. 4, pp. 499-517, 2006.
[5]
F. Silvestri, Sorting out the document identifier assignment problem, in Proc. of the 29th European Conference on IR Research, Rome, Italy, 2007, pp. 101-112.
DOI
[6]
H. Yan, S. Ding, and T. Suel, Inverted index compression and query processing with optimized document ordering, in Proc. of the 18th International Conference on World Wide Web, Madrid, Spain, 2009, pp. 401-410.
DOI
[7]
S. Ding, J. Attenberg, and T. Suel, Scalable techniques for document identifier assignment ininverted indexes. in Proc. of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA, 2010, pp. 311-320.
DOI
[8]
M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava, Fast indexes and algorithms for set similarity selection queries, in Proc. of the 24th International Conference on Data Engineering, Cancun, Mexico, 2008, pp. 267-276.
DOI
[9]
W. J. Bouknight, A procedure for generation of three-dimensional half-toned computer graphics presentations, Communications of the ACM, vol. 13, no. 9, pp.527-536, September 1970.
[10]
Home page of DBLP bibliography, http://www.informatik.uni-trier.de/ ley/db, 2012.
[11]
Home page of PubMed, http://www.ncbi.nlm.nih.gov/pubmed, 2012.
[12]
S. Agrawal, S. Chaudhuri, and G. Das, DBXplorer: A system for keyword-based search over relational databases, in Proc. of the 18th International Conference on Data Engineering, San Jose, California, USA, 2002, pp. 5-16.
DOI
[13]
V. Hristidis and Y. Papakonstantinou, DISCOVER: Keyword search in relational databases, in Proc. of the 28th International Conference on Very Large Databases, Hong Kong, China, 2002, pp. 670-681.
DOI
[14]
V. Hristidis, L. Gravano, and Y. Papakonstantinou, Efficient IR-style keyword search over relational databases, in Proc. of the 29th International Conference on Very Large Databases, Berlin, Germany, 2003, pp. 850-861.
DOI
[15]
H. He, H. Wang, J. Yang, and P. S. Yu, BLINKS: ranked keyword searches on graphs, in Proc. of the ACM SIGMOD International Conference on Management of Data, Beijing, China, 2007, pp. 305-316.
DOI
[16]
Y. Luo, X. Lin, W. Wang, and X. Zhou, SPARK: Top-k keyword query in relational databases, in Proc. of the ACM SIGMOD International Conference on Management of Data, Beijing, China, 2007, pp. 115-126.
DOI
[17]
F. Liu, C. T. Yu, W. Meng, and A. Chowdhury, Effective keyword search in relational databases, in Proc. of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, 2006, pp. 563-574.
DOI
[18]
G. Li, B. C. Ooi, J. Feng, J. Wang, and L. Zhou, EASE: An effective 3-in-1 keyword search method for unstructured, semi-structured and structured data, in Proc. of the ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 2008, pp. 903-914.
DOI
[19]
G. Li, J. Feng, and L. Zhou, Interactive search in XML data, in Proc. of the 18th International Conference on World Wide Web, Madrid, Spain, 2009, pp. 1063-1064.
DOI
[20]
J. Zobel and A. Moffat, Inverted files for text search engines, ACM Computing Surveys, vol. 38, no. 2, pp. 6, 2006.
[21]
A. Moffat and L. Stuiver, Binary interpolative coding for effective index compression, Information Retrieval, vol. 3, no. 1, pp. 25-47, 2000.
[22]
V. N. Anh and A. Moffat, Inverted index compression using word-aligned binary codes, Information Retrieval, vol. 8, no. 1, pp. 151-166, 2005.
[23]
V. N. Anh and A. Moffat, Improved word-aligned binary compression for text indexing, IEEE Transactions on Knowledge and Data Engineering, vol. 18, no. 6, pp. 857-861, 2006.
[24]
J. Zhang, X. Long, and T. Suel, Performance of compressed inverted list caching in search engines, in Proc. of the 17th International Conference on World Wide Web, Beijing, China, 2008, pp. 387-396.
DOI
[25]
H. Edelsbrunner, A new approach to rectangle intersections, International Journal of Computer Mathematics, vol. 13, no. 3/4, pp. 209-219/221-229, 1983.
Publication history
Copyright
Acknowledgements
Rights and permissions

Publication history

Received: 03 May 2012
Accepted: 21 December 2012
Published: 07 February 2013
Issue date: February 2013

Copyright

© The author(s) 2013

Acknowledgements

This work was supported by the National Natural Science Foundation of China (No. 60833003).

Rights and permissions

Return