article

Streaming similarity search over one billion tweets using parallel locality-sensitive hashing

Authors:
Narayanan Sundaram

Parallel Computing Lab, Intel

Parallel Computing Lab, Intel
View Profile

,
Aizana Turmukhametova

CSAIL, MIT

CSAIL, MIT
View Profile

,
Nadathur Satish

Parallel Computing Lab, Intel

Parallel Computing Lab, Intel
View Profile

,
Todd Mostak

CSAIL, MIT

CSAIL, MIT
View Profile

,
Piotr Indyk

CSAIL, MIT

CSAIL, MIT
View Profile

,
Samuel Madden

CSAIL, MIT

CSAIL, MIT
View Profile

,
Pradeep Dubey

Parallel Computing Lab, Intel

Parallel Computing Lab, Intel
View Profile

Proceedings of the VLDB Endowment Volume 6 Issue 14pp 1930–1941https://doi.org/10.14778/2556549.2556574

Published:01 September 2013Publication History

Proceedings of the VLDB Endowment

Abstract

Finding nearest neighbors has become an important operation on databases, with applications to text search, multimedia indexing, and many other areas. One popular algorithm for similarity search, especially for high dimensional data (where spatial indexes like kd-trees do not perform well) is Locality Sensitive Hashing (LSH), an approximation algorithm for finding similar objects.

In this paper, we describe a new variant of LSH, called Parallel LSH (PLSH) designed to be extremely efficient, capable of scaling out on multiple nodes and multiple cores, and which supports high-throughput streaming of new data. Our approach employs several novel ideas, including: cache-conscious hash table layout, using a 2-level merge algorithm for hash table construction; an efficient algorithm for duplicate elimination during hash-table querying; an insert-optimized hash table structure and efficient data expiration algorithm for streaming data; and a performance model that accurately estimates performance of the algorithm and can be used to optimize parameter settings. We show that on a workload where we perform similarity search on a dataset of > 1 Billion tweets, with hundreds of millions of new tweets per day, we can achieve query times of 1-2.5 ms. We show that this is an order of magnitude faster than existing indexing schemes, such as inverted indexes. To the best of our knowledge, this is the fastest implementation of LSH, with table construction times up to 3.7× faster and query times that are 8.3× faster than a basic implementation.

References

E2LSH. http://www.mit.edu/~andoni/LSH/.Google Scholar
LikeLike. http://code.google.com/p/likelike/.Google Scholar
LSH-Hadoop. https://github.com/LanceNorskog/LSH-Hadoop.Google Scholar
LSHKIT. http://lshkit.sourceforge.net.Google Scholar
OptimalLSH. https://github.com/yahoo/Optimal-LSH.Google Scholar
Twitter breaks 400 million tweet-per-day barrier, sees increasing mobile revenue. http://bit.ly/MmXObG.Google Scholar
A. Andoni and P. Indyk. Efficient algorithms for substring near neighbor problem. In Proceedings of SODA, pages 1203-1212, 2006. Google Scholar
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM, 58, 2008. Google Scholar
N. Askitis and J. Zobel. Cache-conscious collision resolution in string hash tables. In SPIRE, pages 92-104, 2005. Google Scholar
B. Bahmani, A. Goel, and R. Shinde. Efficient distributed locality sensitive hashing. In CIKM, pages 2174-2178. ACM, 2012. Google Scholar
J. L. Bentley. Multidimensional binary search trees used for associative searching. CACM, 18(9):509-517, 1975. Google Scholar
C. Böhm, S. Berchtold, and D. A. Keim. Searching in high-dimensional spaces: Index structures for improving the performance of multimedia databases. ACM Computing Surveys, 33(3):322-373, Sept. 2001. Google Scholar
M. Charikar. Similarity estimation techniques from rounding. In Proceedings of STOC, pages 380-388, 2002. Google Scholar
L. Chen, M. T. Özsu, and V. Oria. Robust and fast similarity search for moving object trajectories. In Proceedings of SIGMOD, 2005. Google Scholar
A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In WWW, 2007. Google Scholar
A. Dasgupta, R. Kumar, and T. Sarlós. Fast locality-sensitive hashing. In SIGKDD, pages 1073-1081. ACM, 2011. Google Scholar
I. S. Duff, A. M. Erisman, and J. K. Reid. Direct methods for sparse matrices. Oxford University Press, Inc., 1986. Google Scholar
P. C.-M. K. A. P. Haghani. Lsh at large - distributed knn search in high dimensions. In WebDB, 2008.Google Scholar
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, 2006. Google Scholar
P. Indyk and R. Motwani. Approximate nearest neighbor: towards removing the curse of dimensionality. In Proceedings of STOC, 1998. Google Scholar
C. Kim, E. Sedlar, J. Chhugani, T. Kaldewey, et al. Sort vs. hash revisited: Fast join implementation on multi-core cpus. PVLDB, 2(2):1378-1389, 2009. Google Scholar
T. J. Lehman and M. J. Carey. A study of index structures for main memory database management systems. In VLDB, 1986. Google Scholar
Y. Li, J. M. Patel, and A. Terrell. Wham: A high-throughput sequence alignment method. TODS, 37(4):28:1-28:39, Dec. 2012. Google Scholar
F. Liu, C. Yu, W. Meng, and A. Chowdhury. Effective keyword search in relational databases. In Proceedings of ACM SIGMOD, 2006. Google Scholar
G. S. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. In Proceedings of WWW, 2007. Google Scholar
E. Mohr, D. A. Kranz, and R. H. Halstead. Lazy task creation: a technique for increasing the granularity of parallel programs. IEEE Transactions on Parallel and Distributed Systems, 2:185-197, 1991. Google Scholar
J. Pan and D. Manocha. Fast GPU-based locality sensitive hashing for k-nearest neighbor computation. In SIGSPATIAL, 2011. Google Scholar
S. Petrovic, M. Osborne, and V. Lavrenko. Streaming first story detection with application to twitter. In NAACL, volume 10, pages 181-189, 2010. Google Scholar
A. Sadilek and H. Kautz. Modeling the impact of lifestyle on health at scale. In Proceedings of ACM ICWSDM, pages 637-646, 2013. Google Scholar
N. Satish, C. Kim, J. Chhugani, et al. Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In SIGMOD, 2010. Google Scholar
M. Slaney, Y. Lifshits, and J. He. Optimal parameters for locality-sensitive hashing. Proceedings of the IEEE, 100(9):2604-2623, 2012.Google Scholar
X. Yan, P. S. Yu, and J. Han. Substructure similarity search in graph databases. In Proceedings of SIGMOD, 2005. Google Scholar

Index Terms

Streaming similarity search over one billion tweets using parallel locality-sensitive hashing
1. General and reference
  1. Cross-computing tools and techniques
    1. Performance
2. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information retrieval
    1. Document representation

Index terms have been assigned to the content through auto-classification.

Recommendations

Distributed similarity search in high dimensions using locality sensitive hashing
EDBT '09: Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology

In this paper we consider distributed K-Nearest Neighbor (KNN) search and range query processing in high dimensional data. Our approach is based on Locality Sensitive Hashing (LSH) which has proven very efficient in answering KNN queries in centralized ...
Read More
A posteriori multi-probe locality sensitive hashing
MM '08: Proceedings of the 16th ACM international conference on Multimedia

Efficient high-dimensional similarity search structures are essential for building scalable content-based search systems on feature-rich multimedia data. In the last decade, Locality Sensitive Hashing (LSH) has been proposed as indexing technique for ...
Read More
Data-Dependent Locality Sensitive Hashing
Proceedings of the 15th Pacific-Rim Conference on Advances in Multimedia Information Processing --- PCM 2014 - Volume 8879

Locality sensitive hashing LSH is the most popular algorithm for approximate nearest neighbor ANN search. As LSH partitions vector space uniformly and the distribution of vectors is usually non-uniform, it poorly fits real dataset and has limited ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 6, Issue 14
September 2013
384 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2013
Published in pvldb Volume 6, Issue 14
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 43
  Total Citations
  View Citations
- 806
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)11
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Streaming similarity search over one billion tweets using parallel locality-sensitive hashing

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Distributed similarity search in high dimensions using locality sensitive hashing

A posteriori multi-probe locality sensitive hashing

Data-Dependent Locality Sensitive Hashing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Streaming similarity search over one billion tweets using parallel locality-sensitive hashing

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Distributed similarity search in high dimensions using locality sensitive hashing

A posteriori multi-probe locality sensitive hashing

Data-Dependent Locality Sensitive Hashing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media