skip to main content
10.1145/2736277.2741285acmotherconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

Published:18 May 2015Publication History

ABSTRACT

Minwise hashing (Minhash) is a widely popular indexing scheme in practice. Minhash is designed for estimating set resemblance and is known to be suboptimal in many applications where the desired measure is set overlap (i.e., inner product between binary vectors) or set containment. Minhash has inherent bias towards smaller sets, which adversely affects its performance in applications where such a penalization is not desirable. In this paper, we propose asymmetric minwise hashing ({\em MH-ALSH}), to provide a solution to this well-known problem. The new scheme utilizes asymmetric transformations to cancel the bias of traditional minhash towards smaller sets, making the final ``collision probability'' monotonic in the inner product. Our theoretical comparisons show that, for the task of retrieving with binary inner products, asymmetric minhash is provably better than traditional minhash and other recently proposed hashing algorithms for general inner products. Thus, we obtain an algorithmic improvement over existing approaches in the literature. Experimental evaluations on four publicly available high-dimensional datasets validate our claims. The proposed scheme outperforms, often significantly, other hashing algorithms on the task of near neighbor retrieval with set containment. Our proposal is simple and easy to implement in practice.

References

  1. P. Agrawal, A. Arasu, and R. Kaushik. On indexing error-tolerant set containment. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 927--938. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Andoni and P. Indyk. E2lsh: Exact euclidean locality sensitive hashing. Technical report, 2004.Google ScholarGoogle Scholar
  3. Y. Bachrach, Y. Finkelstein, R. Gilad-Bachrach, L. Katzir, N. Koenigstein, N. Nice, and U. Paquet. Speeding up the xbox recommender system using a euclidean transformation for inner-product spaces. In RecSys, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Y. Bachrach, E. Porat, and J. S. Rosenschein. Sketching techniques for collaborative filtering. In Proceedings of the 21st International Jont Conference on Artifical Intelligence, IJCAI'09, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. J. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. In WWW, pages 131--140, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Z. Broder. On the resemblance and containment of documents. In the Compression and Complexity of Sequences, pages 21--29, Positano, Italy, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In STOC, pages 327--336, Dallas, TX, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Chandra, E. Ie, K. Goldman, T. L. Llinares, J. McFadden, F. Pereira, J. Redstone, T. Shaked, and Y. Singer. Sibyl: a system for large scale machine learning.Google ScholarGoogle Scholar
  9. O. Chapelle, P. Haffner, and V. N. Vapnik. Support vector machines for histogram-based image classification. IEEE Transactions on Neural Networks, 10(5):1055--1064, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380--388, Montreal, Quebec, Canada, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chaudhuri, V. Ganti, and R. Kaushik. A primitive operatior for similarity joins in data cleaning. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Chaudhuri, V. Ganti, and D. Xin. Mining document collections to facilitate accurate approximate entity matching. Proceedings of the VLDB Endowment, 2(1):395--406, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. F. Chierichetti, R. Kumar, S. Lattanzi, M. Mitzenmacher, A. Panconesi, and P. Raghavan. On compressing social networks. In KDD, pages 219--228, Paris, France, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. G. Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 271--282. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. S. Das, M. Datar, A. Garg, and S. Rajaram. Google news personalization: scalable online collaborative filtering. In Proceedings of the 16th international conference on World Wide Web, pages 271--280. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokn. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253 -- 262, Brooklyn, NY, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Geva and C. M. De Vries. Topsig: Topology preserving document signatures. In CIKM, pages 333--338, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. X. Goemans and D. P. Williamson. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of ACM, 42(6):1115--1145, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Haeupler, M. Manasse, and K. Talwar. Consistent weighted sampling made fast, small, and easy. Technical report, arXiv:1410.4266, 2014.Google ScholarGoogle Scholar
  20. M. Hein and O. Bousquet. Hilbertian metrics and positive definite kernels on probability measures. In AISTATS, pages 136--143, Barbados, 2005.Google ScholarGoogle Scholar
  21. M. R. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR, pages 284--291, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Indyk and R. Motwani. Approximate nearest neighbors: Towards removing the curse of dimensionality. In STOC, pages 604--613, Dallas, TX, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Ioffe. Improved consistent sampling, weighted minhash and L1 sketching. In ICDM, pages 246--255, Sydney, AU, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Jiang, C. Ngo, and J. Yang. Towards optimal bag-of-features for object categorization and semantic video retrieval. In CIVR, pages 494--501, Amsterdam, Netherlands, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pages 802--803. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Li. Min-max kernels. Technical report, arXiv:1503. 0173, 2015.Google ScholarGoogle Scholar
  27. P. Li and A. C. König. Theory and applications b-bit minwise hashing. Commun. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. Li, M. Mitzenmacher, and A. Shrivastava. Coding for random projections and approximate near neighbor search. Technical report, arXiv:1403.8144, 2014.Google ScholarGoogle Scholar
  29. P. Li, A. B. Owen, and C.-H. Zhang. One permutation hashing. In NIPS, Lake Tahoe, NV, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Manasse, F. McSherry, and K. Talwar. Consistent weighted sampling. Technical Report MSR-TR-2010-73, Microsoft Research, 2010.Google ScholarGoogle Scholar
  31. S. Melnik and H. Garcia-Molina. Adaptive algorithms for set containment joins. ACM Transactions on Database Systems (TODS), 28(1):56--99, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. B. Neyshabur and N. Srebro. A simpler and better lsh for maximum inner product search (mips). Technical report, arXiv:1410.5518, 2014.Google ScholarGoogle Scholar
  33. A. Rajaraman and J. Ullman. Mining of Massive Datasets. http://i.stanford.edu/ ullman/mmds.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. K. Ramasamy, J. F. Naughton, and R. Kaushik. Set containment joins: The good, the bad and the ugly.Google ScholarGoogle Scholar
  35. A. Shrivastava and P. Li. Asymmetric LSH (ALSH) for sublinear time maximum inner product search (mips). In NIPS, Montreal, CA, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. A. Shrivastava and P. Li. Densifying one permutation hashing via rotation for fast near neighbor search. In ICML, Beijing, China, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Shrivastava and P. Li. Improved asymmetric locality sensitive hashing (ALSH) for maximum inner product search (MIPS). arXiv:1410.5410 (submitted to AISTATS), 2014.Google ScholarGoogle Scholar
  38. A. Shrivastava and P. Li. Improved densification of one permutation hashing. In UAI, Quebec City, CA, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Shrivastava and P. Li. In defense of minhash over simhash. In AISTATS, 2014.Google ScholarGoogle Scholar
  40. R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In VLDB, pages 194--205, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Asymmetric Minwise Hashing for Indexing Binary Inner Products and Set Containment

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      WWW '15: Proceedings of the 24th International Conference on World Wide Web
      May 2015
      1460 pages
      ISBN:9781450334693

      Copyright © 2015 Copyright is held by the International World Wide Web Conference Committee (IW3C2)

      Publisher

      International World Wide Web Conferences Steering Committee

      Republic and Canton of Geneva, Switzerland

      Publication History

      • Published: 18 May 2015

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      WWW '15 Paper Acceptance Rate131of929submissions,14%Overall Acceptance Rate1,899of8,196submissions,23%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader