Skip to main content
Top
Published in: Annals of Data Science 4/2016

26-10-2016

A Systematic Review on Minwise Hashing Algorithms

Authors: Jingjing Tang, Yingjie Tian

Published in: Annals of Data Science | Issue 4/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Similarity detection technology captures a host of researchers’ attention. Minwise hashing schemes become the current researching hot spots in machine learning for similarity preservation. During the data preprocessing stage, the basic idea of minwise hashing schemes is to transfer the original data into binary codes which are good proxies of original data to preserve the similarity. Minwise hashing schemes can improve the computation efficiency and save the storage space without notable loss of accuracy. Thus, they have been studied extensively and developed rapidly for decades. Considering minwise hashing algorithm and its variants, a systematic survey is needed and beneficial to understand and utilize this kind of data preprocessing techniques more easily. The purpose of this paper is to review minwise hashing algorithms in detail and provide an insightful understanding of current developments. In order to show the application prospect of the minwise hashing algorithms, various algorithms have combined with linear Support Vector Machine for large-scale classification. Both theoretical analysis and experimental results demonstrate that these algorithms can achieve massive advantages in accuracy, efficiency and energy-consumption. Furthermore, their limitations, major opportunities and challenges, extensions and variants as well as potential important research directions have been pointed out.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, pp 459–468 Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, pp 459–468
2.
go back to reference Broder AZ (1997) On the resemblance and containment of documents. Compression Complex Seq 1997:21–29 Broder AZ (1997) On the resemblance and containment of documents. Compression Complex Seq 1997:21–29
3.
go back to reference Broder AZ, Glassman SC, Manasse MS, Zweig G (Sep 1997) Syntactic clustering of the web. In: Proceedings of the 6th international conference on World Wide Web, Vol 29, p 1157–1166 Broder AZ, Glassman SC, Manasse MS, Zweig G (Sep 1997) Syntactic clustering of the web. In: Proceedings of the 6th international conference on World Wide Web, Vol 29, p 1157–1166
4.
go back to reference Gollapudi S, Sharma A (2009) An axiomatic approach for result diversification. In: Proceedings of the 18th international conference on World Wide Web, p 381–390 Gollapudi S, Sharma A (2009) An axiomatic approach for result diversification. In: Proceedings of the 18th international conference on World Wide Web, p 381–390
5.
go back to reference Kalpakis K, Tang S (2008) Collaborative data gathering in wireless sensor networks using measurement co-occurrence. Comput Commun 31(10):19791992CrossRef Kalpakis K, Tang S (2008) Collaborative data gathering in wireless sensor networks using measurement co-occurrence. Comput Commun 31(10):19791992CrossRef
6.
go back to reference Najork M, Gollapudi S, Panigrahy R (2009) Less is more: sampling the neighborhood graph makes salsa better and faster. Web Search and Data Mining, p 242–251 Najork M, Gollapudi S, Panigrahy R (2009) Less is more: sampling the neighborhood graph makes salsa better and faster. Web Search and Data Mining, p 242–251
7.
go back to reference Urvoy T, Chauveau E, Filoche P, Lavergne T (2008) Tracking web spam with html style similarities. ACM Trans Web 2(1):1–28CrossRef Urvoy T, Chauveau E, Filoche P, Lavergne T (2008) Tracking web spam with html style similarities. ACM Trans Web 2(1):1–28CrossRef
8.
go back to reference Gong Y, Kumar S, Verma V, Lazebnik S (2012) Angular quantization-based binary codes for fast similarity search. In: Advances in neural information processing systems, p 1196–1204 Gong Y, Kumar S, Verma V, Lazebnik S (2012) Angular quantization-based binary codes for fast similarity search. In: Advances in neural information processing systems, p 1196–1204
9.
go back to reference He J, Chang S, Radhakrishnan R. Bauer C (2011) Compact hashing with joint optimization of search accuracy and time. In IEEE Conference on Computer Vision and Pattern Recognition, p 753–760, IEEE He J, Chang S, Radhakrishnan R. Bauer C (2011) Compact hashing with joint optimization of search accuracy and time. In IEEE Conference on Computer Vision and Pattern Recognition, p 753–760, IEEE
10.
go back to reference Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. VLDB 99:518–529 Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. VLDB 99:518–529
11.
go back to reference Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), p 459–468,IEEE Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), p 459–468,IEEE
12.
go back to reference Torralba A, Fergus R, Weiss Y (2008) Small codes and large image databases for recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, p 1–8. IEEE Torralba A, Fergus R, Weiss Y (2008) Small codes and large image databases for recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, p 1–8. IEEE
13.
go back to reference Liu W, Wang J, Kumar S, Chang S (2011) Hashing with graphs. In: Proceedings of the 28th international conference on machine learning, p 1–8 Liu W, Wang J, Kumar S, Chang S (2011) Hashing with graphs. In: Proceedings of the 28th international conference on machine learning, p 1–8
14.
go back to reference Shi QF, Petterson J, Dror G, Langford J, Smola A, Vishwanathan SVN (2009) Hash kernels for structured data. J Mach Learn Res 10:2615–2637 Shi QF, Petterson J, Dror G, Langford J, Smola A, Vishwanathan SVN (2009) Hash kernels for structured data. J Mach Learn Res 10:2615–2637
15.
go back to reference Mu YD, Hua G, Fan W, Chang SF (2014) Hash-SVM: scalable kernel machines for large-scale visual classification. In: IEEE Conference on Computer Vision and Pattern Recognition, p 979–986 Mu YD, Hua G, Fan W, Chang SF (2014) Hash-SVM: scalable kernel machines for large-scale visual classification. In: IEEE Conference on Computer Vision and Pattern Recognition, p 979–986
16.
go back to reference Litayem S, Joly A, Boujemaa N (2012) Hash-based support vector machines approximation for large scale prediction. Br Mach Vis Conf 34(6):1092–1104 Litayem S, Joly A, Boujemaa N (2012) Hash-based support vector machines approximation for large scale prediction. Br Mach Vis Conf 34(6):1092–1104
17.
go back to reference Gong Y, Lazebnik S (2011) Iterative quantization: a procrustean approach to learning binary codes. In: IEEE Conference on Computer Vision and Pattern Recognition, p 817–824. IEEE Gong Y, Lazebnik S (2011) Iterative quantization: a procrustean approach to learning binary codes. In: IEEE Conference on Computer Vision and Pattern Recognition, p 817–824. IEEE
18.
go back to reference Liu W, Wang J, Ji R, Jiang Y, Chang (2012) Supervised hashing with kernels. In: IEEE Conference on Computer Vision and Pattern Recognition, p 2074–2081. IEEE Liu W, Wang J, Ji R, Jiang Y, Chang (2012) Supervised hashing with kernels. In: IEEE Conference on Computer Vision and Pattern Recognition, p 2074–2081. IEEE
19.
go back to reference Joly A, Buisson O (2008) A posteriori multi-probe locality sensitive hashing. In: Proceedings of the 16th ACM international conference on Multimedia, p 209–218. ACM Joly A, Buisson O (2008) A posteriori multi-probe locality sensitive hashing. In: Proceedings of the 16th ACM international conference on Multimedia, p 209–218. ACM
20.
go back to reference Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th annual symposium on Computational geometry, p 253–262. ACM, 2004 Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th annual symposium on Computational geometry, p 253–262. ACM, 2004
21.
go back to reference Kulis B, Darrell T (2009) Learning to hash with binary reconstructive embeddings. In: Advances in neural information processing systems, p 1042–1050 Kulis B, Darrell T (2009) Learning to hash with binary reconstructive embeddings. In: Advances in neural information processing systems, p 1042–1050
22.
go back to reference Kulis B, Grauman K (2012) Kernelized locality-sensitive hashing. IEEE Trans Pattern Anal Mach Intell 34(6):1092–1104CrossRef Kulis B, Grauman K (2012) Kernelized locality-sensitive hashing. IEEE Trans Pattern Anal Mach Intell 34(6):1092–1104CrossRef
23.
go back to reference Raginsky M, Lazebnik S (2009) Locality-sensitive binary codes from shift-invariant kernels. In: Advances in neural information processing systems, p 1509–1517 Raginsky M, Lazebnik S (2009) Locality-sensitive binary codes from shift-invariant kernels. In: Advances in neural information processing systems, p 1509–1517
24.
go back to reference Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reason 50(7):969–978CrossRef Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reason 50(7):969–978CrossRef
25.
go back to reference Weiss Y, Torralba A, Fergus R (2009) Spectral hashing. In: Advances in neural information processing systems, p 1753–1760 Weiss Y, Torralba A, Fergus R (2009) Spectral hashing. In: Advances in neural information processing systems, p 1753–1760
26.
go back to reference Wang J, Kumar S, Chang S (2010) Sequential projection learning for hashing with compact codes. In: Proceedings of the 27th international conference on machine learning, p 1127–1134 Wang J, Kumar S, Chang S (2010) Sequential projection learning for hashing with compact codes. In: Proceedings of the 27th international conference on machine learning, p 1127–1134
27.
go back to reference Zhang D, Wang J, Cai D, Lu J (2010) Self-taught hashing for fast similarity search. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, p 18–25. ACM Zhang D, Wang J, Cai D, Lu J (2010) Self-taught hashing for fast similarity search. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, p 18–25. ACM
28.
go back to reference Zhang D, Wang F, Si L (2011) Composite hashing with multiple information sources. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, p 225–234. ACM Zhang D, Wang F, Si L (2011) Composite hashing with multiple information sources. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, p 225–234. ACM
29.
go back to reference Norouzi M, Blei DM (2011) Minimal loss hashing for compact binary codes. In: Proceedings of the 28th international conference on machine learning, p 353–360 Norouzi M, Blei DM (2011) Minimal loss hashing for compact binary codes. In: Proceedings of the 28th international conference on machine learning, p 353–360
30.
go back to reference Pandey S, Broder A, Chierichetti F, Josifovski V, Kumar R, Vassilvitskii S (2009) Nearest-neighbor caching for content-match applications. In: Proceedings of the 18th international conference on World Wide Web, p 441–450. ACM Pandey S, Broder A, Chierichetti F, Josifovski V, Kumar R, Vassilvitskii S (2009) Nearest-neighbor caching for content-match applications. In: Proceedings of the 18th international conference on World Wide Web, p 441–450. ACM
31.
go back to reference Shrivastava A, Li P (2012) Fast near neighbor search in high-dimensional binary data. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, p 474–489. Springer Shrivastava A, Li P (2012) Fast near neighbor search in high-dimensional binary data. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, p 474–489. Springer
32.
go back to reference Li P, Knig AC, Gui WH (2010) b-bit minwise hashing for estimating three-way similarities. In: Advances in Neural Information Processing Systems Li P, Knig AC, Gui WH (2010) b-bit minwise hashing for estimating three-way similarities. In: Advances in Neural Information Processing Systems
33.
go back to reference Li P, Knig AC (2010) b-bit minwise hashing. In: Proceedings of the 19th international conference on World Wide Web, p 671–680. ACM Li P, Knig AC (2010) b-bit minwise hashing. In: Proceedings of the 19th international conference on World Wide Web, p 671–680. ACM
34.
go back to reference Li P, Knig AC (2011) Theory and applications of b-bit minwise hashing. Commun ACM 54(8):101–109CrossRef Li P, Knig AC (2011) Theory and applications of b-bit minwise hashing. Commun ACM 54(8):101–109CrossRef
35.
go back to reference Li P, Shrivastava A, Moore J, Knig AC (2011) b-bit minwise hashing for large-scale learning. In: Advances in Neural Information Processing Systems. Neural Information Processing Foundation Li P, Shrivastava A, Moore J, Knig AC (2011) b-bit minwise hashing for large-scale learning. In: Advances in Neural Information Processing Systems. Neural Information Processing Foundation
36.
go back to reference Li P, Shrivastava A, Moore J, Knig AC (2011) Hashing algorithms for large-scale learning. In: Advances in neural information processing systems, p 2672–2680 Li P, Shrivastava A, Moore J, Knig AC (2011) Hashing algorithms for large-scale learning. In: Advances in neural information processing systems, p 2672–2680
37.
go back to reference Li P, Moore JL, Knig AC (2011) b-bit minwise hashing for large-scale linear svm. Technical report Li P, Moore JL, Knig AC (2011) b-bit minwise hashing for large-scale linear svm. Technical report
38.
go back to reference Yuan XP, Long J, Zhang ZP, Luo YY, Zhang H, Gui WH (2013) Connected bit minwise hashing. J Comput Res Dev 50(4):883–890 Yuan XP, Long J, Zhang ZP, Luo YY, Zhang H, Gui WH (2013) Connected bit minwise hashing. J Comput Res Dev 50(4):883–890
39.
go back to reference Yuan XP, Long J, Zhang ZP, Luo YY, Zhang H, Gui WH (2012) f-fractional bit minwise hashing. JSoftw 7(1):228–236 Yuan XP, Long J, Zhang ZP, Luo YY, Zhang H, Gui WH (2012) f-fractional bit minwise hashing. JSoftw 7(1):228–236
40.
go back to reference Li P, Owen A, Zhang CH (2012) One permutation hashing. In: Advances in Neural Information Processing Systems, p 3113–3121 Li P, Owen A, Zhang CH (2012) One permutation hashing. In: Advances in Neural Information Processing Systems, p 3113–3121
41.
go back to reference Fetterly D, Manasse M, Najork M, Wiener J (2003) A large-scale study of the evolution of web pages. In: Proceedings of the 12th international conference on World Wide Web, p 669678, New York, NY, USA. Proceedings of the 12th International World Wide Web Conference, ACM Fetterly D, Manasse M, Najork M, Wiener J (2003) A large-scale study of the evolution of web pages. In: Proceedings of the 12th international conference on World Wide Web, p 669678, New York, NY, USA. Proceedings of the 12th International World Wide Web Conference, ACM
42.
go back to reference Tang JJ, Tian YJ (2015) f-fractional bit minwise hashing for large-scale learning. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Vol 3, p 60–63. IEEE Tang JJ, Tian YJ (2015) f-fractional bit minwise hashing for large-scale learning. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Vol 3, p 60–63. IEEE
43.
go back to reference Tang JJ, Tian YJ, Liu DL (2015) Connected bit minwise hashing for large-scale linear svm. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2015 12th International Conference on, p 995–1002. IEEE Tang JJ, Tian YJ, Liu DL (2015) Connected bit minwise hashing for large-scale linear svm. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2015 12th International Conference on, p 995–1002. IEEE
44.
go back to reference Tian YJ, Ping Y (2014) Large-scale linear nonparallel support vector machine solver. Neural Netw 50:166–174CrossRef Tian YJ, Ping Y (2014) Large-scale linear nonparallel support vector machine solver. Neural Netw 50:166–174CrossRef
45.
go back to reference Manku GS, Jain A, Sarma AD (2007) Detecting near-duplicates for web crawlings. Proceedings of the 16th international conference on World Wide Web, (10):141–150 Manku GS, Jain A, Sarma AD (2007) Detecting near-duplicates for web crawlings. Proceedings of the 16th international conference on World Wide Web, (10):141–150
46.
go back to reference Tashev I, Acero A (2010) Statistical modeling of the speech signal. In: International Workshop on Acoustic, Echo, and Noise Control (IWAENC) Tashev I, Acero A (2010) Statistical modeling of the speech signal. In: International Workshop on Acoustic, Echo, and Noise Control (IWAENC)
47.
go back to reference Yuan XP, Long J, Zhang ZP, Luo YY, Zhang H, Gui WH (August 2012) Research on optimal fractional bit minwise hashing. Computer Science, 39(8) Yuan XP, Long J, Zhang ZP, Luo YY, Zhang H, Gui WH (August 2012) Research on optimal fractional bit minwise hashing. Computer Science, 39(8)
48.
go back to reference Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear svm. In: International Conference on Machine Learning, p 408–415 Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear svm. In: International Conference on Machine Learning, p 408–415
49.
go back to reference Cherkasova L, Eshghi K, Morrey CB, Tucek J, Veitch A (2009) Applying syntactic similarity algorithms for enterprise information management. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, p 1087–1096. ACM Cherkasova L, Eshghi K, Morrey CB, Tucek J, Veitch A (2009) Applying syntactic similarity algorithms for enterprise information management. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, p 1087–1096. ACM
50.
go back to reference Forman G, Eshghi K, Suermondt J (2009) Efficient detection of large-scale redundancy in enterprise file systems. ACM SIGOPS Op Sys Rev 43(1):84–91CrossRef Forman G, Eshghi K, Suermondt J (2009) Efficient detection of large-scale redundancy in enterprise file systems. ACM SIGOPS Op Sys Rev 43(1):84–91CrossRef
51.
go back to reference Bendersky M, Croft WB (2009) Finding text reuse on the web. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, p 262–271. ACM Bendersky M, Croft WB (2009) Finding text reuse on the web. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, p 262–271. ACM
52.
go back to reference Biggio B, Fumera G, Roli F (2014) Security evaluation of pattern classifiers under attack. IEEE Trans Knowl Data Eng 26(4):984–996CrossRef Biggio B, Fumera G, Roli F (2014) Security evaluation of pattern classifiers under attack. IEEE Trans Knowl Data Eng 26(4):984–996CrossRef
53.
go back to reference Cox RV, Kamm CA, Rabiner L, Schroeter J, Wilpon JG (2000) Speech and language processing for next-millennium communications services. Proc IEEE 88(8):1314–1337CrossRef Cox RV, Kamm CA, Rabiner L, Schroeter J, Wilpon JG (2000) Speech and language processing for next-millennium communications services. Proc IEEE 88(8):1314–1337CrossRef
54.
go back to reference Camastra F, Verri A (2005) A novel kernel method for clustering. IEEE Trans Pattern Anal Mach Intell 27(5):801–805CrossRef Camastra F, Verri A (2005) A novel kernel method for clustering. IEEE Trans Pattern Anal Mach Intell 27(5):801–805CrossRef
55.
go back to reference Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9(4):1871–1874 Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9(4):1871–1874
56.
go back to reference Joachims T (2006) Training linear svms in linear time. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (10):217–226. KDD ’06 Joachims T (2006) Training linear svms in linear time. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (10):217–226. KDD ’06
57.
go back to reference Border AZ, Bottou L, Gallinari P (2009) Sgd-qn: careful quasi-newton stochastic gradient descent. J Mach Learn Res 10:1737–1754 Border AZ, Bottou L, Gallinari P (2009) Sgd-qn: careful quasi-newton stochastic gradient descent. J Mach Learn Res 10:1737–1754
58.
go back to reference Zhang T (July 4-8 2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. International Conference on Machine Learning Zhang T (July 4-8 2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. International Conference on Machine Learning
59.
go back to reference Singer Y, Srebro N (2007) Pegasos: Primal estimated sub-gradient solver for svm. International Conference on Machine Learning, pages 807–814 Singer Y, Srebro N (2007) Pegasos: Primal estimated sub-gradient solver for svm. International Conference on Machine Learning, pages 807–814
60.
go back to reference Indyk P (2001) A small approximately min-wise independent family of hash functions. J Algorithms 38(1):84–90CrossRef Indyk P (2001) A small approximately min-wise independent family of hash functions. J Algorithms 38(1):84–90CrossRef
61.
62.
go back to reference Li P, Shrivastava A, Konig C (2011) Training logistic regression and svm on 200gb data using b-bit minwise hashing and comparisons with vowpal wabbit (vw). arXiv preprint arXiv:1108.3072 Li P, Shrivastava A, Konig C (2011) Training logistic regression and svm on 200gb data using b-bit minwise hashing and comparisons with vowpal wabbit (vw). arXiv preprint arXiv:​1108.​3072
63.
go back to reference Li P (2015) 0-bit consistent weighted sampling. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p 665–674. ACM Li P (2015) 0-bit consistent weighted sampling. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p 665–674. ACM
64.
go back to reference Li P, Mitzenmacher M, Shrivastava A (2016) 2-bit random projections, nonlinear estimators, and approximate near neighbor search. arXiv preprint arXiv:1602.06577 Li P, Mitzenmacher M, Shrivastava A (2016) 2-bit random projections, nonlinear estimators, and approximate near neighbor search. arXiv preprint arXiv:​1602.​06577
66.
go back to reference Shrivastava A, Li P (2015) Asymmetric minwise hashing for indexing binary inner products and set containment. In: Proceedings of the 24th International Conference on World Wide Web, p 981–991. ACM Shrivastava A, Li P (2015) Asymmetric minwise hashing for indexing binary inner products and set containment. In: Proceedings of the 24th International Conference on World Wide Web, p 981–991. ACM
67.
go back to reference Shrivastava A, Li P (2014) Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In: Advances in Neural Information Processing Systems, p 2321–2329 Shrivastava A, Li P (2014) Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In: Advances in Neural Information Processing Systems, p 2321–2329
69.
go back to reference Shrivastava A, Li P (2014) Densifying one permutation hashing via rotation for fast near neighbor search. In: International Conference on Machine Learning, p 557–565 Shrivastava A, Li P (2014) Densifying one permutation hashing via rotation for fast near neighbor search. In: International Conference on Machine Learning, p 557–565
70.
go back to reference Shah RD, Meinshausen N (2013) Min-wise hashing for large-scale regression and classification with sparse data. arXiv preprint arXiv:1308.1269 Shah RD, Meinshausen N (2013) Min-wise hashing for large-scale regression and classification with sparse data. arXiv preprint arXiv:​1308.​1269
71.
go back to reference Li P, Shrivastava A, Konig CA (2012) Gpu-based minwise hashing: Gpu-based minwise hashing. In: Proceedings of the 21st International Conference on World Wide Web, p 565–566. ACM Li P, Shrivastava A, Konig CA (2012) Gpu-based minwise hashing: Gpu-based minwise hashing. In: Proceedings of the 21st International Conference on World Wide Web, p 565–566. ACM
72.
go back to reference Tabei Y, Yamanishi Y (2013) Scalable prediction of compound-protein interactions using minwise hashing. BMC Syst Biol 7(6):1 Tabei Y, Yamanishi Y (2013) Scalable prediction of compound-protein interactions using minwise hashing. BMC Syst Biol 7(6):1
73.
go back to reference Dahlgaard S, Thorup M (2014) Approximately minwise independence with twisted tabulation. In: Scandinavian Workshop on Algorithm Theory, p 134–145. Springer Dahlgaard S, Thorup M (2014) Approximately minwise independence with twisted tabulation. In: Scandinavian Workshop on Algorithm Theory, p 134–145. Springer
Metadata
Title
A Systematic Review on Minwise Hashing Algorithms
Authors
Jingjing Tang
Yingjie Tian
Publication date
26-10-2016
Publisher
Springer Berlin Heidelberg
Published in
Annals of Data Science / Issue 4/2016
Print ISSN: 2198-5804
Electronic ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-016-0091-y

Other articles of this Issue 4/2016

Annals of Data Science 4/2016 Go to the issue