Top

Annals of Data Science

Published in:

26-10-2016

A Systematic Review on Minwise Hashing Algorithms

Authors: Jingjing Tang, Yingjie Tian

Published in: Annals of Data Science | Issue 4/2016

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Similarity detection technology captures a host of researchers’ attention. Minwise hashing schemes become the current researching hot spots in machine learning for similarity preservation. During the data preprocessing stage, the basic idea of minwise hashing schemes is to transfer the original data into binary codes which are good proxies of original data to preserve the similarity. Minwise hashing schemes can improve the computation efficiency and save the storage space without notable loss of accuracy. Thus, they have been studied extensively and developed rapidly for decades. Considering minwise hashing algorithm and its variants, a systematic survey is needed and beneficial to understand and utilize this kind of data preprocessing techniques more easily. The purpose of this paper is to review minwise hashing algorithms in detail and provide an insightful understanding of current developments. In order to show the application prospect of the minwise hashing algorithms, various algorithms have combined with linear Support Vector Machine for large-scale classification. Both theoretical analysis and experimental results demonstrate that these algorithms can achieve massive advantages in accuracy, efficiency and energy-consumption. Furthermore, their limitations, major opportunities and challenges, extensions and variants as well as potential important research directions have been pointed out.

previous article Targeted Marketing Using Balance Optimization Subset Selection

next article A Frame Work for XML Ontology to STEP-PDM from Express Entities: A String Matching Approach

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In Foundations of Computer Science, pp 459–468

Broder AZ (1997) On the resemblance and containment of documents. Compression Complex Seq 1997:21–29

Broder AZ, Glassman SC, Manasse MS, Zweig G (Sep 1997) Syntactic clustering of the web. In: Proceedings of the 6th international conference on World Wide Web, Vol 29, p 1157–1166

Gollapudi S, Sharma A (2009) An axiomatic approach for result diversification. In: Proceedings of the 18th international conference on World Wide Web, p 381–390

Kalpakis K, Tang S (2008) Collaborative data gathering in wireless sensor networks using measurement co-occurrence. Comput Commun 31(10):19791992CrossRef

Najork M, Gollapudi S, Panigrahy R (2009) Less is more: sampling the neighborhood graph makes salsa better and faster. Web Search and Data Mining, p 242–251

Urvoy T, Chauveau E, Filoche P, Lavergne T (2008) Tracking web spam with html style similarities. ACM Trans Web 2(1):1–28CrossRef

Gong Y, Kumar S, Verma V, Lazebnik S (2012) Angular quantization-based binary codes for fast similarity search. In: Advances in neural information processing systems, p 1196–1204

He J, Chang S, Radhakrishnan R. Bauer C (2011) Compact hashing with joint optimization of search accuracy and time. In IEEE Conference on Computer Vision and Pattern Recognition, p 753–760, IEEE

10.

Gionis A, Indyk P, Motwani R (1999) Similarity search in high dimensions via hashing. VLDB 99:518–529

11.

Andoni A, Indyk P (2006) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. In 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), p 459–468,IEEE

12.

Torralba A, Fergus R, Weiss Y (2008) Small codes and large image databases for recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, p 1–8. IEEE

13.

Liu W, Wang J, Kumar S, Chang S (2011) Hashing with graphs. In: Proceedings of the 28th international conference on machine learning, p 1–8

14.

Shi QF, Petterson J, Dror G, Langford J, Smola A, Vishwanathan SVN (2009) Hash kernels for structured data. J Mach Learn Res 10:2615–2637

15.

Mu YD, Hua G, Fan W, Chang SF (2014) Hash-SVM: scalable kernel machines for large-scale visual classification. In: IEEE Conference on Computer Vision and Pattern Recognition, p 979–986

16.

Litayem S, Joly A, Boujemaa N (2012) Hash-based support vector machines approximation for large scale prediction. Br Mach Vis Conf 34(6):1092–1104

17.

Gong Y, Lazebnik S (2011) Iterative quantization: a procrustean approach to learning binary codes. In: IEEE Conference on Computer Vision and Pattern Recognition, p 817–824. IEEE

18.

Liu W, Wang J, Ji R, Jiang Y, Chang (2012) Supervised hashing with kernels. In: IEEE Conference on Computer Vision and Pattern Recognition, p 2074–2081. IEEE

19.

Joly A, Buisson O (2008) A posteriori multi-probe locality sensitive hashing. In: Proceedings of the 16th ACM international conference on Multimedia, p 209–218. ACM

20.

Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In Proceedings of the 20th annual symposium on Computational geometry, p 253–262. ACM, 2004

21.

Kulis B, Darrell T (2009) Learning to hash with binary reconstructive embeddings. In: Advances in neural information processing systems, p 1042–1050

22.

Kulis B, Grauman K (2012) Kernelized locality-sensitive hashing. IEEE Trans Pattern Anal Mach Intell 34(6):1092–1104CrossRef

23.

Raginsky M, Lazebnik S (2009) Locality-sensitive binary codes from shift-invariant kernels. In: Advances in neural information processing systems, p 1509–1517

24.

Salakhutdinov R, Hinton G (2009) Semantic hashing. Int J Approx Reason 50(7):969–978CrossRef

25.

Weiss Y, Torralba A, Fergus R (2009) Spectral hashing. In: Advances in neural information processing systems, p 1753–1760

26.

Wang J, Kumar S, Chang S (2010) Sequential projection learning for hashing with compact codes. In: Proceedings of the 27th international conference on machine learning, p 1127–1134

27.

Zhang D, Wang J, Cai D, Lu J (2010) Self-taught hashing for fast similarity search. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, p 18–25. ACM

28.

Zhang D, Wang F, Si L (2011) Composite hashing with multiple information sources. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, p 225–234. ACM

29.

Norouzi M, Blei DM (2011) Minimal loss hashing for compact binary codes. In: Proceedings of the 28th international conference on machine learning, p 353–360

30.

Pandey S, Broder A, Chierichetti F, Josifovski V, Kumar R, Vassilvitskii S (2009) Nearest-neighbor caching for content-match applications. In: Proceedings of the 18th international conference on World Wide Web, p 441–450. ACM

31.

Shrivastava A, Li P (2012) Fast near neighbor search in high-dimensional binary data. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, p 474–489. Springer

32.

Li P, Knig AC, Gui WH (2010) b-bit minwise hashing for estimating three-way similarities. In: Advances in Neural Information Processing Systems

33.

Li P, Knig AC (2010) b-bit minwise hashing. In: Proceedings of the 19th international conference on World Wide Web, p 671–680. ACM

34.

Li P, Knig AC (2011) Theory and applications of b-bit minwise hashing. Commun ACM 54(8):101–109CrossRef

35.

Li P, Shrivastava A, Moore J, Knig AC (2011) b-bit minwise hashing for large-scale learning. In: Advances in Neural Information Processing Systems. Neural Information Processing Foundation

36.

Li P, Shrivastava A, Moore J, Knig AC (2011) Hashing algorithms for large-scale learning. In: Advances in neural information processing systems, p 2672–2680

37.

Li P, Moore JL, Knig AC (2011) b-bit minwise hashing for large-scale linear svm. Technical report

38.

Yuan XP, Long J, Zhang ZP, Luo YY, Zhang H, Gui WH (2013) Connected bit minwise hashing. J Comput Res Dev 50(4):883–890

39.

Yuan XP, Long J, Zhang ZP, Luo YY, Zhang H, Gui WH (2012) f-fractional bit minwise hashing. JSoftw 7(1):228–236

40.

Li P, Owen A, Zhang CH (2012) One permutation hashing. In: Advances in Neural Information Processing Systems, p 3113–3121

41.

Fetterly D, Manasse M, Najork M, Wiener J (2003) A large-scale study of the evolution of web pages. In: Proceedings of the 12th international conference on World Wide Web, p 669678, New York, NY, USA. Proceedings of the 12th International World Wide Web Conference, ACM

42.

Tang JJ, Tian YJ (2015) f-fractional bit minwise hashing for large-scale learning. In: 2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), Vol 3, p 60–63. IEEE

43.

Tang JJ, Tian YJ, Liu DL (2015) Connected bit minwise hashing for large-scale linear svm. In: Fuzzy Systems and Knowledge Discovery (FSKD), 2015 12th International Conference on, p 995–1002. IEEE

44.

Tian YJ, Ping Y (2014) Large-scale linear nonparallel support vector machine solver. Neural Netw 50:166–174CrossRef

45.

Manku GS, Jain A, Sarma AD (2007) Detecting near-duplicates for web crawlings. Proceedings of the 16th international conference on World Wide Web, (10):141–150

46.

Tashev I, Acero A (2010) Statistical modeling of the speech signal. In: International Workshop on Acoustic, Echo, and Noise Control (IWAENC)

47.

Yuan XP, Long J, Zhang ZP, Luo YY, Zhang H, Gui WH (August 2012) Research on optimal fractional bit minwise hashing. Computer Science, 39(8)

48.

Hsieh CJ, Chang KW, Lin CJ, Keerthi SS, Sundararajan S (2008) A dual coordinate descent method for large-scale linear svm. In: International Conference on Machine Learning, p 408–415

49.

Cherkasova L, Eshghi K, Morrey CB, Tucek J, Veitch A (2009) Applying syntactic similarity algorithms for enterprise information management. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, p 1087–1096. ACM

50.

Forman G, Eshghi K, Suermondt J (2009) Efficient detection of large-scale redundancy in enterprise file systems. ACM SIGOPS Op Sys Rev 43(1):84–91CrossRef

51.

Bendersky M, Croft WB (2009) Finding text reuse on the web. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining, p 262–271. ACM

52.

Biggio B, Fumera G, Roli F (2014) Security evaluation of pattern classifiers under attack. IEEE Trans Knowl Data Eng 26(4):984–996CrossRef

53.

Cox RV, Kamm CA, Rabiner L, Schroeter J, Wilpon JG (2000) Speech and language processing for next-millennium communications services. Proc IEEE 88(8):1314–1337CrossRef

54.

Camastra F, Verri A (2005) A novel kernel method for clustering. IEEE Trans Pattern Anal Mach Intell 27(5):801–805CrossRef

55.

Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9(4):1871–1874

56.

Joachims T (2006) Training linear svms in linear time. Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, (10):217–226. KDD ’06

57.

Border AZ, Bottou L, Gallinari P (2009) Sgd-qn: careful quasi-newton stochastic gradient descent. J Mach Learn Res 10:1737–1754

58.

Zhang T (July 4-8 2004) Solving large scale linear prediction problems using stochastic gradient descent algorithms. International Conference on Machine Learning

59.

Singer Y, Srebro N (2007) Pegasos: Primal estimated sub-gradient solver for svm. International Conference on Machine Learning, pages 807–814

60.

Indyk P (2001) A small approximately min-wise independent family of hash functions. J Algorithms 38(1):84–90CrossRef

61.

Li P, Konig C (2011). Accurate estimators for improving minwise hashing and b-bit minwise hashing. arXiv preprint arXiv:1108.0895

62.

Li P, Shrivastava A, Konig C (2011) Training logistic regression and svm on 200gb data using b-bit minwise hashing and comparisons with vowpal wabbit (vw). arXiv preprint arXiv:1108.3072

63.

Li P (2015) 0-bit consistent weighted sampling. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p 665–674. ACM

64.

Li P, Mitzenmacher M, Shrivastava A (2016) 2-bit random projections, nonlinear estimators, and approximate near neighbor search. arXiv preprint arXiv:1602.06577

65.

Shrivastava A, Li P (2014) Asymmetric minwise hashing. arXiv preprint arXiv:1411.3787

66.

Shrivastava A, Li P (2015) Asymmetric minwise hashing for indexing binary inner products and set containment. In: Proceedings of the 24th International Conference on World Wide Web, p 981–991. ACM

67.

Shrivastava A, Li P (2014) Asymmetric lsh (alsh) for sublinear time maximum inner product search (mips). In: Advances in Neural Information Processing Systems, p 2321–2329

68.

Shrivastava A (2016) Exact weighted minwise hashing in constant time. arXiv preprint arXiv:1602.08393

69.

Shrivastava A, Li P (2014) Densifying one permutation hashing via rotation for fast near neighbor search. In: International Conference on Machine Learning, p 557–565

70.

Shah RD, Meinshausen N (2013) Min-wise hashing for large-scale regression and classification with sparse data. arXiv preprint arXiv:1308.1269

71.

Li P, Shrivastava A, Konig CA (2012) Gpu-based minwise hashing: Gpu-based minwise hashing. In: Proceedings of the 21st International Conference on World Wide Web, p 565–566. ACM

72.

Tabei Y, Yamanishi Y (2013) Scalable prediction of compound-protein interactions using minwise hashing. BMC Syst Biol 7(6):1

73.

Dahlgaard S, Thorup M (2014) Approximately minwise independence with twisted tabulation. In: Scandinavian Workshop on Algorithm Theory, p 134–145. Springer

Title: A Systematic Review on Minwise Hashing Algorithms
Authors: Jingjing Tang
Yingjie Tian
Publication date: 26-10-2016
Publisher: Springer Berlin Heidelberg
Published in: Annals of Data Science / Issue 4/2016
Print ISSN: 2198-5804
Electronic ISSN: 2198-5812
DOI: https://doi.org/10.1007/s40745-016-0091-y

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2016

A Frame Work for XML Ontology to STEP-PDM from Express Entities: A String Matching Approach

Building Computational Virtual Reality Environment for Anesthesia

Targeted Marketing Using Balance Optimization Subset Selection

N-Folded Parallel String Matching Mechanism

A Goodness-of-Fit Test for Rayleigh Distribution Based on Hellinger Distance

Estimation of the Lomax Distribution in the Presence of Outliers