Skip to main content
Top

2018 | OriginalPaper | Chapter

An LSH-Based Model-Words-Driven Product Duplicate Detection Method

Authors : Aron Hartveld, Max van Keulen, Diederik Mathol, Thomas van Noort, Thomas Plaatsman, Flavius Frasincar, Kim Schouten

Published in: Advanced Information Systems Engineering

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The online shopping market is growing rapidly in the 21st century, leading to a huge amount of duplicate products being sold online. An important component for aggregating online products is duplicate detection, although this is a time consuming process. In this paper, we focus on reducing the amount of possible duplicates that can be used as an input for the Multi-component Similarity Method (MSM), a state-of-the-art duplicate detection solution. To find the candidate pairs, Locality Sensitive Hashing (LSH) is employed. A previously proposed LSH-based algorithm makes use of binary vectors based on the model words in the product titles. This paper proposes several extensions to this, by performing advanced data cleaning and additionally using information from the key-value pairs. Compared to MSM, the MSMP+ method proposed in this paper leads to a minor reduction by \(6\%\) in the \(F_1\)-measure whilst reducing the number of needed computations by \(95\%\).

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
2.
go back to reference van Bezu, R., Borst, S., Rijkse, R., Verhagen, J., Frasincar, F., Vandic, D.: Multi-component similarity method for web product duplicate detection. In: 30th Symposium on Applied Computing (SAC 2015), pp. 761–768. ACM (2015) van Bezu, R., Borst, S., Rijkse, R., Verhagen, J., Frasincar, F., Vandic, D.: Multi-component similarity method for web product duplicate detection. In: 30th Symposium on Applied Computing (SAC 2015), pp. 761–768. ACM (2015)
3.
go back to reference Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Netw. ISDN Syst. 29(8), 1157–1166 (1997)CrossRef Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Computer Netw. ISDN Syst. 29(8), 1157–1166 (1997)CrossRef
4.
go back to reference Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Thirty-Fourth Annual ACM Symposium on Theory of Computing (STOC 2002), pp. 380–388. ACM (2002) Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Thirty-Fourth Annual ACM Symposium on Theory of Computing (STOC 2002), pp. 380–388. ACM (2002)
5.
go back to reference Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13(1), 64–78 (2001)CrossRef Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. IEEE Trans. Knowl. Data Eng. 13(1), 64–78 (2001)CrossRef
6.
go back to reference van Dam, I., van Ginkel, G., Kuipers, W., Nijenhuis, N., Vandic, D., Frasincar, F.: Duplicate detection in web shops using LSH to reduce the number of computations. In: 31th ACM Symposium on of Applied Computing (SAC 2016), pp. 772–779. ACM (2016) van Dam, I., van Ginkel, G., Kuipers, W., Nijenhuis, N., Vandic, D., Frasincar, F.: Duplicate detection in web shops using LSH to reduce the number of computations. In: 31th ACM Symposium on of Applied Computing (SAC 2016), pp. 772–779. ACM (2016)
7.
go back to reference Duan, S., Fokoue, A., Hassanzadeh, O., Kementsietsidis, A., Srinivas, K., Ward, M.J.: Instance-based matching of large ontologies using locality-sensitive hashing. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 49–64. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-35176-1_4CrossRef Duan, S., Fokoue, A., Hassanzadeh, O., Kementsietsidis, A., Srinivas, K., Ward, M.J.: Instance-based matching of large ontologies using locality-sensitive hashing. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012. LNCS, vol. 7649, pp. 49–64. Springer, Heidelberg (2012). https://​doi.​org/​10.​1007/​978-3-642-35176-1_​4CrossRef
8.
go back to reference Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. J. Web Eng. 2(4), 228–246 (2003) Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate web pages. J. Web Eng. 2(4), 228–246 (2003)
9.
go back to reference Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 284–291. ACM (2006) Henzinger, M.: Finding near-duplicate web pages: a large-scale evaluation of algorithms. In: 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 284–291. ACM (2006)
10.
go back to reference Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Thirtieth Annual ACM Symposium on Theory of Computing (STOC 1998), pp. 604–613. ACM (1998) Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Thirtieth Annual ACM Symposium on Theory of Computing (STOC 1998), pp. 604–613. ACM (1998)
11.
go back to reference Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef Leskovec, J., Rajaraman, A., Ullman, J.D.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2014)CrossRef
12.
go back to reference Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 23(4), 3–13 (2000) Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. Bull. IEEE Comput. Soc. Tech. Comm. Data Eng. 23(4), 3–13 (2000)
13.
go back to reference Slaney, M., Casey, M.: Locality-sensitive hashing for finding nearest neighbors. IEEE Sig. Process. Mag. 25(2), 128–131 (2008)CrossRef Slaney, M., Casey, M.: Locality-sensitive hashing for finding nearest neighbors. IEEE Sig. Process. Mag. 25(2), 128–131 (2008)CrossRef
14.
go back to reference Vandic, D., Van Dam, J.W., Frasincar, F.: Faceted product search powered by the Semantic Web. Decis. Support Syst. 53(3), 425–437 (2012)CrossRef Vandic, D., Van Dam, J.W., Frasincar, F.: Faceted product search powered by the Semantic Web. Decis. Support Syst. 53(3), 425–437 (2012)CrossRef
Metadata
Title
An LSH-Based Model-Words-Driven Product Duplicate Detection Method
Authors
Aron Hartveld
Max van Keulen
Diederik Mathol
Thomas van Noort
Thomas Plaatsman
Flavius Frasincar
Kim Schouten
Copyright Year
2018
DOI
https://doi.org/10.1007/978-3-319-91563-0_25

Premium Partner