Skip to main content

2016 | OriginalPaper | Buchkapitel

13. Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection

verfasst von : Ritu Arora, Jessica Trelogan, Trung Nguyen Ba

Erschienen in: Conquering Big Data with High Performance Computing

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
3.
Zurück zum Zitat R. Datta, J. Li, J.Z. Wang, Content-based image retrieval: Approaches and trends of the new age, in Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR ’05), ACM, New York, 2005, pp. 253–262 R. Datta, J. Li, J.Z. Wang, Content-based image retrieval: Approaches and trends of the new age, in Proceedings of the 7th ACM SIGMM International Workshop on Multimedia Information Retrieval (MIR ’05), ACM, New York, 2005, pp. 253–262
4.
Zurück zum Zitat D.-H. Kim, C.-W. Chung, Qcluster: Relevance feedback using adaptive clustering for content-based image retrieval, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD ’03), ACM, New York, 2003, pp. 599–610 D.-H. Kim, C.-W. Chung, Qcluster: Relevance feedback using adaptive clustering for content-based image retrieval, in Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD ’03), ACM, New York, 2003, pp. 599–610
5.
Zurück zum Zitat P.A. Viola, M.J. Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of the 2001 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2001), 2001, pp. 511–518 P.A. Viola, M.J. Jones, Rapid object detection using a boosted cascade of simple features, in Proceedings of the 2001 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2001), 2001, pp. 511–518
6.
Zurück zum Zitat Y. Ke, R. Sukthankar, L. Huston, An efficient parts-based near-duplicate and subimage retrieval system, in Proceedings of the 12th Annual ACM International Conference on Multimedia (MULTIMEDIA ’04), ACM, New York, 2004, pp. 869–876 Y. Ke, R. Sukthankar, L. Huston, An efficient parts-based near-duplicate and subimage retrieval system, in Proceedings of the 12th Annual ACM International Conference on Multimedia (MULTIMEDIA ’04), ACM, New York, 2004, pp. 869–876
14.
Zurück zum Zitat Q. Huynh-Thu, M. Ghanbari, The accuracy of PSNR in predicting video quality for different video scenes and frame rates. Telecommun. Syst. 49(1), 35–48 (2010)CrossRef Q. Huynh-Thu, M. Ghanbari, The accuracy of PSNR in predicting video quality for different video scenes and frame rates. Telecommun. Syst. 49(1), 35–48 (2010)CrossRef
15.
Zurück zum Zitat R. Brunelli, Template Matching Techniques in Computer Vision: Theory and Practice (Wiley, 2009), ISBN: 978-0-470-51706-2 R. Brunelli, Template Matching Techniques in Computer Vision: Theory and Practice (Wiley, 2009), ISBN: 978-0-470-51706-2
16.
Zurück zum Zitat H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)CrossRef H. Bay, A. Ess, T. Tuytelaars, L. Van Gool, Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)CrossRef
17.
Zurück zum Zitat G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library (O’Reilly Media, Sebastopol, CA, 2008), pp. 1–580 G. Bradski, A. Kaehler, Learning OpenCV: Computer Vision with the OpenCV Library (O’Reilly Media, Sebastopol, CA, 2008), pp. 1–580
19.
Zurück zum Zitat R. Arora, M. Esteva, J. Trelogan, Leveraging high performance computing for managing large and evolving data collections. Int. J. Digit. Curation 9(2), 17–27 (2014)CrossRef R. Arora, M. Esteva, J. Trelogan, Leveraging high performance computing for managing large and evolving data collections. Int. J. Digit. Curation 9(2), 17–27 (2014)CrossRef
21.
Zurück zum Zitat M. Esteva, J. Trelogan, W. Xu, A. Solis, N. Lauland, Lost in the data, aerial views of an archaeological collection, in Proceedings of the 2013 Digital Humanities Conference, 2013, pp. 174–177. ISBN: 978-1-60962-036-3 M. Esteva, J. Trelogan, W. Xu, A. Solis, N. Lauland, Lost in the data, aerial views of an archaeological collection, in Proceedings of the 2013 Digital Humanities Conference, 2013, pp. 174–177. ISBN: 978-1-60962-036-3
23.
Zurück zum Zitat W. Xu, M. Esteva, J. Trelogan, T. Swinson, A case study on entity resolution for distant processing of big humanities data, in Proceedings of the 2013 IEEE International Conference on Big Data, 2013, pp. 113–120 W. Xu, M. Esteva, J. Trelogan, T. Swinson, A case study on entity resolution for distant processing of big humanities data, in Proceedings of the 2013 IEEE International Conference on Big Data, 2013, pp. 113–120
Metadaten
Titel
Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection
verfasst von
Ritu Arora
Jessica Trelogan
Trung Nguyen Ba
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-33742-5_13