ABSTRACT
Most researchers working on high-dimensional indexing agree on the following three trends: (i) the size of the multimedia collections to index are now reaching millions if not billions of items, (ii) the computers we use every day now come with multiple cores and (iii) hardware becomes more available, thanks to easier access to Grids and/or Clouds. This paper shows how the Map-Reduce paradigm can be applied to indexing algorithms and demonstrates that great scalability can be achieved using Hadoop, a popular Map-Reduce-based framework. Dramatic performance improvements are not however guaranteed a priori: such frameworks are rigid, they severely constrain the possible access patterns to data and scares resource RAM has to be shared. Furthermore, algorithms require major redesign, and may have to settle for sub-optimal behavior. The benefits, however, are many: simplicity for programmers, automatic distribution, fault tolerance, failure detection and automatic re-runs and, last but not least, scalability. We share our experience of adapting a clustering-based high-dimensional indexing algorithm to the Map-Reduce model, and of testing it at large scale with Hadoop as we index 30 billion SIFT descriptors. We foresee that lessons drawn from our work could minimize time, effort and energy invested by other researchers and practitioners working in similar directions.
- M. Batko, F. Falchi, C. Lucchese, D. Novak, R. Perego, F. Rabitti, J. Sedmidubský, and P. Zezula. Building a web-scale image similarity search system. Multimedia Tools Appl., 47(3), 2010. Google ScholarDigital Library
- F. Chierichetti, A. Panconesi, P. Raghavan, M. Sozio, A. Tiberi, and E. Upfal. Finding near neighbors through cluster pruning. In PODS, 2007. Google ScholarDigital Library
- M. Datar, N. Immorlica, P. Indyk, and V. S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, 2004. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1), 2008. Google ScholarDigital Library
- M. Douze, H. Jégou, H. Singh, L. Amsaleg, and C. Schmid. Evaluation of gist descriptors for web-scale image search. In CIVR, 2009. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP, 2003. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, 1999. Google ScholarDigital Library
- G. Gudmundsson, B. T. Jónsson, and L. Amsaleg. A large-scale performance study of cluster-based high-dimensional indexing. In VLS-MCMR Workshop with ACM MM, 2010. Google ScholarDigital Library
- J. S. Hare, S. Samangooei, D. P. Dupplaw, and P. H. Lewis. Imageterrier: an extensible platform for scalable high-performance image retrieval. In ICMR, 2012. Google ScholarDigital Library
- H. Jégou, M. Douze, and C. Schmid. Hamming embedding and weak geometric consistency for large scale image search. In ECCV, 2008.Google ScholarDigital Library
- H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. on PAMI, 2011. Google ScholarDigital Library
- H. Jégou, F. Perronnin, M. Douze, J. Sánchez, P. Pérez, and C. Schmid. Aggregating local image descriptors into compact codes. IEEE Trans. on PAMI, 2011.Google Scholar
- H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors: Re-rank with source coding. In ICASSP, 2011.Google ScholarCross Ref
- Y. Jégou, S. Lantéri, J. Leduc, and all. Grid'5000: a large scale and highly reconfigurable experimental Grid testbed. Intl. Journal of HPC Applications, 20(4), 2006. Google ScholarDigital Library
- A. Joly and O. Buisson. A posteriori multi-probe locality sensitive hashing. In MM, 2008. Google ScholarDigital Library
- H. Lejsek, F. H. Amundsson, B. T. Jónsson, and L. Amsaleg. NV-Tree: An efficient disk-based index for approximate search in very large high-dimensional collections. IEEE Trans. on PAMI, 2009. Google ScholarDigital Library
- H. Lejsek, B. T. Jónsson, and L. Amsaleg. NV-Tree: nearest neighbors at the billion scale. In ICMR, 2011. Google ScholarDigital Library
- D. Lowe. Distinctive image features from scale invariant keypoints. IJCV, 60(2), 2004. Google ScholarDigital Library
- Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li. Multi-probe lsh: efficient indexing for high-dimensional similarity search. In VLDB, 2007. Google ScholarDigital Library
- L. Paulevé, H. Jégou, and L. Amsaleg. Locality sensitive hashing: A comparison of hash function types and querying mechanisms. Pattern Recognition Letters, 2010. Google ScholarDigital Library
- J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In CVPR, 2008.Google ScholarCross Ref
- K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In MSST, 2010. Google ScholarDigital Library
- J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In ICCV, 2003. Google ScholarDigital Library
Index Terms
- Indexing and searching 100M images with map-reduce
Recommendations
Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningThe amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
Comparison of Map-Reduce and SQL on Large-Scale Data Processing
ISPA '10: Proceedings of the International Symposium on Parallel and Distributed Processing with ApplicationsPopularity for the term ‘Cloud-Computing’ has been increasing in recent years. There are many great companies such as Yahoo, Google etc. tried to provide related services to business community, even through public users. In addition to the SQL technique,...
Rainfall Prediction using Artificial Neural Network on Map-Reduce Framework
WCI '15: Proceedings of the Third International Symposium on Women in Computing and InformaticsBig data is a celebrated topic in Business as well as research community for several years. With the revolution of Big Data, it is becoming easy and less expensive to store tremendous amount of data for future analysis. Weather data gets accumulated very ...
Comments