ABSTRACT
Multimedia search systems must deal with an increasingly large and heterogeneous amount of data. Several challenges exist when deploying real-world search engines for such data. Existing literature does not properly tackle the many efficiency issues that such task requires. In this paper, we address several of the key efficiency aspects required to deploy a distributed search engine, capable of handling several millions of multimedia documents. The search engine builds on a framework designed to: first, ease the distribution of documents and queries across cluster-nodes, second, index media efficiently for fast similarity search and third aggregate ranked results from several heterogeneous sources. Moreover, the proposed framework is flexible enough to support several state-of-the-art indexing and aggregation techniques.
At the heart of the indexing architecture lies an inverse index structure optimized for sparse hashes, that speeds up the retrieval of similar descriptors. To leverage the distributed nature of the search framework, the proposed aggregation technique offers a low temporal complexity overhead and it is agnostic to the index type (a key aspect to support simultaneous modalities). A comprehensive evaluation with both general IR metrics and efficiency metrics, provides a unique assessment of the several efficiency bottlenecks faced by a search engine. In addition, we test the scalability of the search framework to multiple index sizes, i.e., up to 5 million documents per cluster-node.
- M. Abedini, L. Cao, N. Codella, J. H. Connell, A. Garnavi, Rahiland Geva, M. Merler, Q.-B. Nguyen, J. U. Pankanti, Sharathchandraand R. Smith, and A. Sun, Xingzhiand Tzadok. Ibm research at imageclef 2013 medical tasks. In WN of CLEF'13, 2013.Google Scholar
- M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. IEEE Trans. on Signal Processing, 54(11):4311--4322, 2006. Google ScholarDigital Library
- R. Aly, D. Hiemstra, and T. Demeester. Taily: Shard selection using the tail of score distributions. In SIGIR'13, pages 673--682, 2013. Google ScholarDigital Library
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Commun. ACM, 51(1):117--122, 2008. Google ScholarDigital Library
- C. Badue, R. Baeza-Yates, B. Ribeiro-neto, and N. Ziviani. Distributed query processing using partitioned inverted files. In SPIRE'01, pages 10--20. IEEE CS Press, 2001.Google ScholarCross Ref
- R. Baeza-Yates, V. Murdock, and C. Hauff. Efficiency trade-offs in two-tier web search systems. In SIGIR'09, pages 163--170, 2009. Google ScholarDigital Library
- M. Batko, F. Falchi, C. Lucchese, D. Novak, R. Perego, F. Rabitti, J. Sedmidubsky, and P. Zezula. Building a web-scale image similarity search system. Multimedia Tools and Applications, 47(3):599--629, 2010. Google ScholarDigital Library
- P. Borges, A. Mourão, and J. Magalhães. High-dimensional indexing by sparse approximation. In ICMR'15. ACM, 2015. Google ScholarDigital Library
- S. Büttcher, C. L. Clarke, and G. V. Cormack. Information retrieval: Implementing and evaluating search engines. Mit Press, 2010. Google ScholarDigital Library
- G. V. Cormack, C. L. A. Clarke, and S. Buettcher. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proc of. SIGIR '09, pages 758--759, 2009. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- A. Garcia Seco de Herrera, J. Kalpathy-Cramer, D. Demner Fushman, S. Antani, and H. Müller. Overview of the imageclef 2013 medical tasks. In WN on CLEF'13, 2013.Google Scholar
- J.-P. Heo, Y. Lee, J. He, S.-F. Chang, and S.-E. Yoon. Spherical hashing. In CVPR'12, pages 2957--2964, 2012. Google ScholarDigital Library
- H. Jégou, M. Douze, and C. Schmid. Product quantization for nearest neighbor search. IEEE Trans. on PAMI, 33(1):117--128, 2011. Google ScholarDigital Library
- H. Jégou, T. Furon, and J.-J. Fuchs. Anti-sparse coding for approximate nearest neighbor search. In Proc. on ICASSP'98, pages 2029--2032, 2012.Google ScholarCross Ref
- H. Jégou, R. Tavenard, M. Douze, and L. Amsaleg. Searching in one billion vectors: re-rank with source coding. ArXiv e-prints, 2011.Google Scholar
- R. Ji, L.-Y. Duan, J. Chen, L. Xie, H. Yao, and W. Gao. Learning to distribute vocabulary indexing for scalable visual search. IEEE Trans. on Multimedia, 15(1):153--166, 2013. Google ScholarDigital Library
- H. Lee, A. Battle, R. Raina, and A. Ng. Efficient sparse coding algorithms. NIPS'06, pages 801--808, 2006.Google Scholar
- M. S. Lewicki and T. J. Sejnowski. Learning overcomplete representations. Neural Comput., 12(2):337--365, 2000. Google ScholarDigital Library
- C. Macdonald and I. Ounis. Voting for candidates: adapting data fusion techniques for an expert search task. In CIKM'06, 2006. Google ScholarDigital Library
- H. Mohamed and S. Marchand-Maillet. Distributed media indexing based on mpi and mapreduce. Multimedia Tools and Applications, 69(2):513--537, 2014. Google ScholarDigital Library
- D. Moise, D. Shestakov, G. Gudmundsson, and L. Amsaleg. Indexing and searching 100m images with map-reduce. In ICMR'13, pages 17--24, 2013. Google ScholarDigital Library
- M. Montague and J. A. Aslam. Relevance score normalization for metasearch. In CIKM'01, pages 427--433, 2001. Google ScholarDigital Library
- A. Mourão, F. Martins, and J. Magalhães. Inverse square rank fusion for multimodal search. In CBMI'14, pages 1--6. IEEE, 2014.Google ScholarCross Ref
- M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. on PAMI, 36, 2014.Google Scholar
- D. Novak and P. Zezula. M-chord: A scalable distributed similarity search structure. In InfoScale'06, 2006. Google ScholarDigital Library
- Y. Pati, R. Rezaiifar, and P. Krishnaprasad. Orthogonal Matching Pursuit: recursive function approximation with application to wavelet decomposition. In ASILOMAR, pages 40--44, 1993.Google ScholarCross Ref
- J. A. Shaw and E. A. Fox. Combination of multiple searches. In TREC'94, pages 243--252, 1994.Google Scholar
- T. Skopal, J. Pokornỳ, and V. Snasel. Pm-tree: Pivoting metric tree for similarity search in multimedia databases. In ADBIS'04, 2004.Google Scholar
- E. Tiakas, D. Rafailidis, A. Dimou, and P. Daras. Msidx: Multi-sort indexing for efficient content-based image search and retrieval. Trans. on Multimedia, 15(6):1415--1430, 2013. Google ScholarDigital Library
- A. Torralba, R. Fergus, and W. Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. Trans. on PAMI, 30(11):1958--1970, 2008. Google ScholarDigital Library
- A. Torralba, R. Fergus, and Y. Weiss. Small codes and large image databases for recognition. In CVPR'08, pages 1--8, 2008.Google ScholarCross Ref
- Y. Weiss, A. Torralba, and R. Fergus. Spectral hashing. NIPS'08, 9(1):1--8, 2008.Google Scholar
- A. Y. Yang, A. Ganesh, Z. Zhou, S. Sastry, and Y. Ma. A review of fast l1-minimization algorithms for robust face recognition. Arxiv CoRR, abs/1007.3753, 2010.Google Scholar
- X. Zhu, Z. Huang, H. Cheng, J. Cui, and H. T. Shen. Sparse hashing for fast multimedia search. ACM Trans. Inf. Syst., 31(2):1--9, 2013. Google ScholarDigital Library
Index Terms
- Scalable Multimodal Search with Distributed Indexing by Sparse Hashing
Recommendations
Scalable Supervised Discrete Hashing for Large-Scale Search
WWW '18: Proceedings of the 2018 World Wide Web ConferenceSupervised hashing methods have attracted much attention in these years. However, most existing supervised hashing algorithms have some of the following problems. First, most of them leverage the pairwise similarity matrix, whose size is quadratic to ...
Effective rank aggregation for metasearching
Nowadays, mashup services and especially metasearch engines play an increasingly important role on the Web. Most of users use them directly or indirectly to access and aggregate information from more than one data sources. Similarly to the rest of the ...
High-Dimensional Sparse Cross-Modal Hashing with Fine-Grained Similarity Embedding
WWW '21: Proceedings of the Web Conference 2021Recently, with the discoveries in neurobiology, high-dimensional sparse hashing has attracted increasing attention. In contrast with general hashing that generates low-dimensional hash codes, the high-dimensional sparse hashing maps inputs into a ...
Comments