ABSTRACT
We consider the problem of similarity search over the large datasets in the distributed environment. The proposed framework employs the Vp-Tree algorithm that integrated on top of the MapReduce framework to achieve good performance as well as meet the scalability and fault tolerance requirements for the system while data scale up. Since VP-Tree algorithm was implemented initially for partition and searching data in the local disk access, we proposed a new approach to using it in the parallel environment. The key point of the Vp-Tree algorithm is that it distributed the similar data points into groups, thereby reducing number of data need to scan during the searching stage. Consequently, the response time of the entire system has been improved. Otherwise, we used an open source computer vision library OpenCV for detect the similarity among images in the dataset. We evaluate the performance of our proposed framework using a synthetic data to show the positive of our approach. The experiment shows that our proposed framework achieves 57% improvement in response time in comparison with running searching job in tradition Hadoop framework. We also compared our application running time on Docker container against VM-based environment. The result points out that deploy our system over Docker container provide higher performance than VM-based environment in term of response time.
- OpenCV Library Tutorials. http://docs.opencv.org/doc/tutorials/tutorials.html.Google Scholar
- H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). Comput. Vis. Image Underst., 110(3):346--359, June 2008. Google ScholarDigital Library
- S. Berchtold, D. A. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB '96, pages 28--39, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- K. Chakrabarti and S. Mehrotra. The hybrid tree: an index structure for high dimensional feature spaces. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 440--447, Mar 1999. Google ScholarDigital Library
- D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. Mapping the world's photos. In Proceedings of the 18th International Conference on World Wide Web, WWW '09, pages 761--770, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008. Google ScholarDigital Library
- A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB '99, pages 518--529, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- N. Kumar, L. Zhang, and S. Nayar. What is a good nearest neighbors algorithm for finding similar patches in images? In Proceedings of the 10th European Conference on Computer Vision: Part II, ECCV '08, pages 364--378, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarDigital Library
- H. Liao, J. Han, and J. Fang. Multi-dimensional index on hadoop distributed file system. In Proceedings of the 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage, NAS '10, pages 240--249, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
- D. Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239):2, 2014. Google ScholarDigital Library
- D. Moise, D. Shestakov, G. Gudmundsson, and L. Amsaleg. Indexing and searching 100m images with map-reduce. In Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, ICMR '13, pages 17--24, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2, 2009.Google Scholar
- M. Slaney and M. Casey. Locality-sensitive hashing for finding nearest neighbors {lecture notes}. Signal Processing Magazine, IEEE, 25(2):128--131, March 2008.Google ScholarCross Ref
- P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '93, pages 311--321, Philadelphia, PA, USA, 1993. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
Index Terms
- An Index Scheme for Similarity Search on Cloud Computing using MapReduce over Docker Container
Recommendations
An Adaptive Similarity Search in Massive Datasets
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIII - Volume 9480Similarity search is an important task engaging in different fields of studies as well as in various application domains. The era of big data, however, has been posing challenges on existing information systems in general and on similarity search in ...
Index Structures for Fast Similarity Search for Binary Vectors
This article reviews index structures for fast similarity search for objects represented by binary vectors (with components equal to 0 or 1). Structures for both exact and approximate search by Hamming distance and other similarity measures are ...
An Efficient Document Indexing-Based Similarity Search in Large Datasets
FDSE 2015: Proceedings of the Second International Conference on Future Data and Security Engineering - Volume 9446In this paper, we principally devote our effort to proposing a novel MapReduce-based approach for efficient similarity search in big data. Specifically, we address the drawbacks of using inverted index in similarity search with MapReduce and then ...
Comments