research-article

An Index Scheme for Similarity Search on Cloud Computing using MapReduce over Docker Container

Authors:
DT-Tri Nguyen

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea
View Profile

,
Chan Ho Yong

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea
View Profile

,
Xuan-Qui Pham

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea
View Profile

,
Huu-Quoc Nguyen

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea
View Profile

,
Ton Thi Kim Loan

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea
View Profile

,
Eui-Nam Huh

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea

Department of Computer Engineer, Kyung Hee University, Suwon, South Korea
View Profile

IMCOM '16: Proceedings of the 10th International Conference on Ubiquitous Information Management and CommunicationJanuary 2016Article No.: 60Pages 1–6https://doi.org/10.1145/2857546.2857607

Published:04 January 2016Publication History

IMCOM '16: Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication

Pages 1–6

ABSTRACT

We consider the problem of similarity search over the large datasets in the distributed environment. The proposed framework employs the Vp-Tree algorithm that integrated on top of the MapReduce framework to achieve good performance as well as meet the scalability and fault tolerance requirements for the system while data scale up. Since VP-Tree algorithm was implemented initially for partition and searching data in the local disk access, we proposed a new approach to using it in the parallel environment. The key point of the Vp-Tree algorithm is that it distributed the similar data points into groups, thereby reducing number of data need to scan during the searching stage. Consequently, the response time of the entire system has been improved. Otherwise, we used an open source computer vision library OpenCV for detect the similarity among images in the dataset. We evaluate the performance of our proposed framework using a synthetic data to show the positive of our approach. The experiment shows that our proposed framework achieves 57% improvement in response time in comparison with running searching job in tradition Hadoop framework. We also compared our application running time on Docker container against VM-based environment. The result points out that deploy our system over Docker container provide higher performance than VM-based environment in term of response time.

References

OpenCV Library Tutorials. http://docs.opencv.org/doc/tutorials/tutorials.html.Google Scholar
H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (surf). Comput. Vis. Image Underst., 110(3):346--359, June 2008. Google ScholarDigital Library
S. Berchtold, D. A. Keim, and H.-P. Kriegel. The x-tree: An index structure for high-dimensional data. In Proceedings of the 22th International Conference on Very Large Data Bases, VLDB '96, pages 28--39, San Francisco, CA, USA, 1996. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
K. Chakrabarti and S. Mehrotra. The hybrid tree: an index structure for high dimensional feature spaces. In Data Engineering, 1999. Proceedings., 15th International Conference on, pages 440--447, Mar 1999. Google ScholarDigital Library
D. J. Crandall, L. Backstrom, D. Huttenlocher, and J. Kleinberg. Mapping the world's photos. In Proceedings of the 18th International Conference on World Wide Web, WWW '09, pages 761--770, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Commun. ACM, 51(1):107--113, Jan. 2008. Google ScholarDigital Library
A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB '99, pages 518--529, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
N. Kumar, L. Zhang, and S. Nayar. What is a good nearest neighbors algorithm for finding similar patches in images? In Proceedings of the 10th European Conference on Computer Vision: Part II, ECCV '08, pages 364--378, Berlin, Heidelberg, 2008. Springer-Verlag. Google ScholarDigital Library
H. Liao, J. Han, and J. Fang. Multi-dimensional index on hadoop distributed file system. In Proceedings of the 2010 IEEE Fifth International Conference on Networking, Architecture, and Storage, NAS '10, pages 240--249, Washington, DC, USA, 2010. IEEE Computer Society. Google ScholarDigital Library
D. Merkel. Docker: lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239):2, 2014. Google ScholarDigital Library
D. Moise, D. Shestakov, G. Gudmundsson, and L. Amsaleg. Indexing and searching 100m images with map-reduce. In Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, ICMR '13, pages 17--24, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
M. Muja and D. G. Lowe. Fast approximate nearest neighbors with automatic algorithm configuration. VISAPP (1), 2, 2009.Google Scholar
M. Slaney and M. Casey. Locality-sensitive hashing for finding nearest neighbors {lecture notes}. Signal Processing Magazine, IEEE, 25(2):128--131, March 2008.Google ScholarCross Ref
P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the Fourth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '93, pages 311--321, Philadelphia, PA, USA, 1993. Society for Industrial and Applied Mathematics. Google ScholarDigital Library

Index Terms

An Index Scheme for Similarity Search on Cloud Computing using MapReduce over Docker Container
1. Information systems
  1. Information retrieval
  2. Information storage systems

Recommendations

An Adaptive Similarity Search in Massive Datasets
Transactions on Large-Scale Data- and Knowledge-Centered Systems XXIII - Volume 9480

Similarity search is an important task engaging in different fields of studies as well as in various application domains. The era of big data, however, has been posing challenges on existing information systems in general and on similarity search in ...
Read More
Index Structures for Fast Similarity Search for Binary Vectors

This article reviews index structures for fast similarity search for objects represented by binary vectors (with components equal to 0 or 1). Structures for both exact and approximate search by Hamming distance and other similarity measures are ...
Read More
An Efficient Document Indexing-Based Similarity Search in Large Datasets
FDSE 2015: Proceedings of the Second International Conference on Future Data and Security Engineering - Volume 9446

In this paper, we principally devote our effort to proposing a novel MapReduce-based approach for efficient similarity search in big data. Specifically, we address the drawbacks of using inverted index in similarity search with MapReduce and then ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

IMCOM '16: Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication
January 2016
658 pages
ISBN:9781450341424
DOI:10.1145/2857546

Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 4 January 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MapReduce
Similarity search
index scheme
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate213of621submissions,34%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 9
  Total Citations
  View Citations
- 234
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Index Scheme for Similarity Search on Cloud Computing using MapReduce over Docker Container

IMCOM '16: Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Adaptive Similarity Search in Massive Datasets

Index Structures for Fast Similarity Search for Binary Vectors

An Efficient Document Indexing-Based Similarity Search in Large Datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

An Index Scheme for Similarity Search on Cloud Computing using MapReduce over Docker Container

IMCOM '16: Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Adaptive Similarity Search in Massive Datasets

Index Structures for Fast Similarity Search for Binary Vectors

An Efficient Document Indexing-Based Similarity Search in Large Datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media