ABSTRACT
We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it provides the storage services required by the Sphere compute cloud. We also describe the programming paradigm supported by the Sphere compute cloud. Sector and Sphere are designed for analyzing large data sets using computer clusters connected with wide area high performance networks (for example, 10+ Gb/s). We describe a distributed data mining application that we have developed using Sector and Sphere. Finally, we describe some experimental studies comparing Sector/Sphere to Hadoop.
- Amazon. Amazon Simple Storage Service (Amazon S3). www.amazon.com/s3.Google Scholar
- Jay Beale, Andrew R Baker, and Joel Esler. Snort IDS and IPS Toolkit. Syngress, 2007.Google Scholar
- Dhruba Borthaku. The hadoop distributed file system: Architecture and design. retrieved from lucene.apache.org/hadoop, 2007.Google Scholar
- Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Chapman and Hall, New York, 1984.Google Scholar
- Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004. Google ScholarDigital Library
- National Center for Data Mining at the University of Illinois at Chicago. The large data archives project.Google Scholar
- Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco, California, 2004. Google ScholarDigital Library
- Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In SOSP, 2003. Google ScholarDigital Library
- Jim Gray and Alexander S. Szalay. The world-wide telescope. Science, 293:2037--2040, 2001.Google ScholarCross Ref
- William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd Edition. MIT Press, 1999. Google ScholarDigital Library
- Robert L. Grossman and Yunhong Gu. Sc 2006 bandwidth challenge: National center for data mining - udt. retrieved from https://scinet.supercomp.org/2006/bwc/graphs/challengencdm.png, 2006.Google Scholar
- Robert L Grossman, Michael Sabala, Yunhong Gu, Anushka Anand, Matt Handley, Rajmonda Sulo, and Lee Wilkinson. Distributed discovery in e-science: Lessons from the angle project. In Next Generation Data Mining (NGDM '07), 2008.Google ScholarCross Ref
- Yunhong Gu and Robert L. Grossman. UDT: UDP-based data transfer for high-speed wide area networks. Computer Networks, 51(7):1777--1799, 2007. Google ScholarDigital Library
- Yunhong Gu, Robert L. Grossman, Alex Szalay, and Ani Thakar. Distributing the sloan digital sky survey using udt and sector. In Proceedings of e-Science 2006, 2006. Google ScholarDigital Library
- Hillol Kargupta. Proceedings of Next Generation Data Mining 2007. Taylor and Francis, 2008. Google ScholarDigital Library
- Amazon Web Services LLC. Amazon web services developer connection. retrieved from developer.amazonwebservices.com on November 1, 2007.Google Scholar
- John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. Lefohn, and Timothy J.Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, pages 21--51, 2005.Google Scholar
- The Sector Project. Sector, a distributed storage and computing infrastructure, version 1.4.Google Scholar
- I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H Balakrishnana. Chord: A scalable peer to peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM '01, pages 149--160, 2001. Google ScholarDigital Library
- Hbase Development Team. Hbase: Bigtable-like structured storage for hadoop hdfs. http://wiki.apache.org/lucene-hadoop/Hbase, 2007.Google Scholar
Index Terms
- Data mining using high performance data clouds: experimental studies using sector and sphere
Recommendations
Monitoring-based auto-scalability across hybrid clouds
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied ComputingCloud computing is a relatively new type of Internet-based computing that becomes more and more popular. Using methods like virtualization, adopting architectures based on microservices, automation of building and deployment processes, Cloud could ...
A Trust Management Solution in the Context of Hybrid Clouds
WETICE '14: Proceedings of the 2014 IEEE 23rd International WETICE ConferenceCloud computing is a revolutionary paradigm which enables on-demand provisioning of computing resources. Resources are delivered to cloud consumers in the form of infrastructure, platform and software services. These resources are deployed on three ...
Empirical Performance Assessment of Public Clouds Using System Level Benchmarks
Amazon's Elastic Compute Cloud EC2 Service is one of the leading public cloud service providers and offers many different levels of service. This paper looks into evaluating the memory, central processing unit CPU, and input/output I/O performance of ...
Comments