research-article

Data mining using high performance data clouds: experimental studies using sector and sphere

Authors:
Robert Grossman

University of Illinois at Chicago and Open Data Group, Chicago, IL, USA

University of Illinois at Chicago and Open Data Group, Chicago, IL, USA
View Profile

,
Yunhong Gu

University of Illinois at Chicago, Chicago, IL, USA

University of Illinois at Chicago, Chicago, IL, USA
View Profile

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2008Pages 920–927https://doi.org/10.1145/1401890.1402000

Published:24 August 2008Publication History

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 920–927

ABSTRACT

We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it provides the storage services required by the Sphere compute cloud. We also describe the programming paradigm supported by the Sphere compute cloud. Sector and Sphere are designed for analyzing large data sets using computer clusters connected with wide area high performance networks (for example, 10+ Gb/s). We describe a distributed data mining application that we have developed using Sector and Sphere. Finally, we describe some experimental studies comparing Sector/Sphere to Hadoop.

References

Amazon. Amazon Simple Storage Service (Amazon S3). www.amazon.com/s3.Google Scholar
Jay Beale, Andrew R Baker, and Joel Esler. Snort IDS and IPS Toolkit. Syngress, 2007.Google Scholar
Dhruba Borthaku. The hadoop distributed file system: Architecture and design. retrieved from lucene.apache.org/hadoop, 2007.Google Scholar
Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Chapman and Hall, New York, 1984.Google Scholar
Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004. Google ScholarDigital Library
National Center for Data Mining at the University of Illinois at Chicago. The large data archives project.Google Scholar
Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco, California, 2004. Google ScholarDigital Library
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In SOSP, 2003. Google ScholarDigital Library
Jim Gray and Alexander S. Szalay. The world-wide telescope. Science, 293:2037--2040, 2001.Google ScholarCross Ref
William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd Edition. MIT Press, 1999. Google ScholarDigital Library
Robert L. Grossman and Yunhong Gu. Sc 2006 bandwidth challenge: National center for data mining - udt. retrieved from https://scinet.supercomp.org/2006/bwc/graphs/challengencdm.png, 2006.Google Scholar
Robert L Grossman, Michael Sabala, Yunhong Gu, Anushka Anand, Matt Handley, Rajmonda Sulo, and Lee Wilkinson. Distributed discovery in e-science: Lessons from the angle project. In Next Generation Data Mining (NGDM '07), 2008.Google ScholarCross Ref
Yunhong Gu and Robert L. Grossman. UDT: UDP-based data transfer for high-speed wide area networks. Computer Networks, 51(7):1777--1799, 2007. Google ScholarDigital Library
Yunhong Gu, Robert L. Grossman, Alex Szalay, and Ani Thakar. Distributing the sloan digital sky survey using udt and sector. In Proceedings of e-Science 2006, 2006. Google ScholarDigital Library
Hillol Kargupta. Proceedings of Next Generation Data Mining 2007. Taylor and Francis, 2008. Google ScholarDigital Library
Amazon Web Services LLC. Amazon web services developer connection. retrieved from developer.amazonwebservices.com on November 1, 2007.Google Scholar
John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. Lefohn, and Timothy J.Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, pages 21--51, 2005.Google Scholar
The Sector Project. Sector, a distributed storage and computing infrastructure, version 1.4.Google Scholar
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H Balakrishnana. Chord: A scalable peer to peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM '01, pages 149--160, 2001. Google ScholarDigital Library
Hbase Development Team. Hbase: Bigtable-like structured storage for hadoop hdfs. http://wiki.apache.org/lucene-hadoop/Hbase, 2007.Google Scholar

Index Terms

Data mining using high performance data clouds: experimental studies using sector and sphere
1. Information systems
  1. Information systems applications
    1. Data mining
2. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
        Distributed memory
        Process management
        Multiprocessing / multiprogramming / multitasking
    2. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

Monitoring-based auto-scalability across hybrid clouds
SAC '18: Proceedings of the 33rd Annual ACM Symposium on Applied Computing

Cloud computing is a relatively new type of Internet-based computing that becomes more and more popular. Using methods like virtualization, adopting architectures based on microservices, automation of building and deployment processes, Cloud could ...
Read More
A Trust Management Solution in the Context of Hybrid Clouds
WETICE '14: Proceedings of the 2014 IEEE 23rd International WETICE Conference

Cloud computing is a revolutionary paradigm which enables on-demand provisioning of computing resources. Resources are delivered to cloud consumers in the form of infrastructure, platform and software services. These resources are deployed on three ...
Read More
Empirical Performance Assessment of Public Clouds Using System Level Benchmarks

Amazon's Elastic Compute Cloud EC2 Service is one of the leading public cloud service providers and offers many different levels of service. This paper looks into evaluating the memory, central processing unit CPU, and input/output I/O performance of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2008
1116 pages
ISBN:9781605581934
DOI:10.1145/1401890
General Chair:
Ying Li
Microsoft adCenter Labs
,
Program Chairs:
Bing Liu
University of Illinois at Chicago
,
Sunita Sarawagi
Indian Institute of Technology, Bombay
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 August 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud computing
distributed data mining
high performance data mining
Qualifiers
- research-article
Conference

Acceptance Rates
KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%
More
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 106
  Total Citations
  View Citations
- 4,140
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data mining using high performance data clouds: experimental studies using sector and sphere

KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Monitoring-based auto-scalability across hybrid clouds

A Trust Management Solution in the Context of Hybrid Clouds

Empirical Performance Assessment of Public Clouds Using System Level Benchmarks