skip to main content
10.1145/1401890.1402000acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Data mining using high performance data clouds: experimental studies using sector and sphere

Authors Info & Claims
Published:24 August 2008Publication History

ABSTRACT

We describe the design and implementation of a high performance cloud that we have used to archive, analyze and mine large distributed data sets. By a cloud, we mean an infrastructure that provides resources and/or services over the Internet. A storage cloud provides storage services, while a compute cloud provides compute services. We describe the design of the Sector storage cloud and how it provides the storage services required by the Sphere compute cloud. We also describe the programming paradigm supported by the Sphere compute cloud. Sector and Sphere are designed for analyzing large data sets using computer clusters connected with wide area high performance networks (for example, 10+ Gb/s). We describe a distributed data mining application that we have developed using Sector and Sphere. Finally, we describe some experimental studies comparing Sector/Sphere to Hadoop.

References

  1. Amazon. Amazon Simple Storage Service (Amazon S3). www.amazon.com/s3.Google ScholarGoogle Scholar
  2. Jay Beale, Andrew R Baker, and Joel Esler. Snort IDS and IPS Toolkit. Syngress, 2007.Google ScholarGoogle Scholar
  3. Dhruba Borthaku. The hadoop distributed file system: Architecture and design. retrieved from lucene.apache.org/hadoop, 2007.Google ScholarGoogle Scholar
  4. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification and Regression Trees. Chapman and Hall, New York, 1984.Google ScholarGoogle Scholar
  5. Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI'04: Sixth Symposium on Operating System Design and Implementation, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. National Center for Data Mining at the University of Illinois at Chicago. The large data archives project.Google ScholarGoogle Scholar
  7. Ian Foster and Carl Kesselman. The Grid 2: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco, California, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google File System. In SOSP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jim Gray and Alexander S. Szalay. The world-wide telescope. Science, 293:2037--2040, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  10. William Gropp, Ewing Lusk, and Anthony Skjellum. Using MPI: Portable Parallel Programming with the Message Passing Interface, 2nd Edition. MIT Press, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Robert L. Grossman and Yunhong Gu. Sc 2006 bandwidth challenge: National center for data mining - udt. retrieved from https://scinet.supercomp.org/2006/bwc/graphs/challengencdm.png, 2006.Google ScholarGoogle Scholar
  12. Robert L Grossman, Michael Sabala, Yunhong Gu, Anushka Anand, Matt Handley, Rajmonda Sulo, and Lee Wilkinson. Distributed discovery in e-science: Lessons from the angle project. In Next Generation Data Mining (NGDM '07), 2008.Google ScholarGoogle ScholarCross RefCross Ref
  13. Yunhong Gu and Robert L. Grossman. UDT: UDP-based data transfer for high-speed wide area networks. Computer Networks, 51(7):1777--1799, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yunhong Gu, Robert L. Grossman, Alex Szalay, and Ani Thakar. Distributing the sloan digital sky survey using udt and sector. In Proceedings of e-Science 2006, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hillol Kargupta. Proceedings of Next Generation Data Mining 2007. Taylor and Francis, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Amazon Web Services LLC. Amazon web services developer connection. retrieved from developer.amazonwebservices.com on November 1, 2007.Google ScholarGoogle Scholar
  17. John D. Owens, David Luebke, Naga Govindaraju, Mark Harris, Jens Kruger, Aaron E. Lefohn, and Timothy J.Purcell. A survey of general-purpose computation on graphics hardware. In Eurographics 2005, pages 21--51, 2005.Google ScholarGoogle Scholar
  18. The Sector Project. Sector, a distributed storage and computing infrastructure, version 1.4.Google ScholarGoogle Scholar
  19. I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H Balakrishnana. Chord: A scalable peer to peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM '01, pages 149--160, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Hbase Development Team. Hbase: Bigtable-like structured storage for hadoop hdfs. http://wiki.apache.org/lucene-hadoop/Hbase, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Data mining using high performance data clouds: experimental studies using sector and sphere

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '08: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2008
        1116 pages
        ISBN:9781605581934
        DOI:10.1145/1401890
        • General Chair:
        • Ying Li,
        • Program Chairs:
        • Bing Liu,
        • Sunita Sarawagi

        Copyright © 2008 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2008

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        KDD '08 Paper Acceptance Rate118of593submissions,20%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader