skip to main content
research-article

MAD skills: new analysis practices for big data

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world's largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

References

  1. T. Barclay et al. Loading databases using dataflow parallelism. SIGMOD Record, 23(4), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Choi et al. ScaLAPACK: a portable linear algebra library for distributed memory computers -- design issues and performance. Computer Physics Communications, 97(1--2), 1996. High-Performance Computing in Science.Google ScholarGoogle Scholar
  3. C.-T. Chu et al. Map-Reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Dubner. Hal Varian answers your questions, February 2008.Google ScholarGoogle Scholar
  6. M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec., 34(4), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Graefe. Encapsulation of parallelism in the volcano query processing system. SIGMOD Rec., 19(2), 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J Gray et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov., 1(1), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Greenplum. A unified engine for RDBMS and MapReduce, 2009. http://www.greenplum.com/resources/mapreduce/.Google ScholarGoogle Scholar
  10. F. R. Hampel et al. Robust Statistics -- The Approach Based on Influence Functions. Wiley, 1986.Google ScholarGoogle Scholar
  11. J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In ACM SIGMOD, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Holland, February 2009. Downloaded from http://www.urbandictionary.com/define.php?term=mad.Google ScholarGoogle Scholar
  13. W. H. Inmon. Building the Data Warehouse. Wiley, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Y. E. Ioannidis et al. Zoo: A desktop experiment management environment. In VLDB, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Kaushik. Web Analytics: An Hour a Day. Sybex, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Khoussainova et al. A case for a collaborative query management system. In CIDR, 2009.Google ScholarGoogle Scholar
  17. K. Lange. Optimization. Springer, 2004.Google ScholarGoogle Scholar
  18. M. T. Roth and P. M. Schwarz. Don't scrap it, wrap it! A wrapper architecture for legacy data sources. In VLDB, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Stonebraker. Inclusion of new types in relational data base systems. In ICDE, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Stonebraker et al. C-store: a column-oriented dbms. In VLDB, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Stonebraker et al. Requirements for science data bases and SciDB. In CIDR, 2009.Google ScholarGoogle Scholar
  22. A. S. Szalay et al. Designing and mining multi-terabyte astronomy archives: the sloan digital sky survey. SIGMOD Rec., 29(2), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Vuduc, J. Demmel, and K. Yelick. Oski: A library of automatically tuned sparse matrix kernels. In SciDAC, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  24. M. J. Zaki and C.-T. Ho. Large-Scale Parallel Data Mining. Springer, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  25. Y. Zhang, H. Herodotou, and J. Yang. Riot: I/O-efficient numerical computing without SQL. In CIDR, 2009.Google ScholarGoogle Scholar

Index Terms

  1. MAD skills: new analysis practices for big data

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 2, Issue 2
          August 2009
          367 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 August 2009
          Published in pvldb Volume 2, Issue 2

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader