research-article

MAD skills: new analysis practices for big data

Authors:
Jeffrey Cohen

Greenplum

Greenplum
View Profile

,
Brian Dolan

Fox Audience Network

Fox Audience Network
View Profile

,
Mark Dunlap

Evergreen Technologies

Evergreen Technologies
View Profile

,
Joseph M. Hellerstein

U. C. Berkeley

U. C. Berkeley
View Profile

,
Caleb Welton

Greenplum

Greenplum
View Profile

Proceedings of the VLDB Endowment Volume 2 Issue 2pp 1481–1492https://doi.org/10.14778/1687553.1687576

Published:01 August 2009Publication History

Proceedings of the VLDB Endowment

Abstract

As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world's largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.

References

T. Barclay et al. Loading databases using dataflow parallelism. SIGMOD Record, 23(4), 1994. Google ScholarDigital Library
J. Choi et al. ScaLAPACK: a portable linear algebra library for distributed memory computers -- design issues and performance. Computer Physics Communications, 97(1--2), 1996. High-Performance Computing in Science.Google Scholar
C.-T. Chu et al. Map-Reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
S. Dubner. Hal Varian answers your questions, February 2008.Google Scholar
M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec., 34(4), 2005. Google ScholarDigital Library
G. Graefe. Encapsulation of parallelism in the volcano query processing system. SIGMOD Rec., 19(2), 1990. Google ScholarDigital Library
J Gray et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov., 1(1), 1997. Google ScholarDigital Library
Greenplum. A unified engine for RDBMS and MapReduce, 2009. http://www.greenplum.com/resources/mapreduce/.Google Scholar
F. R. Hampel et al. Robust Statistics -- The Approach Based on Influence Functions. Wiley, 1986.Google Scholar
J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In ACM SIGMOD, 1997. Google ScholarDigital Library
W. Holland, February 2009. Downloaded from http://www.urbandictionary.com/define.php?term=mad.Google Scholar
W. H. Inmon. Building the Data Warehouse. Wiley, 2005. Google ScholarDigital Library
Y. E. Ioannidis et al. Zoo: A desktop experiment management environment. In VLDB, 1996. Google ScholarDigital Library
A. Kaushik. Web Analytics: An Hour a Day. Sybex, 2007. Google ScholarDigital Library
N. Khoussainova et al. A case for a collaborative query management system. In CIDR, 2009.Google Scholar
K. Lange. Optimization. Springer, 2004.Google Scholar
M. T. Roth and P. M. Schwarz. Don't scrap it, wrap it! A wrapper architecture for legacy data sources. In VLDB, 1997. Google ScholarDigital Library
M. Stonebraker. Inclusion of new types in relational data base systems. In ICDE, 1986. Google ScholarDigital Library
M. Stonebraker et al. C-store: a column-oriented dbms. In VLDB, 2005. Google ScholarDigital Library
M. Stonebraker et al. Requirements for science data bases and SciDB. In CIDR, 2009.Google Scholar
A. S. Szalay et al. Designing and mining multi-terabyte astronomy archives: the sloan digital sky survey. SIGMOD Rec., 29(2), 2000. Google ScholarDigital Library
R. Vuduc, J. Demmel, and K. Yelick. Oski: A library of automatically tuned sparse matrix kernels. In SciDAC, 2005.Google ScholarCross Ref
M. J. Zaki and C.-T. Ho. Large-Scale Parallel Data Mining. Springer, 2000.Google ScholarCross Ref
Y. Zhang, H. Herodotou, and J. Yang. Riot: I/O-efficient numerical computing without SQL. In CIDR, 2009.Google Scholar

Index Terms

MAD skills: new analysis practices for big data
1. Information systems

Recommendations

Essential Skills for the Agile Developer: A Guide to Better Programming and Design
Read More
MAD: A Monitor System for Big Data Applications
IScIDE 2015: Revised Selected Papers, Part II, of the 5th International Conference on Intelligence Science and Big Data Engineering. Big Data and Machine Learning Techniques - Volume 9243

A big data application usually needs to build a pipeline on the top of workflow engine which connects relevant periodic workflow jobs. It's crucial to timely alert pipeline issues, provide an issue diagnosis subsystem to find out root cause from a ...
Read More
Non-technical individual skills are weakly connected to the maturity of agile practices
Abstract
Context: Existing knowledge in agile software development suggests that individual competency (e.g. skills) is a critical success factor for agile projects. While assuming that technical skills are important for every ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 2, Issue 2
August 2009
367 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2009
Published in pvldb Volume 2, Issue 2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 134
  Total Citations
  View Citations
- 7,034
  Total Downloads
- Downloads (Last 12 months)84
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Essential Skills for the Agile Developer: A Guide to Better Programming and Design

MAD: A Monitor System for Big Data Applications

Non-technical individual skills are weakly connected to the maturity of agile practices

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

MAD skills: new analysis practices for big data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Essential Skills for the Agile Developer: A Guide to Better Programming and Design

MAD: A Monitor System for Big Data Applications

Non-technical individual skills are weakly connected to the maturity of agile practices

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media