Abstract
As massive data acquisition and storage becomes increasingly affordable, a wide variety of enterprises are employing statisticians to engage in sophisticated data analysis. In this paper we highlight the emerging practice of Magnetic, Agile, Deep (MAD) data analysis as a radical departure from traditional Enterprise Data Warehouses and Business Intelligence. We present our design philosophy, techniques and experience providing MAD analytics for one of the world's largest advertising networks at Fox Audience Network, using the Greenplum parallel database system. We describe database design methodologies that support the agile working style of analysts in these settings. We present dataparallel algorithms for sophisticated statistical techniques, with a focus on density methods. Finally, we reflect on database system features that enable agile design and flexible algorithm development using both SQL and MapReduce interfaces over a variety of storage mechanisms.
- T. Barclay et al. Loading databases using dataflow parallelism. SIGMOD Record, 23(4), 1994. Google ScholarDigital Library
- J. Choi et al. ScaLAPACK: a portable linear algebra library for distributed memory computers -- design issues and performance. Computer Physics Communications, 97(1--2), 1996. High-Performance Computing in Science.Google Scholar
- C.-T. Chu et al. Map-Reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
- S. Dubner. Hal Varian answers your questions, February 2008.Google Scholar
- M. Franklin, A. Halevy, and D. Maier. From databases to dataspaces: a new abstraction for information management. SIGMOD Rec., 34(4), 2005. Google ScholarDigital Library
- G. Graefe. Encapsulation of parallelism in the volcano query processing system. SIGMOD Rec., 19(2), 1990. Google ScholarDigital Library
- J Gray et al. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov., 1(1), 1997. Google ScholarDigital Library
- Greenplum. A unified engine for RDBMS and MapReduce, 2009. http://www.greenplum.com/resources/mapreduce/.Google Scholar
- F. R. Hampel et al. Robust Statistics -- The Approach Based on Influence Functions. Wiley, 1986.Google Scholar
- J. M. Hellerstein, P. J. Haas, and H. J. Wang. Online aggregation. In ACM SIGMOD, 1997. Google ScholarDigital Library
- W. Holland, February 2009. Downloaded from http://www.urbandictionary.com/define.php?term=mad.Google Scholar
- W. H. Inmon. Building the Data Warehouse. Wiley, 2005. Google ScholarDigital Library
- Y. E. Ioannidis et al. Zoo: A desktop experiment management environment. In VLDB, 1996. Google ScholarDigital Library
- A. Kaushik. Web Analytics: An Hour a Day. Sybex, 2007. Google ScholarDigital Library
- N. Khoussainova et al. A case for a collaborative query management system. In CIDR, 2009.Google Scholar
- K. Lange. Optimization. Springer, 2004.Google Scholar
- M. T. Roth and P. M. Schwarz. Don't scrap it, wrap it! A wrapper architecture for legacy data sources. In VLDB, 1997. Google ScholarDigital Library
- M. Stonebraker. Inclusion of new types in relational data base systems. In ICDE, 1986. Google ScholarDigital Library
- M. Stonebraker et al. C-store: a column-oriented dbms. In VLDB, 2005. Google ScholarDigital Library
- M. Stonebraker et al. Requirements for science data bases and SciDB. In CIDR, 2009.Google Scholar
- A. S. Szalay et al. Designing and mining multi-terabyte astronomy archives: the sloan digital sky survey. SIGMOD Rec., 29(2), 2000. Google ScholarDigital Library
- R. Vuduc, J. Demmel, and K. Yelick. Oski: A library of automatically tuned sparse matrix kernels. In SciDAC, 2005.Google ScholarCross Ref
- M. J. Zaki and C.-T. Ho. Large-Scale Parallel Data Mining. Springer, 2000.Google ScholarCross Ref
- Y. Zhang, H. Herodotou, and J. Yang. Riot: I/O-efficient numerical computing without SQL. In CIDR, 2009.Google Scholar
Index Terms
- MAD skills: new analysis practices for big data
Recommendations
MAD: A Monitor System for Big Data Applications
IScIDE 2015: Revised Selected Papers, Part II, of the 5th International Conference on Intelligence Science and Big Data Engineering. Big Data and Machine Learning Techniques - Volume 9243A big data application usually needs to build a pipeline on the top of workflow engine which connects relevant periodic workflow jobs. It's crucial to timely alert pipeline issues, provide an issue diagnosis subsystem to find out root cause from a ...
Non-technical individual skills are weakly connected to the maturity of agile practices
AbstractContext: Existing knowledge in agile software development suggests that individual competency (e.g. skills) is a critical success factor for agile projects. While assuming that technical skills are important for every ...
Comments