skip to main content
research-article
Free Access

Dremel: interactive analysis of web-scale datasets

Published:01 June 2011Publication History
Skip Abstract Section

Abstract

Dremel is a scalable, interactive ad hoc query system for analysis of read-only nested data. By combining multilevel execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. In this paper, we describe the architecture and implementation of Dremel, and explain how it complements MapReduce-based computing. We present a novel columnar storage representation for nested records and discuss experiments on few-thousand node instances of the system.

References

  1. Abadi, D. J., Boncz, P. A., Harizopoulos, S. Column-oriented database systems. VLDB 2, 2 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Abiteboul, S., Hull, R., and Vianu, V. Foundations of Databases. Addison Wesley, Reading, PA, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D. J., Rasin, A., Silberschatz, A. HadoopDB: An architectural hybrid of MapReduce and DBMS technologies for analytical workloads. VLDB 2, 1 (2009). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bar-Yossef, Z., Jayram, T. S., Kumar, R., Sivakumar, D., Trevisan, L. Counting distinct elements in a data stream. In RANDOM, 2002, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Barroso, L. A., Hölzle, U. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan & Claypool Publishers, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. BigQuery. http://code.google.com/apis/bigquery.Google ScholarGoogle Scholar
  7. Chaiken, R., Jenkins, B., Larson, P.-?., Ramsey, B., Shakib, D., Weaver, S., Zhou, J. SCOPE: Easy and efficient parallel processing of massive data sets. VLDB 1, 2 (2008). Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Chambers, C., Raniwala, A., Perry, F., Adams, S., Henry, R., Bradshaw, R., Weizenbaum, N. FlumeJava: Easy, efficient data-parallel pipelines. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., Gruber, R. Bigtable: A distributed storage system for structured data. In OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Colby, L. S. A recursive algebra and query optimization for nested relations. In SIGMOD, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Dean. J., Challenges in building large-scale information retrieval systems: Invited talk. In WSDM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Dean, J., Ghemawat, S. MapReduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dean, J., Ghemawat, S. MapReduce: A Flexible data processing tool. Commun. ACM 53, 1 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Ghemawat, S., Gobioff, H., Leung, S.-T. The Google File System. In SOSP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Hadoop Apache Project. http://hadoop.apache.org.Google ScholarGoogle Scholar
  16. Hive. http://wiki.apache.org/hadoop/Hive, 2009.Google ScholarGoogle Scholar
  17. Liefke, H., Suciu, D. XMill: An efficient compressor for XML data. In SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T. Dremel: Interactive analysis of web-scale datasets. PVLDB 3, 1 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. O'Neil, P. E., O'Neil, E. J., Pal, S., Cseri, I., Schaller, G., Westbury, N. ORDPATHs: Insert-friendly XML node labels. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Pike, R., Dorward, S., Griesemer, R., Quinlan, S. Interpreting the data: Parallel analysis with Sawzall. Sci. Program. 13, 4 (2005). Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Protocol Buffers: Developer Guide. Available at http://code.google.com/apis/protocolbuffers/docs/overview.html.Google ScholarGoogle Scholar
  23. Stonebraker, M., Abadi, D., DeWitt, D. J., Madden, S., Paulson, E., Pavlo, A., Rasin, A., MapReduce and parallel DBMSs: Friends or foes? Commun. ACM 53, 1 (2010). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yu, Y., Isard, M., Fetterly, D., Budiu, M., Erlingsson, Ú., Gunda, P. K., Currey, J. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Dremel: interactive analysis of web-scale datasets

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Communications of the ACM
            Communications of the ACM  Volume 54, Issue 6
            June 2011
            134 pages
            ISSN:0001-0782
            EISSN:1557-7317
            DOI:10.1145/1953122
            Issue’s Table of Contents

            Copyright © 2011 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 June 2011

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Popular
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format