skip to main content
research-article

Processing a trillion cells per mouse click

Published:01 July 2012Publication History
Skip Abstract Section

Abstract

Column-oriented database systems have been a real game changer for the industry in recent years. Highly tuned and performant systems have evolved that provide users with the possibility of answering ad hoc queries over large datasets in an interactive manner.

In this paper we present the column-oriented datastore developed as one of the central components of PowerDrill. It combines the advantages of columnar data layout with other known techniques (such as using composite range partitions) and extensive algorithmic engineering on key data structures. The main goal of the latter being to reduce the main memory footprint and to increase the efficiency in processing typical user queries. In this combination we achieve large speed-ups. These enable a highly interactive Web UI where it is common that a single mouse click leads to processing a trillion values in the underlying dataset.

References

  1. D. J. Abadi. Query Execution in Column-Oriented Database Systems. PhD thesis, MIT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column oriented database systems. PVLDB, 2(2):1664--1665, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. D. J. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD Conference, pages 671--682, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. J. Abadi, S. Madden, and N. Hachem. Column-stores vs. row-stores: how different are they really? In SIGMOD Conference, pages 967--980, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Abiteboul and N. Bidoit. Non first normal form relations to represent hierarchically organized data. In PODS, pages 191--200, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Bast and I. Weber. The completesearch engine: Interactive, efficient, and towards ir& db integration. In CIDR, pages 88--95, 2007.Google ScholarGoogle Scholar
  8. S. Bischof, T. Schickinger, and A. Steger. Load balancing using bisectors - a tight average-case analysis. In ESA, pages 172--183, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google ScholarGoogle Scholar
  10. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. SIGMOD Record, 26(1):65--74, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean. Challenges in building large-scale information retrieval systems: invited talk. In WSDM, page 1, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP, pages 29--43, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Google Web Toolkit. code.google.com/webtoolkit.Google ScholarGoogle Scholar
  17. S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD Conference, pages 297--308, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. S. Johnson, S. Krishnan, J. Chhugani, S. Kumar, and S. Venkatasubramanian. Compressing large boolean matrices using reordering techniques. In VLDB, pages 13--23, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Johnson and D. Shasha. 2q: A low overhead high performance buffer management replacement algorithm. In VLDB, pages 439--450, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. PVLDB, 4(12):1474--1477, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Lemire and O. Kaser. Reordering columns for smaller indexes. Inf. Sci., 181(12):2550--2570, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. N. Megiddo and D. S. Modha. Outperforming lru with an adaptive replacement cache algorithm. IEEE Computer, 37(4):58--65, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Commun. ACM, 54(6):114--123, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. G. Moerkotte. Small materialized aggregates: A light weight index structure for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. MonetDB. www.monetdb.org.Google ScholarGoogle Scholar
  26. Netezza. www.netezza.com.Google ScholarGoogle Scholar
  27. Page replacement algorithms. en.wikipedia.org/wiki/Page_replacement_algorithm.Google ScholarGoogle Scholar
  28. Partition databases (wiki). en.wikipedia.org/wiki/Partition_(database).Google ScholarGoogle Scholar
  29. Protocol Buffers: Developer Guide. code.google.com/apis/protocolbuffers/docs/overview.html.Google ScholarGoogle Scholar
  30. QlikTech. www.qlikview.com.Google ScholarGoogle Scholar
  31. R. Sedgewick and K. Wayne. Algorithms. Addison Wesley, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. D. Slezak, J. Wroblewski, V. Eastwood, and P. Synak. Brighthouse: an analytic data warehouse for ad-hoc queries. PVLDB, 1(2):1337--1345, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Snappy compression algorithm. code.google.com/p/snappy.Google ScholarGoogle Scholar
  34. M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-store: A column-oriented dbms. In VLDB, pages 553--564, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. L. Trevisan. When hamming meets euclid: The approximability of geometric tsp and mst (extended abstract). In STOC, pages 21--29, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. VectorWise. www.actian.com/products/vectorwise.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 11
    July 2012
    608 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 July 2012
    Published in pvldb Volume 5, Issue 11

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader