Abstract
Column-oriented database systems have been a real game changer for the industry in recent years. Highly tuned and performant systems have evolved that provide users with the possibility of answering ad hoc queries over large datasets in an interactive manner.
In this paper we present the column-oriented datastore developed as one of the central components of PowerDrill. It combines the advantages of columnar data layout with other known techniques (such as using composite range partitions) and extensive algorithmic engineering on key data structures. The main goal of the latter being to reduce the main memory footprint and to increase the efficiency in processing typical user queries. In this combination we achieve large speed-ups. These enable a highly interactive Web UI where it is common that a single mouse click leads to processing a trillion values in the underlying dataset.
- D. J. Abadi. Query Execution in Column-Oriented Database Systems. PhD thesis, MIT, 2008. Google ScholarDigital Library
- D. J. Abadi, P. A. Boncz, and S. Harizopoulos. Column oriented database systems. PVLDB, 2(2):1664--1665, 2009. Google ScholarDigital Library
- D. J. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD Conference, pages 671--682, 2006. Google ScholarDigital Library
- D. J. Abadi, S. Madden, and N. Hachem. Column-stores vs. row-stores: how different are they really? In SIGMOD Conference, pages 967--980, 2008. Google ScholarDigital Library
- S. Abiteboul and N. Bidoit. Non first normal form relations to represent hierarchically organized data. In PODS, pages 191--200, 1984. Google ScholarDigital Library
- Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, pages 1--10, 2002. Google ScholarDigital Library
- H. Bast and I. Weber. The completesearch engine: Interactive, efficient, and towards ir& db integration. In CIDR, pages 88--95, 2007.Google Scholar
- S. Bischof, T. Schickinger, and A. Steger. Load balancing using bisectors - a tight average-case analysis. In ESA, pages 172--183, 1999. Google ScholarDigital Library
- P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google Scholar
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26(2):4:1--4:26, 2008. Google ScholarDigital Library
- S. Chaudhuri and U. Dayal. An overview of data warehousing and olap technology. SIGMOD Record, 26(1):65--74, 1997. Google ScholarDigital Library
- J. Dean. Challenges in building large-scale information retrieval systems: invited talk. In WSDM, page 1, 2009. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP, pages 29--43, 2003. Google ScholarDigital Library
- Google Web Toolkit. code.google.com/webtoolkit.Google Scholar
- S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In SIGMOD Conference, pages 297--308, 2009. Google ScholarDigital Library
- D. S. Johnson, S. Krishnan, J. Chhugani, S. Kumar, and S. Venkatasubramanian. Compressing large boolean matrices using reordering techniques. In VLDB, pages 13--23, 2004. Google ScholarDigital Library
- T. Johnson and D. Shasha. 2q: A low overhead high performance buffer management replacement algorithm. In VLDB, pages 439--450, 1994. Google ScholarDigital Library
- M. L. Kersten, S. Idreos, S. Manegold, and E. Liarou. The researcher's guide to the data deluge: Querying a scientific database in just a few seconds. PVLDB, 4(12):1474--1477, 2011.Google ScholarDigital Library
- D. Lemire and O. Kaser. Reordering columns for smaller indexes. Inf. Sci., 181(12):2550--2570, 2011. Google ScholarDigital Library
- N. Megiddo and D. S. Modha. Outperforming lru with an adaptive replacement cache algorithm. IEEE Computer, 37(4):58--65, 2004. Google ScholarDigital Library
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, and T. Vassilakis. Dremel: interactive analysis of web-scale datasets. Commun. ACM, 54(6):114--123, 2011. Google ScholarDigital Library
- G. Moerkotte. Small materialized aggregates: A light weight index structure for data warehousing. In VLDB, pages 476--487, 1998. Google ScholarDigital Library
- MonetDB. www.monetdb.org.Google Scholar
- Netezza. www.netezza.com.Google Scholar
- Page replacement algorithms. en.wikipedia.org/wiki/Page_replacement_algorithm.Google Scholar
- Partition databases (wiki). en.wikipedia.org/wiki/Partition_(database).Google Scholar
- Protocol Buffers: Developer Guide. code.google.com/apis/protocolbuffers/docs/overview.html.Google Scholar
- QlikTech. www.qlikview.com.Google Scholar
- R. Sedgewick and K. Wayne. Algorithms. Addison Wesley, 2011. Google ScholarDigital Library
- D. Slezak, J. Wroblewski, V. Eastwood, and P. Synak. Brighthouse: an analytic data warehouse for ad-hoc queries. PVLDB, 1(2):1337--1345, 2008. Google ScholarDigital Library
- Snappy compression algorithm. code.google.com/p/snappy.Google Scholar
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. J. O'Neil, P. E. O'Neil, A. Rasin, N. Tran, and S. B. Zdonik. C-store: A column-oriented dbms. In VLDB, pages 553--564, 2005. Google ScholarDigital Library
- L. Trevisan. When hamming meets euclid: The approximability of geometric tsp and mst (extended abstract). In STOC, pages 21--29, 1997. Google ScholarDigital Library
- VectorWise. www.actian.com/products/vectorwise.Google Scholar
Recommendations
Session based click features for recency ranking
AAAI'10: Proceedings of the Twenty-Fourth AAAI Conference on Artificial IntelligenceRecency ranking refers to the ranking of web results by accounting for both relevance and freshness. This is particularly important for "recency sensitive" queries such as breaking news queries. In this study, we propose a set of novel click features to ...
Tool for detecting webpage usability problems from mouse click coordinate logs
HCI'07: Proceedings of the 12th international conference on Human-computer interaction: interaction design and usabilityIn this paper, we propose a method that detects inconsistencies between user interaction logs of a task and desired sequences for the task based on mouse click coordinate logs. The proposed method models two successive clicks as a vector and thus a ...
Click patterns: an empirical representation of complex query intents
CIKM '12: Proceedings of the 21st ACM international conference on Information and knowledge managementUnderstanding users' search intents is critical component of modern search engines. A key limitation made by most query log analyses is the assumption that each clicked web result represents one unique intent. However, there are many search tasks, such ...
Comments