research-article

Kodiak: leveraging materialized views for very low-latency analytics over high-dimensional web-scale data

Authors:
Shaosu Liu

Turn, Inc

Turn, Inc
View Profile

,
Bin Song

Turn, Inc

Turn, Inc
View Profile

,
Sriharsha Gangam

Turn, Inc

Turn, Inc
View Profile

,
Lawrence Lo

Turn, Inc

Turn, Inc
View Profile

,
Khaled Elmeleegy

Turn, Inc

Turn, Inc
View Profile

Proceedings of the VLDB Endowment Volume 9 Issue 13pp 1269–1280https://doi.org/10.14778/3007263.3007266

Published:01 September 2016Publication History

Proceedings of the VLDB Endowment

Abstract

Turn's online advertising campaigns produce petabytes of data. This data is composed of trillions of events, e.g. impressions, clicks, etc., spanning multiple years. In addition to a timestamp, each event includes hundreds of fields describing the user's attributes, campaign's attributes, attributes of where the ad was served, etc.

Advertisers need advanced analytics to monitor their running campaigns' performance, as well as to optimize future campaigns. This involves slicing and dicing the data over tens of dimensions over arbitrary time ranges. Many of these queries need to power the web portal to provide reports and dashboards. For an interactive response time, they have to have tens of milliseconds latency. At Turn's scale of operations, no existing system was able to deliver this performance in a cost effective manner.

Kodiak, a distributed analytical data platform for web-scale high-dimensional data, was built to serve this need. It relies on pre-computations to materialize thousands of views to serve these advanced queries. These views are partitioned and replicated across Kodiak's storage nodes for scalability and reliability. They are system maintained as new events arrive. At query time, the system auto-selects the most suitable view to serve each query.

Kodiak has been used in production for over a year. It hosts 2490 views for over three petabytes of raw data serving over 200K queries daily. It has median and 99% query latencies of 8 ms and 252 ms respectively. Our experiments show that its query latency is 3 orders of magnitude faster than leading big data platforms on head-to-head comparisons using Turn's query workload. Moreover, Kodiak uses 4 orders of magnitude less resources to run the same workload.

References

P. Agrawal, A. Silberstein, B. F. Cooper, U. Srivastava, and R. Ramakrishnan. Asynchronous view maintenance for vlsd databases. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of Data, SIGMOD '09, pages 179--192, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
HBase: the Hadoop database. http:///hbase.apache.org/.Google Scholar
The Apache Cassandra database. http:///cassandra.apache.org/.Google Scholar
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 1383--1394, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. Impala: A modern, open-source sql engine for hadoop. In CIDR, 2015.Google Scholar
F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Computer Systems, 26(2), 2008. Google ScholarDigital Library
S. Chen. Cheetah: A high performance, custom data warehouse on top of mapreduce. Proc. VLDB Endow., 3(1-2):1459--1468, Sept. 2010. Google ScholarDigital Library
L. S. Colby, T. Griffin, L. Libkin, I. S. Mumick, and H. Trickey. Algorithms for deferred view maintenance. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD '96, pages 469--480, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
B. F. Cooper, R. Ramakrishnan, U. Srivastava, A. Silberstein, P. Bohannon, H.-A. Jacobsen, N. Puz, D. Weaver, and R. Yerneni. Pnuts: Yahoo!'s hosted data serving platform. Proc. VLDB Endow., 1(2):1277--1288, Aug. 2008. Google ScholarDigital Library
J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford. Spanner: Google's globally-distributed database. In Proceedings of the 10th USENIX Conference on Operating Systems Design and Implementation, OSDI'12, pages 251--264, Berkeley, CA, USA, 2012. USENIX Association. Google ScholarDigital Library
G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramanian, P. Vosshall, and W. Vogels. Dynamo: Amazon's highly available key-value store. In Proceedings of Twenty-first ACM SIGOPS Symposium on Operating Systems Principles, SOSP '07, pages 205--220, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
H. Elmeleegy, Y. Li, Y. Qi, P. Wilmot, M. Wu, S. Kolay, A. Dasdan, and S. Chen. Overview of turn data management platform for digital advertising. PVLDB, 6(11):1138--1149, 2013. Google ScholarDigital Library
K. Elmeleegy. Piranha: Optimizing short jobs in hadoop. Proc. VLDB Endow., 6(11):985--996, Aug. 2013. Google ScholarDigital Library
A. Gupta and I. S. Mumick. Materialized views. chapter Maintenance of Materialized Views: Problems, Techniques, and Applications, pages 145--157. MIT Press, Cambridge, MA, USA, 1999. Google ScholarDigital Library
P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. Zookeeper: wait-free coordination for internet-scale systems. In USENIXATC'10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference, pages 11--11, 2010. Google ScholarDigital Library
K. Y. Lee and M. H. Kim. Efficient incremental maintenance of data cubes. In Proceedings of the 32Nd International Conference on Very Large Data Bases, VLDB '06, pages 823--833. VLDB Endowment, 2006. Google ScholarDigital Library
C. Mohan. History repeats itself: Sensible and nonsensql aspects of the nosql hoopla. In Proceedings of the 16th International Conference on Extending Database Technology, EDBT '13, pages 11--16, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
I. S. Mumick, D. Quass, and B. S. Mumick. Maintenance of data cubes and summary tables in a warehouse. In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, SIGMOD '97, pages 100--111, New York, NY, USA, 1997. ACM. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In Proc. ACM SIGMOD, 2008. Google ScholarDigital Library
Oracle. Automatic Storage Management. http://docs.oracle.com/cd/E11882_01/server.112/e18951/asmcon.htm.Google Scholar
Oracle Clusterware. http://www.oracle.com/technetwork/database/database-technologies/clusterware/overview/index.html.Google Scholar
L. Qiao, K. Surlaker, S. Das, T. Quiggle, B. Schulman, B. Ghosh, A. Curtis, O. Seeliger, Z. Zhang, A. Auradar, C. Beaver, G. Brandt, M. Gandhi, K. Gopalakrishna, W. Ip, S. Jgadish, S. Lu, A. Pachev, A. Ramesh, A. Sebastian, R. Shanbhag, S. Subramaniam, Y. Sun, S. Topiwala, C. Tran, J. Westerman, and D. Zhang. On brewing fresh espresso: Linkedin's distributed data serving platform. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 1135--1146, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
K. A. Ross, D. Srivastava, and S. Sudarshan. Materialized view maintenance and integrity constraint checking: Trading space for time. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD '96, pages 447--458, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
K. Salem, K. Beyer, B. Lindsay, and R. Cochrane. How to roll a join: Asynchronous incremental view maintenance. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, SIGMOD '00, pages 129--140, New York, NY, USA, 2000. ACM. Google ScholarDigital Library
J. Shute, R. Vingralek, B. Samwel, B. Handy, C. Whipkey, E. Rollins, M. Oancea, K. Littlefield, D. Menestrina, S. Ellner, J. Cieslewicz, I. Rae, T. Stancescu, and H. Apte. F1: A distributed sql database that scales. In VLDB, 2013. Google ScholarDigital Library
B. Song, S. Liu, S. Kolay, and L. Lo. Antsboa: A new time series pipeline for big data processing, analyzing and querying in online advertising application. In First IEEE International Conference on Big Data Computing Service and Applications, BigDataService 2015, Redwood City, CA, USA, March 30 - April 2, 2015, pages 223--232, 2015. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: A warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2):1626--1629, Aug. 2009. Google ScholarDigital Library
M. Traverso. Presto: Interacting with petabytes of data at Facebook. https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920.Google Scholar
Big Data Benchmark - AMPLab. https://amplab.cs.berkeley.edu/benchmark.Google Scholar
VMware, Inc. Tungsten Replicator 3.0 Manual. Technical report, VMware, Inc, 2015.Google Scholar
R. S. Xin, J. Rosen, M. Zaharia, M. J. Franklin, S. Shenker, and I. Stoica. Shark: Sql and rich analytics at scale. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD '13, pages 13--24, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
F. Yang, E. Tschetter, X. Léauté, N. Ray, G. Merlino, and D. Ganguli. Druid: A real-time analytical data store. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD '14, pages 157--168, New York, NY, USA, 2014. ACM. Google ScholarDigital Library

Index Terms

Kodiak: leveraging materialized views for very low-latency analytics over high-dimensional web-scale data
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs

Index terms have been assigned to the content through auto-classification.

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 9, Issue 13
September 2016
378 pages
ISSN:2150-8097
Editor:
Surajit Chaudhuri
Microsoft Research
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2016
Published in pvldb Volume 9, Issue 13
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 89
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Kodiak: leveraging materialized views for very low-latency analytics over high-dimensional web-scale data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Kodiak: leveraging materialized views for very low-latency analytics over high-dimensional web-scale data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Big Data Analytics with R and Hadoop

Big Data Analytics

Practical Hadoop Ecosystem: A Definitive Guide to Hadoop-Related Frameworks and Tools

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media