research-article

Data warehousing and analytics infrastructure at facebook

Authors:
Ashish Thusoo

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

,
Zheng Shao

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

,
Suresh Anthony

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

,
Dhruba Borthakur

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

,
Namit Jain

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

,
Joydeep Sen Sarma

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

,
Raghotham Murthy

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

,
Hao Liu

Facebook, Palo Alto, CA, USA

Facebook, Palo Alto, CA, USA
View Profile

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataJune 2010Pages 1013–1020https://doi.org/10.1145/1807167.1807278

Published:06 June 2010Publication History

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

Pages 1013–1020

ABSTRACT

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

References

Apache Hadoop wiki. Available at http://wiki.apache.org/hadoop.Google Scholar
Apache Hadoop Hive wiki. Available at http://wiki.apache.org/hadoop/Hive.Google Scholar
Scribe wiki. Available at http://wiki.github.com/facebook/scribe.Google Scholar
Ghemawat, S., Gobioff, H. and Leung, S. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (Lake George, NY, Oct. 2003). Google ScholarDigital Library
Dean, J. and Ghemawat S. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (San Francisco, CA, Dec. 2004). OSDI'04. Google ScholarDigital Library
HDFS Architecture. Available at http://hadoop.apache.org/common/docs/current/hdfs_design.pdf.Google Scholar
Hive DDL wiki. Available at http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL.Google Scholar
Thusoo, A., Murthy, R., Sen Sarma, J., Shao, Z., Jain, N., Chakka, P., Anthony, A., Liu, H., Zhang, N. 2010. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In Proceedings of 26th IEEE International Conference on Data Engineering (Long Beach, California, Mar. 2010). ICDE'10.Google ScholarCross Ref
Ailamaki, A., DeWitt, D. J., Hill, M. D., Skounakis, M. 2001. Weaving Relations for Cache Performance. In Proceedings of 27th Very Large Data Base Conference (Roma, Italy, 2001). VLDB'01. Google ScholarDigital Library
Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I. 2009. Job Scheduling for Multi-User MapReduce Clusters. UC Berkeley Technical Report UCB/EECS-2009-55 (Apr. 2009).Google Scholar
Fan, B., Tantisiriroj, W., Xiao, Lin, Gibson, G. 2009. DiskReduce: RAID for Data-Intensive Scalable Computing. In Proceedings of 4th Petascale Data Storage Workshop Supercomputing Conference (Portland, Oregon, Nov. 2009). Supercomputing PDSW'09. Google ScholarDigital Library

Index Terms

Data warehousing and analytics infrastructure at facebook
1. Information systems

Recommendations

Apache hadoop goes realtime at Facebook
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Facebook recently deployed Facebook Messages, its first ever user-facing application built on the Apache Hadoop platform. Apache HBase is a database-like layer built on Hadoop designed to support billions of messages per day. This paper describes the ...
Read More
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Read More
Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

The amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
June 2010
1286 pages
ISBN:9781450300322
DOI:10.1145/1807167
General Chair:
Ahmed Elmagarmid
Purdue University, USA
,
Program Chair:
Divyakant Agrawal
University of California at Santa Barbara, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
analytics
data discovery
data warehouse
distributed file system
distributed systems
facebook
hadoop
hive
log aggregation
map-reduce
resource sharing
scalability
scribe
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 268
  Total Citations
  View Citations
- 6,493
  Total Downloads
- Downloads (Last 12 months)166
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data warehousing and analytics infrastructure at facebook

SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Apache hadoop goes realtime at Facebook

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Scale-out beyond map-reduce