skip to main content
10.1145/1807167.1807278acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Data warehousing and analytics infrastructure at facebook

Published:06 June 2010Publication History

ABSTRACT

Scalable analysis on large data sets has been core to the functions of a number of teams at Facebook - both engineering and non-engineering. Apart from ad hoc analysis of data and creation of business intelligence dashboards by analysts across the company, a number of Facebook's site features are also based on analyzing large data sets. These features range from simple reporting applications like Insights for the Facebook Advertisers, to more advanced kinds such as friend recommendations. In order to support this diversity of use cases on the ever increasing amount of data, a flexible infrastructure that scales up in a cost effective manner, is critical. We have leveraged, authored and contributed to a number of open source technologies in order to address these requirements at Facebook. These include Scribe, Hadoop and Hive which together form the cornerstones of the log collection, storage and analytics infrastructure at Facebook. In this paper we will present how these systems have come together and enabled us to implement a data warehouse that stores more than 15PB of data (2.5PB after compression) and loads more than 60TB of new data (10TB after compression) every day. We discuss the motivations behind our design choices, the capabilities of this solution, the challenges that we face in day today operations and future capabilities and improvements that we are working on.

References

  1. Apache Hadoop wiki. Available at http://wiki.apache.org/hadoop.Google ScholarGoogle Scholar
  2. Apache Hadoop Hive wiki. Available at http://wiki.apache.org/hadoop/Hive.Google ScholarGoogle Scholar
  3. Scribe wiki. Available at http://wiki.github.com/facebook/scribe.Google ScholarGoogle Scholar
  4. Ghemawat, S., Gobioff, H. and Leung, S. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles (Lake George, NY, Oct. 2003). Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Dean, J. and Ghemawat S. 2004. MapReduce: Simplified Data Processing on Large Clusters. In Proceedings of the 6th Symposium on Operating System Design and Implementation (San Francisco, CA, Dec. 2004). OSDI'04. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. HDFS Architecture. Available at http://hadoop.apache.org/common/docs/current/hdfs_design.pdf.Google ScholarGoogle Scholar
  7. Hive DDL wiki. Available at http://wiki.apache.org/hadoop/Hive/LanguageManual/DDL.Google ScholarGoogle Scholar
  8. Thusoo, A., Murthy, R., Sen Sarma, J., Shao, Z., Jain, N., Chakka, P., Anthony, A., Liu, H., Zhang, N. 2010. Hive - A Petabyte Scale Data Warehouse Using Hadoop. In Proceedings of 26th IEEE International Conference on Data Engineering (Long Beach, California, Mar. 2010). ICDE'10.Google ScholarGoogle ScholarCross RefCross Ref
  9. Ailamaki, A., DeWitt, D. J., Hill, M. D., Skounakis, M. 2001. Weaving Relations for Cache Performance. In Proceedings of 27th Very Large Data Base Conference (Roma, Italy, 2001). VLDB'01. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Zaharia, M., Borthakur, D., Sen Sarma, J., Elmeleegy, K., Shenker, S., Stoica, I. 2009. Job Scheduling for Multi-User MapReduce Clusters. UC Berkeley Technical Report UCB/EECS-2009-55 (Apr. 2009).Google ScholarGoogle Scholar
  11. Fan, B., Tantisiriroj, W., Xiao, Lin, Gibson, G. 2009. DiskReduce: RAID for Data-Intensive Scalable Computing. In Proceedings of 4th Petascale Data Storage Workshop Supercomputing Conference (Portland, Oregon, Nov. 2009). Supercomputing PDSW'09. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data warehousing and analytics infrastructure at facebook

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data
      June 2010
      1286 pages
      ISBN:9781450300322
      DOI:10.1145/1807167

      Copyright © 2010 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 6 June 2010

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader