skip to main content
10.1145/2463676.2465290acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Fast data in the era of big data: Twitter's real-time related query suggestion architecture

Published:22 June 2013Publication History

ABSTRACT

We present the architecture behind Twitter's real-time related query suggestion and spelling correction service. Although these tasks have received much attention in the web search literature, the Twitter context introduces a real-time "twist": after significant breaking news events, we aim to provide relevant results within minutes. This paper provides a case study illustrating the challenges of real-time data processing in the era of "big data". We tell the story of how our system was built twice: our first implementation was built on a typical Hadoop-based analytics stack, but was later replaced because it did not meet the latency requirements necessary to generate meaningful real-time results. The second implementation, which is the system deployed in production today, is a custom in-memory processing engine specifically designed for the task. This experience taught us that the current typical usage of Hadoop as a "big data" platform, while great for experimentation, is not well suited to low-latency processing, and points the way to future work on data analytics platforms that can handle "big" as well as "fast" data.

References

  1. C. Aggarwal. Data Streams: Models and Algorithms. Kluwer Academic Publishers, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. E. Alfonseca, M. Ciaramita, and K. Hall. Gazpacho and summer rash: lexical relationships from temporal patterns of web search queries. In EMNLP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. G. Ananthanarayanan, S. Kandula, A. Greenberg, I. Stoica, Y. Lu, B. Saha, and E. Harris. Reining in the outliers in Map-Reduce clusters using Mantri. In OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Baraglia, F. M. Nardini, C. Castillo, R. Perego, D. Donato, and F. Silvestri. The effects of time on query flow graph-based models for query suggestion. In RIAO, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Borthakur, J. Gray, J. Sarma, K. Muthukkaruppan, N. Spiegelberg, H. Kuang, K. Ranganathan, D. Molkov, A. Menon, S. Rash, R. Schmidt, and A. Aiyer. Apache Hadoop goes realtime at Facebook. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Busch, K. Gade, B. Larson, P. Lok, S. Luckenbill, and J. Lin. Earlybird: Real-time search at Twitter. In ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Cao, D. Jiang, J. Pei, Q. He, Z. Liao, E. Chen, and H. Li. Context-aware query suggestion by mining click-through and session data. In KDD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams--a new class of data management applications. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Chandramouli, J. Goldstein, and S. Duan. Temporal analytics on big data for web advertising. In ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Cucerzan and E. Brill. Spelling correction as an iterative process that exploits the collective knowledge of web users. In EMNLP, 2004.Google ScholarGoogle Scholar
  12. H. Cui, J.-R. Wen, J.-Y. Nie, and W.-Y. Ma. Query expansion by mining user logs. TKDE, 15(4):829--839, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Dakka, L. Gravano, and P. Ipeirotis. Answering general time-sensitive queries. TKDE, 24(2):220--235, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Efron and G. Golovchinsky. Estimation methods for ranking recent information. In SIGIR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Ganjisaffar, R. Caruana, and C. Lopes. Bagging gradient-boosted trees for high precision, low variance ranking models. In SIGIR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of MapReduce: The Pig experience. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. B. Gedik, H. Andrade, K.-L. Wu, P. Yu, and M. Doo. SPADE: The System S declarative stream processing engine. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Gehrke. Special issue on data stream processing. Bulletin of the Technical Committee on Data Engineering, 26(1):2, 2003.Google ScholarGoogle Scholar
  19. K. Goodhope, J. Koshy, J. Kreps, N. Narkhede, R. Park, J. Rao, and V. Ye. Building LinkedIn's real-time activity data pipeline. Bulletin of the Technical Committee on Data Engineering, 35(2):33--45, 2012.Google ScholarGoogle Scholar
  20. P. Hunt, M. Konar, F. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for Internet-scale systems. In USENIX, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Jones and F. Diaz. Temporal profiles of queries. ACM TOIS, 25(3), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Jones, B. Rey, O. Madani, and W. Greiner. Generating query substitutions. In WWW, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Koenemann and N. Belkin. A case for interaction: A study of interactive information retrieval behavior and effectiveness. In CHI, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. In NetDB Workshop, 2011.Google ScholarGoogle Scholar
  25. S. Krishnamurthy, M. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, and N. Thombre. Continuous analytics over discontinuous streams. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. SkewTune: Mitigating skew in MapReduce applications. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan. Muppet: MapReduce-style processing of fast data. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. V. Lavrenko and W. Croft. Relevance-based language models. In SIGIR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at Twitter. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. F. Leibert, J. Mannix, J. Lin, and B. Hamadani. Automatic management of partitioned, replicated search services. In SoCC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. H. Li. Learning to Rank for Information Retrieval and Natural Language Processing. Morgan & Claypool Publishers, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. X. Li and W. Croft. Time-based language models. In CIKM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Lin and A. Kolcz. Large-scale machine learning at Twitter. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Lin and G. Mishne. A study of "churn" in tweets and real-time search queries. In ICWSM, 2012.Google ScholarGoogle Scholar
  35. J. Lin and D. Ryaboy. Scaling big data mining infrastructure: The Twitter experience. SIGKDD Explorations, 14(2):6--19, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In MAPREDUCE Workshop, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachusetts, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Q. Mei, D. Zhou, and K. Church. Query suggestion using hitting time. In CIKM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Mizzaro. How many relevances in information retrieval? Interacting With Computers, 10(3):305--322, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  40. C. Moretti, J. Bulosan, D. Thain, and P. Flynn. All-Pairs: An abstraction for data-intensive cloud computing. In IPDPS, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  41. L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In KDCloud Workshop at ICDM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. D. Pearce. A comparative evaluation of collocation extraction techniques. In LREC, 2002.Google ScholarGoogle Scholar
  44. D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In OSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. K. Radinsky, K. Svore, S. Dumais, J. Teevan, A. Bocharov, and E. Horvitz. Modeling and predicting behavioral dynamics on the web. In WWW, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Rocchio. Relevance feedback in information retrieval. In G. Salton, editor, The SMART Retrieval System--Experiments in Automatic Document Processing. Prentice-Hall, 1971.Google ScholarGoogle Scholar
  47. M. Shokouhi. Detecting seasonal queries by time-series analysis. In SIGIR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. Shokouhi and K. Radinsky. Time sensitive query auto-completion. In SIGIR, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Thusoo, Z. Shao, S. Anthony, D. Borthakur, N. Jain, J. Sarma, R. Murthy, and H. Liu. Data warehousing and analytics infrastructure at Facebook. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. Vlachos, C. Meek, Z. Vagena, and D. Gunopulos. Identifying similarities, periodicities and bursts for online search queries. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. J. Xu and W. Croft. Improving the effectiveness of information retrieval with local context analysis. ACM TOIS, 18(1):79--112, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fast data in the era of big data: Twitter's real-time related query suggestion architecture

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
      June 2013
      1322 pages
      ISBN:9781450320375
      DOI:10.1145/2463676

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 June 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader