ABSTRACT
Kineograph is a distributed system that takes a stream of incoming data to construct a continuously changing graph, which captures the relationships that exist in the data feed. As a computing platform, Kineograph further supports graph-mining algorithms to extract timely insights from the fast-changing graph structure. To accommodate graph-mining algorithms that assume a static underlying graph, Kineograph creates a series of consistent snapshots, using a novel and efficient epoch commit protocol. To keep up with continuous updates on the graph, Kineograph includes an incremental graph-computation engine. We have developed three applications on top of Kineograph to analyze Twitter data: user ranking, approximate shortest paths, and controversial topic detection. For these applications, Kineograph takes a live Twitter data feed and maintains a graph of edges between all users and hashtags. Our evaluation shows that with 40 machines processing 100K tweets per second, Kineograph is able to continuously compute global properties, such as user ranks, with less than 2.5-minute timeliness guarantees. This rate of traffic is more than 10 times the reported peak rate of Twitter as of October 2011.
- R. Angles and C. Gutierrez. Survey of graph database models. ACM Computing Surveys (CSUR), 40 (1): 1--39, 2008. Google ScholarDigital Library
- P. Bhatotia, A. Wieder, R. Rodrigues, U. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. In ACM SoCC, 2011. Google ScholarDigital Library
- Y. Bu, B. Howe, M. Balazinska, and M. Ernst. HaLoop: Efficient iterative data processing on large clusters. In VLDB, 2010. Google ScholarDigital Library
- M. Burrows. The Chubby lock service for loosely-coupled distributed systems. In OSDI, 2006. Google ScholarDigital Library
- D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams -- a new class of data management applications. In VLDB, 2002. Google ScholarDigital Library
- S. Chandrasekaran, O. Cooper, A. Deshpande, M. Franklin, J. Hellerstein, W. Hong, S. Krishnamurthy, S. Madden, V. Raman, F. Reiss, and M. Shah. TelegraphCQ: Continuous dataflow processing for an uncertain world. In CIDR, 2003. Google ScholarDigital Library
- J. Chen, D. J. Dewitt, F. Tian, and Y. Wang. NiagaraCQ: A scalable continuous query system for internet databases. In SIGMOD, 2000. Google ScholarDigital Library
- T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. MIT Press and McGraw-Hill, 2nd. edition, 2001. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51 (1): 107--113, 2008. Google ScholarDigital Library
- P. Gunda, L. Ravindranath, C. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in datacenters. In OSDI, 2010. Google ScholarDigital Library
- B. He, M. Yang, Z. Guo, R. Chen, B. Su, W. Lin, and L. Zhou. Comet: Batched stream processing in data intensive distributed computing. In ACM SoCC, 2010. Google ScholarDigital Library
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed. ZooKeeper: Wait-free coordination for internet-scale systems. In USENIX ATC, 2010. Google ScholarDigital Library
- R. Kallman, H. Kimura, J. Natkins, A. Pavlo, A. Rasin, S. Zdonik, E. P. C. Jones, S. Madden, M. Stonebraker, Y. Zhang, J. Hugg, and D. J. Abadi. H-Store: A high-performance, distributed main memory transaction processing system. In VLDB, 2008. Google ScholarDigital Library
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In IEEE International Conference on Data Mining, 2009. Google ScholarDigital Library
- L. Lamport. The part-time parliament. ACM Trans. Comput. Syst., 16 (2): 133--169, 1998. Google ScholarDigital Library
- D. Logothetis, C. Olston, B. Reed, K. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In ACM SoCC, 2010. Google ScholarDigital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. Hellerstein. GraphLab: A new parallel framework for machine learning. In Conference on Uncertainty in Artificial Intelligence(UAI), 2010.Google Scholar
- G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, 2010. Google ScholarDigital Library
- memcached. Memcached: A distributed memory object caching system, 2011. http://memcached.org.Google Scholar
- Neo4j. Neo4j: The graph database, 2011. http://neo4j.org. Accessed October, 2011.Google Scholar
- D. Ongaro, S. M. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum. Fast crash recovery in ramcloud. In SOSP, 2011. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: Bringing order to the web. Stanford Technical Report, 1999.Google Scholar
- R. Pearce, M. Gokhale, and N. Amato. Multithreaded asynchronous graph traversal for in-memory and semi-external memory. In ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2010. Google ScholarDigital Library
- D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In OSDI, 2010. Google ScholarDigital Library
- L. Popa, M. Budiu, Y. Yu, and M. Isard. Dryadinc: Reusing work in large-scale computations. In HotCloud, 2009. Google ScholarDigital Library
- R. Power and J. Li. Piccolo: Building fast, distributed programs with partitioned tables. In OSDI, 2010. Google ScholarDigital Library
- D. Romero, B. Meeder, and J. Kleinberg. Differences in the mechanics of information diffusion across topics: Idioms, political hashtags, and complex contagion on twitter. In WWW, 2011. Google ScholarDigital Library
- A. Sarma, S. Gollapudi, M. Najork, and R. Panigrahy. A sketch-based distance oracle for web-scale graphs. In WSDM, 2010. Google ScholarDigital Library
- D. Sullivan. Tweets about steve jobs spike but don't break twitter peak record, 2011. http://searchengineland.com/tweets-about-steve-jobs-spike-but-dont-break-twitter-record-96048.Google Scholar
- A. Tanenbaum. Distributed Operating Systems. Prentice Hall, 1995. Google ScholarDigital Library
- D. Tunkelang. A twitter analog to pagerank. Retrieved from http://thenoisychannel. com/2009/01/13/a-twitter-analog-to-pagerank, 2009.Google Scholar
Index Terms
- Kineograph: taking the pulse of a fast-changing and connected world
Recommendations
Large-Scale Stream Graph Processing: Doctoral Symposium
DEBS '17: Proceedings of the 11th ACM International Conference on Distributed and Event-based SystemsDynamically changing graphs are a powerful abstraction used to represent temporal relationships and connections occurring between data entities in various real-world organizations, such as social and telecommunication networks. The increasing volume, ...
Graphtides: a framework for evaluating stream-based graph processing platforms
GRADES-NDA '18: Proceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA)Stream-based graph systems continuously ingest graph-changing events via an established input stream, performing the required computation on the corresponding graph. While there are various benchmarking and evaluation approaches for traditional, batch-...
KickStarter: Fast and Accurate Computations on Streaming Graphs via Trimmed Approximations
Asplos'17Continuous processing of a streaming graph maintains an approximate result of the iterative computation on a recent version of the graph. Upon a user query, the accurate result on the current graph can be quickly computed by feeding the approximate ...
Comments