ABSTRACT
Distributed systems are hard to build, profile, debug, and test. Monitoring a distributed system - to detect and analyze bugs, test for regressions, identify fault-tolerance problems or security compromises - can be difficult and error-prone. In this paper we argue that declarative development of distributed systems is well suited to tackle these tasks. We present an application logging, monitoring, and debugging facility that we have built on top of the P2 system, comprising an introspection model, an execution tracing component, and a distributed query processor. We use this facility to demonstrate a range of on-line distributed diagnosis tools that range from simple, local state assertions to sophisticated global property detectors on consistent snapshots. These tools are small, simple, and can be deployed piecemeal on-line at any point during a system's life cycle. Our evaluation suggests that the overhead of our approach to improving and monitoring running distributed systems continuously is well in tune with its benefits.
- M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance Debugging for Distributed Systems of Black Boxes. In Proceedings of ACM Symposium on Operating System Principles (SOSP), Bolton Landing, NY, USA, Oct. 2003. Google ScholarDigital Library
- P. T. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In Proceedings of USENIX Operating System Design and Implementation (OSDI), San Francisco, CA, USA, Dec. 2004. Google ScholarDigital Library
- P. Bates, J. Wileden, and V. Lesser. A Debugging Tool for Distributed Systems. In Proceedings of the Second Annual Phoenix Conference on Computers and Communications, Phoenix, AZ, USA, 1983.Google Scholar
- A. Chanda, K. Elmeleegy, A. Cox, and W. Zwaenepoel. Causeway: System Support for Controlling and Analyzing the Execution of Distributed Programs. In Proceedings of USENIX Hot Topics in Operating System (HotOS), Santa Fe, NM, USA, June 2005. Google ScholarDigital Library
- K. M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63--75, 1985. Google ScholarDigital Library
- M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based Failure and Evolution Management. In Proceedings of USENIX Networked Systems Design and Implementation (NSDI), San Francisco, CA, USA, Mar. 2004. Google ScholarDigital Library
- L. Conradie and M.-A. Mountzia. A Relational Model for Distributed Systems Monitoring using Flexible agents. In Proceedings of IEEE Workshop on Services in Distributed and Networked Environments (SDNE), Hong Kong, 1996. Google ScholarDigital Library
- M. Consens, M. Hasan, and A. Mendelzon. Using Hy+ for Network Management and Distributed Debugging. In Proceedings of Centre for Advanced Studies on Collaborative research: software engineering, pages 450--471, Toronto, Ontario, Canada, Nov. 1993. Google ScholarDigital Library
- R. H. Crawford, R. A. Olsson, W. W. Ho, and C. E. Wee. Semantic Issues in the Design of Languages for Debugging. Computer Languages, 21(1):17--37, 1995. Google ScholarDigital Library
- D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay Debugging for Distributed Applications. In Proceedings of the USENIX Annual Technical Conference, Boston, MA, USA, May 2006. Google ScholarDigital Library
- T. L. Harris. Dependable Software Needs Pervasive Debugging (Extended Abstract). In Proceedings of ACM SIGOPS European Workshop, Saint-Emilion, France, Sept. 2002. Google ScholarDigital Library
- J. Hollingsworth and B. Miller. Dynamic Control of Performance Monitoring of Large Scale Parallel Systems. In Proceedings of Super Computing (SC), Tokyo, Japan, July 1993. Google ScholarDigital Library
- A.-C. Huang and P. Steenkiste. Building Self-adapting Services Using Service-specific Knowledge. In Proceedings of IEEE High Performance Distributed Computing (HPDC), Research Triangle Park, NC, USA, July 2005. Google ScholarDigital Library
- IBM Websphere XD. http://www-306.ibm.com/common/ssi/fcgi-bin/ssialias?infotype=an&subtype=ca&htmlfid=897/ENUS206--010, Jan. 2006.Google Scholar
- E. Kiciman and L. Subramanian. A Root Cause Localization Model for Large Scale Systems. In Proceedings of USENIX Hot Topics On Dependability (HotDep), Yokohama, Japan, June 2005. Google ScholarDigital Library
- S. T. King and P. M. Chen. Backtracking Intrusions. In Proceedings of ACM Symposium on Operating System Principles (SOSP), Bolton Landing, NY, USA, Oct. 2003. Google ScholarDigital Library
- E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The Click modular router. ACM Transaction on Computer Systems, 18(3), 2000. Google ScholarDigital Library
- S. Lin, A. Pan, and Z. Zhang. WiDS: an Integrated Toolkit for Distributed System Developement. In Proceedings of USENIX Hot Topics in Operating System (HotOS), Santa Fe, NM, USA, June 2005. Google ScholarDigital Library
- B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis, T. Roscoe, and I. Stoica. Implementing Declarative Overlays. In Proceedings of ACM Symposium on Operating System Principles (SOSP), Brighton, UK, Oct. 2005. Google ScholarDigital Library
- R. Mahajan, D. Wetherall, and T. Anderson. Understanding BGP Misconfiguration. In Proceedings of ACM Special Interest Group On Data Communications (SIGCOMM), Pittsburg, PA, USA, Aug. 2002. Google ScholarDigital Library
- P. Reynolds, J. L. Biener, J. C. Mogul, M. A. Shah, C. Killian, and A. Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of USENIX Networked Systems Design and Implementation (NSDI), San Jose, CA, USA, May 2006. Google ScholarDigital Library
- R. Snodgrass. A Relations Approach to Monitoring Complex Systems. IEEE Transactions on Computer Systems, 6(2): 157--196, 1988. Google ScholarDigital Library
- I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications. IEEE/ACM Transactions of Networking, 11(1): 17--32, 2003. Google ScholarDigital Library
- H. J. Wang, J. C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic Misconfiguration Troubleshooting with PeerPressure. In Proceedings of USENIX Operating System Design and Implementation (OSDI), San Francisco, CA, USA, Dec. 2004. Google ScholarDigital Library
- A. Whitaker, R. Cox, and S. Gribble. Configuration Debugging as Search: Finding the Needle in the Haystack. In Proceedings of USENIX Operating System Design and Implementation (OSDI), San Francisco, CA, USA, Dec. 2004. Google ScholarDigital Library
- O. Wolfson, S. Sengupta, and Y. Yemini. Managing Communication Networks by Monitoring Databases. IEEE Transactions on Software Engineering, 17(9):944--953, 1991. Google ScholarDigital Library
Index Terms
- Using queries for distributed monitoring and forensics
Recommendations
Using queries for distributed monitoring and forensics
Proceedings of the 2006 EuroSys conferenceDistributed systems are hard to build, profile, debug, and test. Monitoring a distributed system - to detect and analyze bugs, test for regressions, identify fault-tolerance problems or security compromises - can be difficult and error-prone. In this ...
Improved Algorithms for Distributed Entropy Monitoring
Modern data management systems often need to deal with massive, dynamic and inherently distributed data sources. We collect the data using a distributed network, and at the same time try to maintain a global view of the data at a central coordinator ...
A geometric approach to monitoring threshold functions over distributed data streams
Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More ...
Comments