skip to main content
10.1145/1217935.1217973acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
Article

Using queries for distributed monitoring and forensics

Published:18 April 2006Publication History

ABSTRACT

Distributed systems are hard to build, profile, debug, and test. Monitoring a distributed system - to detect and analyze bugs, test for regressions, identify fault-tolerance problems or security compromises - can be difficult and error-prone. In this paper we argue that declarative development of distributed systems is well suited to tackle these tasks. We present an application logging, monitoring, and debugging facility that we have built on top of the P2 system, comprising an introspection model, an execution tracing component, and a distributed query processor. We use this facility to demonstrate a range of on-line distributed diagnosis tools that range from simple, local state assertions to sophisticated global property detectors on consistent snapshots. These tools are small, simple, and can be deployed piecemeal on-line at any point during a system's life cycle. Our evaluation suggests that the overhead of our approach to improving and monitoring running distributed systems continuously is well in tune with its benefits.

References

  1. M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance Debugging for Distributed Systems of Black Boxes. In Proceedings of ACM Symposium on Operating System Principles (SOSP), Bolton Landing, NY, USA, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. P. T. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In Proceedings of USENIX Operating System Design and Implementation (OSDI), San Francisco, CA, USA, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bates, J. Wileden, and V. Lesser. A Debugging Tool for Distributed Systems. In Proceedings of the Second Annual Phoenix Conference on Computers and Communications, Phoenix, AZ, USA, 1983.Google ScholarGoogle Scholar
  4. A. Chanda, K. Elmeleegy, A. Cox, and W. Zwaenepoel. Causeway: System Support for Controlling and Analyzing the Execution of Distributed Programs. In Proceedings of USENIX Hot Topics in Operating System (HotOS), Santa Fe, NM, USA, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63--75, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based Failure and Evolution Management. In Proceedings of USENIX Networked Systems Design and Implementation (NSDI), San Francisco, CA, USA, Mar. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. L. Conradie and M.-A. Mountzia. A Relational Model for Distributed Systems Monitoring using Flexible agents. In Proceedings of IEEE Workshop on Services in Distributed and Networked Environments (SDNE), Hong Kong, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Consens, M. Hasan, and A. Mendelzon. Using Hy+ for Network Management and Distributed Debugging. In Proceedings of Centre for Advanced Studies on Collaborative research: software engineering, pages 450--471, Toronto, Ontario, Canada, Nov. 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. H. Crawford, R. A. Olsson, W. W. Ho, and C. E. Wee. Semantic Issues in the Design of Languages for Debugging. Computer Languages, 21(1):17--37, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay Debugging for Distributed Applications. In Proceedings of the USENIX Annual Technical Conference, Boston, MA, USA, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. L. Harris. Dependable Software Needs Pervasive Debugging (Extended Abstract). In Proceedings of ACM SIGOPS European Workshop, Saint-Emilion, France, Sept. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Hollingsworth and B. Miller. Dynamic Control of Performance Monitoring of Large Scale Parallel Systems. In Proceedings of Super Computing (SC), Tokyo, Japan, July 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. A.-C. Huang and P. Steenkiste. Building Self-adapting Services Using Service-specific Knowledge. In Proceedings of IEEE High Performance Distributed Computing (HPDC), Research Triangle Park, NC, USA, July 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. IBM Websphere XD. http://www-306.ibm.com/common/ssi/fcgi-bin/ssialias?infotype=an&subtype=ca&htmlfid=897/ENUS206--010, Jan. 2006.Google ScholarGoogle Scholar
  15. E. Kiciman and L. Subramanian. A Root Cause Localization Model for Large Scale Systems. In Proceedings of USENIX Hot Topics On Dependability (HotDep), Yokohama, Japan, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. T. King and P. M. Chen. Backtracking Intrusions. In Proceedings of ACM Symposium on Operating System Principles (SOSP), Bolton Landing, NY, USA, Oct. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The Click modular router. ACM Transaction on Computer Systems, 18(3), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Lin, A. Pan, and Z. Zhang. WiDS: an Integrated Toolkit for Distributed System Developement. In Proceedings of USENIX Hot Topics in Operating System (HotOS), Santa Fe, NM, USA, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis, T. Roscoe, and I. Stoica. Implementing Declarative Overlays. In Proceedings of ACM Symposium on Operating System Principles (SOSP), Brighton, UK, Oct. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. R. Mahajan, D. Wetherall, and T. Anderson. Understanding BGP Misconfiguration. In Proceedings of ACM Special Interest Group On Data Communications (SIGCOMM), Pittsburg, PA, USA, Aug. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. P. Reynolds, J. L. Biener, J. C. Mogul, M. A. Shah, C. Killian, and A. Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of USENIX Networked Systems Design and Implementation (NSDI), San Jose, CA, USA, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. R. Snodgrass. A Relations Approach to Monitoring Complex Systems. IEEE Transactions on Computer Systems, 6(2): 157--196, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications. IEEE/ACM Transactions of Networking, 11(1): 17--32, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. J. Wang, J. C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic Misconfiguration Troubleshooting with PeerPressure. In Proceedings of USENIX Operating System Design and Implementation (OSDI), San Francisco, CA, USA, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Whitaker, R. Cox, and S. Gribble. Configuration Debugging as Search: Finding the Needle in the Haystack. In Proceedings of USENIX Operating System Design and Implementation (OSDI), San Francisco, CA, USA, Dec. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. O. Wolfson, S. Sengupta, and Y. Yemini. Managing Communication Networks by Monitoring Databases. IEEE Transactions on Software Engineering, 17(9):944--953, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Using queries for distributed monitoring and forensics

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
        April 2006
        420 pages
        ISBN:1595933220
        DOI:10.1145/1217935
        • cover image ACM SIGOPS Operating Systems Review
          ACM SIGOPS Operating Systems Review  Volume 40, Issue 4
          Proceedings of the 2006 EuroSys conference
          October 2006
          383 pages
          ISSN:0163-5980
          DOI:10.1145/1218063
          Issue’s Table of Contents

        Copyright © 2006 Authors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 April 2006

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate241of1,308submissions,18%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader