Article

Using queries for distributed monitoring and forensics

Authors:
Atul Singh

Rice University

Rice University
View Profile

,
Petros Maniatis

Intel Research Berkeley

Intel Research Berkeley
View Profile

,
Timothy Roscoe

Intel Research Berkeley

Intel Research Berkeley
View Profile

,
Peter Druschel

Rice University

Rice University
View Profile

EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006April 2006Pages 389–402https://doi.org/10.1145/1217935.1217973

Published:18 April 2006Publication History

EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006

Pages 389–402

ABSTRACT

Distributed systems are hard to build, profile, debug, and test. Monitoring a distributed system - to detect and analyze bugs, test for regressions, identify fault-tolerance problems or security compromises - can be difficult and error-prone. In this paper we argue that declarative development of distributed systems is well suited to tackle these tasks. We present an application logging, monitoring, and debugging facility that we have built on top of the P2 system, comprising an introspection model, an execution tracing component, and a distributed query processor. We use this facility to demonstrate a range of on-line distributed diagnosis tools that range from simple, local state assertions to sophisticated global property detectors on consistent snapshots. These tools are small, simple, and can be deployed piecemeal on-line at any point during a system's life cycle. Our evaluation suggests that the overhead of our approach to improving and monitoring running distributed systems continuously is well in tune with its benefits.

References

M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance Debugging for Distributed Systems of Black Boxes. In Proceedings of ACM Symposium on Operating System Principles (SOSP), Bolton Landing, NY, USA, Oct. 2003. Google ScholarDigital Library
P. T. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. In Proceedings of USENIX Operating System Design and Implementation (OSDI), San Francisco, CA, USA, Dec. 2004. Google ScholarDigital Library
P. Bates, J. Wileden, and V. Lesser. A Debugging Tool for Distributed Systems. In Proceedings of the Second Annual Phoenix Conference on Computers and Communications, Phoenix, AZ, USA, 1983.Google Scholar
A. Chanda, K. Elmeleegy, A. Cox, and W. Zwaenepoel. Causeway: System Support for Controlling and Analyzing the Execution of Distributed Programs. In Proceedings of USENIX Hot Topics in Operating System (HotOS), Santa Fe, NM, USA, June 2005. Google ScholarDigital Library
K. M. Chandy and L. Lamport. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Transactions on Computer Systems, 3(1):63--75, 1985. Google ScholarDigital Library
M. Y. Chen, A. Accardi, E. Kiciman, J. Lloyd, D. Patterson, A. Fox, and E. Brewer. Path-based Failure and Evolution Management. In Proceedings of USENIX Networked Systems Design and Implementation (NSDI), San Francisco, CA, USA, Mar. 2004. Google ScholarDigital Library
L. Conradie and M.-A. Mountzia. A Relational Model for Distributed Systems Monitoring using Flexible agents. In Proceedings of IEEE Workshop on Services in Distributed and Networked Environments (SDNE), Hong Kong, 1996. Google ScholarDigital Library
M. Consens, M. Hasan, and A. Mendelzon. Using Hy+ for Network Management and Distributed Debugging. In Proceedings of Centre for Advanced Studies on Collaborative research: software engineering, pages 450--471, Toronto, Ontario, Canada, Nov. 1993. Google ScholarDigital Library
R. H. Crawford, R. A. Olsson, W. W. Ho, and C. E. Wee. Semantic Issues in the Design of Languages for Debugging. Computer Languages, 21(1):17--37, 1995. Google ScholarDigital Library
D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay Debugging for Distributed Applications. In Proceedings of the USENIX Annual Technical Conference, Boston, MA, USA, May 2006. Google ScholarDigital Library
T. L. Harris. Dependable Software Needs Pervasive Debugging (Extended Abstract). In Proceedings of ACM SIGOPS European Workshop, Saint-Emilion, France, Sept. 2002. Google ScholarDigital Library
J. Hollingsworth and B. Miller. Dynamic Control of Performance Monitoring of Large Scale Parallel Systems. In Proceedings of Super Computing (SC), Tokyo, Japan, July 1993. Google ScholarDigital Library
A.-C. Huang and P. Steenkiste. Building Self-adapting Services Using Service-specific Knowledge. In Proceedings of IEEE High Performance Distributed Computing (HPDC), Research Triangle Park, NC, USA, July 2005. Google ScholarDigital Library
IBM Websphere XD. http://www-306.ibm.com/common/ssi/fcgi-bin/ssialias?infotype=an&subtype=ca&htmlfid=897/ENUS206--010, Jan. 2006.Google Scholar
E. Kiciman and L. Subramanian. A Root Cause Localization Model for Large Scale Systems. In Proceedings of USENIX Hot Topics On Dependability (HotDep), Yokohama, Japan, June 2005. Google ScholarDigital Library
S. T. King and P. M. Chen. Backtracking Intrusions. In Proceedings of ACM Symposium on Operating System Principles (SOSP), Bolton Landing, NY, USA, Oct. 2003. Google ScholarDigital Library
E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The Click modular router. ACM Transaction on Computer Systems, 18(3), 2000. Google ScholarDigital Library
S. Lin, A. Pan, and Z. Zhang. WiDS: an Integrated Toolkit for Distributed System Developement. In Proceedings of USENIX Hot Topics in Operating System (HotOS), Santa Fe, NM, USA, June 2005. Google ScholarDigital Library
B. T. Loo, T. Condie, J. M. Hellerstein, P. Maniatis, T. Roscoe, and I. Stoica. Implementing Declarative Overlays. In Proceedings of ACM Symposium on Operating System Principles (SOSP), Brighton, UK, Oct. 2005. Google ScholarDigital Library
R. Mahajan, D. Wetherall, and T. Anderson. Understanding BGP Misconfiguration. In Proceedings of ACM Special Interest Group On Data Communications (SIGCOMM), Pittsburg, PA, USA, Aug. 2002. Google ScholarDigital Library
P. Reynolds, J. L. Biener, J. C. Mogul, M. A. Shah, C. Killian, and A. Vahdat. Pip: Detecting the Unexpected in Distributed Systems. In Proceedings of USENIX Networked Systems Design and Implementation (NSDI), San Jose, CA, USA, May 2006. Google ScholarDigital Library
R. Snodgrass. A Relations Approach to Monitoring Complex Systems. IEEE Transactions on Computer Systems, 6(2): 157--196, 1988. Google ScholarDigital Library
I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. Kaashoek, F. Dabek, and H. Balakrishnan. Chord: A Scalable Peer-to-peer Lookup Protocol for Internet Applications. IEEE/ACM Transactions of Networking, 11(1): 17--32, 2003. Google ScholarDigital Library
H. J. Wang, J. C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic Misconfiguration Troubleshooting with PeerPressure. In Proceedings of USENIX Operating System Design and Implementation (OSDI), San Francisco, CA, USA, Dec. 2004. Google ScholarDigital Library
A. Whitaker, R. Cox, and S. Gribble. Configuration Debugging as Search: Finding the Needle in the Haystack. In Proceedings of USENIX Operating System Design and Implementation (OSDI), San Francisco, CA, USA, Dec. 2004. Google ScholarDigital Library
O. Wolfson, S. Sengupta, and Y. Yemini. Managing Communication Networks by Monitoring Databases. IEEE Transactions on Software Engineering, 17(9):944--953, 1991. Google ScholarDigital Library

Index Terms

Using queries for distributed monitoring and forensics
1. Networks
  1. Network protocols
2. Software and its engineering
  1. Software organization and properties
    1. Software system structures
      1. Distributed systems organizing principles
        Organizing principles for web applications

Recommendations

Using queries for distributed monitoring and forensics
Proceedings of the 2006 EuroSys conference

Distributed systems are hard to build, profile, debug, and test. Monitoring a distributed system - to detect and analyze bugs, test for regressions, identify fault-tolerance problems or security compromises - can be difficult and error-prone. In this ...
Read More
Improved Algorithms for Distributed Entropy Monitoring

Modern data management systems often need to deal with massive, dynamic and inherently distributed data sources. We collect the data using a distributed network, and at the same time try to maintain a global view of the data at a central coordinator ...
Read More
A geometric approach to monitoring threshold functions over distributed data streams

Monitoring data streams in a distributed system is the focus of much research in recent years. Most of the proposed schemes, however, deal with monitoring simple aggregated values, such as the frequency of appearance of items in the streams. More ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006
April 2006
420 pages
ISBN:1595933220
DOI:10.1145/1217935
Conference Chair:
Yolande Berbers
K. U. Leuven, Belgium
,
Program Chair:
Willy Zwaenepoel
EPFL
ACM SIGOPS Operating Systems Review Volume 40, Issue 4
Proceedings of the 2006 EuroSys conference
October 2006
383 pages
ISSN:0163-5980
DOI:10.1145/1218063
Issue’s Table of Contents
Copyright © 2006 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 April 2006
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
declarative overlays
distributed debugging
distributed monitoring
invariant checking
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 62
  Total Citations
  View Citations
- 368
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using queries for distributed monitoring and forensics

EuroSys '06: Proceedings of the 1st ACM SIGOPS/EuroSys European Conference on Computer Systems 2006

ABSTRACT

References

Cited By

Index Terms

Recommendations

Using queries for distributed monitoring and forensics

Improved Algorithms for Distributed Entropy Monitoring

A geometric approach to monitoring threshold functions over distributed data streams