ABSTRACT
Configuration errors (i.e., misconfigurations) are among the dominant causes of system failures. Their importance has inspired many research efforts on detecting, diagnosing, and fixing misconfigurations; such research would benefit greatly from a real-world characteristic study on misconfigurations. Unfortunately, few such studies have been conducted in the past, primarily because historical misconfigurations usually have not been recorded rigorously in databases.
In this work, we undertake one of the first attempts to conduct a real-world misconfiguration characteristic study. We study a total of 546 real world misconfigurations, including 309 misconfigurations from a commercial storage system deployed at thousands of customers, and 237 from four widely used open source systems (CentOS, MySQL, Apache HTTP Server, and OpenLDAP). Some of our major findings include: (1) A majority of misconfigurations (70.0%~85.5%) are due to mistakes in setting configuration parameters; however, a significant number of misconfigurations are due to compatibility issues or component configurations (i.e., not parameter-related). (2) 38.1%~53.7% of parameter mistakes are caused by illegal parameters that clearly violate some format or rules, motivating the use of an automatic configuration checker to detect these misconfigurations. (3) A significant percentage (12.2%~29.7%) of parameter-based mistakes are due to inconsistencies between different parameter values. (4) 21.7%~57.3% of the misconfigurations involve configurations external to the examined system, some even on entirely different hosts. (5) A significant portion of misconfigurations can cause hard-to-diagnose failures, such as crashes, hangs, or severe performance degradation, indicating that systems should be better-equipped to handle misconfigurations.
- P. Anderson, P. Goldsack, and J. Paterson. SmartFrog meets LCFG Autonomous Reconfiguration with Central Policy Control. In LISA, August 2003. Google ScholarDigital Library
- M. Attariyan and J. Flinn. Using causality to diagnose configuration bugs. In USENIX, June 2008. Google ScholarDigital Library
- M. Attariyan and J. Flinn. Automating configuration troubleshooting with dynamic information flow analysis. In OSDI, October 2010. Google ScholarDigital Library
- A. B. Brown and D. A. Patterson. Undo for Operators: Building an Undoable E-mail Store. In USENIX, June 2003. Google ScholarDigital Library
- A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An empirical study of operating systems errors. In SOSP'01. Google ScholarDigital Library
- CircleID. Misconfiguration brings down entire se domain in sweden. www.circleid.com/posts/misconfiguration_brings_down_entire_se_domain_in_sweden/.Google Scholar
- O. Crameri, N. Knezević, D. Kostić, R. Bianchini, and W. Zwaenepoel. Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System. In SOSP'07, October 2007. Google ScholarDigital Library
- Debian. The Debian GNU/Linux FAQ, Chapter 8: The Debian Package Management Tools. http://www.debian.org/doc/FAQ/ch-pkgtools.en.html.Google Scholar
- N. Feamster and H. Balakrishnan. Detecting BGP configuration faults with static analysis. In NSDI, May 2005. Google ScholarDigital Library
- D. Freedman, R. Pisani, and R. Purves. Statistics, 3rd Edition. W. W. Norton & Company., 1997.Google Scholar
- J. Gray. Why do computers stop and what can be done about it? In Symp. on Reliability in Distributed Software and Database Systems, 1986.Google Scholar
- J. Ha, C. J. Rossbach, J. V. Davis, I. Roy, H. E. Ramadan, D. E. Porter, D. L. Chen, and E. Witchel. Improved Error Reporting for Software that Uses Black-Box Components. In PLDI, 2007. Google ScholarDigital Library
- Hewlett-Packard. HP Storage Essentials SRM Software Suite. http://h18000.www1.hp.com/products/quickspecs/12191_na/12191_na.pdf.Google Scholar
- IBM Corp. IBM Tivoli Software. http://www-01.ibm.com/software/tivoli/.Google Scholar
- R. Johnson. More details on today's outage. http://www.facebook.com/notes/facebook-engineering/more-details-on-todays-outage/431441338919.Google Scholar
- A. Kappor. Web-to-host: Reducing total cost of ownership. In Technical Report 200503, The Tolly Group, May 2000.Google Scholar
- L. Keller, P. Upadhyaya, and G. Candea. ConfErr: A Tool for Assessing Resilience to Human Configuration Errors. In DSN, June 2008.Google ScholarCross Ref
- N. Kushman and D. Katabi. Enabling Configuration-Independent Automation by Non-Expert Users. In OSDI, October 2010. Google ScholarDigital Library
- S. Lu, S. Park, E. Seo, and Y. Zhou. Learning from mistakes -- a comprehensive study on real world concurrency bug characteristics. In ASPLOS, March 2008. Google ScholarDigital Library
- R. A. Maxion and R. W. Reeder. Improving user-interface dependability through mitigation of human error. International Journal of Human-Computer Studies, 63, July 2005. Google ScholarDigital Library
- Microsoft Corp. Microsoft Baseline Security Analyzer. 2008. http://www.microsoft.com/technet/security/tools/MBSAHome.mspx.Google Scholar
- B. Murphy and T. Gent. Measuring system and software reliability using an automated data collection process. In Quality and Reliability Engineering International, 11(5), 1995.Google Scholar
- K. Nagaraja, F. Oliveira, R. Bianchini, R. P. Martin, and T. D. Nguyen. Understanding and Dealing with Operator Mistakes in Internet Services. In OSDI'04, October 2004. Google ScholarDigital Library
- NetApp, Inc. Proactive Health Management with AutoSupport. http://media.netapp.com/documents/wp-7027.pdf.Google Scholar
- NetApp, Inc. Protection Manager. http://www.netapp.com/us/products/management-software/protection.html.Google Scholar
- NetApp, Inc. Provisioning Manager. http://www.netapp.com/us/products/management-software/provisioning.html.Google Scholar
- F. Oliveira, K. Nagaraja, R. Bachwani, R. Bianchini, R. P. Martin, and T. D. Nguyen. Understanding and Validating Database System Administration. In USENIX'06, 2006. Google ScholarDigital Library
- F. Oliveira, A. Tjang, R. Bianchini, R. P. Martin, and T. D. Nguyen. Barricade: Defending Systems Against Operator Mistakes. In EuroSys'10, April 2010. Google ScholarDigital Library
- D. Oppenheimer, A. Ganapathi, and D. A. Patterson. Why do Internet services fail, and what can be done about it? In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems (USITS), March 2003. Google ScholarDigital Library
- D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupman, and N. Treuhaft. Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. In Technical Report UCB//CSD-02-1175, University of California, Berkeley, March 2002. Google ScholarDigital Library
- A. Rabkin and R. Katz. Static Extraction of Program Configuration Options. In ICSE, May 2011. Google ScholarDigital Library
- V. Ramachandran, M. Gupta, M. Sethi, and S. R. Chowdhury. Determining Configuration Parameter Dependencies via Analysis of Configuration Data from Multi-tiered Enterprise Applications. In ICAC, June 2009. Google ScholarDigital Library
- E. Reisner, C. Song, K.-K. Ma, J. S. Foster, and A. Porter. Using symbolic evaluation to understand behavior in configurable software systems. In ICSE, May 2010. Google ScholarDigital Library
- RPM. Rpm package manager (rpm). http://rpm.org/.Google Scholar
- Y.-Y. Su, M. Attariyan, and J. Flinn. AutoBash: improving configuration management with operating system causality analysis. In SOSP, October 2007. Google ScholarDigital Library
- M. Sullivan and R. Chillarege. Software defects and their impact on system availability: A study of field failures in operating systems. In FTCS, 1991.Google ScholarCross Ref
- M. Sullivan and R. Chillarege. A comparison of software defects in database management systems and operating systems. In International Symposium on Fault-Tolerant Computing, 1992.Google ScholarCross Ref
- H. J. Wang, J. C. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Automatic Misconfiguration Troubleshooting with PeerPressure. In OSDI'04, October 2004. Google ScholarDigital Library
- R. Wang, X. Wang, K. Zhang, and Z. li. Towards Automatic Reverse Engineering of Software Security Configurations. In CCS, October 2008. Google ScholarDigital Library
- Y.-M. Wang, C. Verbowski, J. Dunagan, Y. Chen, H. J. Wang, C. Yuan, and Z. Zhang. STRIDER: A Black-box, State-based Approach to Change and Configuration Management and Support. In LISA'03, October 2003. Google ScholarDigital Library
- A. Whitaker, R. S. Cox, and S. D. Gribble. Configuration Debugging as Search: Finding the Needle in the Haystack. In OSDI, October 2004. Google ScholarDigital Library
- C. Yuan, N. Lao, J.-R. Wen, J. Li, Z. Zhang, Y.-M. Wang, and W.-Y. Ma. Automated Known Problem Diagnosis with Event Traces. In EuroSys, April 2006. Google ScholarDigital Library
- D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving Software Diagnosability via Log Enhancement. In ASPLOS, March 2011. Google ScholarDigital Library
- W. Zheng, R. Bianchini, and T. D. Nguyen. Automatic Configuration of Internet Services. In EuroSys, March 2007. Google ScholarDigital Library
Index Terms
- An empirical study on configuration errors in commercial and open source systems
Recommendations
Systems Approaches to Tackling Configuration Errors: A Survey
In recent years, configuration errors (i.e., misconfigurations) have become one of the dominant causes of system failures, resulting in many severe service outages and downtime. Unfortunately, it is notoriously difficult for system users (e.g., ...
Configuration research and commercial solutions
In this paper we intend to motivate various research areas in configuration, based on our experience in developing commercial configuration solutions. Informal definitions are given for the configuration task and for configuration specification and ...
Impact of configuration errors on DNS robustness
During the past twenty years the Domain Name System (DNS) has sustained phenomenal growth while maintaining satisfactory performance. However, the original design focused mainly on system robustness against physical failures, and neglected the impact of ...
Comments