Elsevier

Computer Networks

Volume 51, Issue 5, 11 April 2007, Pages 1239-1255
Computer Networks

Learning DFA representations of HTTP for protecting web applications

https://doi.org/10.1016/j.comnet.2006.09.016Get rights and content

Abstract

Intrusion detection is a key technology for self-healing systems designed to prevent or manage damage caused by security threats. Protecting web server-based applications using intrusion detection is challenging, especially when autonomy is required (i.e., without signature updates or extensive administrative overhead). Web applications are difficult to protect because they are large, complex, highly customized, and often created by programmers with little security background. Anomaly-based intrusion detection has been proposed as a strategy to meet these requirements.

This paper describes how DFA (Deterministic Finite Automata) induction can be used to detect malicious web requests. The method is used in combination with rules for reducing variability among requests and heuristics for filtering and grouping anomalies. With this setup a wide variety of attacks is detectable with few false-positives, even when the system is trained on data containing benign attacks (e.g., attacks that fail against properly patched servers).

Introduction

Self-healing systems detect problems and repair them with little or no human intervention, and do so in a timely fashion. Although such detection/response combinations are desirable for many classes of problems, one of the most compelling applications is that of computer security, where Internet-connected computers are routinely attacked by computer viruses, worms, and human adversaries. In living systems, response to damage from analogous threats is coordinated primarily by the immune system. Similarly, “computer immune systems” have been proposed to respond to threats that circumvent conventional computer defenses. Earlier work in computer immunology has addressed many security threats, by program code [1], [2] monitoring network connections [3], [4], [5] and low level program behavior [6], [7], [8], [9]. However, as we explain below, these methods in their current form are not appropriate for protecting web server applications.

Web servers are important to protect because they are so ubiquitous, yet they are currently poorly defended. For example, web-based vulnerabilities have been estimated to account for more than 25% of the total number of reported vulnerabilities from 1999 to 2005 [10]. Web servers must accept complex, highly variable inputs from virtually any host on the Internet, and with the emergence of web application services, they must process those inputs in arbitrarily complex ways. These characteristics make the problem of securing web servers especially challenging.

One approach to protecting computer systems, including web servers, is to craft specific defenses for each observed problem, either in the form of code patches or attack signatures. Both strategies, however, require a human to analyze each problem and develop the solution. This limits the feasible response time to a timescale of hours or days, but attacks by self-replicating programs (viruses or worms) can manifest in a matter of seconds [11]; thus, there is a need for automated mechanisms that can detect and respond to threats in real time. Anomaly-detection methods have been proposed as an alternative because they can potentially detect novel attacks without human intervention [12], [13], [14], [6]. In anomaly detection, a model of normal behavior is developed from empirical observations, and subsequent observations that deviate from the model are labeled anomalies. Anomaly detection is problematic in the case of web servers, however, because web traffic is highly variable and it is difficult to characterize normal behavior in a way that both detects attacks and limits the rate of false alarms.

Other researchers have addressed the anomaly detection problem for web servers by measuring the frequency distribution of one or more simple features of a web request (the conduit of most attacks on web servers) and combining those features into a single anomaly measure. Although the features taken individually, such as character frequency, are often not sufficient for accurate detection, multiple features can be combined with better results [15]. However, these anomaly detectors are trained on data that are free of attacks. This requirement is unfortunate because the normal background of today’s Internet contains large numbers of attacks, most of which are harmless against properly patched servers. Signature-scanning intrusion-detection systems (IDS) such as snort [16] can be used to filter out known harmless attacks; however, the high accuracy required for training requires frequent updates to the attack signature database and careful site-specific tuning to remove rules that generate false alarms. This manual intervention reduces the main advantage of using anomaly detection.

To summarize, there is a need for web servers that can detect and repair damage caused by security violations with minimal human interaction. To accomplish this will require techniques for detecting anomalous web requests that are tolerant of the benign attacks that are inevitably present in observations of normal web traffic. This paper describes a system that addresses these criteria, which learns normal requests using deterministic finite automata (DFA) induction. To account for the high variability in normal web requests, we combine the DFA with a set of simple parser heuristics, which remove the most variable parts of the web request before the induction step, and anomaly heuristics for classifying detected anomalies during the testing phase.

In order to test our system, we collected a data set consisting of 65 attacks against web servers, a much larger corpus than has been used in earlier work, and we studied the normal traffic from four distinct web sites over the course of months. The tests show that our system can detect 79% of the attacks with 40 false alarms per day on a complex website; with simpler websites, our system can detect 90%+ with just a few alarms per day. These numbers are competitive with similar methods but do not require pre-filtering of training data to eliminate attacks in the data.

In the remaining sections, we first give background information on intrusion detection in Section 2. We then describe the DFA method for modeling HTTP requests (Section 3), the datasets we developed for testing (Section 4), and our experimental results in Section 5. Implications of the work are discussed in Section 6. Section 7 reviews related work, and Section 8 gives a summary of our results and plans for future work.

Section snippets

Background

A brief overview of intrusion detection in the context of web server security is given, with an emphasis on anomaly-based intrusion detection. We then compare our system to the most similar approaches, leaving the discussion of other related work to Section 7.

Modeling web requests

In this section we first describe our method for modeling web requests, first by discussing how we tokenize web requests, and then by explaining how we build a DFA model of requests. Next, we describe how new behavior is compared to our DFA model. Then, we discuss how to mitigate false-positives.

Test data

We used four production web sites to collect our normal data, and then we developed our extensive database of attacks for testing our system.

Experiments

In this section, we report some experimental results obtained by using the training and test datasets just described. The experiments test performance in terms of model size and accuracy (true-positives and false-positives), for both the case of data in which attacks were filtered and the case where they were unfiltered. We then also present results regarding the specificity of our model in order to illustrate the limited amount of generalization our method performs on web requests.

Table 2

Discussion

In this section we interpret the experiments and explore their significance, focusing on the accuracy of our method in terms of false and true-positives, the significance of our modeling strategy and how it relates to other approaches to anomaly detection, and the suitability of the method for production systems.

Other related work

In general, learning algorithms for one-class, non-stationary problems like the security anomaly detection problem are rare, although this is an area of current interest in the machine learning community (e.g., see [31], [47], [48]). Methods for handling non-stationary data include forgetting, as described by Salganicoff [49]. Other machine learning algorithms, such as neural networks with adaptive critics, have been proposed as an anomaly detection method [50]. However, these methods require a

Conclusion

Protecting web servers is a challenging and important problem, especially in the context of custom web-based applications. We described a method for detecting anomalous web requests via DFA induction that meets many of the requirements of an autonomous security solution, including efficient online unsupervised one-class learning, tolerance for highly variable normal behavior that contains benign attacks, and detection of a wide variety of attacks on web applications. We have also developed and

Acknowledgements

SF and KI gratefully acknowledge the support of the National Science Foundation (Grants ANIR-9986555, CCR-0331580, and CCR-0311686), Defense Advanced Research Projects Agency (Grants F30602-02-1-0146), and the Santa Fe Institute. AS gratefully acknowledges the support of Natural Sciences and Engineering Research Council of Canada (NSERC) and MITACS.

Kenneth L. Ingham received his Bachelor of Science (1985) in Computer Science, cum laude, from the University of New Mexico as well as his Master of Science (1994) in Computer Science from the University of New Mexico. He is currently finishing his Ph.D. in Computer Science at the University of New Mexico, focusing on anomaly detection for protecting network servers. He expects to complete the degree in December 2006.

References (58)

  • R. Lippmann et al.

    The 1999 DARPA off-line intrusion detection evaluation

    Computer Networks

    (2000)
  • C. Kruegel et al.

    A multi-model approach to the detection of web-based attacks

    Computer Networks

    (2005)
  • E. Gold

    Complexity of automaton identification from given data

    Information and Control

    (1978)
  • J.O. Kephart, A biologically inspired immune system for computers, in: Proceedings of Artificial Life IV, the Fourth...
  • S. Forrest, A.S. Perelson, L. Alien, R. Cherukuri, Self-nonself discrimination in a computer, in: Proceedings of 1994...
  • S.A. Hofmeyr, S. Forrest, Immunizing computer networks: Getting all the machines in your network to fight the hacker...
  • S.A. Hofmeyr, An immunological model of distributed detection and its application to computer security, Ph.D. thesis....
  • P. Williams, K. Anchor, J. Bebo, G. Gunsch, G. Lament, GDIS: Towards a computer immune system for detecting network...
  • S. Forrest, S. Hofmeyr, A. Somayaji, T. Longstaff, A sense of self for Unix processes, in: Proceedings of 1996 IEEE...
  • S.A. Hofmeyr, S. Forrest, A. Somayaji, Intrusion detection using sequences of system calls. Journal of Computer...
  • A. Somayaji, S. Forrest, Automated response using system-call delays, in: Proceedings of the 9th USENIX Security...
  • A. Somayaji, Operating system stability and security through process homeostasis, Ph.D. thesis, University of New...
  • W. Robertson, G. Vigna, C. Kruegel, R.A. Kemmerer, Using generalization and characterization techniques in the...
  • D. Moore et al.

    Inside the slammer worm

    IEEE Security and Privacy

    (2003)
  • JP. Anderson, Computer security technology planning study, Technical Report ESD-TR-73-51, United States Air Force,...
  • T.F. Lunt, Detecting Intruders in Computer Systems, in: Proceedings of 1993 Conference on Auditing and Computer...
  • L.T. Heberlein et al.

    A network security monitor

  • C. Kruegel et al.

    Anomaly detection of web-based attacks

  • M. Roesch, Snort—lightweight intrusion detection for networks, in: Proceedings of 13th Systems Administration...
  • S. Patton, W. Yurcik, D. Doss, An Achilles’ heel in signature-based IDS: Squealing false positives in SNORT, in:...
  • M.J. Ranum, A network firewall, in: Proceedings of the First World Conference on System Administration and Security,...
  • G.W. Treese, A. Wolman, X through the firewall and other application relays, in: Proceedings of the USENIX Summer...
  • M. Handley, V. Paxson, C. Kreibich, Network intrusion detection: Evasion, traffic normalization, and end-to-end...
  • E. Strother, Denial of service protection: the nozzle, in: Proceedings of Annual Computer Security Applications...
  • C. Ko, G. Fink, K. Levitt, Automated detection of vulnerabilities in privileged programs by execution monitoring, in:...
  • P.G. Neumann, P.A. Porras, Experience with EMERALD to date, in: Proceedings of First USENIX Workshop on Intrusion...
  • S.A. Hofmeyr et al.

    Architecture for an artificial immune system

    Evolutionary Computation

    (2000)
  • J. Balthrop, S. Forrest, M. Glickman, Revisiting lisys: Parameters and normal behavior, in: Proceedings of the 2002...
  • C. Marceau, Characterizing the behavior of a program using multiple-length n-grams, in: Proceedings of New Security...
  • Cited by (66)

    • ShellBreaker: Automatically detecting PHP-based malicious web shells

      2019, Computers and Security
      Citation Excerpt :

      Son and Shmatikov (2011) proposed a method to identify PHP web applications with semantic vulnerabilities such as infinite loops and the missing of authorization checks. Many methods have also been proposed to enhance web security at the during-exploitation phase (i.e., to detect ongoing attacks for vulnerability exploitation) (Almgren et al., 2000; Ingham et al., 2007; Kruegel and Vigna, 2003; Robertson et al., 2006). For example, Ingham et al. (2007) leverages DFA (Deterministic Finite Automata) to model web requests and identify malicious ones.

    • Intelligent security of computer networks

      2021, Intelligent Network Management and Control: Intelligent Security, Multi-criteria Optimization, Cloud Computing, Internet of Vehicles, Intelligent Radio
    View all citing articles on Scopus

    Kenneth L. Ingham received his Bachelor of Science (1985) in Computer Science, cum laude, from the University of New Mexico as well as his Master of Science (1994) in Computer Science from the University of New Mexico. He is currently finishing his Ph.D. in Computer Science at the University of New Mexico, focusing on anomaly detection for protecting network servers. He expects to complete the degree in December 2006.

    Anil Somayaji received the BS (1994) degree in mathematics from the Massachusetts Institute of Technology and the PhD (2002) degree in computer science from the University of New Mexico. He is an assistant professor in the School of Computer Science at Carleton University. His research interests include computer security, operating systems, complex adaptive systems, and artificial life.

    John Burge was born and raised in New Mexico. He received his Bachelors of Science in Computer Science, cum laude, from the University of New Mexico as well as his Masters degree in Computer Science. He is currently finishing up his PhD in Machine Learning (specifically with probabilistic Bayesian methods) and expects to be completed in December of 2006.

    Stephanie Forrest is Professor and Chairman of Computer Science at the University of New Mexico in Albuquerque and a Research Professor at the Santa Fe Institute. She received a Ph.D. in Computer and Communication Sciences from the University of Michigan. Before joining UNM she worked for Teknowledge Inc. and was a Director’s Fellow at the Center for Nonlinear Studies, Los Alamos National Laboratory. She is currently a member of the Santa Fe Institute Science Board and served as SFI’s Interim Vice President 1999–2000. Her research interests are in adaptive systems, including genetic algorithms, computational immunology, biological modeling, and computer security.

    View full text