Elsevier

Performance Evaluation

Volume 54, Issue 1, September 2003, Pages 33-57
Performance Evaluation

A hierarchical and multiscale approach to analyze E-business workloads

https://doi.org/10.1016/S0166-5316(02)00228-6Get rights and content

Abstract

Understanding the characteristics of electronic business (E-business) workloads is a crucial step to improve the quality of service offered to customers in E-business environments. This paper proposes a hierarchical and multiple time scale approach to characterize E-business workloads. The three levels of the hierarchy are user, application, and protocol, and are associated with customer sessions, functions requested, and HTTP requests, respectively. Within each layer, an analysis across several time scales is conducted. The approach is illustrated by presenting a detailed characterization of two actual E-business sites: an online bookstore and an electronic auction site. Our analysis of the workloads showed that the session length, measured in number of requests to execute E-business functions, is heavy-tailed, especially for sites subject to requests generated by robots. An overwhelming majority of the sessions consist of only a handful requests, which seems to suggest that most customers are human (as opposed to robots). A significant fraction of the functions requested by customers were found to be product selection functions as opposed to product ordering. An analysis of the popularity of search terms revealed that it follows a Zipf distribution. However, Zipf’s law as applied to E-business is time scale dependent due to the shift in popularity of search terms. We also found that requests to execute frequent E-business functions exhibit a pattern similar to the HTTP request arrival process. Finally, we demonstrated that there is a strong correlation in the arrival process at the HTTP request level. These correlations are particularly stronger at intermediate time scales of a few minutes.

Introduction

Electronic business (E-business) sites are very complex, composed of several tiers of servers of different types (e.g., web servers, application servers, and database servers), and are subject to workloads that vary in ways hard to predict. The quality of service requirements for E-business sites are strict since customers demand fast response times and high availability or else they turn to competitors. Understanding the nature and characteristics of E-business workloads is a crucial step to improve the quality of service offered to customers in E-business environments. E-business workload characterization can lead to a better understanding of the interaction between customers and web sites and can also help design systems with better performance and availability [16]. This paper presents a hierarchical and multiple scale approach to the characterization of E-business workloads.

E-business workloads are composed of sessions. A session is a sequence of requests of different types made by a single customer during a single visit to a site. During a session, a customer requests the execution of various E-business functions such as browse, search, select, add to the shopping cart, register, and pay. A request to execute an E-business function may generate many HTTP requests to the site. For example, several images may have to be retrieved to display the page that contains the results of the execution of an E-business function.

Past studies of WWW workloads concentrated on information provider sites and found several characteristics common to them [5], [7], [9], [18]. Some of these characteristics deal with file size distributions, file popularity distribution, self-similarity in web traffic, reference locality, and user request patterns. A number of studies of different web sites found file sizes to exhibit heavy-tailed distributions and object popularity to be Zipf-like. Other studies of different web site environments demonstrated long-range dependencies in the user request process, in other words, strong correlations in the user requests. In particular, Arlitt and Williamson [7] identified 10 workload properties, called invariants, across six different data sets, which included different types of information provider web sites. Some of the most relevant invariants are: (i) images and HTML files account for 90–100% of the files transferred; (ii) 10% of the documents account for 90% of all requests and bytes transferred; (iii) file sizes follow the Pareto distribution, and (iv) file inter-reference times are independent and exponentially distributed. Shortly after, Almeida et al. [5] discovered that the popularity of documents served by web sites dedicated to information dissemination follows a Zipf’s law. In [9], the authors pointed to the self-similar nature of web server traffic. All these studies were performed almost 5 years ago. Since then, several major changes have been observed in the WWW. The most important are: clients now have much larger bandwidth, the number of users has grown exponentially, and E-business became one of the major applications on the web.

In [12], the authors introduce the notion of session, consisting of many individual HTTP requests. However, they do not characterize the workload of E-business sites, which is composed of typical requests such as browse, search, select, add, and pay. The analysis focuses only on the throughput gains obtained by an admission control mechanism that aims at guaranteeing the completion of any accepted session. The work in [19] proposes a workload characterization for E-business servers, where customers follow typical sequences of URLs as they move towards the completion of transactions. The authors though do not present any characterization or properties of actual E-business workloads.

There are very few published studies [6], [14], [17] of E-business workloads because of the difficulty in obtaining actual logs from electronic companies. Most companies consider web logs to be very sensitive data. In [17], the authors propose a graph-based methodology for characterizing E-business workloads and apply it to an actual workload to obtain metrics related to the interaction of customers with a site. For example, the paper shows how to obtain information such as the number of sessions, average session length, and buy-to-visit ratio. Ref. [15] presents several models (e.g., customer behavior model graph and customer visit model) for workload characterization of E-business sites. It also shows how workload models can be obtained from HTTP logs. Our previous work [14], extended here, discussed the issue of how to obtain invariants for E-business workloads. In [6], Arlitt et al. characterize the workload of an actual E-commerce site for the purpose of analyzing its scalability. They use performance-related criteria to cluster requests into similar groups. They then use multiclass queuing models to carry out a capacity planning study for the site. In [3], the authors study the impact of time scale on operational analysis for a large web-based shopping system. They show that time-related service level agreements and input parameters for predictive queuing models are sensitive to time scale.

A question that naturally arises is: are the characteristics and invariants found in information provider web sites still valid for E-business workloads? To answer this question, we define a hierarchical and multiscale approach to characterize the workload of E-business sites. The three layers of the hierarchy are: session, function, and HTTP request, as defined in Section 2. Within each layer, an analysis across several time scales is conducted. The approach is illustrated by presenting a detailed characterization of two actual E-business sites: an online bookstore and an electronic auction site. This paper extends our previous work [14] and examines statistical and distributional properties of the E-business workloads and compare these properties across the two data sets. As much as possible, we compare the features of these workloads with the invariants that were discovered for information dissemination web sites and provide an extended multiscale analysis of the workload. The same hierarchical approach was used by the authors to study the presence of robots in web workloads [4].

The rest of the paper is organized as follows. Section 2 shows the approach used to characterize E-business workloads. The next section describes the data collection process. Section 4 analyzes two logs from actual E-business sites and characterizes the workload at the HTTP request level. Characterizations at the E-business function and session levels are provided in 5 Function characterization, 6 Session characterization, respectively. Finally, Section 7 presents concluding remarks.

Section snippets

Hierarchical multiscale approach

Workload characterization can be accomplished at many levels: user level, application level, protocol level, and network level. An E-business workload can be viewed in a multi-layer hierarchical way, as shown in Fig. 1. This paper focuses on the characterization of three levels, represented by the HTTP request-layer (protocol level), function layer (application level), and session layer (user level). This hierarchy can be used to capture changes in user behavior and map the effects of these

Data collection for case studies

The online bookstore sells exclusively on the Internet. The auction site sells Internet domains. In both cases, the data consist of access logs recorded by the WWW server of each E-business.

The data comprises 2 weeks of accesses to each of these sites. The bookstore logs were collected from 1–15 August 1999, while the auction server logs are from 28 March to 11 April 2000.

During these 2 weeks, the bookstore handled 3,630,964 requests (242,064 daily requests on average), transferring a total of

Request-layer characterization

In this section, we study the statistical nature of the arrival process of HTTP requests to allow for the extraction of statistically significant features towards classification, understanding, and modeling of request workload.

Function characterization

In this section, we characterize the workload at the level of E-business functions. Our first criterion is the nature of the function. When considering an online store, we may divide the functions into four groups: static, product selection, purchase, and other. Static functions comprise the home and informational pages about the store. Product selection includes all functions that allow a client to find and verify a product they are looking for: browse, search, and view. Purchase functions

Session characterization

Session boundaries are delimited by a period of inactivity by a customer. In other words, if a customer has not issued any request for a period longer than a threshold τ, his session is considered finished. Usually, sites enforce this threshold and close inactive sessions to save resources allocated to these sessions. For the auction site, we know that the HTTP server enforced a threshold of 20 min. Since we do not have this information for the bookstore site, we had to estimate the threshold

Concluding remarks

Several studies have been published regarding the workload of information provider sites. However, very few studies are available for E-business sites. This paper presented a hierarchical and multiscale approach for workload characterization of E-business sites. The characterization was done at the session, E-business function, and request levels. The approach was applied to two actual E-businesses: an online bookstore and online auction site.

The hierarchical and multiscale characterization

Acknowledgements

The authors would like to thank the anonymous reviewers for their detailed and helpful comments, which greatly improved the quality of this paper. The work of D. Menascé was partially supported by the sponsors of the E-center for E-business at GMU. The work of V. Almeida was partially supported by the Brazilian Research Council (CNPq) and by a grant from SIAM 76.97.1016.00. R. Riedi’s support comes in part from an NSF grant no. ANI-00099148 and from Texas Instruments. He acknowledges the

D.A. Menascé is a Professor of Computer Science at George Mason University, the Co-director of its E-center for E-business, and the Director of its M.S. in E-commerce program. He holds a Ph.D. in computer science from UCLA. Menasce is a fellow of the ACM and the recipient of the 2001 A.A. Michelson award from the Computer Measurement Group. His research interests include performance evaluation of distributed and web-based systems and software performance engineering.

References (22)

  • D.A Menascé et al.

    Business-oriented resource management policies for e-commerce servers

    Perform. Eval.

    (2000)
  • P. Abry, P. Flandrin, M. Taqqu, D. Veitch, Wavelets for the analysis, estimation and synthesis of scaling data, in:...
  • P. Abry, P. Gonçalvès, P. Flandrin, Wavelets, spectrum analysis and 1/f processes, in: A. Antoniadis, G. Oppenheim...
  • V. Almeida, M. Arlitt, J. Rolia, Analyzing a web-based system’s performance at multiple time scales, in: Proceedings of...
  • V. Almeida, D. Menascé, R. Riedi, F. Ribeiro, R. Fonseca, W. Meira Jr., Analyzing web robots and their impact on...
  • V. Almeida, M. Crovella, A. Bestavros, A. Oliveira, Characterizing reference locality in the WWW, in: Proceedings of...
  • M Arlitt et al.

    Characterizing the scalability of a large web-based shopping system

    ACM Trans. Internet Technol.

    (2001)
  • M. Arlitt, C. Williamson, Web server workload characterization, in: Proceedings of the 1996 SIGMETRICS Conference on...
  • L. Cherkasova, M. Gupta, Characterizing locality, evolution, and life span of accesses in enterprise media server...
  • M Crovella et al.

    Self-similarity in world wide web traffic: evidence and possible causes

    IEEE/ACM Trans. Networking

    (1997)
  • I. Daubechies, Ten Lectures on Wavelets, SIAM, New York,...
  • Cited by (46)

    • Modeling a non-stationary bots’ arrival process at an e-commerce Web site

      2017, Journal of Computational Science
      Citation Excerpt :

      Later, a user session level has been taken into account – user sessions on an e-commerce site have been represented by state transition graphs [25] and Markov models [26,27]. Some approaches have tried to perform a multilayer or hierarchical analysis [28,29]. The most frequently analyzed features of the Web traffic have been the file size (transfer size), file popularity, request interarrival time, session interarrival time, the number of requests or pages per session, and the number of embedded objects per page.

    • Workload modeling for resource usage analysis and simulation in cloud computing

      2015, Computers and Electrical Engineering
      Citation Excerpt :

      RUBiS reproduces the think time via negative exponential distribution with mean between 7 and 8 s, as defined by TCP-W benchmark [16]. Most of e-commerce sessions last less than 16.66 min [17] and the reasonable longest time for user Web browsing are 15 min [18]. In this context, the session time uses the negative exponential distribution with the mean equals to 15 min, consistent with the RUBiS specifications [10].

    • A structural approach for modelling the hierarchical dynamic process of Web workload in a large-scale campus network

      2012, Journal of Network and Computer Applications
      Citation Excerpt :

      However, these works are quite different from the issues concerned in this paper. In Nuzman et al. (2002) and Menascé et al. (2003), authors only focused on workload's statistical properties of different TCP/IP layers instead of the hierarchical dynamic time process of workload. In Hariri et al. (2008), the underlying state processes were used to model the game states and gaming player behaviour.

    • Low Latency Execution Guarantee Under Uncertainty in Serverless Platforms

      2022, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Data-Intensive Workload Consolidation in Serverless (Lambda/FaaS) Platforms

      2021, 2021 IEEE 20th International Symposium on Network Computing and Applications, NCA 2021
    View all citing articles on Scopus

    D.A. Menascé is a Professor of Computer Science at George Mason University, the Co-director of its E-center for E-business, and the Director of its M.S. in E-commerce program. He holds a Ph.D. in computer science from UCLA. Menasce is a fellow of the ACM and the recipient of the 2001 A.A. Michelson award from the Computer Measurement Group. His research interests include performance evaluation of distributed and web-based systems and software performance engineering.

    V.A.F. Almeida is a Professor of Computer Science at the Federal University of Minas Gerais, Brazil. He holds a Ph.D. in computer science from Vanderbilt University. Almeida held visiting positions at Boston University, Xerox Parc, and at HP Research Laboratories in Palo Alto. His research interests include performance evaluation and modeling of large scale distributed systems.

    R. Riedi is a Faculty Fellow with the Electrical and Computer Engineering Department at Rice University in Houston, Texas. He holds a Ph.D. in mathematics from the Federal Institute of Technology ETH Zurich, Switzerland. He held positions at Yale University and at the National Research Institute in automation and computing, INRIA, Paris, France. He won the ETHZ Polya prize in 1986. His research interests lie in the theory and practice of multifractals, multiscale analysis and synthesis, especially for network traffic.

    F. Ribeiro is a Ph.D. candidate in computer science at the Federal University of Minas Gerais (UFMG), Brazil. She has an M.S. in computer science from the same institution. She was a summer intern at HP Labs, Palo Alto in 2001. Her current interests include web workload modeling and characterization, web data mining, capacity planing and performance analysis.

    R. Fonseca is a Ph.D. student in computer science at the University of California, Berkeley. He received his M.S. and B.S. degrees in computer science from the Federal University of Minas Gerais, Brazil. His research interests include workload modeling and characterization of distributed Internet systems such as the web, E-commerce systems, and search engines, as well as service distribution/replication strategies for better scalability of these systems.

    W. Meira is an Associate Professor of computer science at the Federal University of Minas Gerais, Brazil. He holds a Ph.D. in computer science from the University of Rochester. His current interests include scalability issues, performance analysis, and modeling of parallel and distributed systems, in particular Internet-based systems.

    This is an expanded and revised version of Menascé et al. [14].

    View full text