Top

2006 | Book

Springer Handbook of Engineering Statistics

Editor: Hoang Pham, Prof.

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

In today’s global and highly competitive environment, continuous improvement in the processes and products of any field of engineering is essential for survival. Many organisations have shown that the first step to continuous improvement is to integrate the widespread use of statistics and basic data analysis into the manufacturing development process as well as into the day-to-day business decisions taken in regard to engineering processes.

The "Springer Handbook of Engineering Statistics" gathers together the full range of statistical techniques required by engineers from all fields to gain sensible statistical feedback on how their processes or products are functioning and to give them realistic predictions of how these could be improved.

Frontmatter

Fundamental Statistics and Its Applications

Frontmatter

1. Basic Statistical Concepts

This brief chapter presents some fundamental elements of engineering probability and statistics with which some readers are probably already familiar, but others may not be. Statistics is the study of how best one can describe and analyze the data and then draw conclusions or inferences based on the data available. The first section of this chapter begins with some basic definitions, including probability axioms, basic statistics and reliability measures. The second section describes the most common distribution functions such as the binomial, Poisson, geometric, exponential, normal, log normal, Studentʼs t, gamma, Pareto, Beta, Rayleigh, Cauchy, Weibull and Vtub-shaped hazard rate distributions, their applications and their use in engineering and applied statistics. The third section describes statistical inference, including parameter estimation and confidence intervals. Statistical inference is the process by which information from sample data is used to draw conclusions about the population from which the sample was selected that hopefully represents the whole population. This discussion also introduces the maximum likelihood estimation (MLE) maximum likelihood estimation (MLE)method, the method of moments, MLE with censored data, the statistical change-point estimation method, nonparametic tolerance limits, sequential sampling and Bayesian methods. The fourth section briefly discusses stochastic processes, including Markov processes, Poisson processes, renewal processes, quasi-renewal processes, and nonhomogeneous Poisson processes. Finally, the last section provides a short list of books for readers who are interested in advanced engineering and applied statistics.

Hoang Pham

2. Statistical Reliability with Applications

This chapter reviews fundamental ideas in reliability theory and inference. The first part of the chapter accounts for lifetime distributions that are used in engineering reliability analysis, including general properties of reliability distributions that pertain to lifetime for manufactured products. Certain distributions are formulated on the basis of simple physical properties, and other are more or less empirical. The first part of the chapter ends with a description of graphical and analytical methods to find appropriate lifetime distributions for a set of failure data. The second part of the chapter describes statistical methods for analyzing reliability data, including maximum likelihood estimation and likelihood ratio testing. Degradation data are more prevalent in experiments in which failure is rare and test time is limited. Special regression techniques for degradation data can be used to draw inference on the underlying lifetime distribution, even if failures are rarely observed. The last part of the chapter discusses reliability for systems. reliabilityfor systemsAlong with the components that comprise the system, reliability analysis must take account of the system configuration and (stochastic) component dependencies. System reliability is illustrated with an analysis of logistics systems (e.g., moving goods in a system of product sources and retail outlets). Robust reliability design can be used to construct a supply chain that runs with maximum efficiency or minimum cost.

Paul Kvam, Jye-Chyi Lu

3. Weibull Distributions and Their Applications

Weibull models are used to describe various types of observed failures of components and phenomena. They are widely used in reliability and survival analysis. In addition to the traditional two-parameter and three-parameter Weibull distributions in the reliability or statistics literature, many other Weibull-related distributions are available. The purpose of this chapter is to give a brief introduction to those models, with the emphasis on models that have the potential for further applications. After introducing the traditional Weibull distribution, some historical development and basic properties are presented. We also discuss estimation problems and hypothesis-testing issues, with the emphasis on graphical methods. Many extensions and generalizations of the basic Weibull distributions are then summarized. Various applications in the reliability context and some Weibull densitydistribution functionempirical modelingestimationfailure rate (FR)graphicalhazard plothistoricalhypothesis testingmomentsreliabilityWeibull probability plotWeibull distributionWeibull derived analysis software are also provided.

Chin-Diew Lai, D.N. Murthy, Min Xie

4. Characterizations of Probability Distributions

A characterization is a certain distributional or statistical property of a statistic or statistics that uniquely determines the associated stochastic model. This chapter provides a brief survey of the huge literature on this topic. Characterizations based on random (complete or censored) samples from common univariate discrete and continuous distributions, and some multivariate continuous distributions are presented. Characterizations that use the properties of sample moments, order statistics, record statistics, and reliability properties are reviewed. Applications to simulation, stochastic modeling and goodness-of-fit tests are discussed. An introduction to further resources is given.

H. Nagaraja

5. Two-Dimensional Failure Modeling

For many products (for example, automobiles), failures failures depend on age and usage and, in this case, failures are random points in a two-dimensional plane with the two axes representing age and usage. In contrast to the one-dimensional case (where failures are random points along the time axis) the modeling of two-dimensional failures has received very little attention. In this chapter we discuss various issues (such as modeling process, parameter estimation, model analysis) for the two-dimensional case and compare it with the one-dimensional case.

D.N. Murthy, Jaiwook Baik, Richard Wilson, Michael Bulmer

6. Prediction Intervals for Reliability Growth Models with Small Sample Sizes

The first section of this chapter provides an introduction to the types of test considered for this growth model and a description of the two main forms of uncertainty encountered within statistical modelling, namely aleatory and epistemic. These two forms are combined to generate prediction intervals prediction interval for use in reliability growth reliabilitygrowth analysis. The second section of this chapter provides a historical account of the modelling form used to support prediction interval prediction intervals. An industry-standard model is described and will be extended to account for both forms of uncertainty in supporting predictions of the time to the detection of the next fault. The third section of this chapter describes the derivation of the prediction intervals. The approach to modelling growth uses a hybrid Bayesian of the Bayesian frequentist and frequentist approaches to statistical inference. A prior distribution is used to describe the number of potential faults believed to exist within a system design, reliabilitygrowth while reliability growth test data is used to estimate the rate at which these faults are detected. After deriving prediction interval the prediction intervals, the fourth section of this chapter provides an analysis of the statistical properties of the underlying distribution for a range of small sample sizes. The fifth section gives an illustrative example used to demonstrate the computation and interpretation of the prediction intervals within a typical product development process. The strengths and weaknesses of the process are discussed in the final section.

John Quigley, Lesley Walls

7. Promotional Warranty Policies: Analysis and Perspectives

Warranty is a topic that has been studied extensively by different disciplines including engineering, economics, management science, accounting, and marketing researchers [7.1, p. 47]. This chapter aims to provide an overview on warranties, focusing on the cost and benefit perspective of warranty issuers. After a brief introduction of the current status of warranty research, the second part of this chapter classifies various existing and several new promotional warranty policies to extend the taxonomy initiated by Blischke and Murthy [7.2]. Focusing on the quantitative modeling perspective of both the cost and benefit analyses of warranties, we summarize five problems that are essential to warranty issuers. These problems are: i) what are the warranty cost factors; ii) how to compare different warranty policies; iii) how to analyze the warranty cost of multi-component systems; iv) how to evaluate the warranty benefits; v) how to determine the optimal warranty policy. A list of future warranty research topics are presented in the last part of this chapter. We hope that this will stimulate further interest among researchers and practitioners.

Jun Bai, Hoang Pham

8. Stationary Marked Point Processes

Many areas of engineering and statistics involve the study of a sequence of random events, described by points occurring over time (or space), together with a mark for each such point that contains some further information about it (type, class, etc.). Examples include image analysis, stochastic geometry, telecommunications, credit or insurance risk, discrete-event simulation, empirical processes, and general queueing theory. In telecommunications, for example, the events might be the arrival times of requests for bandwidth usage, and the marks the bandwidth capacity requested. In a mobile phone context, the points could represent the locations (at some given time) of all mobile phones, and the marks 1 or 0 as to whether the phone is in use or not. Such a stochastic sequence is called a random marked point process, an MPP for short. In a stationary stochastic setting (e.g., if we have moved marked point process (MPP) our origin far away in time or space, so that moving further would not change the distribution of what we see) there are two versions of an MPP of interest depending on how we choose our origin: point-stationary and time-stationary (space-stationary). The point-stationarytime-stationary first randomly chooses an event point as the origin, whereas the second randomly chooses a time (or space) point as the origin. Fundamental mathematical relationships exists between these two versions allowing for nice applications and computations. In what follows, we present this basic theory with emphasis on one-dimensional processes over time, but also include some recent results for d-dimensional Euclidean space, . This chapter will primarily deal with marked point processes with points on the real line (time). Spatial point processes with points in will be touched upon in the final section; some of the deepest results in multiple dimensions have only come about recently. Topics covered include point- and time-stationarity, inversion formulas, the Palm distribution, Campbellʼs formula, MPPs jointly inversionformulaPalm distributionrate conservation lawmarked point process (MPP)Palm distributionstationary process with a stochastic process, the rate conservation law, conditional intensities, and ergodicity.

Karl Sigman

9. Modeling and Analyzing Yield, Burn-In and Reliability for Semiconductor Manufacturing: Overview

The demand for proactive techniques to model yield and reliability and to deal with various infant mortality issues are growing with increased integrated circuit (IC) complexity and new technologies toward the nanoscale. This chapter provides an overview of modeling and analysis of yield and reliability with an additional burn-in step as a fundamental means for yield and reliability enhancement. After the introduction, the second section reviews yield modeling. The notions of various yield components are introduced. The existing models, such as the Poisson model, compound Poisson models and other approaches for yield modeling, are introduced. In addition to the critical area and defect size distributions on the wafers, key factors for accurate yield modeling are also examined. This section addresses the issues in improving semiconductor yield including how clustering may affect yield. The third section reviews reliability aspects of semiconductors such as the properties of failure mechanisms and the typical bathtub failure rate curve with an emphasis on the high rate of early failures. The issues for reliability improvement are addressed. The fourth section discusses several issues related to burn-in. The necessity for and effects of burn-in are examined. Strategies for the level and type of burn-in are examined. The literature on optimal burn-in policy is reviewed. Often percentile residual life can be a good measure of performance in addition to the failure rate or reliability commonly used. The fifth section introduces proactive methods of estimating semiconductor reliability from yield information using yield–reliability relation models. Time-dependent and -independent models are discussed. The last section concludes this chapter and addresses topics for future research and development.

Way Kuo, Kyungmee Kim, Taeho Kim

Process Monitoring and Improvement

Frontmatter

10. Statistical Methods for Quality and Productivity Improvement

The automatic process controlexperimental designrobust designstatistical process control (SPC)Taguchi methodfirst section of this chapter introduces statistical process control SPC and robust design RD, two important statistical methodologies for quality and productivity improvement. Section 10.1 describes in-depth SPC theory and tools for monitoring independent and autocorrelated data with a single quality characteristic. The relationship between SPC methods and automatic process control methods is discussed and differences in their philosophies, techniques, efficiencies, and design are contrasted. SPC methods for monitoring multivariate quality characteristics are also briefly reviewed. Section 10.2 considers univariate RD, with emphasis on experimental design, performance measures and modeling of the latter. Combined and product arrays are featured and performance measures examined, include signal-to-noise ratios SNR, PerMIAs, process response, process variance and desirability functions. Of central importance is the decomposition of the expected value of squared-error loss into variance and off-target components which sometimes allows the dimensionality of the optimization problem to be reduced. Section 10.3 deals with multivariate RD and demonstrates that the objective function for the multiple characteristic case is typically formed by additive or multiplicative combination of the univariate objective functions. Some alternative objective functions are examined as well as strategies for solving the optimization problem. Section 10.4 defines dynamic RD and summarizes related publications in the statistics literature, including some very recent entries. Section 10.5 lists RD case studies originating from applications in manufacturing, reliability and tolerance design.

Wei Jiang, Terrence Murphy, Kwok-Leung Tsui

11. Statistical Methods for Product and Process Improvement

The first part of this chapter describes a process model and the importance of product and process improvement in industry. Six Sigma methodology is introduced as one of most successful integrated statistical tool. Then the second section describes the basic ideas for Six Sigma methodology and the (D)MAIC(T) process for better understanding of this integrated process improvement methodology. In the third section, “Product Specification Optimization”, optimization models are developed to determine optimal specifications that minimize the total cost to both the producer and the consumer, based on the present technology and the existing process capability. The total cost consists of expected quality loss due to the variability to the consumer, and the scrap or rework cost and inspection or measurement cost to the producer. We set up the specifications and use them as a counter measure for the inspection or product disposition, only if it reduces the total cost compared with the expected quality loss without inspection. Several models are presented for various process distributions and quality loss functions. The fourth part, “Process Optimization”, demonstrates that the process can be improved during the design phase by reducing the bias or variance of the system output, that is, by changing the mean and variance of the quality characteristic of the output. Statistical methods for process optimization, such as experimental design, response surface methods, and Chebyshevʼs orthogonal polynomials are reviewed. Then the integrated optimization models are developed to minimize the total cost to the system of producers and customers by determining the means and variances of the controllable factors. Finally, a short summary is given to conclude this chapter.

Kailash Kapur, Qianmei Feng

12. Robust Optimization in Quality Engineering

Quality engineers often face the job of identifying process or product design parameters that optimize performance response. The first step is to construct a model, using historical or experimental data, that relates the design parameters to the response measures. The next step is to identify the best design parameters based on the model. Clearly, the model itself is only an approximation of the true relationship between the design parameters and the responses. The advances in optimization theory and computer technology have enabled quality engineers to obtain a good solution more efficiently by taking into account the inherent uncertainty in these empirically based models. Two widely used techniques for parameter optimization, parameter optimization described with examples in this chapter, are the response surface methodology (RSM) and Taguchi loss function. Taguchi loss functionresponse surface method (RSM) In both methods, the response model is assumed to be fully correct at each step. In this chapter we show how to enhance both methods by using robust optimization robust optimization tools that acknowledge the uncertainty in the models to find even better solutions. We develop a family of models from the confidence region of the model parameters and show how to use sophistical optimization techniques to find better design parameters over the entire family of approximate models. Section 12.1 of the chapter gives an introduction to the design parameter selection problem design parameter selection and motivates the need for robust optimization. Section 12.2 presents the robust optimization approach to address the problem of optimizing empirically based response functions by developing a family of models from the confidence region of the model parameters. In Sect. 12.2 robust optimization is compared to traditional optimization approaches where the empirical model is assumed to be true and the optimization is conducted without considering the uncertainty in the parameter estimates. Simulation is used to make the comparison in the context of response surface methodology, a widely used method to optimize products and processes that is briefly described in the section. Section 12.3 introduces a refined technique, called weighted robust optimization, where more-likely points in the confidence region of the empirically determined parameters are given heavier weight than less-likely points. We show that this method provides even more effective solutions compared to robust optimization without weights. Section 12.4 discusses Taguchiʼs loss function and how to leverage robust optimization methods to obtain better solutions when the loss function is estimated from empirical experimental data.

Susan Albin, Di Xu

13. Uniform Design and Its Industrial Applications

Uniform design is a kind of space-filling design whose applications in industrial experiments, reliability testing and computer experiments computer experiment is a novel endeavor. Uniform design is characterized by uniform scattering of the design points over the experimental domain, and hence is particularly suitable for experiments with an unknown underlying model and for experiments in which the entire experimental domain has to be adequately explored. An advantage of uniform design over traditional designs such as factorial design is that, even when the number of factors or the number of levels of the factors are large, the experiment can still be completed in a relatively small number of runs. In this chapter we shall introduce uniform design, the relevant underlying theories, and the methods of constructing uniform designs in the s-dimensional cube and in the (q − 1)-dimensional simplex for experiments with mixtures experiments with mixtures. We shall also give application examples of industrial experiments, accelerated stress testing and computer experiment computer experiments.

Kai-Tai Fang, Ling-Yau Chan

14. Cuscore Statistics: Directed Process Monitoring for Early Problem Detection

This chapter presents the background to the Cuscore statistic, the development of the Cuscore chart, and how it can be used as a tool for directed process monitoring. In Sect. 14.1 an illustrative example shows how it is effective at providing an early signal to detect known types of problems, modeled as mathematical signals embedded in observational data. Section 14.2 provides the theoretical development of the Cuscore and shows how it is related to Fisherʼs score statistic. Sections 14.3, 14.4, and 14.5 then present the details of using Cuscores to monitor for signals in white noise, autocorrelated data, and seasonal processes, respectively. The capability to home in on a particular signal is certainly an important aspect of Cuscore statistics CUSCOREstatistics . however, Sect. 14.6 shows how they can be applied much more broadly to include the process model (i.e., a model of the process dynamics and noise) and process adjustments (i.e., feedback control). Two examples from industrial cases show how Cuscores can be devised and used appropriately in more complex monitoring applications. Section 14.7 concludes the chapter with a discussion and description of future work.

Harriet Nembhard

15. Chain Sampling

A brief introduction to the concept of chain sampling is first presented. The chain sampling plan of type ChSP-1 is first reviewed, and a discussion on the design and application of ChSP-1 plans is then presented in the second section of this chapter. Various extensions of chain sampling plans such as the ChSP-4 plan are discussed in the third part. The representation of the ChSP-1 plan as a two-stage cumulative results criterion plan, and its design are discussed in the fourth part. The fifth section relates to the modification of the ChSP-1 plan. The sixth section of this chapter is on the relationship between chain sampling and deferred sentencing plans. A review of sampling inspection plans that are based on the ideas of chain or dependent sampling or deferred sentencing is also made in this section. The economics of chain sampling when compared to quick switching systems is discussed in the seventh section. The eighth section extends the attribute chain sampling to variables inspection. In the ninth section, chain sampling is then compared with the CUSUM approach. The tenth section gives several other interesting extensions of chain sampling, such as chain sampling for mixed attribute and variables inspection. The final section gives concluding remarks.

Raj Govindaraju

16. Some Statistical Models for the Monitoring of High-Quality Processes

One important application of statistical models in industry is statistical process control. Many control charts have been developed and used in industry. They are easy to use, but have been developed based on statistical principles. However, for todayʼs high-quality processes, traditional control-charting techniques are not applicable in many situations. Research has been going on in the last two decades and new methods have been proposed. This chapter summarizes some of these techniques. High-quality processes are those with very low defect-occurrence rates. Control charts based on the cumulative count of conforming items are recommended for such processes. The use of such charts has opened up new frontiers in the research and applications of statistical control charts in general. In this chapter, several extended or modified statistical models are described. They are useful when the simple and basic geometric distribution is not appropriate or is insufficient. In particular, we present some extended Poisson distribution models that can be used for count data with large numbers of zero counts. We also extend the chart to the case of general time-between-event monitoring; such an extension can be useful in service or reliability monitoring. Traditionally, the exponential distribution is used for the modeling of time-between-events, although other distributions such as the Weibull or gamma distribution can also be used in this context.

Min Xie, Thong Goh

17. Monitoring Process Variability Using EWMA

During the last decade, the use of the exponentially weighted moving average (EWMA) exponentially weighted moving average (EWMA)statistic as a process-monitoring tool has become more and more popular in the statistical process-control field. If the properties and design strategies of the EWMA control chart for the mean have been thoroughly investigated, the use of the EWMA as a tool for monitoring process variability has received little attention in the literature. The goal of this chapter is to present some recent innovative EWMA-type control charts for the monitoring of process variability (i.e. the sample variance, sample standard-deviation and the range). In the first section of this chapter, the definition of an EWMA sequence and its main properties will be presented together with the commonly used procedures for the numerical computation of the average run length (ARL). average run length (ARL)The second section will be dedicated to the use of the EWMA as a monitoring tool for the process position, i.e. sample mean and sample median. In the third section, the use of the EWMA for monitoring the sample variance, sample standard deviation and the range will be presented, assuming a fixed sampling interval (FSI) fixed sampling interval (FSI)strategy. Finally, in the fourth section of this chapter, the variable sampling interval adaptive version of the EWMA-S2 and EWMA-R control charts will be presented.

Philippe Castagliola, Giovanni Celano, Sergio Fichera

18. Multivariate Statistical Process Control Schemes for Controlling a Mean

The quality of products produced and services provided can only be improved by examining the process to identify causes of variation. Modern production processes can involve tens to hundreds of variables, and multivariate procedures play an essential role when evaluating their stability and the amount of variation produced by common causes. Our treatment emphasizes the detection of a change in level of a multivariate process. After a brief introduction, in Sect. 18.1 we review several of the important univariate procedures for detecting a change in level among a sequence of independent random variables. These include Shewhartʼs X −bar chart, Pageʼs cumulative sum, Crosierʼs cumulative sum, and exponentially weighted moving-average schemes. Multivariate schemes are examined in Sect. 18.2. In particular, we consider the multivariate T2 chart and the related bivariate ellipse format chart, the cumulative sum of T chart, Crosierʼs multivariate scheme, and multivariate exponentially weighted moving-average schemes. An application to a sheet metal assembly process is discussed in Sect. 18.3 and the various multivariate procedures are illustrated. Comparisons are made between the various multivariate quality monitoring schemes in Sect. 18.4. A small simulation study compares average run lengths of the different procedures under some selected persistent shifts. When the number of variables is large, it is often useful to base the monitoring procedures on principal components. Section 18.5 discussesthis approach. An example is also given using the sheet metal assembly data. Finally, in Sect. 18.6, we warn against using the standard monitoring procedures without first checking for independence among the observations. Some calculations, involving first-order autoregressive dependence, demonstrate that dependence causes a substantial deviation from the nominal average run length.

Richard Johnson, Ruojia Li

Reliability Models and Survival Analysis

Frontmatter

19. Statistical Survival Analysis with Applications

This chapter discusses several important and interesting applications of statistical survival analysis which are relevant to both medical studies and reliability studies. Although it seems to be true that the proportional hazards models have been more extensively used in the application of biomedical research, the accelerated failure time models are much more popular in engineering and reliability research. Through several applications, this chapter not only offers some unified approaches to statistical survival analysis in biomedical research and reliability/engineering studies, but also sets up necessary connections between the statistical survival models used by biostatisticians and those used by statisticians working in engineering and reliability studies. The first application is the determination of sample size in a typical clinical trial when the mean or a certain percentile of the survival distribution is to be compared. The approach to the problem is based on an accelerated failure time model and therefore can have direct application in designing reliability studies to compare the reliability of two or more groups of differentially manufactured items. The other application we discuss in this chapter is the statistical analysis of reliability data collected from several variations of step-stress accelerated life test. The approach to the problem is based on the accelerated failure time model, but we will point out that these methodologies can be directly applied to medical and clinical studies when different doses of a therapeutic compound are administered in a sequential order to experimental subjects.

Chengjie Xiong, Kejun Zhu, Kai Yu

20. Failure Rates in Heterogeneous Populations

Most of the papers on failure rate modeling deal with homogeneous populations. Mixtures of distributions present an effective tool for modeling heterogeneity. In this chapter we consider nonasymptotic and asymptotic properties of mixture failure rates in different settings. After a short introduction, in the first section of this chapter we show (under rather general assumptions) that the mixture failure rate is ‘bent-down’ compared with the corresponding unconditional expectation of the baseline failure rate, which has been proved in the literature for some specific cases. This property is due to an effect where ‘the weakest populations die out first’, explicitly proved mathematically in this section. This should be taken into account when analyzing failure data for heterogeneous populations in practice. We also consider the problem of mixture failure rate ordering for the ordered mixing distributions. Two types of stochastic ordering are analyzed: ordering in the likelihood ratio sense and ordering the variances when the means are equal. Mixing distributions with equal expectations and different variances can lead to corresponding ordering for mixture failure rates in [0,∞ ) in some specific cases. For a general mixing distribution, however, this ordering is only guaranteed for sufficiently small t. In the second section, the concept of proportional hazards (PH) in a homogeneous population is generalized to a heterogeneous case. For each subpopulation, the PH model is assumed to exist. It is shown that this proportionality is violated for observed (mixture) failure rates. The corresponding bounds for a mixture failure rate are obtained in this case. The change point in the environment is discussed. Shocks – changing the mixing distribution – are also considered. It is shown that shocks with the stochastic properties described also bend down the initial mixture failure rate. Finally, the third section is devoted to new results on the asymptotic behavior of mixture failure rates. The suggested lifetime model generalizes all three conventional survival models (proportional hazards, additive hazards and accelerated life) and makes it possible to derive explicit asymptotic results. Some of the results obtained can be generalized to a wider class of lifetime distributions, but it appears that the class considered is ‘optimal’ in terms of the trade-off between the complexity of a model and the tractability (or applicability) of the results. It is shown that the mixture failure rate asymptotic behavior depends only on the behavior of a mixing distribution near to zero, and not on the whole mixing distribution.

Maxim Finkelstein, Veronica Esaulova

21. Proportional Hazards Regression Models

The proportional hazards model plays an important role in analyzing data with survival outcomes. This chapter provides a summary of different aspects of this very popular model. The first part gives the definition of the model and shows how to estimate the regression parameters for survival data with or without ties. Hypothesis testing can be built based on these estimates. Formulas to estimate the cumulative hazard function and the survival function are also provided. Modified models for stratified data and data with time-dependent covariates are also discussed. The second part of the chapter talks about goodness-of-fit and model checking techniques. These include testing for proportionality assumptions, testing for function forms for a particular covariate and testing for overall fitting. The third part of the chapter extends the model to accommodate more complicated data structures. Several extended models such as models with random effects, nonproportional models, and models for data with multivariate survival outcomes are introduced. In the last part a real example is given. This serves as an illustration of the implementation of the methods and procedures discussed in this chapter.

Wei Wang, Chengcheng Hu

22. Accelerated Life Test Models and Data Analysis

Todayʼs consumers demand high quality and reliability in the products they buy. Accelerated accelerated tests life tests (ALT) are commonly used by manufacturers during product design to obtain reliability information on components and subsystems in a timely manner. The results obtained at high levels of the accelerating variables are then extrapolated to provide information about the product life under normal use conditions. The introduction and Section 22.1 describe the background and motivations for using accelerated testing. Sections 22.2 and 22.3 discuss statistical models for describing lifetime distributions in ALT. Commonly used ALT models have two parts: (a) a statistical distribution at fixed levels of the accelerated variable(s); and (b) a functional relationship between distribution parameters and the accelerating variable(s). We describe relationships for accelerating variables, such as use rate, temperature, voltage, and voltage rate. We also discuss practical guidelines and potential problems in using ALT models. Section 22.4 describes and illustrates a strategy for analyzing ALT data. Both graphical and numerical methods are discussed for fitting an ALT model to data and for assessing its fit. These methods are thoroughly illustrated by fitting an ALT model with a single accelerating variable to data obtained from an actual ALT experiment. Extrapolation of the results at accelerated levels to normal use levels is also discussed. Section 22.5 presents statistical analysis of a wider variety of ALT data types that are encountered in practice. In particular, the examples involve ALTs with interval censoring and with two or more accelerating variables. Section 22.6 discusses practical considerations for interpreting statistical analysis of ALT data. This Section emphasizes the important role of careful planning of ALT to produce useful results. Section 22.7 discusses other kinds of accelerated tests often conducted in practice. Brief descriptions of each and specific applications in industry are also provided. Section 22.8 reviews some of the potential pitfalls the practitioner of accelerated testing may face. These are described within practical situations, and strategies for avoiding them are presented. Section 22.9 lists some computer software packages that are useful for analyzing ALT data.

Francis Pascual, William Meeker, Luis Escobar

23. Statistical Approaches to Planning of Accelerated Reliability Testing

This chapter presents a few statistical methods for designing test plans in which products are tested under harsher environment with more severe stresses than usual operating conditions. Following a short introduction, three different types of testing conditions are dealt with in Sects. 23.2, 23.3, and 23.4; namely, life testing under constant stress, life testing in which stresses are increased in steps, and accelerated testing by monitoring degradation data. Brief literature surveys of the work done in these areas precede presentations of methodologies in each of these sections. In Sect. 23.2, we present the conventional framework for designing accelerated test plans using asymptotic variance of maximum likelihood estimators (MLE) derived from the Fisher information matrix. We then give two possible extensions from the framework for accelerated life testing under three different constant stress levels; one based on a nonlinear programming (NLP) nonlinear programming (NLP)formulation so that experimenters can specify the desired number of failures, and one based on an enlarged solution space so that the design of the test plan can be more flexible in view of the many possible limitations in practice. These ideas are illustrated using numerical examples and followed by a comparison across different test plans. We then present planning of accelerated life testing (ALT) accelerated life test (ALT)in which stresses are increased in steps and held constant for some time before the next increment. The design strategy is based on a target acceleration factor which specifies the desired time compression needed to complete the test compared to testing under use conditions. Using a scheme similar to backward induction in dynamic programming, an algorithm for planning multiple-step step-stress ALT is presented. In Sect. 23.4, we consider planning problems for accelerated degradation test (ADT) accelerated degradation test (ADT)in which degradation data, instead of lifetime data, are used to predict a productʼs reliability. We give a unifying framework for dealing with both constant-stress and step-stress accelerated degradation test (ADT)constant-stressaccelerated degradation test (ADT)step-stressADT. An NLP model which minimizes cost with precision constraint is formulated so that the tradeoff between getting more data and the cost of conducting the test can be quantified.

Loon Tang

24. End-to-End (E2E) Testing and Evaluation of High-Assurance Systems

U. S. Department of Defense (DoD) end-to-end (E2E) testing and evaluation (T& E) technology for high-assurance systems has evolved from specification and analysis of thin threads, through system scenarios, to scenario-driven system engineering including reliability, security, and safety assurance, as well as dynamic verification and validation. Currently, E2E T& E technology is entering its fourth generation and being applied to the development and verification of systems in service-oriented architectures (SOA) and web services (WS). The technology includes a series of techniques, including automated generation of thin threads from system scenarios; automated dependency analysis; completeness and consistency analysis based on condition–event pairs in the system specification; automated test-case generation based on verification patterns; test-case generation based on the topological structure of Boolean expressions; automated code generation for system execution as well as for simulation, automated reliability assurance based on the system design structure, dynamic policy specification, analysis, enforcement and simulation; automated state-model generation; automated sequence-diagram generation; model checking on system specifications; and model checking based on test-case generation. E2E T& E technology has been successfully applied to several DoD command-and-control applications as well civilian projects.

Raymond Paul, Wei-Tek Tsai, Yinong Chen, Chun Fan, Zhibin Cao, Hai Huang

25. Statistical Models in Software Reliability and Operations Research

Statistical models play an important role in monitoring and control of the testing phase of softwaredevelopment life cycle (SDLC) software development life cycle (SDLC). The first section of this chapter provides an introduction to software reliability growth modeling and management problems where optimal control is desired. It includes a brief literature survey and description of optimization problems and solution methods. In the second section a framework has been proposed for developing general software reliability models for both testing and operational phases. Within the framework, pertinent factors such as testing effort, coverage, user growth etc. can be incorporated. A brief description of the usage models have been provided in the section. It is shown how a new product sales growth model from marketing can be used for reliability growth modeling. Proposed models have been validated on software failure data sets. To produce reliable software, efficient management of the testing phase is essential. Three management problems viz. release time, testing effort control and resource allocation are discussed in Sects. 25.2 to 25.4. The operations research approach, i.e. with the help of the models, optimal management decisions can be made regarding the duration of the testing phase, requirement and allocation of resources, intensity of testing effort etc. These optimization problems can be of interest to both theoreticians and software test managers. This chapter discusses both of these aspects viz. model development and optimization problems.

P. Kapur, Amit Bardhan

26. An Experimental Study of Human Factors in Software Reliability Based on a Quality Engineering Approach

In this chapter, we focus on a software design-review process which is more effective than other processes for the elimination and prevention of software faults in software development. Then, we adopt a quality engineering approach to analyze the relationships among the quality of the design-review activities, i.e., software reliability, and human factors to clarify the fault-introduction process in the design-review process. We conduct a design-review experiment with graduate and undergraduate students as subjects. First, we discuss human factors categorized as predispositions and inducers in the design-review process, and set up controllable human factors in the design-review experiment. In particular, we lay out the human factors on an orthogonal array based on the method of design of experiments. Second, in order to select human factors that affect the quality of the design review, we perform a software design-review experiment reflecting an actual design process based on the method of design of experiments. To analyze the experimental results, we adopt a quality engineering approach, i.e., the Taguchi method. That is, applying the orthogonal array L₁₈(21 × 37) to the human-factor experiment, we carry out an analysis of variance by using the signal-to-noise ratio (SNR), which can evaluate the stability of the quality characteristics, discuss effective human factors, and obtain the optimal levels for the selected predispositions and inducers. Further, classifying the faults detected by design-review work into descriptive-design and symbolic-design faults, we discuss the relationships among them in more detail.

Shigeru Yamada

27. Statistical Models for Predicting Reliability of Software Systems in Random Environments

After a brief overview of existing models in software reliability in Sect. 27.1, Sect. 27.2non-homogeneous Poisson process (NHPP) discusses a generalized nonhomogeneous Poisson process model that can be used to derive most existing models in the software reliability literature. Section 27.3 describes a generalized generalized random field environment random field environment (RFE) model incorporating both the testing phase and operating phase in the software development cycle for estimating the reliability of software systems in the field. In contrast to some existing models that assume the same software failure rate for the software testing and field operation environments, this generalized model considers the random environmental effects on software reliability. Based on the generalized RFE model, Sect. 27.4 describes two specific RFE reliability models, the γ-RFE and β-RFE models, for predicting software reliability in field environments. Section 27.4 illustrates the models softwarefailure data using telecommunication software failure data. Some further research considerations based on the generalized software reliability model are also discussed.

Hoang Pham, Xiaolin Teng

Regression Methods and Data Mining

Frontmatter

28. Measures of Influence and Sensitivity in Linear Regression

This chapter reviews diagnostic procedures for detecting outliers and influential observations in linear regression. First, the statistics for detecting single outliers and influential observations are presented, and their limitations for multiple outliers in high-leverage situations are discussed; second, diagnostic procedures designed to avoid masking are shown. We comment on the procedures by Hadi and Smirnoff [28.1,2], Atkinson [28.3] and Swallow and Kianifard [28.4] based on finding a clean subset for estimating the parameters and then increasing its size by incorporating new homogeneous observations one by one, until a heterogeneous observation is found. We also discuss procedures for detecting high-leverage outliers in large data sets based on eigenvalue analysis of the influence and sensitivity matrix, as proposed by Peña and Yohai [28.5,6]. Finally we show that the joint use of simple univariate statistics, as predictive residuals, and Cookʼs distances, jointly with the sensitivity statistic proposed by Peña [28.7] can be a useful diagnostic tool for large high-dimensional data sets.

Daniel Peña

29. Logistic Regression Tree Analysis

This chapter describes a tree-structured extension and generalization of the logistic regression method for fitting models to a binary-valued response variable. The technique overcomes a significant disadvantage of logistic regression viz. the interpretability of the model in the face of multi-collinearity and Simpsonʼs paradox. Section 29.1 summarizes the statistical theory underlying the logistic regression model and the estimation of its parameters. Section 29.2 reviews two standard approaches to model selection for logistic regression, namely, model deviance relative to its degrees of freedom and the Akaike information criterion (AIC) criterion. A dataset on tree damage during a severe thunderstorm is used to compare the approaches and to highlight their weaknesses. A recently published partial one-dimensional model that addresses some of the weaknesses is also reviewed. Section 29.3 introduces the idea of a logistic regression tree model. The latter consists of a binary tree in which a simple linear logistic regression (i.e., a linear logistic regression using a single predictor variable) is fitted to each leaf node. A split at an intermediate node is characterized by a subset of values taken by a (possibly different) predictor variable. The objective is to partition the dataset into rectangular pieces according to the values of the predictor variables such that a simple linear logistic regression model adequately fits the data in each piece. Because the tree structure and the piecewise models can be presented graphically, the whole model can be easily understood. This is illustrated with the thunderstorm dataset using the LOTUS algorithm. Section 29.4 describes the basic elements of the LOTUS algorithm, which is based on recursive partitioning and cost-complexity pruning. A key feature of the algorithm is a correction for bias in variable selection at the splits of the tree. Without bias correction, the splits can yield incorrect inferences. Section 29.5 shows an application of LOTUS to a dataset on automobile crash tests involving dummies. This dataset is challenging because of its large size, its mix of ordered and unordered variables, and its large number of missing values. It also provides a demonstration of Simpsonʼs paradox. The chapter concludes with some remarks in Sect. 29.5.

Wei-Yin Loh

30. Tree-Based Methods and Their Applications

The first part of this chapter introduces the basic structure of tree-based methods using two examples. First, a classification tree is presented that uses e-mail text characteristics to identify spam. The second example uses a regression tree to estimate structural costs for seismic rehabilitation of various types of buildings. Our main focus in this section is the interpretive value of the resulting models. This brief introduction is followed by a more detailed look at how these tree models are constructed. In the second section, we describe the algorithm employed by classification and regression tree (CART), a popular commercial software program for constructing trees for both classification and regression problems. In each case, we outline the processes of growing and pruning trees and discuss available options. The section concludes with a discussion of practical issues, including estimating a treeʼs predictive ability, handling missing data, assessing variable importance, and considering the effects of changes to the learning sample. The third section presents several alternatives to the algorithms used by CART. We begin with a look at one class of algorithms – including QUEST, CRUISE, and GUIDE– which is designed to reduce potential bias toward variables with large numbers of available splitting values. Next, we explore C4.5, another program popular in the artificial-intelligence and machine-learning communities. C4.5 offers the added functionality of converting any tree to a series of decision rules, providing an alternative means of viewing and interpreting its results. Finally, we discuss chi-square automatic interaction detection (CHAID), an early classification-tree construction algorithm used with categorical predictors. The section concludes with a brief comparison of the characteristics of CART and each of these alternative algorithms. In the fourth section, we discuss the use of ensemble methods for improving predictive ability. Ensemble methods generate collections of trees using different subsets of the training data. Final predictions are obtained by aggregating over the predictions of individual members of these collections. The first ensemble method we consider is boosting, a recursive method of generating small trees that each specialize in predicting cases for which its predecessors perform poorly. Next, we explore the use of random forests, which generate collections of trees based on bootstrap sampling procedures. We also comment on the tradeoff between the predictive power of ensemble methods and the interpretive value of their single-tree counterparts. The chapter concludes with a discussion of tree-based methods in the broader context of supervised learning techniques. In particular, we compare classification and regression trees to multivariate adaptive regression splines, neural networks, and support vector machines.

Nan Lin, Douglas Noe, Xuming He

31. Image Registration and Unknown Coordinate Systems

This chapter deals with statistical problems involving unknown coordinate systems, either in Euclidean 3-space or on the unit sphere Ω₃ in . We also consider the simpler cases of Euclidean 2-space and the unit circle Ω₂. The chapter has five major sections. Although other problems of unknown coordinate systems have arisen, a very important problem of this class is the problem of image registration from landmark data. In this problem we have two images of the same object (such as satellite images taken at different times) or an image of a prototypical object and an actual object. It is desired to find the rotation, translation, and possibly scale change, which will best align the two images. Whereas many problems of this type are two-dimensional, it should be noted that medical imaging is often three-dimensional. After introducing some mathematical preliminaries we introduce the concept of M-estimators, a generalization of least squares estimation. In least squares estimation, the registration that minimizes the sum of squares of the lengths of the deviations is chosen; in M estimation, the sum of squares of the lengths of the deviations is replaced by some other objective function. An important case is L₁ estimation, which minimizes the sum of the lengths of the deviations; L₁ estimation is often used when the possibility of outliers in the data is suspected. The second section of this chapter deals with the calculation of least squares estimates. Then, in the third section, we introduce an iterative modification of the least squares algorithm to calculate other M-estimates. Note that minimization usually involves some form of differentiation and hence this section starts with a short introduction to the geometry of the group of rotations and differentiation in the rotation group. Many statistical techniques are based upon approximation by derivatives and hence a little understanding of geometry is necessary to understand the later statistical sections. The fourth section discusses the statistical properties of M-estimates. A great deal of emphasis is placed upon the relationship between the geometric configuration of the landmarks and the statistical errors in the image registration. It is shown that these statistical errors are determined, up to a constant, by the geometry of the landmarks. The constant of proportionality depends upon the objective function and the distribution of the errors in the data. General statistical theory indicates that, if the data error distribution is (anisotropic) multivariate normal, least squares estimation is optimal. An important result of this section is that, even in this case when least squares estimation is theoretically the most efficient, the use of L₁ estimation can guard against outliers with a very modest cost in efficiency. Here optimality and efficiency refer to the expected size of the statistical errors. In practice, data is often long-tailed and L₁ estimation yields smaller statistical errors than least squares estimation. This will be the case with the three-dimensional image registration example given here. Finally, in the fifth section, we discuss diagnostics that can be used to determine which data points are most influential upon the registration. Thus, if the registration is unsatisfactory, these diagnostics can be used to determine which data points are most responsible and should be reexamined.

Ted Chang

32. Statistical Genetics for Genomic Data Analysis

In this chapter, we briefly summarize the emerging statistical concepts and approaches that have been recently developed and applied to the analysis of genomic data such as microarray gene expression data. In the first section we introduce the general background and critical issues in statistical sciences for genomic data analysis. The second section describes a novel concept of statistical significance, the so-called false discovery rate, the rate of false positives among all positive findings, which has been suggested to control the error rate of numerous false positives in large screening biological data analysis. In the next section we introduce two recent statistical testing methods: significance analysis of microarray (SAM) significance analysis of microarray (SAM) and local pooled error (LPE) local pooled error (LPE) tests. The latter in particular, which is significantly strengthened by pooling error information from adjacent genes at local intensity ranges, is useful to analyze microarray data with limited replication. The fourth section introduces analysis of variation (ANOVA) and heterogenous error modeling (HEM) heterogeneous error model (HEM) approaches that have been suggested for analyzing microarray data obtained from multiple experimental and/or biological conditions. The last two sections describe data exploration and discovery tools largely termed supervised learning and unsupervised learning. The former approaches include several multivariate statistical methods for the investigation of coexpression patterns of multiple genes, and the latter approaches are used as classification methods to discover genetic markers for predicting important subclasses of human diseases. Most of the statistical software packages for the approaches introduced in this chapter are freely available at the open-source bioinformatics software web site (Bioconductor; http://www.bioconductor.org/).

Jae Lee

33. Statistical Methodologies for Analyzing Genomic Data

The purpose of this chapter is to describe and review a variety of statistical issues and methods related to the analysis of microarray data. In the first section, after a brief introduction of the DNA microarray technology in biochemical and genetic research, we provide an overview of four levels of statistical analyses. The subsequent sections present the methods and algorithms in detail. In the second section, we describe the methods for identifying significantly differentially expressed genes in different groups. The methods include fold change, different t-statistics, empirical Bayesian approach and significance analysis of microarrays (SAM). We further illustrate SAM using a publicly available colon-cancer dataset as an example. We also discuss multiple comparison issues and the use of false discovery rate. In the third section, we present various algorithms and approaches for studying the relationship among genes, particularly clustering and classification. In clustering analysis, we discuss hierarchical clustering, k-means and probabilistic model-based clustering in detail with examples. We also describe the adjusted Rand index as a measure of agreement between different clustering methods. In classification analysis, we first define some basic concepts related to classification. Then we describe four commonly used classification methods including linear discriminant analysis (LDA), support vector machines (SVM), neural network and tree-and-forest-based classification. Examples are included to illustrate SVM and tree-and-forest-based classification. The fourth section is a brief description of the meta-analysis of microarray data in three different settings: meta-analysis of the same biomolecule and same platform microarray data, meta-analysis of the same biomolecule but different platform microarray data, and meta-analysis of different biomolecule microarray data. We end this chapter with final remarks on future prospects of microarray data analysis.

Fenghai Duan, Heping Zhang

34. Statistical Methods in Proteomics

Proteomics technologies are rapidly evolving and attracting great attention in the post-genome era. In this chapter, we review two key applications of proteomics techniques: disease biomarker discovery and protein/peptide identification. For each of the applications, we state the major issues related to statistical modeling and analysis, review related work, discuss their strengths and weaknesses, and point out unsolved problems for future research. We organize this chapter as follows. Section 34.1 briefly introduces mass spectrometry (MS) and tandem MS/MS with a few sample plots showing the data format. Section 34.2 focuses on MS data preprocessing. We first review approaches in peak identification and then address the problem of peak alignment. After that, we point out unsolved problems and propose a few possible solutions. Section 34.3 addresses the issue of feature selection. We start with a simple example showing the effect of a large number of features. Then we address the interaction of different features and discuss methods of reducing the influence of noise. We finish this section with some discussion on the application of machine learning methods in feature selection. Section 34.4 addresses the problem of sample classification. We describe the random forest method in detail in Sect. 34.5. In Sect. 34.6 we address protein/peptide identification. We first review database searching methods in Sect. 34.6.1 and then focus on de novo MS/MS sequencing in Sect. 34.6.2. After reviewing major protein/peptide identification programs like SEQUEST and MASCOT in Sect. 34.6.3, we conclude the section by pointing out some major issues that need to be addressed in protein/peptide identification. Proteomics technologies are considered the major player in the analysis and understanding of protein function and biological pathways. The development of statistical methods and software for proteomics data analysis will continue to be the focus of proteomics for years to come.

Weichuan Yu, Baolin Wu, Tao Huang, Xiaoye Li, Kenneth Williams, Hongyu Zhao

35. Radial Basis Functions for Data Mining

This chapter deals with the design and applications of the radial basis function (RBF) radial basis function (RBF)model. It is organized into three parts. The first part, consisting of Sect. 35.1, describes the two data mining activities addressed here: classification and regression. Next, we discuss the important issue of bias-variance tradeoff and its relationship to model complexity. The second part consists of Sects. 35.2 to 35.4. Section 35.2 describes the RBF model architecture and its parameters. In Sect. 35.3.1 we briefly describe the four common algorithms used for its design: clustering, orthogonal least squares, regularization, and gradient descent. In Sect. 35.3.2 we discuss an algebraic algorithm, the SG algorithm, which provides a step-by-step approach to RBF design. Section 35.4 presents a detailed example to illustrate the use of the SG algorithm on a small data set. The third part consists of Sects. 35.5 and 35.6. In Sect. 35.5 we describe the development of RBF classifiers for a well-known benchmark problem to determine whether Pima Indians have diabetes. We describe the need for and importance of partitioning the data into training, validation, and test sets. The training set is employed to develop candidate models, the validation set is used to select a model, and the generalization performance of the selected model is assessed using the test set. Section 35.6 describes a recent data mining application in bioinformatics, where the objective is to analyze the gene expression profiles of Leukemia data from patients whose classes are known to predict the target cancer class. Finally, Sect. 35.7 provides concluding remarks and directs the reader to related literature. Although the material in this chapter is applicable to other types of basis funktions, we have used only the Gaussian function for illustrations and case studies because of its popularity and good mathematical properties.

Miyoung Shin, Amrit Goel

36. Data Mining Methods and Applications

In this chapter, we provide a review of the knowledge discovery process, including data handling, data mining methods and software, and current research activities. The introduction defines and knowledge discoverydata mining (DM)softwareknowledge discoverydata mining (DM)database provides a general background to data mining knowledge discovery in databases. In particular, the potential for data mining to improve manufacturing processes in industry is discussed. This is followed by an outline of the entire process of knowledge discovery in databases in the second part of the chapter. The third part presents data handling issues, including databases and preparation of the data for analysis. Although these issues are generally considered uninteresting to modelers, the largest portion of the knowledge discovery process is spent handling data. It is also of great importance since the resulting models can only be as good as the data on which they are based. The fourth part is the core of the chapter and describes popular data mining methods, separated as supervised versus unsupervised learning. In supervised learning, the training data set includes observed output values (“correct answers”) for the given set of inputs. If the outputs are continuous/quantitative, then we have a regression problem. If the outputs are categorical/qualitative, then we have a supervised learningunsupervised learningsupervised learningsupervised learningregressionclassificationlinearmodeltreeneural networkassociation ruleclustering classification problem. Supervised learning methods are described in the context of both regression and classification (as appropriate), beginning with the simplest case of linear models, then presenting more complex modeling with trees, neural networks, and support vector machines, and concluding with some methods, such as nearest neighbor, that are only for classification. In unsupervised learning, the training data set does not contain output values. Unsupervised learning methods are described under two categories: association rules and clustering. Association rules are appropriate for business applications where precise numerical data may not be available while clustering methods are more technically similar to the supervised learning methods presented in this chapter. Finally, this section closes with a review of various software options. The fifth part presents current research projects, involving both industrial and business applications. In the first project, data is collected from monitoring systems, and the objective is to detect unusual activity that may require action. For example, credit card companies monitor customersʼ credit card usage to detect possible fraud. While methods from statistical process control were developed for similar purposes, the difference lies in the quantity of data. The second project describes data mining tools developed by Genichi Taguchi, who is well known for his industrial work on robust design. The third project tackles quality and productivity improvement in manufacturing industries. Although some detail is given, considerable research is still needed to develop a practical tool for todayʼs complex manufacturing processes. industrialdata mining (DM) Finally, the last part provides a brief discussion on remaining problems and future trends.

Kwok-Leung Tsui, Victoria Chen, Wei Jiang, Y. Aslandogan

Modeling and Simulation Methods

Frontmatter

37. Bootstrap, Markov Chain and Estimating Function

In this chapter, we first review bootstrap methods for constructing confidence intervals (regions). We then discuss the following three important properties of these methods: (i) invariance under reparameterization; (ii) automatic computation; and (iii) higher order accuracy. The greatest potential value of the bootstrap lies in complex situations, such as nonlinear regression or high-dimensional parameters for example. It is important to have bootstrap procedures that can be applied to these complex situations, but still have the three desired properties. The main purpose of this chapter is to introduce two recently developed bootstrap methods: the estimation function bootstrap, and the Markov chain marginal bootstrap method. The estimating function bootstrap has all three desired properties and it can also be applied to complex situations. The Markov chain marginal bootstrap is designed for high-dimensional parameters.

Feifang Hu

38. Random Effects

This chapter includes well-known as well as state-of-the-art statistical modeling techniques for drawing inference on correlated data, which occur in a wide variety of studies (during quality control studies of similar products made on different assembly lines, community-based studies on cancer prevention, and familial research of linkage analysis, to name a few). The first section briefly introduces statistical models that incorporate random effect terms, which are increasingly being applied to the analysis of correlated data. An effect is classified as a random effect when inferences are to be made on an entire population, and the levels of that effect represent only a sample from that population. The second section introduces the linear mixed model for clustered data, which explicitly models complex covariance structure among observations by adding random terms into the linear predictor part of the linear regression model. The third section discusses its extension – generalized linear mixed models (GLMMs) – for correlated nonnormal data. The fourth section reviews several common estimating techniques for GLMMs, including the EM and penalized quasi-likelihood approaches, Markov chain Newton-Raphson, the stochastic approximation, and the S-U algorithm. The fifth section focuses on some special topics related to hypothesis tests of random effects, including score tests for various models. The last section is a general discussion of the content of the chapter and some other topics relevant to random effects models.

Yi Li

39. Cluster Randomized Trials: Design and Analysis

The first section of this chapter gives an introduction to cluster randomized trials, and the reasons why such trials are often chosen above simple randomized trials. It also argues that more advanced statistical methods for data obtained from such trials are required, since these data are correlated due to the nesting of persons within clusters. Traditional statistical techniques, such as the regression model ignore this dependency, and thereby result in incorrect conclusions with respect to the effect of treatment. In the first section it is also argued that the design of cluster randomized trials is more complicated than that of simple randomized trials; not only the total sample size needs to be determined, but also the number of clusters and the number of persons per cluster. The second section describes and compares the multilevel regression model and the mixed effects analysis of variance (ANOVA) model. These models explicitly take into account the nesting of persons within clusters, and thereby the dependency of outcomes of persons within the same cluster. It is shown that the traditional regression model leads to an inflated type I error rate for treatment testing. Optimal sample sizes for cluster randomized trials are given in Sects. 39.3 and 39.4. These sample sizes can be shown to depend on the intra-class correlation coefficient, which measures the amount of variance in the outcome variable at the cluster level. A guess of the true value of this parameter must be available in the design stage in order to calculate the optimal sample sizes. Section 39.5 focuses on the robustness of the optimal sample size against incorrect guesses of this parameter. Section 39.6 focuses on optimal designs when the aim is to estimate the intra-class correlation with the greatest precision.

Mirjam Moerbeek

40. A Two-Way Semilinear Model for Normalization and Analysis of Microarray Data

A proper normalization procedure ensures that the normalized intensity ratios provide meaningful measures of relative expression levels. We describe a two-way semilinear model (TW-SLM) two-way semilinear model (TW-SLM)for normalization and analysis of microarray data. This method does not make the usual assumptions underlying some of the existing methods. The TW-SLM also naturally incorporates uncertainty due to normalization into significance analysis of microarrays. We propose a semiparametric M-estimation method in the TW-SLM to estimate the normalization curves and the normalized expression values, and discuss several useful extensions of the TW-SLM. We describe a back-fitting algorithm for computation in the model. We illustrate the application of the TW-SLM by applying it to a microarray data set. We evaluate the performance of TW-SLM using simulation studies and consider theoretical results concerning the asymptotic distribution and rate of convergence of the least-squares estimators in the TW-SLM.

Jian Huang, Cun-Hui Zhang

41. Latent Variable Models for Longitudinal Data with Flexible Measurement Schedule

This chapter provides a survey of the development of latent variable models that are suitable for analyzing unbalanced longitudinal data. This chapter begins with an introduction, in which the marginal modeling approach (without the use of latent variable) for correlated responses such as repeatedly measured longitudinal data is described. The concepts of random effects and latent variables are introduced at the beginning of Sect. 41.1. Section 41.1.1 describes the linear mixed models of Laird and Ware for continuous longitudinal response; Sect. 41.1.2 discusses generalized linear mixed models (with latent variables) for categorical response; and Sect. 41.1.3 covers models with multilevel latent variables. Section 41.2.1 presents an extended linear mixed model of Laird and Ware for multidimensional longitudinal responses of different types. Section 41.2.2 covers measurement error models for multiple longitudinal responses. Section 41.3 describes linear mixed models with latent class variables—the latent class mixed model that can be useful for either a single or multiple longitudinal responses. Section 41.4 studies the relationships between multiple longitudinal responses through structural equation models. Section 41.5 unifies all the above varieties of latent variable models under a single multilevel latent variable model formulation.

Haiqun Lin

42. Genetic Algorithms and Their Applications

The first part of this chapter describes the foundation of genetic algorithms. It includes hybrid genetic algorithms, adaptive genetic algorithms and fuzzy logic controllers. After a short introduction to genetic algorithms, the second part describes combinatorial optimization problems including the knapsack problem, the minimum spanning tree problem, the set-covering problem, the bin-packing problem and the traveling-salesman problem; these are combinatorial optimization studies problems which are characterized by a finite number of feasible solutions. The third part describes network design problems. Network design and routing are important issues in the building and expansion of computer networks. In this part, the shortest-path problem, maximum-flow problem, minimum-cost-flow problem, centralized network design and multistage process-planning problem are introduced. These problems are typical network problems and have been studied for a long time. The fourth section describes scheduling problems. Many scheduling problems from manufacturing industries are quite complex in nature and very difficult to solve by conventional optimization techniques. In this part the flow-shop sequencing problem, job-shop scheduling, the resource-constrained projected scheduling problem and multiprocessor scheduling are introduced. The fifth part introduces the reliability design problem, including simple genetic algorithms for reliability optimization, reliability design with redundant units and alternatives, network reliability design and tree-based network topology design. The sixth part describes logistic problems including the linear transportation problem, the multiobjective transportation problem, the bicriteria transportation problem with fuzzy coefficients and supply-chain management network design. Finally, the last part describes location and allocation problems including the location–allocation problem, capacitated plant-location problem and the obstacle location–allocation problem.

Mitsuo Gen

43. Scan Statistics

Section 43.1 introduces the concept of scan statistics and overviews types used to localize unusual clusters in continuous time or space, in sequences of trials or on a lattice. Section 43.2 focuses on scan statistics in one dimension. Sections 43.2.2 and 43.2.3 deal with clusters of events in continuous time. Sections 43.2.4 and 43.2.5 deal with success clusters in a sequence of discrete binary (s-f) trials. Sections 43.2.6 and 43.2.7 deal with the case where events occur in continuous time, but where we can only scan a discrete set of positions. Different approaches are used to review data when looking for clusters (the retrospective case in Sects. 43.2.2, 43.2.5, 43.2.6), and for ongoing surveillance that monitors unusual clusters (the prospective case in Sects. 43.2.2, 43.2.3, 43.2.7). Section 43.2.7 describes statistics used to scan for clustering on a circle (are certain times of the day or year more likely to have accidents?). Section 43.3 describes statistics used to scan continuous space or a two-dimensional lattice for unusual clusters. Sections 43.2 and 43.3 focus on how unusual the largest number of events within a scanning window is. Section 43.4.1 deals with scanning for unusually sparse regions. In some cases the researcher is more interested in the number of clusters, rather than the size of the largest or smallest, and Sect. 43.4.2 describes results useful for this case. The double-scan statistic of Sect. 43.4.3 allows the researcher to test for unusual simultaneous or lagged clustering of two different types of events. Section 43.4.4 describes scan statistics that can be used on data with a complex structure.

Joseph Naus

44. Condition-Based Failure Prediction

Machine reliability is improved if failures are prevented. Preventive maintenance (PM) can be performed in order to promote reliability, but only if failures can be predicted early enough. Although PM can be approached in different ways, according to time or condition, whichever one of these approaches is adopted, the key issue is whether a failure can be detected early enough or even predicted. This chapter discusses a way to predict failure, for use with PM, in which the state of a DC motor is estimated using the Kalman filter. The prediction consists of a simulation on a computer and an experiment performed on the DC motor. In the simulation, an exponential attenuator is placed at the output end of the motor model in order to simulate aging failure. Failure is ascertained by monitoring a state variables, the rotation speed of the motor. Failure times were generated by Monte Carlo simulation and predicted by the Kalman filter. One-step-ahead and two-steps-ahead predictions are performed. The resulting prediction errors are sufficiently small in both predictions. In the experiment, the rotating speed of the motor was measured every 5 min for 80 days. The measurements were used to perform Kalman prediction and to verify the prediction accuracy. The resulting prediction errors were acceptable. Decreasing the increment time between measurements was found to increase the accuracy of Kalman prediction uses. Consequently, it is shown that failure can be prevented (promoting reliability) by performing predictive maintenance depending on the results of state estimation using the Kalman filter.

Shang-Kuo Yang

45. Statistical Maintenance Modeling for Complex Systems

The first part of this chapter provides a brief introduction to statistical maintenance maintenance modeling subject to multiple failure processes. It includes a description of general probabilistic degradation processes. degradation process The second part discusses detailed reliability modeling reliabilitymodeling for degraded systems subject to competing failure processes without maintenance actions. A generalized multi-state degraded system multi-state degraded-system reliability model with multiple competing failure processes including degradation processes and random shocks is presented. The operating condition of the multi-state system is characterized by a finite number of states. A methodology to generate the system states when multi-failure processes exist is also discussed. The model can be used not only to determine the reliability of the degraded systems in the context of multi-state functions but also to obtain the probabilities of being in a given state of the system. The third part describes the inspection–maintenance inspectionmaintenance issues and reliability modeling for degraded repairable systems with competing failure processes. A generalized condition-based maintenance model for inspected degraded systems is discussed. An average long-run maintenance cost rate function is derived based on an expression for degradation paths and cumulative shock damage, which are measurable. An inspection sequence is determined based on the minimal maintenance minimal maintenance cost rate. Upon inspection, a decision will be made on whether to perform preventive maintenance or not. The optimum preventive maintenancethreshold maintenance thresholds for degradation processes and inspection sequences are also determined based on a modified Nelder–Mead downhill simplex method Nelder–Mead downhill simplex method. Finally, the last part is given over to the conclusions and a discussion of future perspectives for degraded-system maintenance modeling.

Wenjian Li, Hoang Pham

46. Statistical Models on Maintenance

This chapter discusses a variety of approaches to performing maintenance. The first section describes the importance of preparing for maintenance correctly, by collecting data on unit lifetimes and estimating the reliability of the units statistically using quantities such as their mean lifetimes, failure rates and failure distributions. Suppose that the time that the unit has been operational is known (or even just the calendar time since it was first used), and its failure distribution has been estimated statistically. The second section of the chapter shows that the time to failure is approximately given by the reciprocal of the failure rate, and the time before preventive maintenance is required is simply given by the pth percentile point of the failure distribution. Standard replacement policies, such as age replacement, in which a unit undergoes maintenance before it reaches a certain age, and periodic replacement, where the unit undergoes maintenance periodically, are also presented. Suppose that the failure of a unit can only be recorded at discrete times (so the unit completes a specific number of cycles before failure). In the third section, the age replacement and periodic replacement models from the previous section are converted into discrete models. Three replacement policies in which the unit undergoes maintenance after a specific number of failures, episodes of preventive maintenance or repairs, are also presented. The optimum number of units for a parallel redundant system is derived for when each unit fails according to a failure distribution and fails upon some shock with a certain probability. Suppose that the unit fails when the total amount of damage caused by shocks has exceeded a certain failure level. The fourth section describes the replacement policy in where the unit undergoes maintenance before failure for a cumulative damage model. The optimum damage level at which the unit should be replaced when it undergoes minimal repair upon failure is also derived analytically. The last part introduces the repair limit policy, where the unit is replaced instead of being repaired if the repair time is estimated to be more than a certain time limit, as well as the inspection with human error policy, where units are checked periodically and failed units are only detected and replaced upon inspection. Finally, the maintenance of a phased array radar is analyzed as an example of the practical use of maintenance models. Two maintenance models are considered in this case, and policies that minimize the expected cost rates are obtained analytically and numerically.

Toshio Nakagawa

Applications in Engineering Statistics

Frontmatter

47. Risks and Assets Pricing

This chapter introduces the basic elements of risk and financial assets pricing. Asset pricing is considered in two essential situations, complete and incomplete markets, and the definition and use of a number of essential financial instruments is described. Specifically, stocks (as underlying processes), bonds and derivative products (and in particular call and put European and American options) are considered. The intent of the chapter is neither to cover all the many techniques and approaches that are used in asset pricing, nor to provide a complete introduction to financial asset pricing and financial engineering. Rather, the intent of the chapter is to outline through applications and problems the essential mathematical techniques and financial economic concepts used to assess the value of risky assets. An extensive set of references is also included to direct the motivated reader to further and extensive research in this broad and evolving domain of economic and financial engineering and mathematics that deals with asset pricing. The first part of the chapter (The Introduction and Sect. 47.1) deals with a definition of risk and outlines the basic terminology used in asset pricing. Further, some essential elements of the Arrow–Debreu framework that underlies the fundamental economic approach to asset pricing are introduced. A second part (Sect. 47.2), develops the concepts of risk-neutral pricing, no arbitrage and complete markets. A number of examples are used to demonstrate how we can determine a probability measure to which risk-neutral pricing can be applied to value assets when markets are complete. In this section, a distinction between complete and incomplete markets is also introduced. Sections 47.3, 47.4 and 47.5 provide an introduction to and examples of basic financial approaches and instruments. First, Sect. 47.3, outlines the basic elements of the consumption capital asset-pricing model (with the CAPM stated as a special case). Section 47.4 introduces the basic elements of net present value and bonds, calculating the yield curve as well as the term structure of interest rates and provides a brief discussion of default and rated bonds. Section 47.5 is a traditional approach to pricing of options using the risk-neutral approach (for complete markets). European and American options are considered and priced by using a number of examples. The Black–Scholes model is introduced and solved, and extensions to option pricing with stochastic volatility, underlying stock prices with jumps as well as options on bonds are introduced and solved for specific examples. The last section of the chapter focuses on incomplete markets and an outline of techniques that are used in pricing assets when markets are incomplete. In particular, the following problems are considered: the pricing of rated bonds (whether they are default-prone or not), engineered risk-neutral pricing (based on data regarding options or other derivatives) and finally we also introduce the maximum-entropy approach for calculating an approximate risk-neutral distribution.

Charles Tapiero

48. Statistical Management and Modeling for Demand of Spare Parts

In recent years increased emphasis has been placed on improving decision making in business and government. A key aspect of decision making is being able to predict the circumstances surrounding individual decision situations. Examining the diversity of requirements in planning and decision-making situations, it is clearly stated that no single forecasting methods or narrow set of methods can meet the needs of all decision-making situations. Moreover, these methods are strongly dependent on factors such as data quantity, pattern and accuracy, that reflect their inherent capabilities and adaptability, such as intuitive appeal, simplicity, ease of application and, not least, cost. Section 48.1 deals with the placement of the demand-forecasting problem as one of biggest challenge in the repair and overhaul industry; after this brief introduction Sect. 48.2 summarizes the most important categories of forecasting methods; Sects. 48.3–48.4 approach the forecast of spare parts firstly as a theoretical construct, although some industrial applications and results are added from field training, as in many other parts of this chapter. Section 48.5 undertakes the question of optimal stock level for spare parts, with particular regard to low-turnaround-index (LTI) low turnaround index (LTI)parts conceived and designed for the satisfaction of a specific customer request, by the application of classical Poisson methods of minimal availability and minimum cost; similar considerations are drawn and compared in Sect. 48.6, which deals with models based on the binomial distribution. An innovative extension of binomial models based on the total cost function is discussed in Sect. 48.7. Finally Sect. 48.8 adds the Weibull failure-rate function to the analysis of the LTI spare-parts stock level in a maintenance system with declared wear conditions.

Emilio Ferrari, Arrigo Pareschi, Alberto Regattieri, Alessandro Persona

49. Arithmetic and Geometric Processes

arithmetic process (AP)geometric process (GP)Section 49.1 introduces two special monotone processes. A stochastic process is an AP (or a GP) if there exists some real number (or some positive real number) such that after some additions (or multiplications) it becomes a renewal process (RP). renewalprocess (RP)Either is a stochastically monotonic process and can be used to model a point process, i.e. point events occurring in a haphazard way in time or space, especially with a trend. For example, the events may be failures arising from a deteriorating machine, and such a series of failures is distributed haphazardly along a time continuum. Sections. 49.2–49.5 discuss estimation procedures for a number K of independent, homogeneous APs (or GPs). More specifically; in Sect. 49.2, Laplaceʼs statistics are recommended for testing whether a process has a trend or K processes have a common trend, and a graphical technique is suggested for testing whether K processes come from a common AP (or GP) as well as having a common trend; in Sect. 49.3, three parameters – the common difference (or ratio), the intercept and the variance of errors – are estimated using simple linear regression techniques; in Sect. 49.4, a statistic is introduced for testing whether K processes come from a common AP (or GP); in Sect. 49.5, the mean and variance of the first average random variable of the AP (or GP) are estimated based on the results derived in Sect. 49.3. Section 49.6 mentions some simulation studies performed to evaluate various nonparametric estimators and to compare the estimates, obtained from various estimators, of the parameters. Some suggestions for selecting the best estimators under three non-overlapping ranges of the common difference (or ratio) values are made based on the results of the simulation studies. In Sect. 49.7, ten real data sets are treated as examples to illustrate the fitting of AP, GP, homogeneous Poisson process (HPP) and nonhomogeneous Poisson process (NHPP) models. In Sect. 49.8, new repair–replacement models are proposed for a deteriorating system, in which the successive operating times of the system form an arithmetico-geometric process (AGP) arithmetico-geometric process (AGP)and are stochastically decreasing, while the successive repair times after failure also constitute an AGP but are stochastically increasing. Two kinds of replacement policy are considered, one based on the working age (a continuous decision variable) of the system and the other determined by the number of failures (a discrete decision variable) of the system. These policies are considered together with the performance measures, namely loss (or its negation, profit), cost, and downtime (or its complement, availability). Applying the well-known results of renewal reward processes, expressions are derived for the long-run expected performance measure per unit total time, and for the long-run expected performance measure per unit operation time, under the two kinds of policy proposed. In Sect. 49.9, some conclusions of the applicability of an AP and/or a GP based on partial findings of four real case studies are drawn. Section 49.10 gives five concluding remarks. Finally, the derivations of some key results are outlined in the Appendix, followed by the results of both the APs and GPs summarized in Table 49.6 for easy reference. Most of the content of this chapter is based on the authorʼs own original works that appeared in Leung et al. [49.1,2,3,4,5,6,7,8,9,10,11,12,13], while some is extracted from Lam et al. [49.14,15,16]. In this chapter, the procedures are, for the most part, discussed in reliability terminology. Of course, the methods are valid in any area of application (see Examples 1, 5, 6 and 9 in Sect. 49.7), in which case they should be interpreted accordingly.

Kit-Nam Leung

50. Six Sigma

The first part of this chapter describes what Six Sigma is, why we need Six Sigma, and how to implement Six Sigma in practice. A typical business structure for Six Sigma implementation is introduced, and potential failure modes of Six Sigma are also discussed. The second part describes the core methodology of Six Sigma, which consists of five phases, i.e., define, measure, analyze, improve, and control (DMAIC). Specific operational steps in each phase are described in sequence. Key tools to support the DMAIC process including both statistical tools and management tools are also presented. The third part highlights a specific Six Sigma technique for product development and service design, design for Six Sigma (DFSS), which is different from DMAIC. DFSS also has five phases: define, measure, analyze, design and verify (DMADV), spread over product development. Each phase is described and the corresponding key tools to support each phase are presented. In the forth part, a real case study on printed circuit board (PCB) improvement is used to demonstrate the application of Six Sigma. The company and process background is provided. The DMAIC approach is specifically followed and key supporting tools are illustrated accordingly. At the end, the financial benefit of this case is realized through the reduction of cost of poor quality (COPQ). Finally, last part is given over to a discussion of future prospects and conclusions.

Fugee Tsung

51. Multivariate Modeling with Copulas and Engineering Applications

This chapter reviews multivariate modeling with copulas and provides novel applications in engineering. A copula separates the dependence structure of a multivariate distribution from its marginal distributions. Properties and statistical inferences of copula-based multivariate models are discussed in detail. Applications in engineering are illustrated via examples of bivariate process control and degradation analysis, using existing data in the literature. A software package has been developed to promote the development and application of copula-based methods. Section 51.1 introduces the concept of copulas and its connection to multivariate distributions. The most important result about copulas is Sklarʼs theorem, which shows that any continuous multivariate distribution has a canonical representation by a unique copula and all its marginal distributions. A general algorithm to simulate random vectors from a copula is also presented. Section 51.2 introduces two commonly used classes of copulas: elliptical copulas and Archimedean copulas. Simulation algorithms are also presented. Section 51.3 presents the maximum-likelihood inference of copula-based multivariate distributions given the data. Three likelihood approaches are introduced. The exact maximum-likelihood approach estimates the marginal and copula parameters simultaneously by maximizing the exact parametric likelihood. The inference functions for margins approach is a two-step approach, which estimates the marginal parameters separately for each margin in a first step, and then estimates the copula parameters given the the marginal parameters. The canonical maximum-likelihood approach is for copula parameters only, using uniform pseudo-observations obtained from transforming all the margins by their empirical distribution functions. Section 51.4 presents two novel engineering applications. The first example is a bivariate process-control problem, where the marginal normality seems appropriate but joint normality is suspicious. A Clayton copula provides a better fit to the data than a normal copula. Through simulation, the upper control limit of Hotellingʼs T2 chart based on normality is shown to be misleading when the true copula is a Clayton copula. The second example is a degradation analysis, where all the margins are skewed and heavy-tailed. A multivariate gamma distribution with normal copula fits the data much better than a multivariate normal distribution. Section 51.5 concludes and points to references about other aspects of copula-based multivariate modeling that are not discussed in this chapter. An open-source software package for the R project has been developed to promote copula-related methodology development and applications. An introduction to the package and illustrations are provided in the Appendix.

Jun Yan

52. Queuing Theory Applications to Communication Systems: Control of Traffic Flows and Load Balancing

With the tremendous increase in traffic on modern communication systems, such as the World Wide Web, it has made it imperative that users of these systems have some understanding not only of how they are fabricated but also how packets, which traverse the links, are scheduled to their hosts in an efficient and reliable manner. In this chapter, we investigate the role that modern queueing theory plays in achieving this aim. We also provide up-to-date and in-depth knowledge of how queueing techniques have been applied to areas such as prioritizing traffic flows, load balancing and congestion control on the modern internet system. The Introduction gives a synopsis of the key topics of application covered in this chapter, i.e. congestion control using finite buffer queueing models, load balancing and how reliable transmission is achieved using various transmission control protocols. In Sect. 52.1, we provide a brief review of the key concepts of queueing theory, including a discussion of the performance metrics, scheduling algorithms and traffic variables underlying simple queues. A discussion of the continuous-time Markov chain is also presented, linking it with the lack of memory property of the exponential random variable and with simple Markovian queues. A class of queues, known as multiple-priority dual queues (MPDQ), is introduced and analyzed in Sect. 52.2. This type of queues consists of a dual queue and incorporates differentiated classes of customers in order to improve their quality of service. Firstly, MPDQs are simulated under different scenarios and their performance compared using a variety of performance metrics. Secondly, a full analysis of MPDQs is then given using continuous-time Markov chain. Finally, we show how the expected waiting times of different classes of customers are derived for a MPDQ. Section 52.3 describes current approaches to assigning tasks to a distributed system. It highlights the limitations of many task-assignment policies, especially when task sizes have a heavy-tailed distribution. For these so called heavy-tailed workloads, several size-based load distribution policies are shown to perform much better than classical policies. Amongst these, the policies based on prioritizing traffic flows are shown to perform best of all. Section 52.4 gives a detailed account of how the balance between maximizing throughput and congestion control is achieved in modern communication networks. This is mainly accomplished through the use of transmission control protocols and selective dropping of packets. It will be demonstrated that queueing theory is extensively applied in this area to model the phenomena of reliable transmission and congestion control. The final section concludes with a brief discussion of further work in this area, an area which is growing at a rapid rate both in complexity and level of sophistication.

Panlop Zeephongsekul, Anthony Bedford, James Broberg, Peter Dimopoulos, Zahir Tari

53. Support Vector Machines for Data Modeling with Software Engineering Applications

support vector machine (SVM)data modelingsoftwareengineering applicationsThis chapter presents the basic principles of support vector machines (SVM) and their construction algorithms from an applications perspective. The chapter is organized into three parts. The first part consists of Sects. 53.2 and 53.3. In Sect. 53.2 we describe the data modeling issues in classification and prediction problems. In Sect. 53.3 we give an overview of a support vector machine (SVM) with an emphasis on its conceptual underpinnings. In the second part, consisting of Sects. 53.4–53.9, we present a detailed discussion of the support vector machine for constructing classification and prediction models. Sections 53.4 and 53.5 describe the basic ideas behind a SVM and are the key sections. Section 53.4 discusses the construction of optimal hyperplane for the simple case of linearly separable patterns and its relationship to the Vapnik–Chervonenkis dimension. A detailed example is used for illustration. The relatively more difficult case of nonseparable patterns is discussed in Sect. 53.5. The use of inner product kernels for nonlinear classifiers is described in Sect. 53.6 and is illustrated via an example. Nonlinear regression is described in Sect. 53.7. The issue of specifying SVM hyperparameters is addressed in Sect. 53.8, and a generic SVM construction flowchart is presented in Sect. 53.9. The third part details two case studies. In Sect. 53.10 we present the results of a detailed analysis of module-level NASA data for developing classification models. In Sect. 53.11, effort data from 75 projects is used to obtain nonlinear prediction models and analyzetheir performance. Section 53.12 presents some concluding remarks, current activities in support vector machines, and some guidelines for further reading.

Hojung Lim, Amrit Goel

54. Optimal System Design

The first section of this chapter describes various applications of optimal system design and associated mathematical formulations. Special attention is given to the consideration the randomness associated with system characteristics. The problems are described from the reliability engineering point of view. It includes a detailed state-of-the-art presentation of various spares optimization models and their applications. The second section describes the importance of optimal cost-effective designs. The detailed formulations of cost-effective designs for repairable and nonrepairable systems are discussed. Various cost factors such as failure cost, downtime cost, spares cost, and maintenance cost are considered. In order to apply these methods for real-life situations, various constraints including acceptable reliability and availability, weight and space limitations, and budget limitations are addressed. The third section describes the solution techniques and algorithms used for optimal system-design problems. The algorithms are broadly classified as exact algorithms, heuristics, meta-heuristics, approximations, and hybrid methods. The merits and demerits of these algorithms are described. The importance of bounds on the optimal solutions are described. The fourth section describes the usefulness of hybrid methods in solving large problems in a realistic time frame. A detailed description of the latest research findings relating to hybrid methods and their computational advantages are provided. One of the major advantages of these algorithms is finding the near-optimal solutions as quickly as possible and improving the solution quality iteratively. Further, each iteration improves the search efficiency by reducing the search space as a result of the improved bounds. The efficiency of the proposed method is demonstrated through numerical examples.

Suprasad Amari

Backmatter

Title: Springer Handbook of Engineering Statistics
Editor: Hoang Pham, Prof.
Publisher: Springer London
Electronic ISBN: 978-1-84628-288-1
Print ISBN: 978-1-85233-806-0
DOI: https://doi.org/10.1007/978-1-84628-288-1

Springer Professional

About this book

Table of Contents

Frontmatter

Fundamental Statistics and Its Applications

Frontmatter

1. Basic Statistical Concepts

2. Statistical Reliability with Applications

3. Weibull Distributions and Their Applications

4. Characterizations of Probability Distributions

5. Two-Dimensional Failure Modeling

6. Prediction Intervals for Reliability Growth Models with Small Sample Sizes

7. Promotional Warranty Policies: Analysis and Perspectives

8. Stationary Marked Point Processes

9. Modeling and Analyzing Yield, Burn-In and Reliability for Semiconductor Manufacturing: Overview

Process Monitoring and Improvement

Frontmatter

10. Statistical Methods for Quality and Productivity Improvement

11. Statistical Methods for Product and Process Improvement

12. Robust Optimization in Quality Engineering

13. Uniform Design and Its Industrial Applications

14. Cuscore Statistics: Directed Process Monitoring for Early Problem Detection

15. Chain Sampling

16. Some Statistical Models for the Monitoring of High-Quality Processes

17. Monitoring Process Variability Using EWMA

18. Multivariate Statistical Process Control Schemes for Controlling a Mean

Reliability Models and Survival Analysis

Frontmatter

19. Statistical Survival Analysis with Applications

20. Failure Rates in Heterogeneous Populations

21. Proportional Hazards Regression Models

22. Accelerated Life Test Models and Data Analysis

23. Statistical Approaches to Planning of Accelerated Reliability Testing

24. End-to-End (E2E) Testing and Evaluation of High-Assurance Systems

25. Statistical Models in Software Reliability and Operations Research

26. An Experimental Study of Human Factors in Software Reliability Based on a Quality Engineering Approach

27. Statistical Models for Predicting Reliability of Software Systems in Random Environments

Regression Methods and Data Mining

Frontmatter

28. Measures of Influence and Sensitivity in Linear Regression

29. Logistic Regression Tree Analysis

30. Tree-Based Methods and Their Applications

31. Image Registration and Unknown Coordinate Systems

32. Statistical Genetics for Genomic Data Analysis

33. Statistical Methodologies for Analyzing Genomic Data

34. Statistical Methods in Proteomics

35. Radial Basis Functions for Data Mining

36. Data Mining Methods and Applications

Modeling and Simulation Methods

Frontmatter

37. Bootstrap, Markov Chain and Estimating Function

38. Random Effects

39. Cluster Randomized Trials: Design and Analysis

40. A Two-Way Semilinear Model for Normalization and Analysis of Microarray Data

41. Latent Variable Models for Longitudinal Data with Flexible Measurement Schedule

42. Genetic Algorithms and Their Applications

43. Scan Statistics

44. Condition-Based Failure Prediction

45. Statistical Maintenance Modeling for Complex Systems

46. Statistical Models on Maintenance

Applications in Engineering Statistics

Frontmatter

47. Risks and Assets Pricing

48. Statistical Management and Modeling for Demand of Spare Parts

49. Arithmetic and Geometric Processes

50. Six Sigma

51. Multivariate Modeling with Copulas and Engineering Applications

52. Queuing Theory Applications to Communication Systems: Control of Traffic Flows and Load Balancing

53. Support Vector Machines for Data Modeling with Software Engineering Applications

54. Optimal System Design

Backmatter