Sequential pattern mining algorithm for automotive warranty data

https://doi.org/10.1016/j.cie.2008.11.006Get rights and content

Abstract

This paper presents a sequential pattern mining algorithm that allows product and quality engineers to extract hidden knowledge from a large automotive warranty database. The algorithm uses the elementary set concept and database manipulation techniques to search for patterns or relationships among occurrences of warranty claims over time. These patterns are represented as IF–THEN sequential rules, where the IF portion of the rule includes one or more occurrences of warranty problems at one time and the THEN portion includes warranty problem(s) that occur at a later time. Once sequential patterns are generated, the algorithm uses rule strength parameters to filter out insignificant patterns, so that only important (significant) rules are reported. Significant patterns provide knowledge of one or more product failures that leads to future product fault(s). The effectiveness of the algorithm is illustrated with the warranty data mining application from the automotive industry. A discussion on the sequential patterns generated by the algorithm and their interpretation for the automotive example are also provided.

Introduction

Many industries, including the automotive industry are faced with the tasks of improving product quality and minimizing warranty costs. Product quality is by-product of the effectiveness of product development processes and their production systems. Thus, product quality can be improved through continuous improvements in product design and development of robust manufacturing and assembly systems. However, no matter how well a product is designed and manufactured, it may fail in the usage environment, either by chance or by some assignable causes. When a product fails within a certain time period, the warranty is a manufacturer’s assurance to a buyer that the product will be repaired without a cost to the customer. In a service environment where dealers are more likely to replace than to repair, the cost of component failure during the warranty period can easily equal three to ten times the supplier’s unit price (Baird, 2000, Feng et al., 2001, Cali, 1993). Consequently, companies invest significant amounts of time and resources to monitor, document, and analyze product warranty data.

Product quality problems are monitored during the warranty period through the claims filed against the products. This process generates large volumes of warranty data records, such as product problems in the form of repair related labor codes, problem descriptions, actions taken, repair dates, and repair costs (labor and parts). Sequential pattern analyses of these data records may provide significant benefits to product manufacturers. A sequential pattern analysis searches for patterns or relationships between data objects in a database that occur over time. The analysis is particularly of interest to automotive Original Equipment Manufacturers (OEM), because it identifies important sequential relationships between various product faults. For example, sequential pattern analysis results may reveal a fault pattern that shows how previous product failures may have led to other product fault(s) at a later time. This knowledge enables companies to effectively predict or discover the root causes of failures that are caused by, or are associated with, the earlier problems. This helps in formulating an action plan to remedy the problems and improve product quality, which leads to significant savings in warranty costs and the attainment of product goodwill.

In this paper, a sequential pattern mining algorithm for automotive warranty data is presented. The proposed algorithm is based on the elementary set concept and database manipulation techniques. The algorithm is constructed to search for significant sequential patterns in preprocessed data sets that are obtained from a large automotive warranty database. The sequential patterns are represented in a form of IF–THEN association rules, where the IF portion of the rule includes quality/warranty problems, represented as labor codes, that occurred in an earlier time, and the THEN portion includes labor codes that occurred at a later time. Once a set of unique sequential patterns is generated, the algorithm applies a set of thresholds to evaluate the significance of the rules and the rules that pass these thresholds are reported in the solution. The major differences of the proposed approach and those reported in the literature are presented at the end of this section.

Several association rule mining algorithms (Agrawal and Srikant, 1994, Agrawal and Shafer, 1996, Han and Kamber, 2006) and sequential pattern mining algorithms (Agrawal and Srikant, 1995, Thomas, 1998, Pei et al., 2004) have been reported in the literature. Agrawal and Srikant (1994) introduced an Apriori algorithm that generates significant association rules between items in a database such that support and confidence of the rules are greater than the user-specified thresholds. However, the algorithm generates a large number of candidate itemsets, whose sizes grow exponentially with the size of a database. To overcome this problem, Agrawal and Srikant (1995) introduced three different Apriori algorithms that define the problem of sequential pattern mining as finding the maximal (longest) sequences of items that have a certain user-specified minimum support. These algorithms use candidate generation technique to address the scalability related shortcomings of their previous approach. Bayardo and Agrawal (1999) proposed metrics for ranking association rules and introduced an algorithm that uses rule support and confidence for extracting best rules from the large data-sets. Pei et al. (2004) proposed the efficient PrefixSpan approach for sequential pattern mining. In PrefixSpan, the global database is projected into a set of smaller (local) databases and sequential patterns are constructed by exploring frequently occurring datasets of local databases.

Many new efficient algorithms are proposed to mine sequential patterns. The differences between these algorithms are mostly related to how they improve computational time by imposing some constraints on the mining process, or in some subtle differences in how they handle the sequential mining process. For example, Yun (2008) uses weight constraints to reduce the number of unimportant patterns, Chen, Cao, Li, and Qian (2008) incorporate user-defined constraints so that the discovered knowledge better meets user needs, Masseglia, Poncelet, and Teisseire (2008) introduce time constraints in early stages of the data mining process, and Chen and Huang, 2008, Fiot et al., 2007 use fuzzy set techniques and the K-means algorithm (Kuo, Chao, & Liu 2009) to achieve better computational efficiency.

Kum, Chang, and Wang (2006) proposed a new sequential pattern mining method based on multiple alignment (rather than the usual support-based approach) for mining multiple databases. Multiple databases are mined and summarized at the local level, and only the summarized patterns are used in the global mining process. Laur, Symphor, Nock, and Poncelet (2007) introduced statistical supports to maximize mining precision and improve the computational efficiency of the incremental mining process. Kum, Chang, and Wang (2007) benchmarked the effectiveness of sequential pattern mining methods by comparing a support-based sequential pattern model with an approximate pattern model based on sequence alignment. Chen and Hu (2007) introduced concepts of recency (an ability to quickly adapt to changes in a database) and compactness, which can cause reasonable time spans for discovering data patterns. They have proposed algorithms that use these concepts to adapt to the frequency of changes in discovered patterns in the database. Lin, Chen, Hao, Chueh, and Chang (2008) introduced the notion of positive and negative sequential patterns, where positive patterns include the presence of an itemset of a pattern, and negative patterns are the ones with the absence of an itemset. Ren, Sun, and Guo (2008) developed an incremental sequential pattern mining process that stores the results from the previous mining and uses them to efficiently mine the database when additional data are added.

Typically, warranty data are strictly confidential for most companies because they relate to product quality, reliability, and are therefore critical to consumers’ product goodwill. As a result, literature on the warranty data analysis of real-life applications is limited to a few published reports (see Blischke and Murthy, 1994, Majeske Herrin, 1995, and Lu 1998). Most models and algorithms developed in warranty analysis studies involve warranty cost analysis and can be divided into two categories: (1) one-dimensional studies, which model product failures and warranty costs as a function of the warranty period (see Blischke and Murthy, 1996, Sahin and Polatoglu, 1998), and (2) two-dimensional studies, which model failures and perform warranty analysis by considering both warranty period and length or frequency of usage (see Murthy et al., 1995, Singpurwalla and Wilson, 1998, Majeske, 2007). In most studies, the warranty analysis concentrates on: (a) modeling of failure patterns to estimate the number of occurrences (or recurrences) of failures (components, subassemblies, or systems) over the warranty period, assuming all the usage conditions are statistically similar and all the warranty claims are reported with no delay, (b) modeling of rectification costs incurred by failures, and (c) modeling of the expected warranty costs (see Karim et al., 2001, Lawless, 1998, Polatoglu and Sahin, 1998, Suzuki et al., 2000, Suzuki et al., 2001, Majeske, 2007, Fredette and Lawless, 2007, and Kulkarni & Resnick 2008). Several studies developed empirical models based on the manufacturer’s field data (i.e., failures and costs over the warranty period) for the warranty cost analysis (see Robinson and McDonald, 1991, Lawless and Kalbfleisch, 1992, Hu and Lawless, 1996). Others use probability distribution functions and statistical models for estimating warranty costs with the incomplete data (see Karim et al., 2001, Wang and Suzuki, 2001). More recent studies are: Gutie´rrez-Pulido, Aguirre-Torres, and Christen (2006), which used a utility-function-based method to determine the appropriate warranty length of a product (brake linings), and Jung and Bai (2007), which applied a bivariate reliability model to estimate the lifetime distribution for products. A comprehensive literature review on warranty data analysis can be found in Murthy and Djamaludin (2002).

Although a number of research studies have been reported on warranty analysis, most of them use statistical approaches for cost and/or reliability analysis (Majeske et al., 1997, Kalbfleisch et al., 1991, Hu and Lawless, 1996, Lawless, 1998), while very few have applied data mining techniques to warranty data (Hotz et al, 1999, Buddhakulsomsiri et al., 2006). Hotz et al. (1999) implemented a data mining support environment for planning warranty and goodwill costs in the automotive industry. Regression analysis and back-propagation neural network were used to construct an automatic prediction tool based on the historical warranty data and goodwill costs. Hotz et al. (2001) later developed statistical and machine learning methods for detecting deviation of warranty costs and for the analysis of warranty and goodwill cost statements. Buddhakulsomsiri et al. (2006) implemented a data mining approach to explore the potential benefits of data mining in automotive warranty data analysis. Potential data mining tasks were identified, based on the type of knowledge to be mined. An association rule generation algorithm was developed for important mining tasks. The algorithm was applied to automotive warranty data to illustrate its effectiveness.

In this paper, a new data mining algorithm is presented that uses the elementary set concept of rough set theory (Pawlak 1997) with some important modifications and database manipulation techniques for identifying significant sequential patterns from a large automotive warranty database. Specifically, the algorithm considers all the possible rules that may be generated from a data set rather than the rules determined from the upper and lower approximations of rough set theory. Furthermore, the algorithm proposed in this paper uses important database set operations to reduce computation time of the rule generation (Buddhakulsomsiri et al., 2006, Siradeghyan et al., 2008). In addition, sequential mining of warranty data has some unique characteristics not encountered in typical data mining problems. Meaning, the same product problem can occur more than once in a given day, which may result in a significant number of duplicate rules during the rule generation process. The proposed algorithm introduces an important procedure (Step 2 of the proposed algorithm) that effectively combines duplicate rules and improves the algorithm’s computational efficiency. We demonstrate the effectiveness of this procedure by showing the number of rules generated by the algorithm with and without the use of this procedure. Finally, this paper presents a unique and perhaps the first data mining application to the automotive warranty problems that arise over time.

The remainder of the paper is organized as follows: Section 2 provides a discussion on the source and characteristics of automotive warranty data and the data preprocessing process used to extract necessary data attributes for the sequential pattern mining. Section 3 presents the sequential pattern mining algorithm. Section 4 presents computation results of the algorithm when applied to a larger automotive warranty data set, with a detailed discussion on sequential pattern generation and interpretation. Conclusions and future research directions are provided in Section 5.

Section snippets

Source of automotive warranty data and data preprocessing

The automotive warranty database contains vehicle attributes and warranty problem related data. Typically, automotive warranty data are obtained from: (1) manufacturing and assembly plants (e.g., vehicle identification number (VIN), production date, product options (attributes), plant ID, supplier data, and so on); (2) automobile dealerships (e.g., VIN, sales date); and (3) repair shops (e.g., repair-related labor code, repair date, mileage-at-repair, labor and part costs, and so on). These

Sequential pattern mining algorithm

The goal of the sequential pattern mining algorithm is to determine associations between two sets of labor codes that occur sequentially and frequently. Such associations provide knowledge about the temporal relationships between diverse product quality problems. The algorithm developed in this study is an extension of the association rule generation algorithm reported in Buddhakulsomsiri et al. (2006). The algorithm includes three different stages. Stage 1 uses the elementary set concept and

Computational results

All three stages of the Sequential Pattern Mining algorithm have been coded in the C#.NET programming environment and the Oracle 9i database is used to organize and manipulate the automotive warranty data. The computation study presented in this section is conducted on actual automotive warranty data sets of a vehicle model that were collected over a 27-month period. A Pentium 4, 2.8 GHz, 512 Mb RAM, personal computer is used in the experiment. The warranty data are analyzed in three-month

Conclusion

This paper presents a data mining algorithm for extracting significant sequential patterns from a large automotive warranty database. The algorithm used the elementary set concept and database manipulation techniques to search for patterns or relationships among occurrences of warranty claims over time. Significant patterns provided knowledge of one (or more) product failures that led to future product fault(s). These patterns were represented as IF–THEN sequential rules, where the IF portion

References (56)

  • H. Polatoglu et al.

    Probability distributions of cost, revenue and profit over a warranty cycle

    European Journal of Operational Research

    (1998)
  • K. Suzuki et al.

    Statistical analysis of reliability warranty data

  • U. Yun

    A new framework for detecting weighted sequential patterns in large sequence databases

    Knowledge-Based Systems

    (2008)
  • R. Agrawal et al.

    Parallel mining of association rules

    IEEE Transactions on Knowledge and Data Engineering

    (1996)
  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th...
  • Agrawal, R., & Srikant, R. (1995). Mining sequential patterns. In P.S. Yu & A.L.P. Chen, Proceedings of the...
  • P. Baird

    Robert Bosch Corporation failure modes effects analysis (FMEA)

    (2000)
  • Bayardo, R. J., & Agrawal, R. (1999). Mining the most interesting rules. In Proceedings of the 5th ACM SIGKDD...
  • W.R. Blischke et al.

    Warranty cost analysis

    (1994)
  • W.R. Blischke et al.

    Product warranty handbook

    (1996)
  • J. Buddhakulsomsiri et al.

    Association rule generation algorithm for mining automotive warranty data. The special issue on data mining applications in engineering design, manufacturing, and logistics engineering

    International Journal of Production Research

    (2006)
  • J. Cali

    TQM for purchasing management

    (1993)
  • V. Dhar et al.

    Abstract-driven pattern discovery in databases

    IEEE Transactions on Knowledge and Data Engineering

    (1993)
  • J. Feng et al.

    An optimization model for concurrent selection of tolerances and suppliers

    Computers and Industrial Engineering

    (2001)
  • C. Fiot et al.

    From crispness to fuzziness: Three algorithms for soft sequential pattern mining

    IEEE Transactions on Fuzzy Systems

    (2007)
  • M. Fredette et al.

    Finite-horizon prediction of recurrent events, with application to forecasts of warranty claims

    Technometrics

    (2007)
  • H. Garcia-Molina et al.

    Database systems: The complete book

    (2001)
  • H. Gutie´rrez-Pulido et al.

    A Bayesian approach for the determination of warranty length

    Journal of Quality Technology

    (2006)
  • Cited by (39)

    • Discovery of path-attribute dependency in manufacturing environments: A process mining approach

      2021, Journal of Manufacturing Systems
      Citation Excerpt :

      The primary task of SPM is to discover frequent sequential patterns in sequence databases. That is widely applied to market-basket data analysis and weblog mining [20], as well as in product failure [9]. Table 1 shows an example of sequence database.

    • Machine learning and data mining in manufacturing

      2021, Expert Systems with Applications
    • Integrating social media and warranty data for fault identification in the cyber ecosystem: A cloud-based collaborative framework

      2020, Strategy, Leadership, and AI in the Cyber Ecosystem: The Role of Digital Societies in Information Governance and Decision Making
    • Predicting the need for vehicle compressor repairs using maintenance records and logged vehicle data

      2015, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      In a survey of artificial intelligence solutions in the automotive industry, Gusikhin et al. (2007) discuss fault prognostics, after-sales service and warranty claims. Two representative examples of work in this area are Buddhakulsomsiri and Zakarian (2009) and Rajpathak (2013). Buddhakulsomsiri and Zakarian (2009) present a data mining algorithm that extracts associative and sequential patterns from a large automotive warranty database, capturing relationships among occurrences of warranty claims over time.

    View all citing articles on Scopus
    View full text