Skip to main content
main-content

Über dieses Buch

Optimization techniques have been widely adopted to implement various data mining algorithms. In addition to well-known Support Vector Machines (SVMs) (which are based on quadratic programming), different versions of Multiple Criteria Programming (MCP) have been extensively used in data separations. Since optimization based data mining methods differ from statistics, decision tree induction, and neural networks, their theoretical inspiration has attracted many researchers who are interested in algorithm development of data mining.

Optimization based Data Mining: Theory and Applications, mainly focuses on MCP and SVM especially their recent theoretical progress and real-life applications in various fields. These include finance, web services, bio-informatics and petroleum engineering, which has triggered the interest of practitioners who look for new methods to improve the results of data mining for knowledge discovery.

Most of the material in this book is directly from the research and application activities that the authors’ research group has conducted over the last ten years. Aimed at practitioners and graduates who have a fundamental knowledge in data mining, it demonstrates the basic concepts and foundations on how to use optimization techniques to deal with data mining problems.

Inhaltsverzeichnis

Frontmatter

Chapter 20. Intelligent Knowledge Management

Abstract
In deriving knowledge by technical means, data mining becomes popular for the process of extracting knowledge, which is previously unknown to humans, but potentially useful from a large amount of incomplete, noisy, fuzzy and random data. Knowledge discovered from algorithms of data mining from large-scale databases has great novelty, which is often beyond the experience of experts. Its unique irreplaceability and complementarity has brought new opportunities for decision-making. Access to knowledge through data mining has been of great concern for business applications, such as business intelligence. However, from the perspective of knowledge management, knowledge discovery by data mining from large-scale databases face several challenging problems, therefore we call the knowledge or hidden patterns discovered from data mining the “rough knowledge”. Such knowledge has to be examined at a “second order” in order to derive the knowledge accepted by users or organizations. In this chapter, we defined the new knowledge “intelligent knowledge”, proposed the framework of the management process of intelligent knowledge (intelligent knowledge management, IKM), and other theoretical results.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Part I

Chapter 1. Support Vector Machines for Classification Problems

Abstract
Support vector machines (SVMs), introduced by Vapnik in the early 1990’s, are powerful techniques for machine learning and data mining. Recent breakthroughs have led to advancements in the theory and applications. SVMs were developed to solve the classification problem at first, but they have been extended to the domain of regression, clustering problems. Such standard SVMs require the solution of either a quadratic or a linear programming. In this chapter, we introduced the basic concepts of SVMs: method of maximum margin, dual problem, soft margin, kernel functions, and then presented the standard algorithm C-support vector classification (C-SVC). Especially, considering the classification problem of which the training set with nominal attributes, we built a new SVM which can learn the distance of the nominal attribute values, to improve most popular approaches assuming that all attribute values are of equal distance from each other.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 2. LOO Bounds for Support Vector Machines

Abstract
The success of support vector machine depends on the tuning of its several parameters which affect the generalization error. An effective approach is to estimate the generalization error and then search for parameters so that this estimator is minimized. This requires that the estimators are both effective and computationally efficient. Leave-one-out (LOO) method is the extreme case of cross-validation, and in this case, a single point is excluded from the training set, and the classifier is trained using the remaining points. It is then determined whether this new classifier correctly labels the point that was excluded. The process is repeated over the entire training set, and the LOO error is computed by taking the average over these trials. LOO error provides an almost unbiased estimate of the generalization error. However one shortcoming of the LOO method is that it is highly time consuming, thus methods are sought to speed up the process. An effective approach is to approximate the LOO error by its upper bound that is a function of the parameters. Then, we search for parameter so that this upper bound is minimized. This approach has successfully been developed for both support vector classification machine and support vector regression machine. In this chapter we introduced other LOO bounds for several algorithms of support vector machine.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 3. Support Vector Machines for Multi-class Classification Problems

Abstract
Multi-class classification refers to constructing a decision function defined from an input space onto an unordered set of classes based on independently and identically distributed training set. Currently, there are roughly two types of SVC algorithms to solve the multi-class classification problems. One is the “decomposition-reconstruction” architecture approach which mainly makes direct use of binary SVC to tackle the tasks of multi-class classification, such as the “one versus the rest” method, the “one versus one” method, the “error correcting output code” method, and the “one versus one versus rest” method. The other is the “all together” approach, in other words, it solves the multiclass classification through only one optimization formulation. In this chapter, we first introduce two algorithms which follow the idea of, and then construct another multi-class classification algorithm which is based on support vector ordinal regression.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 4. Unsupervised and Semi-supervised Support Vector Machines

Abstract
As an important branch in unsupervised learning, clustering analysis aims at partitioning a collection of objects into groups or clusters so that members within each cluster are more closely related to one another than objects assigned to different clusters. Clustering algorithms provide automated tools to help identify a structure from an unlabeled set, and there is a rich resource of prior works on this subject. Efficient convex optimization techniques have had a profound impact on the field of machine learning, such as quadratic programming and linear programming techniques to Support Vector Machine and other kernel machine training. Furthermore, Semi-definite Programming (SDP) extends the toolbox of optimization methods used in machine learning, beyond the current unconstrained, linear and quadratic programming techniques, which has provided effective algorithms to cope with the difficult computational problem in optimization and obtain high approximate solutions. In this chapter, we proposed several support vector machine algorithms for unsupervised and semi-supervised problems based on SDP.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 5. Robust Support Vector Machines

Abstract
In real world applications, the training data are not usually assumed to be known exactly due to measurement and statistical errors. Since the solutions to optimization problems are typically sensitive to training data perturbations, errors in the input data tend to get amplified in the decision function, often resulting in far from optimal solutions. So it will be useful to explore formulations that can yield robust discriminants to such estimation errors. In this chapter, we first established robust versions of SVORM, which are represented as a second order cone programming (SOCP). And as the theoretical foundation, we study the relationship between the solutions of the SOCP and its dual problem. Here the second order cone in Hilbert space is involved. Then, we also establish a multi-class algorithm based on the above robust SVORM for general multi-class classification problem with perturbations. Furthermore, we construct a robust unsupervised and semi-supervised SVC for the problems with uncertainty information.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 6. Feature Selection via l p -Norm Support Vector Machines

Abstract
Though support vector machine has been a promising tool in machine learning, but it does not directly obtain the feature importance. Identifying a subset of features which contribute most to classification is also an important task in classification. The benefit of feature selection is twofold. It leads to parsimonious models that are often preferred in many scientific problems, and it is also crucial for achieving good classification accuracy in the presence of redundant features. We can combine SVM with various feature selection strategies. Some of them are “filters”: general feature selection methods independent of SVM; on the other hand, some are wrapper-type methods: modifications of SVM which choose important features as well as conduct training/testing. In the machine learning literature, there are several proposals for feature selection to accomplish the goal of automatic feature selection in the SVM, in some of which they applied the l 0-norm, l 1-norm SVM and got competitive performance. We proposed two models in this chapter, l p -norm C-support vector classification (l p -SVC) and l p -norm proximal support vector machine (l p -PSVM), which separately combines C-SVC and PSVM with feature selection strategy by introducing the l p -norm (0<p<1).
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Part II

Chapter 7. Multiple Criteria Linear Programming

Abstract
This chapter discussed the formulation of multiple criteria programming. In linear discriminate analysis, data separation can be achieved by two opposite objectives. The first objective separates the observations by minimizing the sum of the deviations (MSD) among the observations. The second maximizes the minimum distances (MMD) of observations from the critical value. According to the concept of Pareto optimality, we can seek the best tradeoff of the two measurements by Multiple Criteria Linear Programming (MCLP) model, which is similar to the Support Vector Machine model, addresses more control parameters than the Support Vector Machine, and provides more flexibility for better separation of data under the framework of the mathematical programming. Then MCLP models for multiple classes and unbalanced training set are constructed separately. Furthermore, in order to ensure the existence of solution, we add regularization terms in the objective function of MCLP, and constructed regularized MCLP (RMCLP) model.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 8. MCLP Extensions

Abstract
This chapter extends MCLP model to deal with different problems, such as fuzzy MCLP models for fuzzy classification problems, kernel base MCLP for nonlinear classification problems, and knowledge based MCLP for classification problems with prior knowledge. And on account of the limitation which the MCLP model failed to make sure and remove the redundancy in variables or attributes set, we constructed a new method combining rough set and the MCLP model effectively for classification in data mining. At last, we extend MCLP model for regression problems after transforming the regression problems to classification problems.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 9. Multiple Criteria Quadratic Programming

Abstract
The purpose of this chapter is to propose a general optimization-based framework which unifies some existed optimization-based classification methods and to obtain some new models based on this general framework and to test the efficiency of these models by using some real problems. So we adopt some existed algorithms to solve these models. To propose some new and efficient methods for these models based on the structure of these models need to be studied further. Two new models: Multi-criteria Convex Quadratic Programming Model (MCQP) and Kernel based MCQP model are proposed in this chapter.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 10. Non-additive MCLP

Abstract
In order to modeling the interactions among attributes for classification, the non-additive measures are studied in this chapter. The non-additive measures provide very important information regarding the interactions among attributes and potentially useful for data mining. The concept of non-additive measures (also referred to as fuzzy measure theory) was initiated in the 1950s and have been well developed since 1970s. Nonadditive measures have been successfully used as a data aggregation tool for many applications such as information fusion, multiple regressions and classifications. The nonlinear integrals are the aggregation tools for the non-additive measures. The Choquet integral, a nonlinear integral, is utilized to aggregate the feature attributes with respect to the non-additive measure. The non-additive MCLP classification models are constructed in this chapter, and because the using of non-additive measure increases the computational cost, two major solutions to reduce the number of non-additive measures are given: hierarchical Choquet integral and the K-additive measure.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 11. MC2LP

Abstract
Instead of finding the best boundary randomly, we find the best linear combination for the best classifier. That is, in addition to considering the criteria space that contains the tradeoffs of multiple criteria in MSD, this chapter constructed MC2LP model of which the structure has a constraint-level space that shows all possible tradeoffs of resource availability levels (i.e. the tradeoff of upper boundary and lower boundary), and also extended it for multi-class classification problem. We also formulated the Minimal Error and Maximal Between-class Variance (MEMBV) model by using the objective function of Fisher’s LDA (maximizing the between-class variance) and the MC2LP model for relaxing the constraints.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Part III

Chapter 12. Firm Financial Analysis

Abstract
Financial institutions and banks are among those industries that have relatively complete and accurate data and have first employed advanced analytic techniques on their data. Typical cases include stock investment, loan payment prediction, credit approval, bankruptcy prediction, and fraud detection. Classification is one of most extensively used data mining methods in finance and banking. To test the applicability of preceding multiple-criteria mathematical models in finance and banking, we select one model, MCQP in Chap. 9 and apply it to three financial datasets. These datasets come from three countries and represent consumer credit card application data, credit approval data, and corporation bankruptcy data. These three datasets represent different problems in finance and banking. For comparison purpose, the result of MCQP is compared with four well-know classification methods: SPSS, linear discriminant analysis (LDA), Decision Tree based See5, SVMlight, and LibSVM.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 13. Personal Credit Management

Abstract
The most commonly used methods in predicting credit card defaulters are credit scoring models. Based on their applications in credit management, the scoring models can be classified into two categories: the first category concerns about application scores and the second category concerns behavior scores. Specifically, behavior scores are used to determine “raising or lowering the credit limit; how the account should be treated with regard to promotional or marketing decisions; and when action should be taken on a delinquent account”. Behavior scoring models utilize various techniques to identify attributes that can effectively separate credit cardholders behaviors. In this chapter, using the real-life credit card dataset, we first conduct the MCQP classification, then compare the performance of MCQP with MCLP, linear discriminant analysis (LDA), decision tree (DT), support vector machine (SVM), and neural network (NN) methods in terms of predictive accuracy.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 14. Health Insurance Fraud Detection

Abstract
Health insurance fraud detection is an important and challenging task. Traditional heuristic-rule based fraud detection techniques can not identify complex fraud schemes. Such a situation demands more sophisticated analytical methods and techniques that are capable of detecting fraud activities from large databases. Traditionally, insurance companies use human inspections and heuristic rules to detect fraud. As the number of electronic insurance claims increases each year, it is difficult to detect insurance fraud in a timely manner by manual methods alone. In addition, new types of fraud emerge constantly and SQL operations based on heuristic rules cannot identify those new emerging fraud schemes. Such a situation demands more sophisticated analytical methods and techniques that are capable of detecting fraud activities from large databases. This chapter describes the application of three predictive models: MCLP, decision tree, and Naive Bayes classifier, to identify suspicious claims to assist manual inspections. The predictive models can label high-risk claims and help investigators to focus on suspicious records and accelerate the claim-handling process.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 15. Network Intrusion Detection

Abstract
Network intrusion refers to inappropriate, incorrect, or anomalous activities aimed at compromise computer networks. The early and reliable detection of network attacks is a pressing issue of today’s network security. Classification methods are one the major tools in network intrusion detection. A successful network intrusion detection system needs to have high classification accuracies and low false alarm rates. In this chapter, we apply the kernel-based MCLP model to the network intrusion detection. The performance of this model is tested using two network datasets. The first dataset, NeWT, is collected by the STEAL lab at University of Nebraska at Omaha, The second dataset is the KDDCUP-99 data set which was provided by DARPA in 1998 for the evaluation of intrusion detection approaches.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 16. Internet Service Analysis

Abstract
According to the statistic, as Chinese network advanced in the past few years, the total market size of Chinese VIP E-mail service has reached 6.4 hundred million RMB by 2005. This huge market dramatically enforced market competition among all E-mail service companies. The analysis for the pattern of lost customer accounts in hereby a significant research topic. This chapter focuses on this research to perform decision-making in reducing the customer loss rate. A dataset with 230 attributes, 5498 current records and 5498 lost records has been collected from a well-known Wall Street IPO Chinese Internet Co. We applied several MCLP models and SVM to seek the solution.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 17. HIV-1 Informatics

Abstract
The ability to identify neuronal damage in the dendritic arbor during HIV-1-associated dementia (HAD) is crucial for designing specific therapies for the treatment of HAD. In this chapter, we utilized a computer based image analysis method to quantitatively assess HIV-1 viral protein gp120 and glutamate mediated individual neuronal damage in cultured cortical neurons. Changes in the number of neurites, arbors, branch nodes, cell body area, and average arbor lengths were determined and a database was formed. We further proposed a two class model of MCLP to classify such HIV-1mediated neuronal dendritic and synaptic damages. Given certain classes, including treatments with brain-derived neurotrophic factor (BDNF), glutamate, gp120 or non-treatment controls from our in vitro experimental systems, we used the two-class MCLP model to determine the data patterns between classes in order to gain insight about neuronal dendritic damages. This knowledge can be applied in principle to the design and study of specific therapies for the prevention or reversal of neuronal damage associated with HAD. Finally, the MCLP method was compared with a well-known artificial neural network algorithm to test for the relative potential of different data mining applications in HAD research.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 18. Anti-gen and Anti-body Informatics

Abstract
Antibodies bind their antigen using residues, which are part of the hypervariable loops. The properties of the antibody-antigen interaction site are primarily governed by the three dimensional structure of the CDR-loops (complementarity determining region). The mode of antibody binding corresponding antigen is conservation. Antibody structure is rearranged to recognize and bind antigen with at least moderate affinity. Different types of antibody combining sites have been studied such as: cavity or pocket (hapten), groove (peptide, DNA, carbohydrate) and planar (protein). Much effort has focused on characters of antibody structure, antibody-antigen binding sit and mutation on the affinity and specificity of the antibody. In functional studies on antibody-antigen complexes, a few residues are tight bound among a number of contact residues in antibody-antigen interface. The distance between antibody’s interface residue and antigen surface is the one of antigen-antibody binding characters. In this chapter, we set three type of interaction distance range between antibody residue and antigen surface. The residue belong to these distance range in antibody structure is predicted by MCQP, LDA, Decision Tree and SVM to study correlation between characters of antibody surface residue and antigen-antibody interaction.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Chapter 19. Geochemical Analyses

Abstract
World natural diamond production for 2004 is estimated at 156 million carats and it translated into 61.5 billion US dollars in worldwide jewelery sales. Even though, the current level of demand for diamonds with high color and quality is still not being met by the world’s producing diamond mines. Numerous companies are carrying out various phases of diamond exploration in Botswana, which is the world’s leading producer of gem quality diamonds. Due to the extensive Kalahari sand cover (and Karoo basalts underneath), sophisticated and innovative sampling and geophysical techniques are required to locate undiscovered kimberlites. he goal of this chapter is to build an analytical model for kimberlites identification. Three classification methods (LDA, Decision Tree and SVM) are applied to a dataset containing information about rock samples drilled in Botswana.
Yong Shi, Yingjie Tian, Gang Kou, Yi Peng, Jianping Li

Backmatter

Weitere Informationen

Premium Partner