Top

2005 | Book

Read chapter Read first chapter

Classification — the Ubiquitous Challenge

Proceedings of the 28th Annual Conference of the Gesellschaft für Klassifikation e.V. University of Dortmund, March 9–11, 2004

Editors: Professor Dr. Claus Weihs, Professor Dr. Wolfgang Gaul

Publisher: Springer Berlin Heidelberg

Book Series : Studies in Classification, Data Analysis, and Knowledge Organization

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Frontmatter

(Semi-) Plenary Presentations

Classification and Data Mining in Musicology

Data in music are complex and highly structured. In this talk a number of descriptive and model-based methods are discussed that can be used as pre-processing devices before standard methods of classification, clustering etc. can be applied. The purpose of pre-processing is to incorporate prior knowledge in musicology and hence to filter out information that is relevant from the point of view of music theory. This is illustrated by a number of examples from classical music, including the analysis of scores and of musical performance.

Jan Beran

Bayesian Mixed Membership Models for Soft Clustering and Classification

The paper describes and applies a fully Bayesian approach to soft clustering and classification using mixed membership models. Our model structure has assumptions on four levels: population, subject, latent variable, and sampling scheme. Population level assumptions describe the general structure of the population that is common to all subjects. Subject level assumptions specify the distribution of observable responses given individual membership scores. Membership scores are usually unknown and hence we can also view them as latent variables, treating them as either fixed or random in the model. Finally, the last level of assumptions specifies the number of distinct observed characteristics and the number of replications for each characteristic. We illustrate the flexibility and utility of the general model through two applications using data from: (i) the National Long Term Care Survey where we explore types of disability; (ii) abstracts and bibliographies from articles published in

The Proceedings of the National Academy of Sciences

. In the first application we use a Monte Carlo Markov chain implementation for sampling from the posterior distribution. In the second application, because of the size and complexity of the data base, we use a variational approximation to the posterior. We also include a guide to other applications of mixed membership modeling.

Elena A. Erosheva, Stephen E. Fienberg

Predicting Protein Secondary Structure with Markov Models

The

primary structure

of a protein is the sequence of its amino acids. The

secondary structure

describes structural properties of the molecule such as which parts of it form sheets, helices or coils. Spacial and other properties are described by the higher order structures. The classification task we are considering here, is to predict the secondary structure from the primary one. To this end we train a Markov model on training data and then use it to classify parts of unknown protein sequences as sheets, helices or coils. We show how to exploit the directional information contained in the Markov model for this task. Classifications that are purely based on statistical models might not always be biologically meaningful. We present combinatorial methods to incorporate biological background knowledge to enhance the prediction performance.

Paul Fischer, Simon Larsen, Claus Thomsen

Milestones in the History of Data Visualization: A Case Study in Statistical Historiography

The Milestones Project is a comprehensive attempt to collect, document, illustrate, and interpret the historical developments leading to modern data visualization and visual thinking. This paper provides an overview and brief tour of the milestones content, with a few illustrations of significant contributions to the history of data visualization. This forms one basis for exploring interesting questions and problems in the use of statistical and graphical methods to explore this history, a topic that can be called “statistical historiography.”

Michael Friendly

Quantitative Text Typology: The Impact of Word Length

The present study aims at the quantitative classification of texts and text types. By way of a case study, 398 Slovenian texts from different genres and authors are analyzed as to their word length. It is shown that word length is an important factor in the synergetic self-regulation of texts and text types, and that word length may significantly contribute to a new typology of discourse types.

Peter Grzybek, Ernst Stadlober, Emmerich Kelih, Gordana Antić

Cluster Ensembles

Cluster ensembles are collections of individual solutions to a given clustering problem which are useful or necessary to consider in a wide range of applications. Aggregating these to a “common” solution amounts to finding a consensus clustering, which can be characterized in a general optimization framework. We discuss recent conceptual and computational advances in this area, and indicate how these can be used for analyzing the structure in cluster ensembles by clustering its elements.

Kurt Hornik

Bootstrap Confidence Intervals for Three-way Component Methods

The two most common component methods for the analysis of three-way data, CANDECOMP/PARAFAC (CP) and Tucker3 analysis, are used to summarize a three-mode three-way data set by means of a number of component matrices, and, in case of Tucker3, a core array. Until recently, no procedures for computing confidence intervals for the results from such analyses were available. Recently, such procedures have come available by Riu and Bro (2003) for CP using the jack-knife procedure, and by Kiers (2004) for CP and Tucker3 analysis using the bootstrap procedure. The present paper reviews the latter procedures, discusses their performance as reported by Kiers (2004), and illustrates them on an example data set.

Henk A.L. Kiers

Organising the Knowledge Space for Software Components

Software development has become a distributed, collaborative process based on the assembly of off-the-shelf and purpose-built components. The selection of software components from component repositories and the development of components for these repositories requires an accessible information infrastructure that allows the description and comparison of these components.

General knowledge relating to software development is equally important in this context as knowledge concerning the application domain of the software. Both form two pillars on which the structural and behavioural properties of software components can be expressed. Form, effect, and intention are the essential aspects of process-based knowledge representation with behaviour as a primary property.

We investigate how this information space for software components can be organised in order to facilitate the required taxonomy, thesaurus, conceptual model, and logical framework functions. Focal point is an axiomatised ontology that, in addition to the usual static view on knowledge, also intrinsically addresses the dynamics, i.e. the behaviour of software. Modal logics are central here — providing a bridge between classical (static) knowledge representation approaches and behaviour and process description and classification.

We relate our discussion to the Web context, looking at Web services as components and the Semantic Web as the knowledge representation framework.

Claus Pahl

Multimedia Pattern Recognition in Soccer Video Using Time Intervals

In this paper we propose the Time Interval Multimedia Event (TIME) framework as a robust approach for recognition of multimedia patterns, e.g. highlight events, in soccer video. The representation used in TIME extends the Allen temporal interval relations and allows for proper inclusion of context and synchronization of the heterogeneous information sources involved in multimedia pattern recognition. For automatic classification of highlights in soccer video, we compare three different machine learning techniques, i.c. C4.5 decision tree, Maximum Entropy, and Support Vector Machine. It was found that by using the TIME framework the amount of video a user has to watch in order to see almost all highlights can be reduced considerably, especially in combination with a Support Vector Machine.

Cees G.M. Snoek, Marcel Worring

Quantitative Assessment of the Responsibility for the Disease Load in a Population

The concept of attributable risk (

), introduced more than 50 years ago, quantifies the proportion of cases diseased due to a certain exposure (risk) factor. While valid approaches to the estimation of crude or adjusted

exist, a problem remains concerning the attribution of

to each of a set of several exposure factors. Inspired by mathematical game theory, namely, the axioms of fairness and the Shapley value, introduced by Shapley in 1953, the concept of partial

has been developed. The partial

offers a unique solution for allocating shares of

to a number of exposure factors of interest, as illustrated by data from the German Göttingen Risk, Incidence, and Prevalence Study (G.R.I.P.S.).

Wolfgang Uter, Olaf Gefeller

Classification and Data Analysis

Bootstrapping Latent Class Models

This paper deals with improved measures of statistical accuracy for parameter estimates of latent class models. It introduces more precise confidence intervals for the parameters of this model, based on parametric and nonparametric bootstrap. Moreover, the label-switching problem is discussed and a solution to handle it introduced. The results are illustrated using a well-known dataset.

José G. Dias

Dimensionality of Random Subspaces

Significant improvement of classification accuracy can be obtained by aggregation of multiple models. Proposed methods in this field are mostly based on sampling cases from the training set, or changing weights for cases. Reduction of classification error can also be achieved by random selection of variables to the training subsamples or directly to the model. In this paper we propose a method of feature selection for ensembles that significantly reduces the dimensionality of the subspaces.

Eugeniusz Gatnar

Two-stage Classification with Automatic Feature Selection for an Industrial Application

We address a current problem in industrial quality control, the detection of defects in a laser welding process. The process is observed by means of a high-speed camera, and the task is complicated by the fact that very high sensitivity is required in spite of a highly dynamic / noisy background and that large amounts of data need to be processed online. In a first stage, individual images are rated and these results are then aggregated in a second stage to come to an overall decision concerning the entire sequence. Classification of individual images is by means of a polynomial classifier, and both its parameters and the optimal subset of features extracted from the images are optimized jointly in the framework of a wrapper optimization. The search for an optimal subset of features is performed using a range of different sequential and parallel search strategies including genetic algorithms.

Sören Hader, Fred A. Hamprecht

Bagging, Boosting and Ordinal Classification

Since the introduction of bagging and boosting many new techniques have been developed within the field of classification via aggregation methods. Most of them have in common that the class indicator is treated as a nominal response without any structure. Since in many practical situations the class must be considered as an ordered categorical variable, it seems worthwhile to take this additional information into account. We propose several variants of bagging and boosting, which make use of the ordinal structure and it is shown how the predictive power might be improved. Comparisons are based not only on misclassification rates but also on general distance measures, which reflect the difference between true and predicted class.

Klaus Hechenbichler, Gerhard Tutz

A Method for Visual Cluster Validation

Cluster validation is necessary because the clusters resulting from cluster analysis algorithms are not in general meaningful patterns. I propose a methodology to explore two aspects of a cluster found by any cluster analysis method: the cluster should be separated from the rest of the data, and the points of the cluster should not split up into further separated subclasses. Both aspects can be visually assessed by linear projections of the data onto the two-dimensional Euclidean space. Optimal separation of the cluster in such a projection can be attained by asymmetric weighted coordinates (Hennig (2002)). Heterogeneity can be explored by the use of projection pursuit indexes as defined in Cook, Buja and Cabrera (1993). The projection methods can be combined with splitting up the data set into clustering data and validation data. A data example is given.

Christian Hennig

Empirical Comparison of Boosting Algorithms

Boosting algorithms combine moderately accurate classifiers in order to produce highly accurate ones. The most important boosting algorithms are Adaboost and Arc-x(

). While belonging to the same algorithms family, they differ in the way of combining classifiers. Adaboost uses weighted majority vote while Arc-x(

) combines them through simple majority vote. Breiman (1998) obtains the best results for Arc-x(

) with

= 4 but higher values were not tested. Two other values for

= 8 and

= 12 are tested and compared to the previous one and to Adaboost. Based on several real binary databases, empirical comparison shows that Arc-x4 outperforms all other algorithms.

Riadh Khanchel, Mohamed Limam

Iterative Majorization Approach to the Distance-based Discriminant Analysis

This paper proposes a method of finding a discriminative linear transformation that enhances the data's degree of conformance to the compactness hypothesis and its inverse. The problem formulation relies on inter-observation distances only, which is shown to improve non-parametric and non-linear classifier performance on benchmark and real-world data sets. The proposed approach is suitable for both binary and multiple-category classification problems, and can be applied as a dimensionality reduction technique. In the latter case, the number of necessary discriminative dimensions can be determined exactly. The sought transformation is found as a solution to an optimization problem using iterative majorization.

Serhiy Kosinov, Stéphane Marchand-Maillet, Thierry Pun

An Extension of the CHAID Tree-based Segmentation Algorithm to Multiple Dependent Variables

The CHAID algorithm has proven to be an effective approach for obtaining a quick but meaningful segmentation where segments are defined in terms of demographic or other variables that are predictive of a

single

categorical criterion (dependent) variable. However, response data may contain ratings or purchase history on

several

products, or, in discrete choice experiments, preferences among alternatives in each of

several

choice sets. We propose an efficient hybrid methodology combining features of CHAID and latent class modeling (LCM) to build a classification tree that is predictive of

multiple

criteria. The resulting method provides an alternative to the standard method of profiling latent classes in LCM through the inclusion of (active) covariates.

Jay Magidson, Jeroen K. Vermunt

Expectation of Random Sets and the ‘Mean Values’ of Interval Data

Several possibilities of defining the expectation of random p-dimensional intervals are proposed. After defining the expectation via reducing intervals to their extremal points p-dimensional intervals (rectangles) are treated as Random Closed Sets (RCSs). In this framework Random Closed Rectangles (RCRs) are defined and the properties of different definitions for expectations of RCSs, applied on RCRs are studied. In addition known mean values of interval data are integrated in this generalized approach.

Ole Nordhoff

Experimental Design for Variable Selection in Data Bases

This paper analyses the influence of 13 stylized facts of the German economy on the West German business cycles from 1955 to 1994. The method used in this investigation is Statistical Experimental Design with orthogonal factors. We are looking for all existing Plackett-Burman designs realizable by coded observations of these data. The plans are then analysed by regression with forward selection and various classification methods to extract the relevant variables for separating upswing and downswing of the cycles. The results are compared with already existing studies on this topic.

Constanze Pumplün, Claus Weihs, Andrea Preusser

KMC/EDAM: A New Approach for the Visualization of K-Means Clustering Results

In this work we introduce a method for classification and visualization. In contrast to simultaneous methods like e.g. Kohonen SOM this new approach, called KMC/EDAM, runs through two stages. In the first stage the data is clustered by classical methods like K-means clustering. In the second stage the centroids of the obtained clusters are visualized in a fixed target space which is directly comparable to that of SOM.

Nils Raabe, Karsten Luebke, Claus Weihs

Clustering of Variables with Missing Data: Application to Preference Studies

Clustering of variables around latent components is a means of organizing multivariate data into meaningful subgroups. We extend the approach to situations with missing data. A straightforward method is to replace the missing values by some estimates and cluster the completed data set. This basic imputation method is improved by more sophisticated procedures which update the imputations within each group after an initial clustering of the variables. We compare the performance of the different imputation methods with the help of a simulation study.

Karin Sahmer, Evelyne Vigneau, Mostafa El Qannari, Joachim Kunert

Binary On-line Classification Based on Temporally Integrated Information

We present a method for on-line classification of triggered but temporally blurred events that are embedded in noisy time series. This means that the time point at which an event is initiated or a dynamical system is perturbed is known, e.g., the moment an injection of a therapeutic agent is given to a patient. From the ongoing monitoring of the system one has to derive a classification of the event or the induced change of the state of the system, e.g., whether the state of health improves or degrades. For simplification we assume that the reactions form two classes of interest. In particular the goal of the binary classification problem is to obtain the decision on-line, as fast and as reliable as possible.

To provide a probabilistic decision at every time-point

the presented method gathers information across time by incorporating decisions from prior time-points using an appropriate weighting scheme. For this specific weighting we utilize the Bayes error to gain insight into the discriminative power between the instantaneous class distributions.

The effectiveness of this procedure is verified by its successful application in the context of a Brain Computer Interface, especially to the binary discrimination task of left against right imaginary hand-movements from ongoing raw EEG data.

Christin Schäfer, Steven Lemm, Gabriel Curio

Different Subspace Classification

We introduce the idea of

Characteristic Regions

to solve a classification problem. By identifying regions in which classes are dense (i.e. many observations) and also relevant (for discrimination) we can characterize the different classes. These Characteristic Regions are used to generate a classification rule. The result can be visualized so the user is provided with an insight into data for an easy interpretation.

Gero Szepannek, Karsten Luebke

Density Estimation and Visualization for Data Containing Clusters of Unknown Structure

A method for measuring the density of data sets that contain an unknown number of clusters of unknown sizes is proposed. This method, called Pareto Density Estimation (PDE), uses hyper spheres to estimate data density. The radius of the hyper spheres is derived from information optimal sets. PDE leads to a tool for the visualization of probability density distributions of variables (PDEplot). For Gaussian mixture data this is an optimal empirical density estimation. A new kind of visualization of the density structure of high dimensional data set, the P-Matrix is defined. The P-Matrix for a 79- dimensional data set from DNA array analysis is shown. The P-Matrix reveals local concentrations of data points representing similar gene expressions. The P-Matrix is also a very effective tool in the detection of clusters and outliers in data sets.

Alfred Ultsch

Hierarchical Mixture Models for Nested Data Structures

A hierarchical extension of the finite mixture model is presented that can be used for the analysis of nested data structures. The model permits a simultaneous model-based clustering of lower- and higher-level units. Lower-level observations within higher-level units are assumed to be mutually independent given cluster membership of the higher-level units. The proposed model can be seen as a finite mixture model in which the prior class membership probabilities are assumed to be random, which makes it very similar to the grade-of-membership (GoM) model. The new model is illustrated with an example from organizational psychology.

Jeroen K. Vermunt, Jay Magidson

Iterative Proportional Scaling Based on a Robust Start Estimator

Model selection procedures in graphical modeling are essentially based on the estimation of covariance matrices under conditional independence restrictions. Such model selection procedures can react heavily on the presence of outlying observations. One reason for this might be that the covariance estimation is influenced by outliers. Hence, a robust procedure to estimate a covariance matrix under conditional independence restrictions is needed. As a first step to robustify the model building process in graphical modeling we propose to use a modified iterative proportional scaling algorithm, starting with a robust covariance estimator.

Claudia Becker

Exploring Multivariate Data Structures with Local Principal Curves

A new approach to find the underlying structure of a multidimensional data cloud is proposed, which is based on a localized version of principal components analysis. More specifically, we calculate a series of local centers of mass and move through the data in directions given by the first local principal axis. One obtains a smooth “local principal curve” passing through the “middle” of a multivariate data cloud. The concept adopts to branched curves by considering the second local principal axis. Since the algorithm is based on a simple eigendecomposition, computation is fast and easy.

Jochen Einbeck, Gerhard Tutz, Ludger Evers

A Three-way Multidimensional Scaling Approach to the Analysis of Judgments About Persons

Judgments about persons may depend on (1) how coherently person attributes are linked within the stimulus person and (2) how strongly the given person information activates a social stereotype. These factors may determine the number of judgment dimensions, their salience and their relatedness. A three-way multidimensional scaling model is presented that measures these parameters and their change across stimulus persons or judgment conditions. The proposed approach involves a formal modelling of information integration in the judgment process. An application to experimental data shows the validity of the model.

Sabine Krolak-Schwerdt

Discovering Temporal Knowledge in Multivariate Time Series

An overview of the

Time Series Knowledge Mining

framework to discover knowledge in multivariate time series is given. A hierarchy of temporal patterns, which are not a priori given, is discovered. The patterns are based on the rule language

Unification-based Temporal Grammar

. A semiotic hierarchy of temporal concepts is build in a bottom up manner from multivariate time instants. We describe the mining problem for each rule discovery step. Several of the steps can be performed with well known data mining algorithms. We present novel algorithms that perform two steps not covered by existing methods. First results on a dataset describing muscle activity during sports are presented.

Fabian Mörchen, Alfred Ultsch

A New Framework for Multidimensional Data Analysis

Our common sense tell us that continuous data contain more information than categorized data. To prove it, however, is not that straightforward because most continuous variables are typically subjected to linear analysis, and categorized data to nonlinear analysis. This discrepancy prompts us to put both data types on a comparable basis, which leads to a number of problems, in particular, how to define information and how to capture both linear and nonlinear relations between variables both continuous and categorical. This paper proposes a general framework for both types of data so that we may look at the original statement on information.

Shizuhiko Nishisato

External Analysis of Two-mode Three-way Asymmetric Multidimensional Scaling

An external analysis of two-mode three-way (object×object×source) asymmetric multidimensional scaling is introduced, which is similar to the external analysis of INDSCAL. The present external analysis discloses the asymmetry of each object, and source differences in symmetric and in asymmetric proximity relationships among objects respectively for an externally given configuration of objects. The present external asymmetric multidimensional scaling is applied to the university enrollment flow among Japanese prefectures.

Akinori Okada, Tadashi Imaizumi

The Relevance Vector Machine Under Covariate Measurement Error

This paper presents the application of two correction methods for co-variate measurement error to nonparametric regression. We focus on a recent and due to its sparsity properties very promising smoothing approach coming from the area of machine learning, the Relevance Vector Machine (RVM), developed by Tipping (2000). Two correction methods for measurement error are then applied to the RVM: regression calibration (Carroll et al. (1995)) and the SIMEX method (Carroll et al. (1995)). We show why standard regression calibration fails and present a simulation study that indicates an improvement of the RVM regression in terms of bias when SIMEX correction is applied.

David Rummel

Applications

A Contribution to the History of Seriation in Archaeology

The honour to be the first who published the seriation of archaeological finds by formal methods is attributed by David Kendall (1964) to Sir W. M. Flinders Petrie (1899). According to Harold Driver (1965), an American anthropologist, the earliest numerical seriation studies are those of Kidder (1915), Kroeber (1916), and Spier (1917). It seems, however, that a general acceptance of formal seriation methods did not begin until the pioneering publications of Ford and Willey (1949) and G. W. Brainerd (1951) and W. S. Robinson (1951). Hole and Shaw published an algorithm for permutation search (1967), Elisséeff's (1965) and Goldmann's (1968) methods leading finally to correspondence analysis.

Peter Ihm

Model-based Cluster Analysis of Roman Bricks and Tiles from Worms and Rheinzabern

Chemical analysis of ancient ceramics has been used frequently to support archaeological interpretation. Often the dimensionality in the data has been high. Therefore multivariate statistical techniques like cluster analysis have been applied. Successful applications of simple model-based Gaussian clustering of Roman bricks and tiles has been reported by Mucha et al. (2001). And now, more complex Gaussian models can be investigated because of an increase of sample size by new findings excavated in Boppard. Additionally these and previous successful simple models will be applied in a very local fashion considering two supposed brickyards only. Here, after giving a brief history of clustering Roman bricks and tiles, some cluster analysis models including different data transformations will be investigated in order to answer questions like: Is it possible to differentiate between brickyards of Rheinzabern and Worms on basis of chemical analysis? Do the bricks and tiles found in Boppard belong to the brickyards of Worms or Rheinzabern?

Hans-Joachim Mucha, Hans-Georg Bartel, Jens Dolata

Astronomical Object Classification and Parameter Estimation with the Gaia Galactic Survey Satellite

Gaia is a cornerstone mission of the European Space Agency (ESA) which will undertake a detailed survey of over 10

stars in our Galaxy. This will generate an extensive, multivariate, heterogeneous data set which presents numerous problems in classification, regression and time series analysis. I give a brief overview of the characteristics and requirements of this project and the challenges it provides.

Coryn A.L. Bailer-Jones

Design of Astronomical Filter Systems for Stellar Classification Using Evolutionary Algorithms

I present a novel method for designing filter systems for astrophysical surveys. The filter system is designed to optimally sample a stellar spectrum such that its astrophysical parameters (APs: temperature, chemical composition etc.) can be determined using supervised regression methods. The problem is addressed by casting it as an optimization problem: A figure-of-merit (FoM) is constructed which measures the ability of the filter system to ‘separate’, in a vectorial sense, stars with different APs; this FoM is then maximized with respect to the parameters of the filter system using an evolutionary algorithm. The resulting filter systems are found to be competitive in performance with conventionally designed systems.

Coryn A.L. Bailer-Jones

Analyzing Microarray Data with the Generative Topographic Mapping Approach

The Generative Topographic Mapping (GTM) approach of Bishop et al. (1998) is proposed as an alternative to the Self-Organizing Map (SOM) approach of Kohonen (1998) for the analysis of gene expression data from microarrays. It is applied exemplarily to a microarray data set from renal tissue and the results are compared with those derived by SOM. Furthermore, enhancements for the application of the GTM methodology to microarray data are made.

Isabelle M. Grimmenstein, Karsten Quast, Wolfgang Urfer

Test for a Change Point in Bernoulli Trials with Dependence

In Krauth (2003, 2004) we considered modified maximum likelihood estimates for the location of change points in Bernoulli sequences with first-order Markov dependence. Here, we address the more difficult problem of deriving in this situation a finite conditional conservative test for the existence of a change point. Our approach is based on the property of intercalary independence of Markov processes (Dufour and Torrès (2000)) and on the CUSUM statistic considered in Krauth (1999, 2000) in the case of independent binomial trials.

Joachim Krauth

Data Mining in Protein Binding Cavities

The molecular function of a protein is coupled to the binding of a substrate or an endogenous ligand to a well defined binding cavity. To detect functional relationships among proteins, their binding-site exposed physicochemical characteristics were described by assigning generic pseudocenters to the functional groups of the amino acids flanking a particular active site. These pseudocenters were assembled into small substructures and their spatial similarity with appropriate chemical properties was examined. If two substructures of two binding cavities are found to be similar, they form the basis for an expanded comparison of the complete cavities. Preliminary tests indicate the benefit of this method and motivate further studies.

Katrin Kupas, Alfred Ultsch

Classification of In Vivo Magnetic Resonance Spectra

We present the results of a systematic and quantitative comparison of methods from pattern recognition for the analysis of clinical magnetic resonance spectra. The medical question being addressed is the detection of brain tumor. In this application we find regularized linear methods to be superior to more flexible methods such as support vector machines, neural networks or random forests. The best preprocessing method for our spectral data is a smoothing and subsampling approach.

Björn H. Menze, Michael Wormit, Peter Bachert, Matthias Lichy, Heinz-Peter Schlemmer, Fred A. Hamprecht

Modifying Microarray Analysis Methods for Categorical Data — SAM and PAM for SNPs

Common and important tasks arising in microarray experiments are the identification of differentially expressed genes and the classification of biological samples. The SAM (Significance Analysis of Microarrays) procedure is a widely used method for dealing with the multiple testing problem concerned with the former task, whereas the PAM (Prediction Analysis of Microarrays) procedure is a method that can cope with the problems associated with the latter task.

In this presentation, we show how these two procedures developed for analyzing continuous gene expression data can be modified for the analysis of categorical SNP (Single Nucleotide Polymorphism) data.

Holger Schwender

Improving the Identification of Differentially Expressed Genes in cDNA Microarray Experiments

The identification of differentially expressed genes in DNA microarray experiments has led to promising results in DNA array analysis. The identification as well as many other methods in cDNA array analysis rely on correct calculations of differential colour intensity. It is shown that the calculation of logarithms of the ratio of the two color intensities (LogRatio) has several disadvantages. The effects of numerical instabilities and rounding errors are demonstrated on published data. As an alternative to LogRatio calculation, relative differences (RelDiff) are proposed. The stability against numerical and rounding errors of RelDiffs are demonstrated to be much better than for LogRatios. RelDiff values are linearly proportional to LogRatios for the range where genes are not differentially expressed. Relative differences map differential expression to a finite range. For most subsequent analysis this is a big advantage, in particular for the search of expression patterns. It has been reported that the variance of intensity measurements is a nonlinear function on intensity. This effect can be explained by an additive measurement error with constant variance. Applying the logarithm to such intensity measurements introduces the presumed nonlinear dependence. Thus in many cases no complicated variance stabilization transformation using nonlinear functions on the LogRatio expression values is necessary.

Alfred Ultsch

PhyNav: A Novel Approach to Reconstruct Large Phylogenies

A novel method,

PhyNav

, is introduced to reconstruct the evolutionary relationship among contemporary species based on their genetic data. The key idea is the definition of the so-called minimal

-distance subset which contains most of the relevant phylogenetic information from the whole dataset. For this reduced subset the subtree is created faster and serves as a scaffold to construct the full tree. Because many minimal subsets exist the procedure is repeated several times and the best tree with respect to some optimality criterion is considered as the inferred phylogenetic tree.

PhyNav

gives encouraging results compared to other programs on both simulated and real datasets.

A program to reconstruct phylogenetic trees based on DNA or amino acid based is available (http://www.bi.uni-duesseldorf.de/software/phynav/).

Sy Le Vinh, Heiko A. Schmidt, Arndt von Haeseler

NewsRec, a Personal Recommendation System for News Websites

The behavior of how individuals select and read news depends on the underlying media for reproduction. Today, the use of news websites is increasing. Online readers usually have to click on abstracts or headlines in order to see full articles. This kind of selection of information is less pleasant than in traditional newspapers where glancing over the whole layout of double pages is possible. Personalization is a possible solution for this article selection problem. So far, most types of personalization are controlled by website owners. In our work, we discuss design aspects and empirical results of our personal recommendation system for news websites, which uses text classification techniques.

Christian Bomhardt, Wolfgang Gaul

Clustering of Large Document Sets with Restricted Random Walks on Usage Histories

Due to their time complexity, conventional clustering methods often cannot cope with large data sets like bibliographic data in a scientific library. We will present a method for clustering library documents according to usage histories that is based on the exploration of object sets using restricted random walks.

We will show that, given the particularities of the data, the time complexity of the algorithm is linear. For our application, the algorithm has proven to work well with more than one million objects, from the point of view of efficiency as well as with respect to cluster quality.

Markus Franke, Anke Thede

Fuzzy Two-mode Clustering vs. Collaborative Filtering

When users rate interesting objects one often gets two-mode data with missing values as result. In the area of recommender systems (automated) collaborative filtering has been used to analyze such kind of two-mode data. Like collaborative filtering (fuzzy) two-mode clustering can be applied to handle so far unknown ratings of users concerning objects of interest. The aim of this paper is to suggest a new algorithm for (fuzzy) two-mode clustering and compare it to collaborative filtering.

Volker Schlecht, Wolfgang Gaul

Web Mining and Online Visibility

In order to attract web visitors via the internet online activities have to be “visible” in the net. Thus, visibility measurement of web sites and strategies how to optimize Online Visibility are important. Here, web mining helps to define benchmarks with respect to competition and allows to calculate visibility indices as predictors for site traffic.

We use information like keyword density, incoming links, and ranking positions in search engines to measure Online Visibility. We also mention physical and psychological drivers of Online Visibility and describe the appropriateness of different concepts for measurement issues.

Nadine Schmidt-Mänz, Wolfgang Gaul

Analysis of Recommender System Usage by Multidimensional Scaling

Recommender systems offer valuable information not only for web site visitors (who are supported during site navigation and/or buying process) but also for online shop owners (who can learn from the behavior of their web site visitors). We use data from large German online stores gathered between March and November 2003 to visualize search queries by customers together with products viewed most frequently or purchased most frequently. Comparisons of these visualizations lead to a better understanding of searching, viewing, and buying behavior of online shoppers and give important hints how to further improve the generation of recommendations.

Patrick Thoma, Wolfgang Gaul

On a Combination of Convex Risk Minimization Methods

A combination of methods from modern statistical machine learning theory based on convex risk minimization is proposed. An interesting pair for such a combination is kernel logistic regression to estimate conditional probabilities and

—support vector regression to estimate conditional expectations. A strategy based on this combination can be helpful to detect and to model high-dimensional dependency structures in complex data sets, e.g. for constructing insurance tariffs.

Andreas Christmann

Credit Scoring Using Global and Local Statistical Models

This paper compares global and local statistical models that are used for the analysis of a complex data set of credit risks. The global model for discriminating clients with good or bad credit status depending on various customer attributes is based on logistic regression. In the local model, unsupervised learning algorithms are used to identify clusters of customers with homogeneous behavior. Afterwards, a model for credit scoring can be applied separately in the identified clusters. Both methods are evaluated with respect to practical constraints and asymmetric cost functions. It can be shown that local models are of higher discriminatory power which leads to more transparent and convincing decision rules for credit assessment.

Alexandra Schwarz, Gerhard Arminger

Informative Patterns for Credit Scoring: Support Vector Machines Preselect Data Subsets for Linear Discriminant Analysis

Pertinent statistical methods for credit scoring can be very simple like e.g. linear discriminant analysis (LDA) or more sophisticated like e.g. support vector machines (SVM). There is mounting evidence of the consistent superiority of SVM over LDA or related methods on real world credit scoring problems. Methods like LDA are preferred by practitioners owing to the simplicity of the resulting decision function and owing to the ease of interpreting single input variables. Can one productively combine SVM and simpler methods? To this end, we use SVM as the preselection method. This subset preselection results in a final classification performance consistently above that of the simple methods used on the entire data.

Ralf Stecking, Klaus B. Schebesch

Application of Support Vector Machines in a Life Assurance Environment

Since its introduction in Boser et al. (1992), the support vector machine has become a popular tool in a variety of classification and regression applications. In this paper we compare support vector machines and several more traditional statistical classification techniques when these techniques are applied to data from a life assurance environment. A measure proposed by Louw and Steel (2004) for ranking the input variables in a kernel method application is also applied to the data. We find that support vector machines are superior in terms of generalisation error to the traditional techniques, and that the information provided by the proposed measure of input variable importance can be utilised for reducing the number of input variables.

Sarel J. Steel, Gertrud K. Hechter

Continuous Market Risk Budgeting in Financial Institutions

In this contribution we develop a profit & loss-dependent, continuous market risk budgeting approach for financial institutions. Based on standard modelling of financial market stochastics we provide a method of risk limit adjustment adopting the idea of synthetic portfolio insurance. Due to varying the strike price of an implicit synthetic put option we are able to keep within limits accepting a certain default probability.

Mario Straßberger

Smooth Correlation Estimation with Application to Portfolio Credit Risk

When estimating high-dimensional PD correlation matrices from short times series the estimation error hinders the detection of a signal. We smooth the empirical correlation matrix by reducing the dimension of the parameter space from quadratic to linear order with respect to the dimension of the underlying random vector. Using the method by Plerou et al. (2002) we present evidence for a one-factor model. Using the noise-reduced correlation matrix leads to increased security of the economic capital estimate as estimated using the credit risk portfolio model CreditRisk

Rafael Weißbach, Bernd Rosenow

How Many Lexical-semantic Relations are Necessary?

In lexical semantics several meta-linguistic relations are used to model lexical structure. Their number and motivation vary from researcher to researcher. This article tries to show that one relation suffices to model the concept structure of the lexicon making use of intensional logic.

Dariusch Bagheri

Automated Detection of Morphemes Using Distributional Measurements

To simply take the distribution of linguistic elements as a basis for analysis was the methodological prime of researchers of the so-called “American Structuralism”. This paper deals with the detection of morphemes from a large corpus of German by simply applying a distributional procedure of counting the number of potential successors of a given sequence of letters of a word, a method reminiscent of proposals by Harris, Shannon and others. Morphemes can be heuristically read off by an increase in the potential successor count. Three different methods of identifying morpheme breaks are discussed and a proposal for improvement of the method by transforming graphemic to partial phonemic representation is put forward.

Christoph Benden

Classification of Author and/or Genre? The Impact of Word Length

190 Russian texts — letters and poems by three different authors — are analyzed as to their word length. The basic question concerns the quantitative classification of these texts as to authorship or as to text sort. By way of multivariate analyses it is shown that word length is a characteristic of genre, rather than of authorship.

Emmerich Kelih, Gordana Antić, Peter Grzybek, Ernst Stadlober

Some Historical Remarks on Library Classification — a Short Introduction to the Science of Library Classification

Classification as a human activity in general becomes a scientific activity in librarianship. There are famous examples of this history of classification among them the schemes of Conrad Gesner (1548) and the Princeton University Library (1901). In present time we find a number of new tasks and obligations in this field.

Bernd Lorenz

Automatic Validation of Hierarchical Cluster Analysis with Application in Dialectometry

Successful applications of hierarchical cluster analysis in the area of quantitative linguistics were reported in the pioneering works by Goebl (1982, 1984, 1994). Often the dimensionality of linguistic data is high. Therefore multivariate statistical techniques like cluster analysis can to some degree support the researcher. However there is much room left for heuristics. Cluster analysis methods can be generalized by taking weights of observations into account. Using special weights leads to well-known resampling techniques. Here we offer an automatic validation technique for hierarchical cluster analysis that can be considered as a so-called built-in validation of the number of clusters and of each cluster itself, respectively. Furthermore this built-in validation can be used to find the appropriate cluster analysis model. As an illustration of an application in linguistics, the validation of results of hierarchical clustering based on the adjusted

Rand

's measure is presented.

Hans-Joachim Mucha, Edgar Haimerl

Discovering the Senses of an Ambiguous Word by Clustering its Local Contexts

As has been shown recently, it is possible to automatically discover the senses of an ambiguous word by statistically analyzing its contextual behavior in a large text corpus. However, this kind of research is still at an early stage. The results need to be improved and there is considerable disagreement on methodological issues. For example, although most researchers use clustering approaches for word sense induction, it is not clear what statistical features the clustering should be based on. Whereas so far most researchers cluster global co-occurrence vectors that reflect the overall behavior of a word in a corpus, in this paper we argue that it is more appropriate to use local context vectors. We support our view by comparing both approaches and by discussing their strengths and weaknesses.

Reinhard Rapp

Document Management and the Development of Information Spaces

Through the use of formal document structures, for example paragraphs and tables, steps are shown on how to use these to extract information in the course of the automatic recognition of the contents of OpenOffice text documents and HTML documents as part of a document management project. It is possible to create formal graphs that structure the document-related information space based on a given information model by using a natural language processing chain and a wrapping procedure. A combined text and layout analysis is carried out with open source components that aims at representing information as a semantic network in a formal and visualizable manner. Scalable ways of retrieving information and processing knowledge are produced by uniting document-related information spaces to form thematic domains.

Ulfert Rist

Stochastic Ranking and the Volatility “Croissant”: A Sensitivity Analysis of Economic Rankings

Rankings of countries are calculated using

indicator variables. Clearly, any ranking based on an index depends on the weights used, and therefore we conduct a sensitivity analysis on the weights of the index to obtain a measure for the volatility of the performance rankings. The weights are simulated from uniform and beta distributions on the simplex. As a result we observe a volatility “croissant”: Countries in the top and the bottom of the ranking are less volatile than in the middle of the ranking. The methodology is shown for the standardized performance ranking (SPR) and the rank performance ranking (RPR).

Helmut Berrer, Christian Helmenstein, Wolfgang Polasek

Importance Assessment of Correlated Predictors in Business Cycles Classification

When trying to interpret estimated parameters the researcher is interested in the (relative) importance of the individual predictors. However, if the predictors are highly correlated, the interpretation of coefficients, e.g. as economic “multipliers”, is not applicable in standard regression or classification models. The goal of this paper is to develop a procedure to obtain such measures of importance for classification methods and to apply them to models for the classification of german business cycle phases.

Daniel Enache, Claus Weihs

Economic Freedom in the 25-Member European Union: Insights Using Classification Tools

In 2004, ten additional countries join the European Union. As a result, the nature of the community and its member countries are predicted to change, including the economic freedom of individuals and organizations. This study uses classification tools to look at the Economic Freedom of the World index (EFI). Patterns of economic freedom are quite different between the current and the acceding EU members. On average, economic freedom in Europe has a good chance of increasing as a result the expansion.

Clifford W. Sell

Intercultural Consumer Classifications in E-Commerce

Global consumer typologies are an effective instrument for identifying regional consumer clusters and addressing different client needs in a focused fashion. The objective of this study is to examine whether the international online users are a homogeneous target group, or if it is possible to identify segments by means of selected criteria for constructing typologies. To answer the research question through an online survey in which interviewees participated from the cultural areas France, Germany and the US, theoretically secured constructs of the purchasing behaviour in the internet were obtained as well as different cluster analyses carried out. The results show that the internet users can be divided into three clusters - the risk aversive doubters, the open minded online shoppers and the reserved information seekers.

Hans H. Bauer, Marcus M. Neumann, Frank Huber

Reservation Price Estimation by Adaptive Conjoint Analysis

Though reservation prices are needed for many business decision processes, e.g., pricing new products, it often turns out to be difficult to measure them. Many researchers reuse conjoint analysis data with price as an attribute for this task (e.g., Kohli and Mahajan (1991)). In this setting the information if a consumer buys a product at all is not elicited which makes reservation price estimation impossible. We propose an additional interview scene at the end of the adaptive conjoint analysis (Johnson (1987)) to estimate reservation prices for all product configurations. This will be achieved by the usage of product stimuli as well as price scales that are adapted for each proband to reflect individual choice behavior. We present preliminary results from an ongoing large-sample conjoint interview of customers of a major mobile phone retailer in Germany.

Christoph Breidert, Michael Hahsler, Lars Schmidt-Thieme

Estimating Reservation Prices for Product Bundles Based on Paired Comparison Data

Reservation prices have evolved as important tool for designing and pricing new products or bundles of products where a reservation price for an item can be interpreted as maximum amount of money that a consumer is willing to pay for that item. In this paper, focusing on product bundles, two types of data collection - an already known and a new one - based on direct elicitation of reservation prices using paired comparison data are discussed. Variants of conjoint analysis that were proposed so far in this context are used, an explicit evaluation of two methods is described, and an example by means of empirical data from a seat system offered by a German car manufacturer is used as demonstration of the applicability of the methodology suggested.

Bernd Stauß, Wolfgang Gaul

Classification of Perceived Musical Intervals

Tests were devised in which subjects were asked to judge the size of musical intervals in a musical context of pairs of successive intervals and chords performed by either harpsichord or violins. The judgements focused on the pitch intonation of one of the notes. Since subjects cannot base their judgements on beats since they are inaudible, results thus differ for one and the same interval depending on the musical context. Discrimination tools were applied in order to ascertain the significance of these differences. Furthermore, the fact that there is a region with a certain extent on the frequency continuum for ‘in tune’ intonation and that there is a region with constant interval perception (the latter can be interpreted as a phenomenon of categorical perception)—both contradict current consonant theories based on beats and roughness.

Jobst P. Fricke

In Search of Variables Distinguishing Low and High Achievers in Music Sight Reading Task

The unrehearsed performance of music, called 'sight reading’ (SR), is a basic skill for all musicians. Despite the merits of expertise theory, there is no comprehensive model which can classify subjects into high and low performance groups. This study is the first that classifies subjects and is based on an extensive experiment measuring the total SR performance of 52 piano students. Classification methods (cluster analysis, classification tree, linear discriminant analysis) were applied. Results of a linear discriminant analysis revealed a 2-class solution with 4 predictors (predictive error: 15%).

Reinhard Kopiez, Claus Weihs, Uwe Ligges, Ji In Lee

Automatic Feature Extraction from Large Time Series

The classification of high dimensional data like time series requires the efficient extraction of meaningful features. The systematization of statistical methods allows automatic approaches to combine these methods and construct a method tree which delivers suitable features. It can be shown that the combination of efficient methods also works efficiently, which is especially necessary for the feature extraction from large value series. The transformation from raw series data to feature vectors is illustrated by different classification tasks in the domain of audio data.

Ingo Mierswa

Identification of Musical Instruments by Means of the Hough-Transformation

In order to distinguish between the sounds of different musical instruments, certain instrument-specific sound features have to be extracted from the time series representing a given recorded sound.

The Hough Transform is a pattern recognition procedure that is usually applied to detect specific curves or shapes in digital pictures (Shapiro (1978)). Due to some similarity between pattern recognition and statistical curve fitting problems, it may as well be applied to sound data (as a special case of time series data).

The transformation is parameterized to detect sinusoidal curve sections in a digitized sound, the motivation being that certain sounds might be identified by certain oscillation patterns. The returned (transformed) data is the timepoints and amplitudes of detected sinusoids, so the result of the transformation is another ‘

condensed

’ time series.

This specific Hough Transform is then applied to sounds played by different musical instruments. The generated data is investigated for features that are specific for the musical instrument that played the sound. Several classification methods are tried out to distinguish between the instruments and it turns out that RDA (a hybrid method combining LDA and QDA) (Friedman (1989)) performs best. The resulting error rate is better than those achieved by humans (Bruderer (2003)).

Christian Röver, Frank Klefenz, Claus Weihs

Support Vector Machines for Bass and Snare Drum Recognition

In this paper we attempt to extract information concerning percussive instruments from a musical audio signal. High-dimensional vectors of descriptors are computed from the signal and classified by means of Support Vector Machines (SVM). We investigate the performance on 2 important classes of drum sounds in Western popular music: bass and snare drums, possibly overlapping. The results are encouraging: SVM achieve a high accuracy and

-measure, with linear kernels performing (nearly) as good as Gaussian kernels, but requiring 1000 times less computation time.

Dirk Van Steelant, Koen Tanghe, Sven Degroeve, Bernard De Baets, Marc Leman, Jean-Pierre Martens

Register Classification by Timbre

The aim of this analysis is the demonstration that the high and the low musical register (Soprano, Alto vs. Tenor, Bass) can be identified by timbre, i.e. after pitch information is eliminated from the spectrum. This is achieved by means of pitch free characteristics of spectral densities of voices and instruments, namely by means of masses and widths of peaks of the first 13 partials (cp. Weihs and Ligges (2003b)).

Different analyses based on the tones in the classical song “Tochter Zion” composed by G.F. Händel are presented. Results are very promising. E.g., if the characteristics are averaged over all tones, then female and male singers can be easily distinguished without any error (prediction error of 0%)! Moreover, stepwise linear discriminant analysis can be used to separate even the females together with 28 high instruments (“playing” the Alto version of the song) from the males together with 20 low instruments (playing the Bass version) with a prediction error of 4%. Also, individual tones are analysed, and the statistical results are discussed and interpreted from acoustics point of view.

Claus Weihs, Christoph Reuter, Uwe Ligges

Classification of Processes by the Lyapunov Exponent

This paper deals with the problem of the discrimination between well-predictable and not-well-predictable time series. One criterion for the separation is given by the size of the Lyapunov exponent, which was originally defined for deterministic systems. However, the Lyapunov exponent can also be analyzed and used for stochastic time series. Experimental results illustrate the classification between well-predictable and not-well-predictable time series.

Anja M. Busse

Desirability to Characterize Process Capability

Over the past few years continuously new process capability indices have been developed, most of them with the aim to add some feature missed in former process capability indices. Thus, for nearly any thinkable situation now a special index exists which makes choosing a certain index as difficult as interpreting and comparing index values correctly.

In this paper we propose the use of the expected value of a certain type of function, the so-called desirability function, to assess the capability of a process. The resulting index may be used analogously to the classical indices related to

, but can be adapted to nearly any process and any specification. It even allows a comparison between different processes regardless of their distribution and may be extended straightforwardly to multivariate scenarios. Furthermore, its properties compare favorably to the properties of the “classical” indices.

Jutta Jessenberger, Claus Weihs

Application and Use of Multivariate Control Charts in a BTA Deep Hole Drilling Process

Deep hole drilling methods are used for producing holes with a high length-to-diameter ratio, good surface finish and straightness. The process is subject to dynamic disturbances usually classified as either chatter vibration or spiralling. In this paper, we will focus on the application and use of multivariate control charts to monitor the process in order to detect chatter vibrations. The results showed that chatter is detected and some alarm signals occur at time points which can be connected to physical changes of the process.

Amor Messaoud, Winfried Theis, Claus Weihs, Franz Hering

Determination of Relevant Frequencies and Modeling Varying Amplitudes of Harmonic Processes

When a process is dominated by few important frequencies the observations of this process can be modelled by a harmonic process (Bloomfield (2000)). If the amplitudes of these dominating frequencies vary over time their dominance may not be apparent during the whole process.

To discriminate between frequencies relevant for such a process we determine the distribution of the periodogram ordinates, and use this distribution to derive a procedure to assess the relevance of the frequencies. This procedure uses the standardized median (Gather and Schultze (1999)) to determine the variance of the error process. In a simulation study we show that this procedure is very efficient even under difficult conditions such as a low signal-to-noise ratio or AR(1) disturbances. Furthermore, we show that the necessary transformation to estimate the amplitudes from periodogram ordinates leads to a good normality approximation which makes it especially easy to model the development of the amplitudes from these estimates.

Winfried Theis, Claus Weihs

Contest: Social Milieus in Dortmund

Introduction to the Contest “Social Milieus in Dortmund”

Goal and data of the contest “Social Milieus in Dortmund” are introduced.

Ernst-Otto Sommerer, Claus Weihs

Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering

In order to group the observations of a data set into a given number of clusters, an ‘optimal’ subset out of a greater number of explanatory variables is to be selected. The problem is approached by maximizing a quality measure under certain restrictions that are supposed to keep the subset most representative of the whole data. The restrictions may either be set manually, or generated from the data. A genetic optimization algorithm is developed to solve this problem.

The procedure is then applied to a data set describing features of sub-districts of the city of Dortmund, Germany, to detect different social milieus and investigate the variables making up the differences between these.

Christian Röver, Gero Szepannek

Annealed κ-Means Clustering and Decision Trees

This paper describes a contribution to the GfKl 2004 Contest. The contest task is to cluster, classify and interpret the 170 districts of the city of Dortmund with respect to their ‘social milieux’. A data set containing 204 variables measured for every district is given.

We apply annealed

-means clustering to the preprocessed contest data. Superparamagnetic clustering is used to foster insight into the natural partitions of the data. A stable and interpretable solution is obtained with

= 3 clusters, dividing Dortmund into three social milieux. A decision tree is deduced from this cluster solution and is used for interpretation and rule generation. The tree offers the possibility to monitor and predict future assessments. To gain information about cluster solutions with

> 3 a stability analysis based on a resampling approach is performed resulting in further interesting insights.

Christin Schäfer, Julian Laub

Correspondence Clustering of Dortmund City Districts

We combine correspondence analysis (CA) and

-means clustering to divide Dortmund's districts into groups that are associated to particular variables and thus represent a social cluster. CA visualizes associations between rows and columns of a frequency matrix and can be used for dimension reduction. Based on the first three dimensions after CA mapping we find a stable partition into five clusters. We further identify variables that are highly associated with the cluster centroids and thus represent a cluster's social condition.

Stefanie Scheid

Backmatter

Title: Classification — the Ubiquitous Challenge
Editors: Professor Dr. Claus Weihs
Professor Dr. Wolfgang Gaul
Publisher: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-28084-2
Print ISBN: 978-3-540-25677-9
DOI: https://doi.org/10.1007/3-540-28084-7

Springer Professional

Table of Contents