Skip to main content

2013 | Buch

Algorithms from and for Nature and Life

Classification and Data Analysis

insite
SUCHEN

Über dieses Buch

This volume provides approaches and solutions to challenges occurring at the interface of research fields such as, e.g., data analysis, data mining and knowledge discovery, computer science, operations research, and statistics. In addition to theory-oriented contributions various application areas are included. Moreover, traditional classification research directions concerning network data, graphs, and social relationships as well as statistical musicology describe examples for current interest fields tackled by the authors. The book comprises a total of 55 selected papers presented at the Joint Conference of the German Classification Society (GfKl), the German Association for Pattern Recognition (DAGM), and the Symposium of the International Federation of Classification Societies (IFCS) in 2011.​

Inhaltsverzeichnis

Frontmatter

Invited

Frontmatter
Size and Power of Multivariate Outlier Detection Rules

Multivariate outliers are usually identified by means of robust distances. A statistically principled method for accurate outlier detection requires both availability of a good approximation to the finite-sample distribution of the robust distances and correction for the multiplicity implied by repeated testing of all the observations for outlyingness. These principles are not always met by the currently available methods. The goal of this paper is thus to provide data analysts with useful information about the practical behaviour of some popular competing techniques. Our conclusion is that the additional information provided by a data-driven level of trimming is an important bonus which ensures an often considerable gain in power.

Andrea Cerioli, Marco Riani, Francesca Torti
Clustering and Prediction of Rankings Within a Kemeny Distance Framework

Rankings and partial rankings are ubiquitous in data analysis, yet there is relatively little work in the classification community that uses the typical properties of rankings. We review the broader literature that we are aware of, and identify a common building block for both prediction of rankings and clustering of rankings, which is also valid for partial rankings. This building block is the Kemeny distance, defined as the minimum number of interchanges of two adjacent elements required to transform one (partial) ranking into another. The Kemeny distance is equivalent to Kendall’s

τ

for complete rankings, but for partial rankings it is equivalent to Emond and Mason’s extension of

τ

. For clustering, we use the flexible class of methods proposed by Ben-Israel and Iyigun (Journal of Classification 25: 5–26, 2008), and define the disparity between a ranking and the center of cluster as the Kemeny distance. For prediction, we build a prediction tree by recursive partitioning, and define the impurity measure of the subgroups formed as the sum of all within-node Kemeny distances. The median ranking characterizes subgroups in both cases.

Willem J. Heiser, Antonio D’Ambrosio
Solving the Minimum Sum of L1 Distances Clustering Problem by Hyperbolic Smoothing and Partition into Boundary and Gravitational Regions

The article considers the minimum sum of distances clustering problem, where the distances are measured through the L1 or Manhattan metric (MSDC-L1). The mathematical modelling of this problem leads to a

min-sum-min

formulation which, in addition to its intrinsic bi-level nature, has the significant characteristic of being strongly non differentiable.

We propose the AHSC-L1 method to solve this problem, by combining two techniques. The first technique is Hyperbolic Smoothing Clustering (HSC), that adopts a smoothing strategy using a special

C

completely differentiable class function. The second technique is the partition of the set of observations into two non overlapping groups: “data in frontier” and “data in gravitational regions”. We propose a classification of the gravitational observations by each component, which simplifies of the calculation of the objective function and its gradient. The combination of these two techniques for MSDC-L1 problem drastically simplify the computational tasks.

Adilson Elias Xavier, Vinicius Layter Xavier, Sergio B. Villas-Boas

Clustering and Unsupervised Learning

Frontmatter
On the Number of Modes of Finite Mixtures of Elliptical Distributions

We extend the concept of the ridgeline from Ray and Lindsay (Ann Stat 33:2042–2065, 2005) to finite mixtures of general elliptical densities with possibly distinct density generators in each component. This can be used to obtain bounds for the number of modes of two-component mixtures of

t

distributions in any dimension. In case of proportional dispersion matrices, these have at most three modes, while for equal degrees of freedom and equal dispersion matrices, the number of modes is at most two. We also give numerical illustrations and indicate applications to clustering and hypothesis testing.

Grigory Alexandrovich, Hajo Holzmann, Surajit Ray
Implications of Axiomatic Consensus Properties

Since Arrow’s celebrated impossibility theorem, axiomatic consensus theory has been extensively studied. Here we are interested in implications between axiomatic properties and consensus functions on a profile of hierarchies. Such implications are systematically investigated using Formal Concept Analysis. All possible consensus functions are automatically generated on a set of hierarchies derived from a fixed set of taxa. The list of implications is presented and discussed.

Florent Domenach, Ali Tayari
Comparing Earth Mover’s Distance and its Approximations for Clustering Images

There are many different approaches to measure dissimilarities between images on the basis of color histograms. Some of them operate fast but generate results in contradiction to human perception. Others yield better results, especially the Earth Mover’s Distance (EMD) (Rubner et al., Int J Comput Vis, 40(2): 99–121, 2000), but its computational complexity prevents its usage in large databases (Ling et al., IEEE Trans Pattern Anal Mach Intell, 29(5):840–853, 2007). This paper presents a new intuitive intelligible approximation of EMD. The empirical study tries to answer the question whether the good results of EMD justify its long computation time. We tested several distances with images that were changed by normally-distributed failures and evaluate their results by means of the adjusted Rand index (Hubert et al., J Classif, 2:193–218, 1985).

Sarah Frost, Daniel Baier
A Hierarchical Clustering Approach to Modularity Maximization

The problem of uncovering clusters of objects described by relationships that can be represented with the help of graphs is an application, which arises in fields as diverse as biology, computer science, and sociology, to name a few. To rate the quality of clusterings of undirected, unweighted graphs, modularity is a widely used goodness-of-fit index. As finding partitions of a graph’s vertex set, which maximize modularity, is NP-complete, various cluster heuristics have been proposed. However, none of these methods uses classical cluster analysis, where clusters based on (dis-)similarity data are sought. We consider the lengths of shortest paths between all vertex pairs as dissimilarities between the pairs of objects in order to apply standard cluster analysis methods. To test the performance of our approach we use popular real-world as well as computer generated benchmark graphs with known optimized cluster structure. Our approach is simple and compares favourably w.r.t. results known from the literature.

Wolfgang Gaul, Rebecca Klages
Mixture Model Clustering with Covariates Using Adjusted Three-Step Approaches

When using mixture models, researchers may investigate the associations between cluster membership and covariates by introducing these variables in a (logistic) regression model for the prior class membership probabilities. However, a very popular alternative among applied researchers is a three-step approach in which after estimating the mixture model (step 1) and assigning subjects to clusters (step 2), the cluster assignments are regressed on covariates (step 3). For mixture models for categorical responses, (Bolck et al., Political Anal 12:3–27, 2004) and (Vermunt, Political Anal 18:450–469, 2010) showed this approach may severely downward bias covariate effects, and moreover showed how to adjust for this bias. This paper generalizes their corrections methods to be applicable also with mixture models for continuous responses, where the main complicating factor is that a complex multidimensional integral needs to be solved to obtain the classification errors needed for the corrections. We propose approximating this integral by a summation over the empirical distribution of the response variables. The simulation study showed that the approaches work well, except for the combination of very badly separated components and a small sample size.

Dereje W. Gudicha, Jeroen K. Vermunt
Efficient Spatial Segmentation of Hyper-spectral 3D Volume Data

Segmentation of noisy hyper-spectral imaging data using clustering requires special algorithms. Such algorithms should consider spatial relations between the pixels, since neighbor pixels should usually be clustered into one group. However, in case of large spectral dimension (

p

), cluster algorithms suffer from the curse of dimensionality and have high memory requirements as well as long run-times. We propose to embed pixels from a window of

w

×

w

pixels to a feature space of dimension

pw

2

. The effect of implicit denoising due to the window is controlled by weights depending on the spatial distance. We propose either using Gaussian weights or data-adaptive weights based on the similarity of pixels. Finally, any vectorial clustering algorithm, like

k

-means, can be applied in this feature space. Then, we use the FastMap algorithm for dimensionality reduction. The proposed algorithm is evaluated on a large simulated imaging mass spectrometry dataset.

Jan Hendrik Kobarg, Theodore Alexandrov
Cluster Analysis Based on Pre-specified Multiple Layer Structure

Cluster analysis can be divided into two categories; hierarchical and non-hierarchical cluster analyses. In the present study, a method of cluster analysis which does not utilize hierarchical nor non-hierarchical procedures is introduced. The present cluster analysis pre-specifies a structure having multiple layers, e.g., the species, the genus, the family, and the order. The highest layer or layer 0 has one cluster which all objects belong to. The cluster at layer 0 has the pre-specified number of clusters at the next lower layer or layer 1. Each cluster at layer 1 has the pre-specified number of clusters at the next lower layer or layer 2, and so on. The cluster analysis classifies the object into one of the clusters at all layers simultaneously. While the cluster structure is hierarchical, the procedure is not hierarchical which is different from that of the agglomerative or divisive algorithms of the hierarchical cluster analysis. The algorithm tries to optimize the fitness measure at all layers simultaneously. The cluster analysis is applied to the data on whisky molts.

Akinori Okada, Satoru Yokoyama
Factor PD-Clustering

Probabilistic Distance (PD) Clustering is a non parametric probabilistic method to find homogeneous groups in multivariate datasets with

J

variables and

n

units. PD Clustering runs on an iterative algorithm and looks for a set of

K

group centers, maximising the empirical probabilities of belonging to a cluster of the

n

statistical units. As

J

becomes large the solution tends to become unstable. This paper extends the PD-Clustering to the context of Factorial clustering methods and shows that Tucker3 decomposition is a consistent transformation to project original data in a subspace defined according to the same PD-Clustering criterion. The method consists of a two step iterative procedure: a linear transformation of the initial data and PD-clustering on the transformed data. The integration of the PD Clustering and the Tucker3 factorial step makes the clustering more stable and lets us consider datasets with large

J

and let us use it in case of clusters not having elliptical form.

Cristina Tortora, Mireille Gettler Summa, Francesco Palumbo

Statistical Data Analysis, Visualization and Scaling

Frontmatter
Clustering Ordinal Data via Latent Variable Models

Item response modelling is a well established method for analysing ordinal response data. Ordinal data are typically collected as responses to a number of questions or items. The observed data can be viewed as discrete versions of an underlying latent Gaussian variable. Item response models assume that this latent variable (and therefore the observed ordinal response) is a function of both respondent specific and item specific parameters. However, item response models assume a homogeneous population in that the item specific parameters are assumed to be the same for all respondents. Often a population is heterogeneous and clusters of respondents exist; members of different clusters may view the items differently. A mixture of item response models is developed to provide clustering capabilities in the context of ordinal response data. The model is estimated within the Bayesian paradigm and is illustrated through an application to an ordinal response data set resulting from a clinical trial involving self-assessment of arthritis.

Damien McParland, Isobel Claire Gormley
Sentiment Analysis of Online Media

A joint model for annotation bias and document classification is presented in the context of media sentiment analysis. We consider an Irish online media data set comprising online news articles with user annotations of negative, positive or irrelevant impact on the Irish economy. The joint model combines a statistical model for user annotation bias and a Naive Bayes model for the document terms. An EM algorithm is used to estimate the annotation bias model, the unobserved biases in the user annotations, the classifier parameters and the sentiment of the articles. The joint modeling of both the user biases and the classifier is demonstrated to be superior to estimation of the bias followed by the estimation of the classifier parameters.

Michael Salter-Townshend, Thomas Brendan Murphy
Visualizing Data in Social and Behavioral Sciences: An Application of PARAMAP on Judicial Statistics

In this paper, we describe a technique called PARAMAP for the visualization, scaling, and dimensionality reduction of data in the social and behavioral sciences. PARAMAP uses a criterion of maximizing continuity between higher dimensional data and lower dimensional derived data, rather than the distance based criterion used by standard distance based multidimensional scaling (MDS). We introduce PARAMAP using the example of scaling and visualizing the voting patterns of Justices in the US Supreme Court. We use data on the agreement rates between individual Justices in the US Supreme Court and on the percentage swing votes for Justices over time. We use PARAMAP, metric MDS, and nonmetric MDS approaches to create a voting space representation of the Justices in one and two dimensions. We test the results using a metric that measures neighborhood agreement of points between higher and lower dimensional solutions. PARAMAP produces smooth, easily interpretable, solutions, with no clumping together of points.

Ulas Akkucuk, J. Douglas Carroll, Stephen L. France
Properties of a General Measure of Configuration Agreement

Variants of the Rand index of clustering agreement have been used to measure agreement between spatial configurations of points (Akkucuk

2004

; Akkucuk and Carroll

2006

; Chen

2006

; Chen and Buja

2009

). For these measures, the

k

-nearest neighbors of each point are compared across configurations. The agreement measure can be generalized across multiple values of

k

(France and Carroll

2007

). The generalized agreement metric is denoted as

ψ

. In this paper, we further generalize

ψ

to the case of more than two configurations. We develop a partial agreement measure as a neighborhood agreement equivalent of the partial correlation coefficient. We demonstrate the use of

ψ

and partial

ψ

using an illustrative example. (MATLAB implementations of routines for calculating

ψ

and partial

ψ

are available at

https://sites.google.com/site/psychminegroup/

.)

Stephen L. France
Convex Optimization as a Tool for Correcting Dissimilarity Matrices for Regular Minimality

Fechnerian scaling as developed by Dzhafarov and Colonius (e.g., Dzhafarov and Colonius, J Math Psychol 51:290–304, 2007) aims at imposing a metric on a set of objects based on their pairwise dissimilarities. A necessary condition for this theory is the law of Regular Minimality (e.g., Dzhafarov EN, Colonius H (2006) Regular minimality: a fundamental law of discrimination. In: Colonius H, Dzhafarov EN (eds) Measurement and representation of sensations. Erlbaum, Mahwah, pp. 1–46 ). In this paper, we solve the problem of correcting a dissimilarity matrix for Regular Minimality by phrasing it as a convex optimization problem in Euclidean metric space. In simulations, we demonstrate the usefulness of this correction procedure.

Matthias Trendtel, Ali Ünlü
Principal Components Analysis for a Gaussian Mixture

Given a

p

-dimensional random variable

X

, Principal Components Analysis defines its optimal representation in a lower dimensional space. In this article we assume that

X

is distributed according to a Mixture of two Multivariate Normal Distributions and we project it onto an optimal vector space. We propose an original combination of principal components and linear discriminant analysis where the area under the ROC curve appears as the link between both methods. We represent

X

in terms of a small number of independent factors with maximum contribution to the area under the ROC curve of an optimal linear discriminant function. A practical example illustrates how these factors describe the differences between two categories in a simple classification problem.

Carlos Cuevas-Covarrubias
Interactive Principal Components Analysis: A New Technological Resource in the Classroom

Principal Components Analysis (PCA) is a mathematical technique widely used in multivariate statistics and pattern recognition. From a statistical point of view, PCA is an optimal linear transformation that eliminates the covariance structure of the data. From a geometrical point of view, it is simply a convenient axes rotation. A successful PCA application depends, at a certain point, on the comprehension of this geometrical concept; however, to visualize these axes rotation can be an important challenge for many students. At the present time, undergraduate students are immersed in a social environment with an increasing amount of collaborative and interactive elements. This situation gives us the opportunity to incorporate new and creative alternatives of knowledge transmission. We present an interactive educational software called Mi-iPCA, that helps students understand geometrical foundations of Principal Components Analysis. Based on the Nintendo’s Wiimote students manipulate axes rotation interactively in order to get a diagonal covariance matrix. The graphical environment shows different projections of the data, as well as several statistics like the percentage of variance explained by each component. Previous applications of this new pedagogical tool suggest that it constitutes an important didactic support in the classroom.

Carmen Villar-Patiño, Miguel Angel Mendez-Mendez, Carlos Cuevas-Covarrubias
One-Mode Three-Way Analysis Based on Result of One-Mode Two-Way Analysis

Several analysis models for proximity data have been introduced. While most of them were for one-mode two-way data, some analysis models which are able to analyze one-mode three-way data have been suggested in recent years. One-mode three-way multidimensional scaling and overlapping cluster analysis models were suggested. Furthermore, several studies were done for comparison between the results of analyses by one-mode two-way and by one-mode three-way data, where both of them are generated from the same source of data. In the present study, the authors suggest the analysis of one-mode three-way data based on one-mode two-way analysis for overlapping clusters. To evaluate the necessity of one-mode three-way analysis, firstly one-mode three-way data are reconstructed from the clusters and weights obtained by one-mode two-way overlapping cluster analysis. Secondly the reconstructed one-mode three-way data are subtracted from original one-mode three-way data. The subtracted data were analyzed by one-mode three-way overlapping cluster analysis model. The result of analysis discloses the components of proximities which can be expressed by one-mode three-way analysis but not by one-mode two-way analysis.

Satoru Yokoyama, Akinori Okada
Latent Class Models of Time Series Data: An Entropic-Based Uncertainty Measure

Latent class modeling has proven to be a powerful tool for identifying regimes in time series. Here, we focus on the classification uncertainty in latent class modeling of time series data with emphasis on entropy-based measures of uncertainty. Results are illustrated with an example.

José G. Dias
Regularization and Model Selection with Categorical Covariates

The challenge in regression problems with categorical covariates is the high number of parameters involved. Common regularization methods like the Lasso, which allow for selection of predictors, are typically designed for metric predictors. If independent variables are categorical, selection strategies should be based on modified penalties. For categorical predictor variables with many categories a useful strategy is to search for clusters of categories with similar effects. We focus on generalized linear models and present

L

1

-penalty approaches for factor selection and clustering of categories. The methods proposed are investigated in simulation studies and applied to a real world classification problem.

Jan Gertheiss, Veronika Stelz, Gerhard Tutz
Factor Preselection and Multiple Measures of Dependence

Factor selection or factor reduction is carried out to reduce the complexity of a data analysis problems (classification, regression) or to improve the fit of a model (via parameter estimation). In data mining there are special needs for a process by which relevant factors of influence are identified in order to achieve a balance between bias and noise. Insurance companies, for example, face data sets that contain hundreds of attributes or factors per object. With a large number of factors, the selection procedure requires a suitable process model. A process like that becomes compelling once data analysis is to be (semi) automated.We suggest an approach that proceeds in two phases: In the first one, we cluster attributes that are highly correlated in order to identify factor combinations that—statistically speaking—are near duplicates. In the second phase, we choose factors from each cluster that are highly associated with a target variable. The implementation requires some form of non-linear canonical correlation analysis. We define a correlation measure for two blocks of factors that will be employed as a measure of similarity within the clustering process. Such measures, in turn, are based on multiple indices of dependence. Few indices have been introduced cf. Wolff (Stochastica 4(3):175–188, 1980), ‘Few indices have been introduced in the literature’. All of them, however, are hard to interpret if the number of dimensions considerably exceeds two. For that reason we come up with signed measures that can be interpreted in the usual way.

Nina Büchel, Kay. F. Hildebrand, Ulrich Müller-Funk
Intrablocks Correspondence Analysis

We propose a new method to describe contingency tables with double partition structures in columns and rows. Furthermore, we propose new superimposed representations, based on the introduction of variable dilations for the partial clouds associated with the partitions of the columns and the rows. We illustrate our contributions with the analysis of some mortality data in Spanish regions.

Campo Elías Pardo, Jorge Eduardo Ortiz
Determining the Similarity Between US Cities Using a Gravity Model for Search Engine Query Data

In this paper we use the gravity model to estimate the similarity of US cities based on data provided by Google Trends (GT). GT allows to look up search terms and to obtain ranked lists of US cities according to the relative frequencies of requests for each term. The occurences of the US cities on these ranked lists are used to determine the similarities with the gravity model. As search terms for GT serve dictionaries derived from the General Inquirer (GI), containing the categories

Economy

and

Politics/Legal

. The estimated similarity scores are visualized with multidimensional scaling (MDS).

Paul Hofmarcher, Bettina Grün, Kurt Hornik, Patrick Mair

Bioinformatics and Biostatistics

Frontmatter
An Efficient Algorithm for the Detection and Classification of Horizontal Gene Transfer Events and Identification of Mosaic Genes

In this article we present a new algorithm for detecting partial and complete horizontal gene transfer (HGT) events which may give rise to the formation of mosaic genes. The algorithm uses a sliding window procedure that analyses sequence fragments along a given multiple sequence alignment (MSA). The size of the sliding window changes during the scanning process to better identify the blocks of transferred sequences. A bootstrap validation procedure incorporated in the algorithm is used to assess the bootstrap support of each predicted partial or complete HGT. The proposed technique can be also used to refine the results obtained by any traditional algorithm for inferring complete HGTs, and thus to classify the detected gene transfers as partial or complete. The new algorithm will be applied to study the evolution of the gene

rpl12e

as well as the evolution of a complete set of 53 archaeal MSA (i.e., 53 different ribosomal proteins) originally considered in Matte-Tailliez et al. (Mol Biol Evol 19:631–639, 2002).

Alix Boc, Pierre Legendre, Vladimir Makarenkov
Complexity Selection with Cross-validation for Lasso and Sparse Partial Least Squares Using High-Dimensional Data

Sparse regression and classification methods are commonly applied to high-dimensional data to simultaneously build a prediction rule and select relevant predictors. The well-known lasso regression and the more recent sparse partial least squares (SPLS) approach are important examples. In such procedures, the number of identified relevant predictors typically depends on a complexity parameter that has to be adequately tuned. Most often, parameter tuning is performed via cross validation (CV). In the context of lasso penalized logistic regression and SPLS classification, this paper addresses three important questions related to complexity selection: (1) Does the number of folds in CV affect the results of the tuning procedure? (2) Should CV be repeated several times to yield less variable tuning results?, and (3) Is complexity selection robust against resampling?

Anne-Laure Boulesteix, Adrian Richter, Christoph Bernau
A New Effective Method for Elimination of Systematic Error in Experimental High-Throughput Screening

High-throughput screening (HTS) is a critical step of the drug discovery process. It involves measuring the activity levels of thousands of chemical compounds. Several technical and environmental factors can affect an experimental HTS campaign and thus cause systematic deviations from correct results. A number of error correction methods have been designed to address this issue in the context of experimental HTS (Brideau et al., J Biomol Screen 8:634–647, 2003; Kevorkov and Makarenkov, J Biomol Screen 10:557–567, 2005; Makarenkov et al., Bioinformatics 23:1648–1657, 2007; Malo et al., Nat Biotechnol 24:167–175, 2006). Despite their power to reduce the impact of systematic noise, all these methods introduce a bias when applied to data not containing any systematic error. We will present a new method, proceeding by finding an approximate solution of an overdetermined system of linear equations, for eliminating systematic error from HTS screens by using a prior knowledge on its exact location. This is an important improvement over the popular B-score method designed by Merck Frosst researchers (Brideau et al., J Biomol Screen 8:634–647, 2003) and widely used in the modern HTS.

Vladimir Makarenkov, Plamen Dragiev, Robert Nadon
Local Clique Merging: An Extension of the Maximum Common Subgraph Measure with Applications in Structural Bioinformatics

We develop a novel similarity measure for node-labeled and edge-weighted graphs, which is an extension of the well-known maximum common subgraph (MCS) measure. Despite its common usage and appealing properties, the MCS also exhibits some disadvantages, notably a lack of flexibility and tolerance toward structural variation. In order to address these issues, we propose a generalization which is based on so-called quasi-cliques. A quasi-clique is a relaxation of a clique in the sense of being an “almost” complete subgraph. Thus, it increases flexibility and robustness toward structural variation. To construct a quasi-clique, we make use of a heuristic approach, in which so-called local cliques are determined first and combined into larger (quasi-)cliques afterward. We also present applications of our novel similarity measure to the retrieval and classification of protein binding sites.

Thomas Fober, Gerhard Klebe, Eyke Hüllermeier
Identification of Risk Factors in Coronary Bypass Surgery

In quality improvement in medical care one important aim is to prevent complications after a surgery and, particularly, keep the mortality rate as small as possible. Therefore it is of great importance to identify which factors increase the risk to die in the aftermath of a surgery. Based on data of 1,163 patients who underwent an isolated coronary bypass surgery in 2007 or 2008 at the Clinic of Cardiovascular Surgery in Düsseldorf, Germany, we select predictors that affect the in-hospital-mortality. A forward search using the wrapper approach in conjunction with simple linear and also more complex classification methods such as gradient boosting and support vector machines is performed. Since the classification problem is highly imbalanced with certainly unequal but unknown misclassification costs the area under ROC curve (AUC) is used as performance criterion for hyperparameter tuning as well as for variable selection. In order to get stable results and to obtain estimates of the AUC the variable selection is repeated 25 times on different subsamples of the data set. It turns out that simple linear classification methods (linear discriminant analysis and logistic regression) are suitable for this problem since the AUC cannot be considerably increased by more complex methods. We identify the three most important predictors as the severity of cardiac insufficiency, the patient’s age as well as pulmonary hypertension. A comparison with full models trained on the same 25 subsamples shows that the classification performance in terms of AUC is not affected or only slightly decreased by variable selection.

Julia Schiffner, Erhard Godehardt, Stefanie Hillebrand, Alexander Albert, Artur Lichtenberg, Claus Weihs

Archaeology and Geography, Psychology and Educational Sciences

Frontmatter
Parallel Coordinate Plots in Archaeology

Parallel coordinate plots (PCPs) can be applied to explore multivariate data with more than three dimensions. This visualisation method is straightforward and easily intelligible for people without statistical background. However, to our knowledge, PCPs have not yet been applied to archaeological data. This paper presents some examples of archaeological classifications which are clearly visible in PCPs. For this purpose a program has been written offering some additional options which are not supported in standard software for PCP generation. Some of the functionality of Geographic Information Systems (GIS) was introduced for PCPs: this program is able to create a thematic display based on a user-selected variable, optionally multiple plots highlight each thematic colour. Another variable may control the width of the PCP lines. Moreover, an info-tool, zoom, and a find-function are supported. The resulting diagram can be saved in SVG format and in a GIS format.

Irmela Herzog, Frank Siegmund
Classification of Roman Tiles with Stamp PARDALIVS

Latterly, 14 Roman tiles were excavated in Nehren in the farthest eastern part of the former Roman province

Gallia Belgica

. Ten of them have the stamp PARDALIVS that was never seen before. First, this new set of tiles is compared with all currently determined provenances of Roman tile making in the northern part of

Germania Superior

based on their chemical composition. The discriminant analysis indicates that the set of tiles of Nehren is different. However, this class looks not homogeneous. Therefore, second, exploratory data analysis including data visualizations is performed. Additionally we investigated the tiles of a provenance with not yet identified location (NYI3) in detail because their statistical difference to the tiles of Nehren is the lowest among the considered provenances. A serious problem is the small sample size. In order to increase the latter one, we propose some combinations of bootstrapping and jittering to generate additional observations.

Hans-Joachim Mucha, Jens Dolata, Hans-Georg Bartel
Applying Location Planning Algorithms to Schools: The Case of Special Education in Hesse (Germany)

In Germany children with special educational needs are still predominantly taught in special schools. Although segregative schooling stands in conflict with the legal situation, only little development towards inclusive concepts has been made. One substantial argument are the costs involved by the required, extensive reorganization of the educational system. By using location planning procedures, this paper describes how organizational effects and the financial impact of implementing inclusive settings can be analyzed systematically. Schooling of children with learning disabilities in the German federal state of Hesse is used to exemplify and discuss the approach. The results indicate that especially in rural regions, where school enrollment is decreasing, preference should be given to inclusive concepts in order to provide a demand-oriented school system where schools are close to home for all students.

Alexandra Schwarz
Detecting Person Heterogeneity in a Large-Scale Orthographic Test Using Item Response Models

Achievement tests for students are constructed with the aim of measuring a specific competency uniformly for all examinees. This requires students to work on the items in a homogenous way. The dichotomous logistic Rasch model is the model of choice for assessing these assumptions during test construction. However, it is also possible that various subgroups of the population either apply different strategies for solving the items or make specific types of mistakes, or that different items measure different latent traits. These assumptions can be evaluated with extensions of the Rasch model or other Item Response models. In this paper, the test construction of a new large-scale German orthographic test for eighth grade students is presented. In the process of test construction and calibration, a pilot version was administered to 3,227 students in Austria. In the first step of analysis, items yielded a poor model fit to the dichotomous logistic Rasch model. Further analyses found homogenous subgroups in the sample which are characterized by different orthographic error patterns.

Christine Hohensinn, Klaus D. Kubinger, Manuel Reif
Linear Logistic Models with Relaxed Assumptions in R

Linear logistic models with relaxed assumptions (LLRA) are a flexible tool for item-based measurement of change or multidimensional Rasch models. Their key features are to allow for multidimensional items and mutual dependencies of items as well as imposing no assumptions on the distribution of the latent trait in the population. Inference for such models becomes possible within a framework of conditional maximum likelihood estimation. In this paper we introduce and illustrate new functionality from the

R

package

eRm

for fitting, comparing and plotting of LLRA models for dichotomous and polytomous responses with any number of time points, treatment groups and categorical covariates.

Thomas Rusch, Marco J. Maier, Reinhold Hatzinger

Text Mining, Social Networks and Clustering

Frontmatter
An Approach for Topic Trend Detection

The detection of topic trends is an important issue in textual data mining. For this task textual documents collected over a certain time period are analysed by grouping them into homogeneous time window dependent clusters. We use a vector space model and a straight-forward vector cosine measure to evaluate document-document similarities in a time window and discuss how cluster-cluster similarities between subsequent windows can help to detect alterations of topic trends over time. Our method is demonstrated by using an empirical data set of about 250 pre-classified time-stamped documents. Results allow to assess which method specific parameters are valuable for further research.

Wolfgang Gaul, Dominique Vincent
Modified Randomized Modularity Clustering: Adapting the Resolution Limit

Fortunato and Barthélemy (Proc Nat Acad Sci USA 104(1):36–41, 2007) investigated the resolution limit of modularity clustering algorithms. They showed that the goal function of the standard modularity clustering algorithm of Newman and Girvan (Phys Rev E 69(2):026113, 2004) implies that the number of clusters chosen by the algorithm is approximately the square root of the number of edges. The existence of the resolution limit shows that the discovery of the number of clusters is not automatic. In this paper we report on two contributions to solve the problem of automatic cluster detection in graph clustering: We parametrize the goal function of modularity clustering by considering the number of edges as free parameter and we introduce permutation invariance as a general formal diagnostic to recognize good partitions of a graph. The second contribution results from the study of scaling bounds combined with the stability of graph partitions on various types of regular graphs. In this study the connection of the stability of graph partitions with the automorphism group of the graph was discovered.

Andreas Geyer-Schulz, Michael Ovelgönne, Martin Stein
Cluster It! Semiautomatic Splitting and Naming of Classification Concepts

In this paper, we present a semiautomatic approach to split overpopulated classification concepts (i.e. classes) into subconcepts and propose suitable names for the new concepts. Our approach consists of three steps: In a first step, meaningful term clusters are created and presented to the user for further curation and selection of possible new subconcepts. A graph representation and simple

tf-idf

weighting is used to create the cluster suggestions. The term clusters are used as seeds for the subsequent content-based clustering of the documents using k-Means. At last, the resulting clusters are evaluated based on their correlation with the preselected term clusters and proper terms for the naming of the clusters are proposed. We show that this approach efficiently supports the maintainer while avoiding the usual quality problems of fully automatic clustering approaches, especially with respect to the handling of outliers and determination of the number of target clusters. The documents of the parent concept are directly assigned to the new subconcepts favoring high precision.

Dominik Stork, Kai Eckert, Heiner Stuckenschmidt

Banking and Finance

Frontmatter
A Theoretical and Empirical Analysis of the Black-Litterman Model

The Black-Litterman (BL) model aims to enhance asset allocation decisions by overcoming the weaknesses of standard mean-variance (MV) portfolio optimization. In this study we propose a method that enables the implementation of the BL model on a multi-asset portfolio allocation decision. Further, we empirically test the out-of-sample portfolio performance of BL optimized portfolios in comparison to mean-variance (MV), minimum-variance, and adequate benchmark portfolios. Using an investment universe of global stock markets, bonds, and commodities, we find that for the period from January 2000 to August 2011 out-of-sample BL optimized portfolios provide superior Sharpe ratios, even after controlling for different levels of risk aversion, realistic investment constraints, and transaction costs. Further the BL approach is suitable to alleviate most of the shortcomings of MV optimization, in that the resulting portfolios are more diversified, less extreme, and hence, economically more intuitive.

Wolfgang Bessler, Dominik Wolff
Vulnerability of Copula-VaR to Misspecification of Margins and Dependence Structure

Copula functions as tools for modeling multivariate distributions are well known in theory of statistics and over the last decade have been gathering more and more popularity also in the field of finance. A Copula-based model of multivariate distribution includes both dependence structure and marginal distributions in such a way that the first may be analyzed separately from the later. Its main advantage is an elasticity allowing to merge margins of one type with a copula function of another one, or even bound margins of various types by a common copula into a single multivariate distribution. In this article copula functions are used to estimate Value at Risk (VaR). The goal is to investigate how misspecification of marginal distributions and dependence structure affects VaR. As dependence structure normal and student-t copula are considered. The analysis is based on simulation studies.

Katarzyna Kuziak
Dynamic Principal Component Analysis: A Banking Customer Satisfaction Evaluation

An empirical study, based on a sample of 27.000 retail customers, has been curried out: the management of a national bank with a spread network across Italian regions wanted to analyze the loss in competition of its retail services, probably due to a loss in customer satisfaction. The survey has the aim to analyze weaknesses of retail services, individuate possible recovery actions and evaluate their effectiveness across different waves (3 time lags). Such issues head our study towards a definition of a new path measure which exploits a dimension reduction obtained with Dynamic Principal Component Analysis (DPCA). Results which shown customer satisfaction configurations are discussed in the light of the possible marketing actions.

Caterina Liberati, Paolo Mariani
Comparison of Some Chosen Tests of Independence of Value-at-Risk Violations

Backtesting is the necessary statistical procedure to evaluate performance of Value-at-Risk models. A satisfactory test should allow to detect both deviations from the correct probability of violations, as well as their clustering. Many researchers and practitioners underline the importance of the lack of any dependence in the hit series over time. If the independence condition is not met, it may be a signal that the Value at Risk model reacts too slowly to changes in the market. If the violation sequence exhibits a dependence other than first order Markov dependence, the classical test of Christoffersen would fail to detect it. This article presents a test based on analysis of duration, having power against more general forms of dependence, based on the same set of information as the Christoffersen test, i.e. hit series.The aim of this article is to analyze presented backtests, focusing on the aspect of limited data sets and the power of tests. Simulated data representing asset returns are used here. This paper is a continuation of earlier research done by the author.

Krzysztof Piontek
Fundamental Portfolio Construction Based on Mahalanobis Distance

In the classic Markowitz model, at an assumed profitability level, the portfolio risk is minimized. The fundamental portfolio introduces an additional condition aimed at ensuring that the portfolio is only composed of companies in good economic condition. A synthetic indicator is constructed for each company, describing its economic and financial situation. There are many methods for constructing synthetic measures. This article applies the standard method of linear order. In models of fundamental portfolio construction, companies are most often organised in order on the basis of Euclidean distance. Due to possible correlation between economic variables, the most appropriate measure of distance between enterprises is the Mahalanobis distance.

The aim of the article is to compare the composition of fundamental portfolios constructed on the basis of Euclidean distance with portfolios determined using the Mahalanobis distance.

Anna Rutkowska-Ziarko
Multivariate Modelling of Cross-Commodity Price Relations Along the Petrochemical Value Chain

We aim to shed light on the relationship between the prices of crude oil and oil-based products along the petrochemical value chain. The analyzed commodities are tied in an integrated production process. This characteristic motivates the existence of long-run equilibrium price relationships. An understanding of the complex price relations between input and output products is important for petrochemical companies, which are exposed to price risk on both sides of their business. Their profitability is linked to the spread between input and output prices. Therefore, information about price relations along the value chain is valuable for risk management decisions. Using vector error correction models (VECM), we explore cross-commodity price relationships. We find that all prices downstream the value chain are cointegrated with the crude oil price, which is the driving price in the system. Furthermore, we assess whether the information about long-run cross-commodity relations, which is incorporated in the VECMs, can be utilized for forecasting prices of oil-based products. Rolling out-of-sample forecasts are computed and the forecasting performance of the VECMs is compared to the performance of naive forecasting models. Our study offers new insights into how economic relations between commodities linked in a production process can be used for price forecasts and offers implications for risk management in the petrochemical industry.

Myriam Thömmes, Peter Winker

Marketing and Management

Frontmatter
Lifestyle Segmentation Based on Contents of Uploaded Images Versus Ratings of Items

Clustering algorithms are standard tools for marketing purposes. So, e.g., in market segmentation, they are applied to derive homogeneous customer groups. However, recently, the available resources for this purpose have extended. So, e.g., in social networks potential customers provide images which reflect their activities, interests, and opinions. To compare whether contents of uploaded images lead to similar lifestyle segmentations as ratings of items, a comparison study was conducted among 478 people. In this paper we discuss the results of this study that suggests that similar lifestyle segmentations can be found. We discuss advantages and disadvantages of the new approach to lifestyle segmentation.

Ines Daniel, Daniel Baier
Optimal Network Revenue Management Decisions Including Flexible Demand Data and Overbooking

In aviation network revenue management it is helpful to address consumers who are flexible w.r.t. certain flight characteristics, e.g., departure times, number of intermediate stops, and booking class assignments. While overbooking has some tradition and the offering of so-called flexible products, in which some of the mentioned characteristics are not fixed in advance, is gaining increasing importance, the simultaneous handling of both aspects is new. We develop a DLP (deterministic linear programming) model that considers flexible products and overbooking and use an empirical example for the explanation of our findings.

Wolfgang Gaul, Christoph Winkler
Non-symmetrical Correspondence Analysis of Abbreviated Hard Laddering Interviews

Hard laddering is a kind of a semi structured interview in a quantitative means-end chain (MEC) approach that yields a summary implication matrix (SIM). The SIM is based on pairwise associations between attributes (A), consequences (C), and personal values (V), and constitutes a base for developing hierarchical value maps (HVM). A new summary data presentation of the A-C-V triplets that form a summary ladder matrix (SLM) is presented. The structure of the SLM is examined with the use of non-symmetrical correspondence analysis. This approach permits us to identify the dependence structure in the SLM as well as the ladders that contribute most to the system’s inertia.

Eugene Kaciak, Adam Sagan
Antecedents and Outcomes of Participation in Social Networking Sites

This study seeks to understand factors that influence the participation in online social networks and outcomes. The proposed model integrates variables such as identification, satisfaction, degree of influence, usefulness and ease of use into a comprehensive framework. The empirical approach was based on an online survey of 336 young adults in Portugal, undertaken during November/December 2010. Research findings showed that identification, perceived usefulness, interaction preference, and extroversion are the most important factors in order to influence the members’participation. The degree of influence and the identification have an indirect effect on the participation through perceived usefulness. Participation in social networking sites, in turn, is linked to higher levels of loyalty, actual use, and word-of-mouth. The results of this study have implications for researchers and practitioners.

Sandra Loureiro
User-Generated Content for Image Clustering and Marketing Purposes

The analysis of images for different purposes – particularly image clustering – has been the subject of several research streams in the past. Since the 1990s query by image content and, somewhat later, content-based image retrieval have been topics of growing scientific interest. A literature review shows that research on image analysis, so far, is primarily related to computer science. However, since the advent of Flickr and other media-sharing platforms there is an ever growing data base of images which reflects individual preferences regarding activities or interests. Hence, these data is promising to observe implicit preferences and complement classical efforts for several marketing purposes (see, e.g., Van House, Int J Hum-Comput Stud 67:1073–1086, 2009 or Baier D, Daniel I (2011) Image clustering for marketing purposes. In: W. Gaul, A. Geyer-Schulz, L. Schmidt-Thieme (eds) Challenges concerning the data analysis – Computer science – Optimization, vol. 43). Against this background, the present paper investigates options for clustering images on the basis of personal image preferences, e.g. to use the results for marketing purposes.

Diana Schindler
Logic Based Conjoint Analysis Using the Commuting Quantum Query Language

Recently, in computer science, the commuting quantum query language (CQQL) has been introduced for ranking search results in a database (Schmitt I, VLDB Very Large Data Base J 17(1):39–56, 2008). A user is asked to express search conditions and to rank presented samples, then, the CQQL system matches select-conditions with attribute values. Schmitt’s (VLDB Very Large Data Base J 17(1):39–56, 2008) approach is based on von Neumann’s (Grundlagen der Quantenmechanik. Springer, Berlin, 1932) quantum mechanics and quantum logic and is adapted in this paper to the usual conjoint analysis setting in marketing research where respondents are asked to evaluate selected attribute level combinations for choice prediction purposes. The results of an application show that the new approach competes well with the traditional approach.

Ingo Schmitt, Daniel Baier
Product Design Optimization Using Ant Colony And Bee Algorithms: A Comparison

In recent years, heuristic algorithms, especially swarm intelligence algorithms, have become popular for product design, where problem formulations often are NP-hard (Socha and Dorigo, Eur J Oper Res 185:1155–1173, 2008). Swarm intelligence algorithms offer an alternative for large-scale problems to reach near-optimal solutions, without constraining the problem formulations immoderately (Albritton and McMullen, Eur J Oper Res 176:498–520 2007). In this paper, ant colony (Albritton and McMullen, Eur J Oper Res 176:498–520 2007) and bee colony algorithms (Karaboga and Basturk, J Glob Optim 39:459–471, 2007) are compared. Simulated conjoint data for different product design settings are used for this comparison, their generation uses a Monte Carlo design similar to the one applied in (Albritton and McMullen, Eur J Oper Res 176:498–520 2007). The purpose of the comparison is to provide an assistance, which algorithm should be applied in which product design setting.

Sascha Voekler, Daniel Krausche, Daniel Baier

Music Classification Workshop

Frontmatter
Comparison of Classical and Sequential Design of Experiments in Note Onset Detection

Design of experiments is an established approach to parameter optimization of industrial processes. In many computer applications however it is usual to optimize the parameters via genetic algorithms. The main idea of this work is to apply design of experiment’s techniques to the optimization of computer processes. The major problem here is finding a compromise between model validity and costs, which increase with the number of experiments. The second relevant problem is choosing an appropriate model, which describes the relationship between parameters and target values. One of the recent approaches here is model combination. In this paper a musical note onset detection algorithm will be optimized using design of experiments. The optimal algorithm parameter setting is sought in order to get the best onset detection accuracy. We try different design strategies including classical and sequential designs and compare several model combination strategies.

Nadja Bauer, Julia Schiffner, Claus Weihs
Recognising Cello Performers Using Timbre Models

In this paper, we compare timbre features of various cello performers playing the same instrument in solo cello recordings. Using an automatic feature extraction framework, we investigate the differences in sound quality of the players. The motivation for this study comes from the fact that the performer’s influence on acoustical characteristics is rarely considered when analysing audio recordings of various instruments. We explore the phenomenon, known amongst musicians as the “sound” of a player, which enables listeners to differentiate one player from another when they perform the same piece of music on the same instrument. We analyse sets of spectral features extracted from cello recordings of five players and model timbre characteristics of each performer. The proposed features include harmonic and noise (residual) spectra, Mel-frequency spectra and Mel-frequency cepstral coefficients. Classifiers such as k-Nearest Neighbours and Linear Discrimination Analysis trained on these models are able to distinguish the five performers with high accuracy.

Magdalena Chudy, Simon Dixon
A Case Study About the Effort to Classify Music Intervals by Chroma and Spectrum Analysis

Recognition of harmonic characteristics from polyphonic music, in particular intervals, can be very hard if the different instruments with their specific characteristics (overtones, formants, noisy components) are playing together at the same time. In our study we examined the impact of chroma features and spectrum on classification of single tone pitchs and music intervals played either by the same or different instruments. After the analysis of the audio recordings which produced the most errors we implemented two optimization approaches based on energy envelope and overtone distribution. The methods were compared during the experiment study. The results show that especially the integration of instrument-specific knowledge can significantly improve the overall performance.

Verena Mattern, Igor Vatolkin, Günter Rudolph
Computational Prediction of High-Level Descriptors of Music Personal Categories

Digital music collections are often organized by genre relationships or personal preferences. The target of automatic classification systems is to provide a music management limiting the listener’s effort for the labeling of a large number of songs. Many state-of-the art methods utilize low-level audio features like spectral and time domain characteristics, chroma etc. for categorization. However the impact of these features is very hard to understand; if the listener labels some music pieces as belonging to a certain category, this decision is indeed motivated by instrumentation, harmony, vocals, rhythm and further high-level descriptors from music theory. So it could be more reasonable to understand a classification model created from such intuitively interpretable features. For our study we annotated high-level characteristics (vocal alignment, tempo, key etc.) for a set of personal music categories. Then we created classification models which predict these characteristics from low-level audio features available in the AMUSE framework. The capability of this set of low level features to classify the expert descriptors is investigated in detail.

Günther Rötter, Igor Vatolkin, Claus Weihs
High Performance Hardware Architectures for Automated Music Classification

Today, stationary systems like personal computers and even portable music playback devices provide storage capacities for huge music collections of several thousand files. Therefore, the automated music classification is a very attractive feature for managing such multimedia databases. This type of application enhances the user comfort by classifying songs into predefined categories like music genres or user-defined categories. However, the automated music classification, based on audio feature extraction, is, firstly, extremely computation intensive, and secondly, has to be applied to enormous amounts of data. This is the reason why energy-efficient high-performance implementations for feature extraction are required. This contribution presents a dedicated hardware architecture for music classification applying typical audio features for discrimination (e.g., spectral centroid, zero crossing rate). For evaluation purposes, the architecture is mapped on an Field Programmable Gate Array (FPGA). In addition, the same application is also implemented on a commercial Graphics Processing Unit (GPU). Both implementations are evaluated in terms of processing time and energy efficiency.

Ingo Schmädecke, Holger Blume
Metadaten
Titel
Algorithms from and for Nature and Life
herausgegeben von
Berthold Lausen
Dirk Van den Poel
Alfred Ultsch
Copyright-Jahr
2013
Electronic ISBN
978-3-319-00035-0
Print ISBN
978-3-319-00034-3
DOI
https://doi.org/10.1007/978-3-319-00035-0