main-content

## Über dieses Buch

This volume provides approaches and solutions to challenges occurring at the interface of research fields such as data analysis, computer science, operations research, and statistics. It includes theoretically oriented contributions as well as papers from various application areas, where knowledge from different research directions is needed to find the best possible interpretation of data for the underlying problem situations. Beside traditional classification research, the book focuses on current interests in fields such as the analysis of social relationships as well as statistical musicology.

## Inhaltsverzeichnis

### Fuzzification of Agglomerative Hierarchical Crisp Clustering Algorithms

User generated content from fora, weblogs and other social networks is a very fast growing data source in which different information extraction algorithms can provide a convenient data access. Hierarchical clustering algorithms are used to provide topics covered in this data on different levels of abstraction. During the last years, there has been some research using hierarchical fuzzy algorithms to handle comments not dealing with one topic but many different topics at once. The used variants of the well-known fuzzy

c

-means algorithm are nondeterministic and thus the cluster results are irreproducible. In this work, we present a deterministic algorithm that fuzzifies currently available agglomerative hierarchical crisp clustering algorithms and therefore allows arbitrary multi-assignments. It is shown how to reuse well-studied linkage metrics while the monotonic behavior is analyzed for each of them. The proposed algorithm is evaluated using collections of the RCV1 and RCV2 corpus.

Mathias Bank, Friedhelm Schwenker

### An EM Algorithm for the Student-t Cluster-Weighted Modeling

Cluster-Weighted Modeling is a flexible statistical framework for modeling local relationships in heterogeneous populations on the basis of weighted combinations of local models. Besides the traditional approach based on Gaussian assumptions, here we consider Cluster Weighted Modeling based on Student-

t

distributions. In this paper we present an EM algorithm for parameter estimation in Cluster-Weighted models according to the maximum likelihood approach.

Salvatore Ingrassia, Simona C. Minotti, Giuseppe Incarbone

### Analysis of Distribution Valued Dissimilarity Data

We deal with methods for analyzing complex structured data, especially, distribution valued data. Nowadays, there are many requests to analyze various types of data including spatial data, time series data, functional data and symbolic data. The idea of symbolic data analysis proposed by Diday covers a large range of data structures. We focus on distribution valued dissimilarity data and multidimensional scaling (MDS) for these kinds of data. MDS is a powerful tool for analyzing dissimilarity data. The purpose of MDS is to construct a configuration of the objects from dissimilarities between objects. In conventional MDS, the input dissimilarity data are assumed (non-negative) real values. Dissimilarities between objects are sometime given probabilistically; dissimilarity data may be represented as distributions. We assume that the distributions between objects

i

and

j

are non-central chi-square distributions

$${\chi }^{2}(p,{\delta }_{ij}/{\gamma }_{ij})$$

multiplied by a scalar (say

$${\gamma }_{ij}$$

), i.e.

$${s}_{ij}\,\sim \,{\gamma }_{ij}{\chi }^{2}(p,{\delta }_{ij}/{\gamma }_{ij})$$

. We propose a method of MDS under this assumption; the purpose of the method is to construct a configuration;

$${x}_{i}\,\sim \,N({\mu }_{i},{\alpha }_{i}^{2}{I}_{p}),i = 1,2,\cdots \,,n$$

.

Masahiro Mizuta, Hiroyuki Minami

### An Overall Index for Comparing Hierarchical Clusterings

In this paper we suggest a new index for measuring the distance between two hierarchical clusterings. This index can be decomposed into the contributions pertaining to each stage of the hierarchies. We show the relations of such components with the currently used criteria for comparing two partitions. We obtain a similarity index as the complement to one of the suggested distances and we propose its adjustment for agreement due to chance. We consider the extension of the proposed distance and similarity measures to more than two dendrograms and their use for the consensus of classification and variable selection in cluster analysis.

I. Morlini, S. Zani

### An Accelerated K-Means Algorithm Based on Adaptive Distances

Widely-used cluster analysis methods such as

K

-means and spectral clustering require some measures of (pairwise) distance on the multivariate space. Unfortunately, distances are often dependent on the scales of the variables. In applications, this can become a crucial point. Here we propose an accelerated

K

-means technique that consists of two steps. First, an appropriate weighted Euclidean distance is established on the multivariate space. This step is based on univariate assessments of the importance of the variables for the cluster analysis task. Here, additionally, one gets a crude idea about what the number of clusters

K

is at least. Subsequently, a fast

K

-means step follows based on random sampling. It is especially suited for the purpose of data reduction of massive data sets. From a theoretical point of view, it looks like MacQueen’s idea of clustering data over a continuous space. However, the main difference is that our algorithm examines only a random sample in a single pass. The proposed algorithm is used to solve a segmentation problem in an application to ecology.

Hans-Joachim Mucha, Hans-Georg Bartel

### Bias-Variance Analysis of Local Classification Methods

In recent years an increasing amount of so called local classification methods has been developed. Local approaches to classification are not new. Well-known examples are the

k

nearest neighbors method and classification trees (e.g. CART). However, the term ‘local’ is usually used without further explanation of its particular meaning, we neither know which properties local methods have nor for which types of classification problems they may be beneficial. In order to address these problems we conduct a benchmark study. Based on 26 artificial and real-world data sets selected local and global classification methods are analyzed in terms of the bias-variance decomposition of the misclassification rate. The results support our intuition that local methods exhibit lower bias compared to global counterparts. This reduction comes at the price of an only slightly increased variance such that the error rate in total may be improved.

Julia Schiffner, Bernd Bischl, Claus Weihs

### Effect of Data Standardization on the Result of k-Means Clustering

In applying clustering to multivariate data, in which there are some large-scale variables, clustering results depend on the variables more than the user’s needs. In such cases, we should standardize the data to control the dependency. For high-dimensional data, Doherty et al. (Appl Soft Comput 7:203–210, 2007) argued numerically that data standardization by variable range leads to almost the same results regardless of the kinds of norms, although Aggarwal et al. (Lect Notes Comput Sci 1973:420–434, 2001) showed theoretically that a fraction norm reduces the effect of the curse of high dimensionality for

k

-means result more than the Euclidean norm does. However, they have not considered the effects of standardization and factors properly. In this paper, we verify the effects of six data standardization methods with various norms and examine factors that affect the clustering results for high-dimensional data. As a result, we show that data standardization with the fraction norm reduces the effect of the curse of high dimensionality and gives a more effective result than data standardization with the Euclidean norm and not applying data standardization with the fraction norm.

### A Case Study on the Use of Statistical Classification Methods in Particle Physics

Current research in experimental particle physics is dominated by high profile and large scale experiments. One of the major tasks in these experiments is the selection of interesting or relevant events. In this paper we propose to use statistical classification algorithms for this task. To illustrate our method we apply it to an Monte-Carlo (MC) dataset from the

BaBar

experiment. One of the major obstacles in constructing a classifier for this task is the imbalanced nature of the dataset. Only about 0.5% of the data are interesting events. The rest are background or noise events. We show how ROC curves can be used to find a suitable cutoff value to select a reasonable subset of a stream for further analysis. Finally, we estimate the

CP

asymmetry of the

$${B}^{\pm }\rightarrow D{K}^{\pm }$$

decay using the samples extracted by the classifiers.

Claus Weihs, Olaf Mersmann, Bernd Bischl, Arno Fritsch, Heike Trautmann, Till Moritz Karbach, Bernhard Spaan

### Problems of Fuzzy c-Means Clustering and Similar Algorithms with High Dimensional Data Sets

Fuzzy c-means clustering and its derivatives are very successful on many clustering problems. However, fuzzy c-means clustering and similar algorithms have problems with high dimensional data sets and a large number of prototypes. In particular, we discuss hard c-means, noise clustering, fuzzy c-means with a polynomial fuzzifier function and its noise variant. A special test data set that is optimal for clustering is used to show weaknesses of said clustering algorithms in high dimensions. We also show that a high number of prototypes influences the clustering procedure in a similar way as a high number of dimensions. Finally, we show that the negative effects of high dimensional data sets can be reduced by adjusting the parameter of the algorithms, i.e. the fuzzifier, depending on the number of dimensions.

Roland Winkler, Frank Klawonn, Rudolf Kruse

### Reduced Versus Complete Space Configurations in Total Information Analysis

In most multidimensional analyses, the dimension reduction is a key concept and reduced space analysis is routinely used. Contrary to this traditional approach, total information analysis (TIA) (Nishisato and Clavel, Behaviormetrika 37:15–32, 2010) places its focal point on tapping every piece of information in data. The present paper is to demonstrate that the time-honored practice of reduced space analysis may have to be reconsidered as its grasp of data structure may be compromised by ignoring intricate details of data. The paper will present numerical examples to make our point.

José G. Clavel, Shizuhiko Nishisato

### A Geometrical Interpretation of the Horseshoe Effect in Multiple Correspondence Analysis of Binary Data

When a set of binary variables is analyzed by multiple correspondence analysis, a quadratic relationship between individual scores corresponding to the two largest characteristic roots is often observed. This phenomenon is called the

horseshoe effect

, which is well known as an artifact in the analysis of the perfect scale in Guttman’s sense, and also observed in the

quasi scale

as a result of random errors. In addition, although errors are unsystematic and symmetric, scores corresponding to erroneous response patterns lie only inside the horseshoe. This phenomenon, which we will call

filled horseshoe

, is explained by the concept of an affine projection of a hypercube that represents binary data. The image of the hypercube on the plane has the form of a

zonotope

, which is a convex and centrally symmetric polygon, and it is shown that images forming the horseshoe must lie along the vertices of the zonotope, if it exists, and hence, other images must reside inside it.

Takashi Murakami

### Quantification Theory: Reminiscence and a Step Forward

After a sketch of topics in my life-long work on quantification theory, two robust clustering procedures are proposed to compliment a newly developed scaling procedure, called total information analysis (TIA), with numerical examples.

Shizuhiko Nishisato

### Modelling Rater Differences in the Analysis of Three-Way Three-Mode Binary Data

Using a basic latent class model for the analysis of three-way three- mode data (i.e. raters by objects by attributes) to cluster raters is often problematic because the number of conditional probabilities increases rapidly when extra latent classes are added. To solve this problem, Meulders et al. (J Classification 19:277–302, 2002) proposed a constrained latent class model in which object-attribute associations are explained on the basis of latent features. In addition, qualitative rater differences are introduced by assuming that raters may only take into account a subset of the features. As this model involves a direct link between the number of features

F

and the number of latent classes (i.e., 2

F

), estimation of the model becomes slow when many latent features are needed to fit the data. In order to solve this problem we propose a new model in which rater differences are modelled by assuming that features can be taken into account with a certain probability which depends on the rater class. An EM algorithm is used to locate the posterior mode of the model and a Gibbs sampling algorithm is developed to compute a sample of the observed posterior of the model. Finally, models with different types of rater differences are applied to marketing data and the performance of the models is compared using posterior predictive checks (see also, Meulders et al. (Psychometrika 68:61–77, 2003)).

Michel Meulders

### Reconstructing One-Mode Three-way Asymmetric Data for Multidimensional Scaling

Some models have been proposed to analyze one-mode three-way data [e.g. De Rooij and Gower (J Classification 20:181–220, 2003), De Rooij and Heiser (Br J Math Stat Psychol 53:99–119, 2000)]. These models usually assume triadic symmetric relationships. Therefore, it is general to transform asymmetric data into symmetric proximity data when one-mode three-way asymmetric proximity data are analyzed using multidimensional scaling. However, valuable information among objects is lost by symmetrizing asymmetric proximity data. It is necessary to devise this transformation so that valuable information among objects is not lost. In one-mode two-way asymmetric data, a method that the overall sum of the rows and columns are equal was proposed by Harshman et al. (Market Sci 1:205–242, 1982). Their method is effective to analyze the data that have differences among the overall sum of the rows and columns caused by external factors. Therefore, the present study proposes a method that extends (Harshman et al., Market Sci 1:205–242, 1982) method to one-mode three-way asymmetric proximity data. The proposed method reconstructs one-mode three-way asymmetric data so that the overall sum of the rows, columns and depths is made equal.

### Analysis of Car Switching Data by Asymmetric Multidimensional Scaling Based on Singular Value Decomposition

Car switching or car trade-in data among car categories were analyzed by a procedure of asymmetric multidimensional scaling. The procedure, which deals with one-mode two-way asymmetric similarities, has originally been introduced to derive the centrality of the asymmetric social network. In the present procedure, the similarity from a car category to the other car category is represented not by the distance in a multidimensional space like the conventional multidimensional scaling, but is represented by the weighted sum of areas with the positive or negative sign along dimensions of a multidimensional space. The result of the analysis shows that attributes which already have been revealed in previous studies accounted for car switching among car categories by a different manner from previous studies, and can more easily be interpreted than the previous studies.

### Visualization of Asymmetric Clustering Result with Digraph and Dendrogram

Asymmetric cluster analysis is one of the most useful methods together with asymmetric multidimensional scaling (MDS) to analyze asymmetric (dis)similarity data. In both methods, visualization of the result of the analysis plays an important role in the analysis. Some methods for visualizing the result of the asymmetric clustering and MDS have been proposed (Saito and Yadohisa, Data Analysis of Asymmetric Structures, Marcel Dekker, New York, 2005). In this paper, we propose a new visualization method for the result of asymmetric agglomerative hierarchical clustering with a digraph and a dendrogram. The visualization can represent asymmetric (dis)similarities between pairs of any clusters, in addition to the information of a traditional dendrogram, which is illustrated by analyzing the symmetric part of asymmetric (dis)similarity data. This visualization enables an intuitive interpretation of the asymmetry in (dis)similarity data.

### Clustering Temporal Population Patterns in Switzerland (1850–2000)

Spatial planning and quantitative geography face a great challenge to handle the growing amount of geospatial data and new statistics. Techniques of data mining and knowledge discovery are therefore presented to examine by time intervals (=15 decades) the population development of 2,896 Swiss communities. The key questions are how many temporal patterns will occur and what are their characteristics? Relative difference (RelDiff) is proposed as an alternative to relative change calculation. The detection of temporal patterns is based on mixture models and the Bayes theorem. A procedure of information optimization aims at selecting relevant temporal patterns for clustering. The use of a k-Nearest Neighbor classifier is based on the assumption that similar relevant temporal patterns are a good point of reference for the whole population development. The classification result is explained by significance with already existing classifications (e.g. central-periphery). Spatial visualization leads to the verification in mind of the spatial analyst and provides the process of knowledge conversion.

Martin Behnisch, Alfred Ultsch

### p-adic Methods in Stereo Vision

The so-called

essential matrix

relates corresponding points of two images from the same scene in 3D, and allows to solve the relative pose problem for the two cameras up to a global scaling factor, if the camera calibrations are known. We will discuss how

Hensel’s lemma

from number theory can be used to find geometric approximations to solutions of the equations describing the essential matrix. Together with recent

p

p

, a

p

-adic version of the classical RANSAC in stereo vision. This approach is motivated by the observation that using

p

-adic numbers often leads to more efficient algorithms than their real or complex counterparts.

### Individualized Error Estimation for Classification and Regression Models

Estimating the error of classification and regression models is one of the most crucial tasks in machine learning. While the global error is capable to measure the quality of a model, local error estimates are even more interesting: on the one hand they contribute to better understanding of prediction models (where does and where does not work the model well), on the other hand they may provide powerful means to build successful ensembles that select for each region the most appropriate model(s). In this paper we introduce an extremely localized error estimation, called

individualized error estimation

(IEE), that estimates the error of a prediction model

M

for each instance

x

individually. To solve the problem of individualized error estimation, we apply a meta model

$${M}^{{_\ast}}$$

. We systematically investigate various combinations of elementary models

M

and meta models

M

on publicly available real-world data sets. Further, we illustrate the power of IEE in the context of time series classification: on 35 publicly available real-world time series data sets, we show that IEE is capable to enhance state-of-the art time series classification methods.

Krisztian Buza, Alexandros Nanopoulos, Lars Schmidt-Thieme

### Evaluation of Spatial Cluster Detection Algorithms for Crime Locations

This comparative analysis examines the suitability of commonly applied local cluster detection algorithms. The spatial distribution of an observed spatial crime pattern for Houston, TX, for August 2005 is examined by three different cluster detection methods, including the Geographical Analysis Machine, the Besag and Newell statistic, and Kulldorff’s spatial scan statistic. The results suggest that the size and locations of the detected clusters are sensitive to the chosen parameters of each method. Results also vary among the methods. We thus recommend to apply multiple different cluster detection methods to the same data and to look for commonalities between the results. Most confidence will then be given to those spatial clusters that are common to as many methods as possible.

Marco Helbich, Michael Leitner

### Checking Serial Independence of Residuals from a Nonlinear Model

In this paper the serial independence tests known as SIS (Serial Independence Simultaneous) and SICS (Serial Independence Chi-Square) are considered. These tests are here contextualized in the model validation phase for nonlinear models in which the underlying assumption of serial independence is tested on the estimated residuals. Simulations are used to explore the performance of the tests, in terms of size and power, once a linear/nonlinear model is fitted on the raw data. Results underline that both tests are powerful against various types of alternatives.

Luca Bagnato, Antonio Punzo

### Analysis of Network Data Based on Probability Neighborhood Cliques

The authors present the concept of a “probability neighborhood clique” intended to substantiate the idea of a “community”, i.e. of a dense subregion within a (simple) network. For that purpose the notion of a clique is generalized in a probabilistic way. The probability neighborhoods employed for that purpose are indexed by one or two tuning parameters to bring out the “degree of denseness” respectively a hierarchy within that community. The paper, moreover, reviews other degree based concepts of communities and addresses algorithmic aspects.

Andreas Baumgart, Ulrich Müller-Funk

### A Comparison of Agglomerative Hierarchical Algorithms for Modularity Clustering

Modularity is a popular measure for the quality of a cluster partition. Primarily, its popularity originates from its suitability for community identification through maximization. A lot of algorithms to maximize modularity have been proposed in recent years. Especially agglomerative hierarchical algorithms showed to be fast and find clusterings with high modularity. In this paper we present several of these heuristics, discuss their problems and point out why some algorithms perform better than others. In particular, we analyze the influence of search heuristics on the balancedness of the merge process and show why the uneven contraction of a graph due to an unbalanced merge process leads to clusterings with comparable low modularity.

Michael Ovelgönne, Andreas Geyer-Schulz

### Network Data as Contiguity Constraints in Modeling Preference Data

In the last decades the use of regression-like preference models has found widespread application in marketing research. The Conjoint Analysis models have even more been used to analyze consumer preferences and simulate product positioning. The typical data structure of this kind of models can be enriched by the presence of supplementary information observed on respondents. We suppose that relational data observed on pairs of consumers are available. In such a case, the existence of a consumer network is introduced in the Conjoint model as a set of contiguous constraints among the respondents. The proposed approach will allow to bring together the theoretical framework of Social Network Analysis with the explicative power of Conjoint Analysis models. The combined use of relational and choice data could be usefully exploited in the framework of relational and tribal marketing strategies.

Giuseppe Giordano, Germana Scepi

### Clustering Coefficients of Random Intersection Graphs

Two general random intersection graph models (active and passive) were introduced by Godehardt and Jaworski (Exploratory Data Analysis in Empirical Research, Springer, Berlin, Heidelberg, New York, pp.68–81, 2002). Recently the models have been shown to have wide real life applications. The two most important ones are: non-metric data analysis and real life network analysis. Within both contexts, the clustering coefficient of the theoretical graph models is studied. Intuitively, the clustering coefficient measures how much the neighborhood of the vertex differs from a clique. The experimental results show that in large complex networks (real life networks such as social networks, internet networks or biological networks) there exists a tendency to connect elements, which have a common neighbor. Therefore it is assumed that in a good theoretical network model the clustering coefficient should be asymptotically constant. In the context of random intersection graphs, the clustering coefficient was first studied by Deijfen and Kets (Eng Inform Sci, 23:661–674, 2009). Here we study a wider class of random intersection graphs than the one considered by them and give the asymptotic value of their clustering coefficient. In particular, we will show how to set parameters – the sizes of the vertex set, of the feature set and of the vertices’ feature sets – in such a way that the clustering coefficient is asymptotically constant in the active (respectively, passive) random intersection graph.

Erhard Godehardt, Jerzy Jaworski, Katarzyna Rybarczyk

### Immersive Dynamic Visualization of Interactions in a Social Network

This paper is focused on the visualization of dynamic social networks, i.e. graphs whose edges model social relationships which evolve during time. In order to overcome the problem of discontinuities of the graphical representations computed by discrete methods, the proposed approach is a continuous one which updates the changes as soon as they happen in the visual restitution. The vast majority of the continuous approaches are restricted to 2D supports which do not optimally match the human perception capabilities. We here present

TempoSpring

which is a new interactive 3D visualization tool of dynamic graphs. This innovative tool relies on a force-directed layout method to span the 3D space along with several immersive setups (active stereoscopic system/visualization in a dome) to offer an efficient user-experience.

TempoSpring

has initially been developed in a particular application context: the analysis of sociability networks in the French medieval peasant society.

Nicolas Greffard, Fabien Picarougne, Pascale Kuntz

### Fuzzy Boolean Network Reconstruction

Genes interact with each other in complex networks that enable the processing of information inside the cell. For an understanding of the cellular functions, the identification of the gene regulatory networks is essential. We present a novel reverse-engineering method to recover networks from gene expression measurements. Our approach is based on Boolean networks, which require the assignment of the label “expressed” or “not expressed” to an individual gene. However, techniques like microarray analyses provide real-valued expression values, consequently the continuous data have to be binarized. Binarization is often unreliable, since noise on gene expression data and the low number of temporal measurement points frequently lead to an uncertain binarization of some values. Our new approach incorporates this uncertainty in the binarized data for the inference process. We show that this new reconstruction approach is less influenced by noise which is inherent in these biological systems.

Martin Hopfensitz, Markus Maucher, Hans A. Kestler

### GIRAN: A Dynamic Graph Interface to Neighborhoods of Related Articles

This contribution reports on the development of GIRAN (Graph Interface to Related Article Neighborhoods), a distributed web application featuring a Java applet user front-end for browsing recommended neighborhoods within the network of Wikipedia articles. The calculation of the neighborhood is based on a graph analysis considering articles as nodes and links as edges. The more the link structure of articles is similar to the article of current interest, the more they are considered related and hence recommended to the user. The similarity strength is depicted in the graph view by means of the width of the edges. A Java applet dynamically displays the neighborhood of related articles in a clickable graph centered around the document of interest to the user. The local view moves along the complete article network when the user shows a new preference by clicking on one of the presented nodes. The path of selected articles is stored, can be displayed within the graph, and is accessible by the user; the content of the article of current interest is displayed next to the graph view. The graph of recommended articles is presented in a radial tree layout based on a minimum spanning tree with animated graph transitions featuring interpolations by polar coordinates to avoid crisscrossings. Further graph search tools and filtering techniques like a selectable histogram of Wikipedia categories and a text search are available as well. This contribution portrays the graph analysis methods for thinning out the graph, the dynamic user interface, as well as the service-oriented architecture of the application back-end.

Andreas W. Neumann, Kiryl Batsiukov

### Power Tags as Tools for Social Knowledge Organization Systems

Web services are popular which allow users to collaboratively index and describe web resources with folksonomies. In broad folksonomies tag distributions for every single resource can be observed. Popular tags can be understood as “implicit consensus” where users have a shared understanding of tags as best matching descriptors for the resource. We call these high-frequent tags “power tags”. If the collective intelligence of the users becomes visible in tags, we can conclude that power tags obtain the characteristics of community controlled vocabulary which allows the building of a social knowledge organization system (KOS). The paper presents an approach for building a folksonomy-based social KOS and results of a research project in which the relevance of assigned tags for particular URLs in the social bookmarking system delicious has been evaluated. Results show which tags were considered relevant and whether relevant tags can be found among power tags.

Isabella Peters

### The Degree Distribution in Random Intersection Graphs

We study the degree distribution in a general random intersection graph introduced by Godehardt and Jaworski (Exploratory Data Analysis in Empirical Research, pp. 68–81, Springer, Berlin, 2002). The model has shown to be useful in many applications, in particular in the analysis of the structure of data sets. Recently Bloznelis (Lithuanian Math J 48:38–45, 2008) and independently Deijfen and Kets (Eng Inform Sci 23:661–674, 2009) proved that in many cases the degree distribution in the model follows a power law. We present an enhancement of the result proved by Bloznelis. We are able to strengthen the result by omitting the assumption on the size of the feature set. The new result is of considerable importance, since it shows that a random intersection graph can be used not only as a model of scale free networks, but also as a model of a more important class of networks – complex networks.

Katarzyna Rybarczyk

### Application of a Community Membership Life Cycle Model on Tag-Based Communities in Twitter

Social networks are the backbone of Web 2.0. More than 500 million users are part of social networks like Twitter, Facebook, discussion boards or other virtual online communities. In this work we report on a first empirical study of the conceptional community membership life-cycle model of (Sonnenbichler, A Community Membership Life Cycle Model, Sunbelt XIX International Social Network Conference, University of California, San Diego, USA, 2009) applied on message data from the micro-blogging service Twitter. Based on hash tags we analyze ad-hoc communities of Twitter and we operationalize the roles of the conceptional model with the help of activity-levels and the local interaction structure of community members. We analyze the development of roles over the life-time of the community. Our explorative analysis supports the existence of the roles of the conceptional model and is a first step towards the empirical validation of the model and its operationalization. The knowledge of a community’s life-cycle model is of high importance for community service providers, as it allows to influence the group structure: Stage transitions can be supported or harmed, e.g. to strengthen the binding of a user to a site and keep communities alive.

Andreas C. Sonnenbichler, Christopher Bazant

### Measuring the Influence of Network Structures on Social Interaction over Time

Communication decisions in networks can be described as a two-level decision process. The second decision about event receivers is a multinomial logistic regression model with an unknown vector of parameters. These parameters evaluate network structures that enforce or weaken the probability for choosing certain actors. However, in many cases those parameters may change over time. In this paper a sliding window approach is introduced, that can be used to understand whether there is evolution of behavior in an observed data set. For future work, it is proposed to develop a statistical test on normalized decision statistics.

### Identifying Artificial Actors in E-Dating: A Probabilistic Segmentation Based on Interactional Pattern Analysis

We propose different behaviour and interaction related indicators of artificial actors (bots) and show how they can be separated from natural users in a virtual dating market. A finite mixture classification model is applied on the different behavioural and interactional information to classify users into bot vs. non-bot-categories. Finally the validity of the classification model and the impact of bots on sociodemographic distributions and scientific analysis is discussed.

Andreas Schmitz, Olga Yanenko, Marcel Hebing

### Calculating a Distributional Similarity Kernel using the Nyström Extension

The analysis of distributional similarities induced by word co-occurrences is an established tool for extracting semantically related words from a large text corpus. Based on the co-occurrence matrix

C

the basic kernel matrix

K

=

CC

T

reflects word–word similarities. In order to considerably improve the results, a similarity kernel matrix is expressed as

$$G\,=\,{U}_{k}{U}_{k}^{T}$$

, where

U

k

are the first

k

eigenvectors of the eigendecomposition

K

=

UΣU

T

. Clearly, the bottleneck of this technique is the high computational demand for calculating the eigendecomposition. In our study we speed up the calculation of the low-rank similarity kernel by means of the Nyström extension. We address in detail the inherent challenge of the Nyström method, namely selecting appropriate kernel matrix columns in such a way that the fast approximation process yields satisfactory results. To illustrate the effectiveness of our method, we have built a thesaurus containing 32,000 entries based on 0.5 billion corpus words (nouns, verbs, adjectives and adverbs) extracted from the Project Gutenberg text collection.

Markus Arndt, Ulrich Arndt

### Text Categorization in R: A Reduced N-Gram Approach

For the majority of Natural Language Processing methods, identifying the language of the processed text is one of the key tasks. Corresponding Natural Language Processing techniques often have language specific conditions, i.e., selecting the correct stop word list or the correct set of rules for stemming. Among various different approaches for language identification or more generally, text categorization, a rather large proportion is based on the word N-gram approach pioneered by Cavnar and Trenkle. In this contribution we will show how to produce language and document profiles using a reduced version of Cavnar and Trenkle’s original algorithm. In addition, performance for N-gram based text classification employing both the original and the reduced approach, is compared. For this purpose, two groups of language profiles were used. One is composed of heterogeneous text data and the other one is solely based on articles from Wikipedia. Within this context we present the R package

textcat

. It enables the user to generate language profile databases as well as document profiles and allows to perform text classifications according to both the original and the reduced N-gram approach.

Wilhelm M. Geiger, Johannes Rauch, Patrick Mair, Kurt Hornik

### HOMALS for Dimension Reduction in Information Retrieval

The usual data base for multiple correspondence analysis/homogeneity analysis consists of objects, characterised by categorical attributes. Its aims and ends are visualisation, dimension reduction and, to some extent, factor analysis using alternating least squares. As for dimension reduction, there are strong parallels between vector-based methods in Information Retrieval (IR) like the Vector Space Model (VSM) or Latent Semantic Analysis (LSA). The latter uses singular value decomposition (SVD) to discard a number of the smallest singular values and that way generates a lower-dimensional retrieval space. In this paper, the HOMALS technique is exploited for use in IR by categorising metric term frequencies in term-document matrices. In this context, dimension reduction is achieved by minimising the difference in distances between objects in the dimensionally reduced space compared to the full-dimensional space. An exemplary set of documents will be submitted to the process and later used for retrieval.

Kay F. Hildebrand, Ulrich Müller-Funk

### Feature Reduction and Nearest Neighbours

Feature reduction is a major preprocessing step in the analysis of high-dimensional data, particularly from biomolecular high-throughput technologies. Reduction techniques are expected to preserve the relevant characteristics of the data, such as neighbourhood relations. We investigate the neighbourhood preservation properties of feature reduction empirically and theoretically. Our results indicate that nearest and farthest neighbours are more reliably preserved than other neighbours in a reduced feature set.

Ludwig Lausser, Christoph Müssel, Markus Maucher, Hans A. Kestler

### Musical Instrument Recognition by High-Level Features

In this work different high-level features and MFCC are taken into account to classify single piano and guitar tones. The features are called high-level because they try to reflect the physical structure of a musical instrument on temporal and spectral levels. Three spectral features and one temporal feature are used for the classification task. The spectral features characterize the distribution of overtones and the temporal feature the energy of a tone. After calculating the features for each tone classification by statistical methods is carried out. Variable selection is used and an interpretation of the selected variables is presented.

Markus Eichhoff, Claus Weihs

### The Recognition of Consonance Is not Impaired by Intonation Deviations: A Revised Theory

The recognition of musical intervals is investigated comparing neurobiological and theoretical models (Tramo et al., The Biological Foundations of Music, Annals of the New York Academy of Sciences, 930, pp. 92–116, 2001; Ebeling, Verschmelzung und neuronale Autokorrelation als Grundlage einer Konsonanztheorie, Lang, Peter, Frankfurt/Main, 2007). The actual analyses focus on pitch tolerances of consonance identification. The mechanisms are different in models and the neurobiological process (pulse width and time latency) that the listener tolerates the deviation of the exact ratio of frequencies in the recognition of consonance. The neurobiological process is characterized by the spontaneous neural activity which is described by a Poisson distribution. Event related activities may be displayed in interspike-interval (ISI) and peri-event-time-histograms (PETH). Consonant musical intervals are characterized by periodicity in all-order ISI-histograms. This result is explained by the frequency ratio of the interval of pitches. These ISI-histograms also display subharmonics which are explainable as artifacts because of methodical issues. In contrast, the peridocity indicates the frequency of the residue. In order to adapt the model to reality, the width of the statistical distribution of the neural impulses should be considered. The spike-analysis for the recognition of periodicity is investigated on the basis of the statistical distribution and compared with the statistical results of the listener’s assessment of muscial intervals. The experimental data were taken from a study dealing with the assessment of intervals in a musical context (Fricke, Classification: The Ubiquitous Challenge, pp. 585–592, Berlin, Springer, 2005).

Jobst Peter Fricke

### Applying Multiple Kernel Learning to Automatic Genre Classification

In this paper we demonstrate the advantages of multiple-kernel learning in the application to music genre classification. Multiple-kernel learning provides the possibility to adaptively tune the kernel settings to each group of features independently. Our experiments show the improvement of classification performance in comparison to the conventional support vector machine classifier.

Hanna Lukashevich

### Multi-Objective Evaluation of Music Classification

Music classification targets the management of personal music collections or recommendation of new songs. Several steps are required here: feature extraction and processing, selection of the most relevant of them, and training of classification models. The complete classification chain is evaluated by a selected performance measure. Often standard confusion matrix based metrics like accuracy are calculated. However it can be valuable to compare the methods using further metrics depending on the current application scenario. For this work we created a large empirical study for different music categories using several feature sets, processing methods and classification algorithms. The correlation between different metrics is discussed, and the ideas for better algorithm evaluation are outlined.

Igor Vatolkin

### Partition Based Feature Processing for Improved Music Classification

Identifying desired music amongst the vast amount of tracks in today’s music collections has become a task of increasing attention for consumers. Music classification based on perceptual features promises to help sorting a collection according to personal music categories determined by the user’s personal taste and listening habits. Regarding limits of processing power and storage space available in real (e.g. mobile) devices necessitates to reduce the amount of feature data used by such classification. This paper compares several methods for feature pruning– experiments on realistic track collections show that an approach attempting to identify relevant song partitions not only allows to reduce the amount of processed feature data by 90% but also helps to improve classification accuracy. They indicate that a combination of structural information and temporal continuity processing of partition based classification helps to substantially improve overall performance.

Igor Vatolkin, Wolfgang Theimer, Martin Botteck

### Software in Music Information Retrieval

Music Information Retrieval (MIR) software is often applied for the identification of rules classifying audio music pieces into certain categories, like e.g. genres. In this paper we compare the abilities of six MIR software packages in ten categories, namely operating systems, user interface, music data input, feature generation, feature formats, transformations and features, data analysis methods, visualization methods, evaluation methods, and further development. The overall rankings are derived from the estimated scores for the analyzed criteria.

Claus Weihs, Klaus Friedrichs, Markus Eichhoff, Igor Vatolkin

### Conditional Factor Models for European Banks

The objective of this study is to analyze the risk factors and their time-variability that may be well suited to explain the behavior of European bank stock returns. In order to test for the relative importance of risk factors over time, we employ a novel democratic orthogonalization procedure proposed by Klein and Chow (Orthogonalized equity risk premia and systematic risk decomposition. Working Paper, West Virginia University, 2010). The time-variability in estimated coefficients is further modeled by conditional regression specifications that incorporate macroeconomic as well as stock market based information variables. In a final step, these conditional multifactor models are evaluated on their ability to capture return information related to traditional cross-sectional variables. Overall, we provide empirical evidence on time-varying relative factor contributions for explaining European bank stock returns. Moreover, we conclude that conditional multifactor models explain a significant portion of the size, value and momentum effects in cross-sectional regressions.

Wolfgang Bessler, Philipp Kurmann

### Classification of Large Imbalanced Credit Client Data with Cluster Based SVM

Credit client scoring on medium sized data sets can be accomplished by means of Support Vector Machines (SVM), a powerful and robust machine learning method. However, real life credit client data sets are usually huge, containing up to hundred thousands of records, with good credit clients vastly outnumbering the defaulting ones. Such data pose severe computational barriers for SVM and other kernel methods, especially if all pairwise data point similarities are requested. Hence, methods which avoid extensive training on the complete data are in high demand. A possible solution is clustering as preprocessing and classification on the more informative resulting data like cluster centers. Clustering variants which avoid the computation of all pairwise similarities robustly filter useful information from the large imbalanced credit client data set, especially when used in conjunction with a symbolic cluster representation. Subsequently, we construct credit client clusters representing both client classes, which are then used for training a non standard SVM adaptable to our imbalanced class set sizes. We also show that SVM trained on symbolic cluster centers result in classification models, which outperform traditional statistical models as well as SVM trained on all our original data.

Ralf Stecking, Klaus B. Schebesch

### Fault Mining Using Peer Group Analysis

There has been increasing interest in deploying data mining methods for fault detection. For the case where we have potentially large numbers of devices to monitor, we propose to use peer group analysis to identify faults. First, we identify the peer group of each device. This consists of other devices that have behaved similarly. We then monitor the behaviour of a device by measuring how well the peer group tracks the device. Should the device’s behaviour deviate strongly from its peer group we flag the behaviour as an outlier. An outlier is used to indicate the potential occurrence of a fault. A device exhibiting outlier behaviour from its peer group need not be an outlier to the population of devices. Indeed a device exhibiting behaviour typical for the population of devices might deviate sufficiently far from its peer group to be flagged as an outlier. We demonstrate the usefulness of this property for detecting faults by monitoring the data output from a collection of privately run weather stations across the UK.

David J. Weston, Niall M. Adams, Yoonseong Kim, David J. Hand

### Discovering Possible Patterns Associations Among Drug Prescriptions

The constant growth of data storage associated to medication prescriptions allows people to get powerful and useful information by applying data mining techniques. The information retrieved by the patterns found in medication prescriptions data can lead to a wide range of new management solutions and possible services optimization. In this work we present a study about medication prescriptions in northern region of Portugal. The main goal is to find possible relations among medication prescriptions themselves, and between the medication prescribed by a doctor and the lab associated with those medications. Since this kind of studies is not available in Portugal, our results provide valuable information for those working in the area that need to make decisions in order to optimize resources within health institutions.

Joana Fernandes, Orlando Belo

### Cluster Analytic Strategy for Identification of Metagenes Relevant for Prognosis of Node Negative Breast Cancer

Worldwide, breast cancer is the second leading cause of cancer deaths in women. To gain insight into the processes related to the course of the disease, human genetic data can be used to identify associations between gene expression and prognosis. Moreover, the expression data of groups of genes may be aggregated to metagenes that may be used for investigating complex diseases like breast cancer. Here we introduce a cluster analytic approach for identification of potentially relevant metagenes. In a first step of our approach we used gene expression patterns over time of erbB2 breast cancer MCF7 cell lines to obtain promising sets of genes for a metagene calculation. For this purpose, two cluster analytic approaches for short time-series of gene expression data – DIB-C and STEM – were applied to identify gene clusters with similar expression patterns. Among these we next focussed on groups of genes with transcription factor (TF) binding site enrichment or associated with a GO group. These gene clusters were then used to calculate metagenes of the gene expression data of 766 breast cancer patients from three breast cancer studies. In the last step of our approach Cox models were applied to determine the effect of the metagenes on the prognosis. Using this strategy we identified new metagenes that were associated with metastasis-free survival patients.

Evgenia Freis, Silvia Selinski, Jan G. Hengstler, Katja Ickstadt

### Image Clustering for Marketing Purposes

Clustering algorithms are standard tools for marketing purposes. For example, in market segmentation, they are applied to derive homogeneous customer groups. However, recently, the available resources for this purpose have extended. So, e.g., in social networks potential customers provide images – and other information as e.g. profiles, contact lists, music or videos – which reflect their activities, interests, and opinions. Also, consumers are getting more and more accustomed to select or upload personal images during an online dialogue. In this paper we discuss, how the application of clustering algorithms to such uploaded image collections can be used for deriving market segments. Software prototypes are discussed and applied.

Daniel Baier, Ines Daniel

### PLS-MGA: A Non-Parametric Approach to Partial Least Squares-based Multi-Group Analysis

This paper adds to an often applied extension of Partial Least Squares (PLS) path modeling, namely the comparison of PLS estimates across subpopulations, also known as multi-group analysis. Existing PLS-based approaches to multi-group analysis have the shortcoming that they rely on distributional assumptions. This paper develops a non-parametric PLS-based approach to multi-group analysis: PLS-MGA. Both the existing approaches and the new approach are applied to a marketing example of customer switching behavior in a liberalized electricity market. This example provides first evidence of favorable operation characteristics of PLS-MGA.

Jörg Henseler

### Love and Loyalty in Car Brands: Segmentation Using Finite Mixture Partial Least Squares

This study seeks to understand the relationship among brand love, inner self, social self, and loyalty perceived by users of three car brands. The model estimation includes structural equation analysis, using the PLS approach and applying the finite mixture partial least squares (FIMIX-PLS) to segment the sample. The research findings showed that area of residence and age are the main difference that characterizes the two uncovered customer segments. Car users of the large segment live mainly in the big city Oporto and are younger than car users of the small segment. For this small group, social self doesn’t contribute to enrich the brand love, they don’t give very much importance to what others think of them, and so, the social aspects and the social image are not a key factor to create a passion and an attraction to the car brand. Indirectly, the social identification is not important to reinforce the intention to recommend and to buy a car with the same brand in the future. On the other hand, the cosmopolitan car users of the large segment consider that the car brand image should fit their inner self and the social group of belonging in order to improve the love to the brand and the intention to recommend and to buy a car with the same car brand in the future.

Sandra Loureiro

### Endogeneity and Exogeneity in Sales Response Functions

Endogeneity and exogeneity are topics that are mainly discussed in macroeconomics. We show that sales response functions (SRF) are exposed to the same problem if we assume that the control variables in a SRF reflect behavioral reactions of the supply side. The supply side actions are covering a flexible marketing component which could interact with the sales responses if sales managers decide to react fast according to new market situations. A recent article of Kao et al. (Evaluating the effectiveness of marketing expenditures, Working Paper, Ohio State University, Fisher College of Business, 2005) suggested to use a class of production functions under constraints to estimate the sales responses that are subject to marketing strategies. In this paper we demonstrate this approach with a simple SRF(1) model that contains one endogenous variable. Such models can be extended by further exogenous variables leading to SRF-X models. The new modeling approach leads to a multivariate equation system and will be demonstrated using data from a pharma-marketing survey in German regions.

Wolfgang Polasek

### Lead User Identification in Conjoint Analysis Based Product Design

Nowadays, the lead user method [von Hippel, Manag Sci 32(7):791–805, 1986; Lüthje et al. (Res Pol 34(6):951–965, 2005)] and conjoint analysis [Green and Rao (J Market Res 8(3):355–363, 1971), Baier and Brusch (Conjointanalyse: Methoden - Anwendungen - Praxisbeispiele, Springer, Heidelberg, 2009)] are widely used methods for (new) product design. Both methods collect and analyze customers’ preferences and use them for (optimal) product design. However, whereas the lead user method primarily creates breakthrough innovations [see von Hippel et al. (Harv Bus Rev 77(5):47–57, 1999)], conjoint analysis is more capable for incremental innovations [Helm et al. (Int J Manag Decis Making 9(3):242–26, 2008), Baier and Brusch (Conjointanalyse: Methoden - Anwendungen - Praxisbeispiele, Springer, Heidelberg, 2009)]. In this paper we extend conjoint analysis by lead user identification for the design of breakthrough innovations. The new procedure is compared to standard conjoint analysis in an empirical setting.

Alexander Sänn, Daniel Baier

### Improving the Validity of Conjoint Analysis by Additional Data Collection and Analysis Steps

Depending on the concrete application field and the data collection situation, conjoint experiments can end up with a low internal validity of the estimated part-worth functions. One of the known reasons for this is the (missing) temporal stability and structural reliability of the respondents’ part-worth functions, another reason is the (missing) attentiveness of the respondents in an uncontrolled data collection environment, e.g. during an online interview with many parallel web applications (e.g. electronic mail, newspapers or web site browsing). Here, additional data collection and analysis has been proposed as a solution. Examples of internal sources of data are response latencies, eye movements, or mouse movements, examples of external sources are sales and market data. The authors suggest alternative procedures for conjoint data collection that deal with these potential sources of internal validity. A comparison in an adaptive conjoint analysis setting shows, that the new procedures lead to a higher internal validity.

Sebastian Selka, Daniel Baier, Michael Brusch

### The Impact of Missing Values on PLS Model Fitting

The analysis of interactive marketing campaigns frequently requires the investigation of latent constructs. Consequently, structural equation modeling is well established in this domain. Noticeably, the Partial-Least-Squares (PLS) algorithm is gaining popularity in the analysis of interactive marketing applications which may be attributed to its accuracy and robustness when data are not normally distributed. Moreover, the PLS algorithm also appraises incomplete data. This study reports from a simulation experiment in which a set of complete observations is blended with different patterns of missing values. We consider the impacts on the overall model fit, the outer model fit, and the assessment of significance by bootstrapping. Our results cast serious doubts on PLS algorithms’ ability to cope with missing values in a data set.

Moritz Parwoll, Ralf Wagner

### Teachers’ Typology of Student Categories: A Cluster Analytic Study

The present study demonstrates the application of cluster analysis to examine the typology of student categories of novice Luxembourgish teachers. Student categories are mental representations of groups of students in which teachers classify their students. The investigation of student categories is a relevant topic in education, because subsequent assessments of students may be biased by prior classification. Eighty two novice Luxemburgish teachers were asked to mention types of students they became acquainted with during teaching and described these types by characterizing attributes. Twenty types of students and 65 characterizing attributes were frequently mentioned by the teachers. These data formed the basis of a hierarchical-agglomerative cluster analysis, using average-linkage and complete-linkage clustering methods. The average-linkage-method resulted in 10 clusters, which were largely resembled by the resulting clusters of the complete-linkage-method. This indicates a clear structure in the student categories of Luxembourgish novice teachers. The clusters are compared to Hofer’s (Informationsverarbeitung und Entscheidungsverhalten von Lehrern, Beiträge zu einer Handlungstheorie des Unterrichtens, Urban & Schwarzenberg, München, 1981) typology of student categories. The comparison leads to the assumption that the content of student categories may be partly influenced by educational and political discussion.

Thomas Hörstermann, Sabine Krolak-Schwerdt

Jonas Kunze, Andreas Geyer-Schulz

### Students Reading Motivation: A Multilevel Mixture Factor Analysis

Latent variable modeling is a commonly used data analysis tool in social sciences and other applied fields. The most popular latent variable models are factor analysis (FA) and latent class analysis (LCA). FA assumes that there is one or more continuous latent variables – called factors – determining the responses on a set of observed variables, while LCA assumes that there is an underlying categorical latent variable – latent classes. Mixture FA is a recently proposed combination of these two models which includes both continuous and categorical latent variables. It simultaneously determines the dimensionality (factors) and the heterogeneity (latent classes) of the observed data. Both in social sciences and in biomedical field, researchers often encounter multilevel data structure. These are usually analyzed using models with random effects. Here, we present a hierarchical extension of FA called multilevel mixture factor analysis (MMFA) (Varriale and Vermunt, Multilevel mixture factor models, Under review). As in multilevel LCA (Vermunt, Sociol Methodol 33:213–239, 2003), the between-group heterogeneity is modeled by assuming that higher-level units belong to one of

K

latent classes. The key difference with the standard mixture FA is that the discrete mixing distribution is at the group level rather than at the individual level. We present an application of MMFA in educational research. More specifically, a FA structure is used to measure the various dimensions underlying pupils reading motivation. We assume that there are latent classes of teachers which differ in their ability of motivating children.

Daniele Riggi, Jeroen K. Vermunt

### Short Term Dynamics of Tourist Arrivals: What Do Italian Destinations Have in Common?

This work aims to detect the common short term dynamics to yearly time series of 413 Italian tourist areas. We adopt the clustering technique of Abraham et al. (Scand J Stat. 30:581–595, 2003) who propose a two-stage method which fits the data by B-splines and partitions the estimated model coefficients using a

k

-means algorithm. The description of each cluster, which identifies a specific kind of dynamics, is made through simple descriptive cross tabulations in order to study how the location of the areas across the regions or their prevailing typology of tourism characterize each group.

Anna Maria Parroco, Raffaele Scuderi

### A Measure of Polarization for Tourism: Evidence from Italian Destinations

This paper proposes an index of polarization for tourism which links the axiomatic theory of Esteban and Ray with the classical hierarchical agglomerative clustering techniques. The index is aimed at analyzing the dynamics of the average length of stay across Italian destinations, and more specifically to detect whether the polarization within the set of clusters of places with similar values of the indicator has varied over time.

Raffaele Scuderi

### Backmatter

Weitere Informationen