Skip to main content

2006 | Buch

From Data and Information Analysis to Knowledge Engineering

Proceedings of the 29th Annual Conference of the Gesellschaft für Klassifikation e.V. University of Magdeburg, March 9–11, 2005

herausgegeben von: Professor Dr. Myra Spiliopoulou, Professor Dr. Rudolf Kruse, Dr. Christian Borgelt, Jun.-Professor Dr. Andreas Nürnberger, Professor Dr. Wolfgang Gaul

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Classification, Data Analysis, and Knowledge Organization

insite
SUCHEN

Inhaltsverzeichnis

Frontmatter

Plenaries and Semi-plenaries

Boosting and ℓ1-Penalty Methods for High-dimensional Data with Some Applications in Genomics

We consider Boosting and

1

-penalty (regularization) methods for prediction and model selection (feature selection) and discuss some relations among the approaches. While Boosting has been originally proposed in the machine learning community (Freund and Schapire (1996)),

1

-penalization has been developed in numerical analysis and statistics (Tibshirani (1996)). Both of the methods are attractive for very high-dimensional data: they are computationally feasible and statistically consistent (e.g. Bayes risk consistent) even when the number of covariates (predictor variables)

p

is much larger than sample size

n

and if the true underlying function (mechanism) is sparse: e.g. we allow for arbitrary polynomial growth

p

=

p

n

=

O

(

n

γ

) for any γ > 0. We demonstrate high-dimensional classification, regression and graphical modeling and outline examples from genomic applications.

Peter Bühlmann
Striving for an Adequate Vocabulary: Next Generation ‘Metadata’

Digital Libraries (DLs) in general and technical or cultural preservation applications in particular offer a rich set of multimedia objects like audio, music, images, videos, and 3D models. But instead of handling these objects consistently as regular documents — in the same way we handle text documents — most applications handle them differently. This is due to the fact that ‘standard’ tasks like content categorization, indexing, content representation or summarization have not yet been developed to a stage where DL technology could readily apply it for these types of documents. Instead, these tasks have to be done manually making the activity almost prohibitively expensive. Consequently, the most pressing research challenge is the development of an adequate ‘vocabulary’ to characterize the content and structure of non-textual documents as the key to indexing, categorization, dissemination and access.

We argue that textual metadata items are insufficient for describing images, videos, 3D models, or audio adequately. A new type of

generalized vocabulary

is needed that permits to express semantic information — which is a prerequisite for a retrieval of generalized documents based on their content, rather than on static textual annotations. The crucial question being which methods and which types of technology will best support the definition of vocabularies and ontologies for non-textual documents.

We present one such method for the domain of 3D models. Our approach allows to differentiate between the structure and the appearance of a 3D model, and we believe that this formalism can be generalized to other types of media.

Dieter Fellner, Sven Havemann
Scalable Swarm Based Fuzzy Clustering

Iterative fuzzy clustering algorithms are sensitive to initialization. Swarm based clustering algorithms are able to do a broader search for the best extrema. A swarm inspired clustering approach which searches in fuzzy cluster centroids space is discussed. An evaluation function based on fuzzy cluster validity was used. A swarm based clustering algorithm can be computationally intensive and a data distributed approach to clustering is shown to be effective. It is shown that the swarm based clustering results in excellent data partitions. Further, it shown that the use of a cluster validity metric as the evaluation function enables the discovery of the number of clusters in the data in an automated way.

Lawrence O. Hall, Parag M. Kanade
SolEuNet: Selected Data Mining Techniques and Applications

Data mining is concerned with the discovery of interesting patterns and models in data. In practice, data mining has become an established technology with applications in a wide range of areas that include marketing, health care, finance, environmental planning, up to applications in e-commerce and e-science. This paper presents selected data mining techniques and applications developed in the course of the SolEuNet 5FP IST project Data Mining and Decision Support for Business Competitiveness: A European Virtual Enterprise (2000–2003).

Nada Lavrač
Inferred Causation Theory: Time for a Paradigm Shift in Marketing Science?

Over the last two decades the analytical toolbox for examining the properties needed to claim causal relationships has been significantly extended. New approaches to the theory of causality rely on the concept of ‘intervention’ instead of ‘association’. Under an axiomatic framework they elaborate the conditions for safe causal inference from nonexperimental data.

Inferred Causation Theory

(Spirtes et. al., 2000; Pearl, 2000) teaches us that the same independence relationships (or covariance matrix) may have been generated by numerous other graphs representing the cause-effect hypotheses. ICT combines elements of graph theory, statistics, logic, and computer science. It is not limited to parametric models in need of quantitative (ratio or interval scaled) data, but also operates much more generally on the observed conditional independence relationships among a set of qualitative (categorical) observations. Causal inference does not appear to be restricted to experimental data. This is particularly promising for research domains such as consumer behavior where policy makers and managers are unwilling to engage in experiments on real markets. A case example highlights the potential use of Inferred Causation methodology for analyzing the marketing researchers’ belief systems about their scientific orientation.

Josef A. Mazanec
Text Mining in Action!

Text mining methods have being successfully used on different problems, where text data is involved. Some Text mining approaches are capable of handling text just relying on statistics such as, frequency of words or phrases, while others assume availability of additional resources such as, natural language processing tools for the language in which the text is written; availability of lexicons; ontologies of concepts; aligned corpus in several languages; additional data sources such as, links between the text units or other non-textual data. This paper aims at illustrating potential of Text mining by presenting several approaches having some of the listed properties. For this purpose, we present research applications that were developed mainly inside European projects in collaboration with end-users and, research prototypes that do not necessary involve end-users.

Dunja Mladenič
Identification of Real-world Objects in Multiple Databases

Object identification is an important issue for integration of data from different sources. The identification task is complicated, if no global and consistent identifier is shared by the sources. Then, object identification can only be performed through the

identifying information

, the objects data provides itself. Unfortunately real-world data is dirty, hence identification mechanisms like

natural keys

fail mostly — we have to take care of the variations and errors of the data. Consequently, object identification can no more be guaranteed to be fault-free. Several methods tackle the object identification problem, e.g.

Record Linkage

, or the

Sorted Neighborhood Method

.

Based on a novel object identification framework, we assessed data quality and evaluated different methods on real data. One main result is that scalability is determined by the applied preselection technique and the usage of efficient data structures. As another result we can state that

Decision Tree Induction

achieves better correctness and is more robust than

Record Linkage

.

Mattis Neiling
Kernels for Predictive Graph Mining

In many application areas,

graphs

are a very natural way of representing structural aspects of a domain. While most classical algorithms for data analysis cannot directly deal with graphs, recently there has been increasing interest in approaches that can learn general classification models from graph-structured data. In this paper, we summarize and review the line of work that we have been following in the last years on making a particular class of methods suitable for predictive graph mining, namely the so-called

kernel methods

. Firstly, we state a result on fundamental computational limits to the possible expressive power of kernel functions for graphs. Secondly, we present two alternative graph kernels, one based on

walks

in a graph, the other based on

cycle

and

tree patterns

. The paper concludes with empirical evaluation on a large chemical data set.

Stefan Wrobel, Thomas Gärtner, Tamás Horváth

Clustering

PRISMA: Improving Risk Estimation with Parallel Logistic Regression Trees

Logistic regression is a very powerful method to estimate models with binary response variables. With the previously suggested combination of tree-based approaches with local, piecewise valid logistic regression models in the nodes, interactions between the covariates are directly conveyed by the tree and can be interpreted more easily. We show that the restriction of partitioning the feature space only at

the single

best attribute limits the overall estimation accuracy. Here we suggest

Parallel RecursIve Search at Multiple Attributes

(PRISMA) and demonstrate how the method can significantly improve risk estimation models in heart surgery and successfully perform a benchmark on three UCI data sets.

Bert Arnrich, Alexander Albert, Jörg Walter
Latent Class Analysis and Model Selection

This paper discusses model selection for latent class (LC) models. A large experimental design is set that allows the comparison of the performance of different information criteria for these models, some compared for the first time. Furthermore, the level of separation of latent classes is controlled using a new procedure. The results show that AIC3 (Akaike information criterion with 3 as penalizing factor) outperforms other model selection criteria for LC models.

José G. Dias
An Indicator for the Number of Clusters: Using a Linear Map to Simplex Structure

The problem of clustering data can be formulated as a graph partitioning problem. In this setting, spectral methods for obtaining optimal solutions have received a lot of attention recently. We describe Perron Cluster Cluster Analysis (PCCA) and establish a connection to spectral graph partitioning. We show that in our approach a clustering can be efficiently computed by mapping the eigenvector data onto a simplex. To deal with the prevalent problem of noisy and possibly overlapping data we introduce the Min-chi indicator which helps in confirming the existence of a partition of the data and in selecting the number of clusters with quite favorable performance. Furthermore, if no hard partition exists in the data, the Min-chi can guide in selecting the number of modes in a mixture model. We close with showing results on simulated data generated by a mixture of Gaussians.

Marcus Weber, Wasinee Rungsarityotin, Alexander Schliep

Discriminant Analysis

On the Use of Some Classification Quality Measure to Construct Mean Value Estimates Under Nonresponse

Several procedures have been developed for estimating the mean value of population characteristic under nonresponse. Usually estimators use available auxiliary information as a basis for the nonresponse correction. Some of them rely on classification procedures which allow to divide the population under study into subsets of units which are similar to sample respondents or sample nonrespondents. This allows to approximate the proportion of respondent and nonrespondent stratum in the population. Nonrespondents are then subsampled and estimates of population parameters are constructed. Such estimators are more accurate than the standard estimator for two-phase sample when distributions of auxiliary variables in respondent and nonrespondent stratum differ significantly. However, in the case when these distributions are similar the improvement disappears and classification-based estimator may be less accurate than the standard one. In this paper another mean value estimator is proposed in order to eliminate this disadvantage. It is constructed as a combination of a standard (unbiased) two-phase estimator and a classification-based estimator. The weights of this combination are functions of some classification quality measure. The proposed mean value estimator should behave like a classification-based estimator when auxiliary characteristics seem to be useful for classification and behave like a standard estimator otherwise. The results of Monte Carlo simulation experiments aimed at assessing the properties of the proposed combined estimator are presented.

Wojciech Gamrot
A Wrapper Feature Selection Method for Combined Tree-based Classifiers

The aim of feature selection is to find the subset of features that maximizes the classifier performance. Recently, we have proposed a correlation-based feature selection method for the classifier ensembles based on Hellwig heuristic (CFSH).

In this paper we show that further improvement of the ensemble accuracy can be achieved by combining the CFSH method with the wrapper approach.

Eugeniusz Gatnar
Input Variable Selection in Kernel Fisher Discriminant Analysis

Variable selection serves a dual purpose in statistical classification problems: it enables one to identify the input variables which separate the groups well, and a classification rule based on these variables frequently has a lower error rate than the rule based on all the input variables. Kernel Fisher discriminant analysis (KFDA) is a recently proposed powerful classification procedure, frequently applied in cases characterized by large numbers of input variables. The important problem of eliminating redundant input variables before implementing KFDA is addressed in this paper. A backward elimination approach is employed, and a criterion which can be used for recursive elimination of input variables is proposed. The merit of the proposal is evaluated in a simulation study and in terms of its performance when applied to two benchmark data sets.

Nelmarie Louw, Sarel J. Steel
The Wavelet Packet Based Cepstral Features for Open Set Speaker Classification in Marathi

In this paper, a new method of feature extraction based on perceptually meaningful subband decomposition of speech signal has been described.

Dialectal zone

based speaker classification in Marathi language has been attempted in the open set mode using a polynomial classifier. The method consists of dividing the speech signal into nonuniform subbands in approximate Mel-scale using an admissible wavelet packet filterbank and modeling each dialectal zone with the 2

nd

and 3

rd

order polynomial expansions of feature vector.

Hemant A. Patil, P. K. Dutta, T. K. Basu
A New Effective Algorithm for Stepwise Principle Components Selection in Discriminant Analysis

The problem of reducing the dimensionality of multivariate Gaussian observations is considered. The efficiency of discriminant analysis procedure based on well-known method of principle components selection is analytically investigated. The average decrease of interclass distances square is presented as a new criterion of feature selection directly connected with the classification error probability. New stepwise discriminant analysis procedure in the space of principal components based on this criterion is proposed and its efficiency is experimentally and analytically investigated.

Ekaterina Serikova, Eugene Zhuk
A Comparison of Validation Methods for Learning Vector Quantization and for Support Vector Machines on Two Biomedical Data Sets

We compare two comprehensive classification algorithms, support vector machines (SVM) and several variants of learning vector quantization (LVQ), with respect to different validation methods. The generalization ability is estimated by “multiple-hold-out” (MHO) and by “leave-one-out” (LOO) cross v method. The

ξα

-method, a further estimation method, which is only applicable for SVM and is computationally more efficient, is also used.

Calculations on two different biomedical data sets generated of experimental data measured in our own laboratory are presented. The first data set contains 748 feature vectors extracted of posturographic signals which were obtained in investigations of balance control in upright standing of 48 young adults. Two different classes are labelled as “without alcoholic impairment” and “with alcoholic impairment”. This classification task aims the detection of small unknown changes in a relative complex signal with high inter-individual variability.

The second data set contains 6432 feature vectors extracted of electroencephalographic and electroocculographic signals recorded during overnight driving simulations of 22 young adults. Short intrusions of sleep during driving, so-called microsleep events, were observed. They form examples of the first class. The second class contains examples of fatigue states, whereas driving is still possible. If microsleep events happen in typical states of brain activity, the recorded signals should contain typical alterations, and therefore discrimination from signals of the second class, which do not refer to such states, should be possible.

Optimal kernel parameters of SVM are found by searching minimal test errors with all three validation methods. Results obtained on both different biomedical data sets show different optimal kernel parameters depending on the validation method. It is shown, that the

ξα

-method seems to be biased and therefore LOO or MHO method should be preferred.

A comparison of eight different variants of LVQ and six other classification methods using MHO validation yields that SVM performs best for the second and more complex data set and SVM, GRLVQ and OLVQ1 show nearly the same performance for the first data set.

David Sommer, Martin Golz
Discriminant Analysis of Polythetically Described Older Palaeolithic Stone Flakes: Possibilities and Questions

Archaeological inventories of flaked stone artefacts are the most important sources for the reconstruction of mankind’s earliest history. It is necessary to evaluate also the blanks of tool production (“waste”) as the most numerous artefact category using statistical methods including features like absolute measurements and form quotients of the pieces and their striking platforms, the flaking angles, and the dorsal degradation data. In Central Europe, these three major chrono-technological groups of finds can be determined: from the Middle Pleistocene interglacial(s) 250,000 or 300,000, the Early Saalian glacial perhaps 200,000, and from the Early Weichselian glacial 100,000-60,000 years ago—represented by the inventories from Wallendorf, Markkleeberg, and Königsaue B. In this study, the attempt has been undertaken to separate these flake inventories using linear discriminant analysis and to use the results for the comparison with other artefact complexes with rather unclear chrono-technological positions.

Thomas Weber

Classification with Latent Variable Models

Model-based Density Estimation by Independent Factor Analysis

In this paper we propose a model based density estimation method which is rooted in Independent Factor Analysis (IFA). IFA is in fact a generative latent variable model, whose structure closely resembles the one of an ordinary factor model but which assumes that the latent variables are mutually independent and distributed according to Gaussian mixtures. From these assumptions, the possibility of modelling the observed data density as a mixture of Gaussian distributions too derives. The number of free parameters is controlled through the dimension of the latent factor space. The model is proved to be a special case of mixture of factor analyzers which is less parameterized than the original proposal by McLachlan and Peel (2000). We illustrate the use of IFA density estimation for supervised classification both on real and simulated data.

Daniela G. Calò, Angela Montanari, Cinzia Viroli
Identifying Multiple Cluster Structures Through Latent Class Models

Many studies addressing the problem of selecting or weighting variables for cluster analysis assume that all the variables define a unique classification of units. However it is also possible that different classifications of units can be obtained from different subsets of variables. In this paper this problem is considered from a model-based perspective. Limitations and drawbacks of standard latent class cluster analysis are highlighted and a new procedure able to overcome these difficulties is proposed. The results obtained from the application of this procedure on simulated and real data sets are presented and discussed.

Giuliano Galimberti, Gabriele Soffritti
Gene Selection in Classification Problems via Projections onto a Latent Space

The analysis of gene expression data involves the observation of a very large number of variables (genes) on a few units (tissues). In such a context the recourse to conventional classification methods may be hard both for analytical and interpretative reasons. In this work a gene selection procedure for classification problems is addressed. The dimensionality reduction is based on the projections of genes along suitable directions obtained by Independent Factor Analysis (IFA). The performances of the proposed procedure are evaluated in the context of both supervised and unsupervised classification problems for different real data sets.

Marilena Pillati, Cinzia Viroli

Multiway Classification and Data Analysis

The Recovery Performance of Two-mode Clustering Methods: Monte Carlo Experiment

In this paper, a Monte Carlo study on the performance of two-mode cluster methods is presented. The synthetical data sets were generated to correspond to two types of data consisting of overlapping as well as disjoint clusters. Furthermore, the data sets differed in cluster number, degrees of within-group homogeneity and between-group heterogeneity as well as degree of cluster overlap. We found that the methods performed very differently depending on type of data, number of clusters, homogeneity and cluster overlap.

Sabine Krolak-Schwerdt, Michael Wiedenbeck
On the Comparability of Relialibility Measures: Bifurcation Analysis of Two Measures in the Case of Dichotomous Ratings

The problem of analysing interrater — agreement and — reliability is known both in human decision making and in machine interaction. Several measures have been developped in the last 100 years for this purpose, with Cohen’s Kappacoefficient to be the most popular one. Due to methodological considerations, the validity of kappa-type measures for interrater agreement has been discussed in a variety of papers. However, a global comparison of properties of these measures is currently still deficient. In our approach, we constructed an integral measure to evaluate the differences between two reliability measures for dichotomous ratings. Additionally, we studied bifurcation properties of the difference of these measures to quantify areas of minimal differences. From the methodological point of view, our integral-measure can also be used to construct other measures for interrater agreement.

Thomas Ostermann, Reinhard Schuster

Ranking, Multi-label Classification, Preferences

On Active Learning in Multi-label Classification

In conventional multiclass classification learning, we seek to induce a prediction function from the domain of input patterns to a mutually exclusive set of class labels. As a straightforward generalization of this category of learning problems, so-called multi-label classification allows for input patterns to be associated with multiple class labels simultaneously. Text categorization is a domain of particular relevance which can be viewed as an instance of this setting. While the process of labeling input patterns for generating training sets already constitutes a major issue in conventional classification learning, it becomes an even more substantial matter of relevance in the more complex multi-label classification setting. We propose a novel active learning strategy for reducing the labeling effort and conduct an experimental study on the well-known Reuters-21578 text categorization benchmark dataset to demonstrate the efficiency of our approach.

Klaus Brinker
From Ranking to Classification: A Statistical View

In applications related to

information retrieval

, the goal is not only to build a classifier for deciding whether a document

x

among a list

χ

is relevant or not, but to learn a scoring function

s

:

χ

→ ℝ for ranking all possible documents with respect to their relevancy. Here we show how the

bipartite ranking problem

boils down to binary classification with dependent data when accuracy is measured by the

A U C criterion

. The natural estimate of the risk being of the form of a

U-statistic

, consistency of methods based on empirical risk minimization is studied using the theory of

U

-processes. Taking advantage of this specific form, we prove that fast rates of convergence may be achieved under general noise assumptions.

Stéphan Clémençon, Gábor Lugosi, Nicolas Vayatis

PLS Path Modeling, PLS Regression and Classification

Assessing Unidimensionality within PLS Path Modeling Framework

In very many applications and, in particular, in PLS path modeling, it is of paramount importance to assess whether a set of variables is unidimensional. For this purpose, different methods are discussed. In addition to methods generally used in PLS path modeling, methods for the determination of the number of components in principal components analysis are considered. Two original methods based on permutation procedures are also proposed. The methods are compared to each others by means of a simulation study.

Karin Sahmer, Mohamed Hanafi, Mostafa El Qannari
The Partial Robust M-approach

The PLS approach is a widely used technique to estimate path models relating various blocks of variables measured from the same population. It is frequently applied in the social sciences and in economics. In this type of applications, deviations from normality and outliers may occur, leading to an efficiency loss or even biased results. In the current paper, a robust path model estimation technique is being proposed, the partial robust M (PRM) approach. In an example its benefits are illustrated.

Sven Serneels, Christophe Croux, Peter Filzmoser, Pierre J. Van Espen
Classification in PLS Path Models and Local Model Optimisation

In this paper, a methodology is proposed which can be used for the identification of classes of units showing homogeneous behavioural models estimated through PLS Path Modelling. The proposed methodology aims at discovering or validating the existence of classes of units in PLS Path models in a predictive-oriented logic, such as it has been proposed, in the framework of PLS Regression, with PLS Typological Regression. An application to a study on customer satisfaction and loyalty is shown.

Silvia Squillacciotti

Robust Methods in Multivariate Statistics

Hierarchical Clustering by Means of Model Grouping

In many applications we are interested in finding clusters of data that share the same properties, like linear shape. We propose a hierarchical clustering procedure that merges groups if they are fitted well by the same linear model. The representative orthogonal model of each cluster is estimated robustly using iterated LQS regressions. We apply the method to two artificial datasets, providing a comparison of results against other non-hierarchical methods that can estimate linear clusters.

Claudio Agostinelli, Paolo Pellizzari
Deepest Points and Least Deep Points: Robustness and Outliers with MZE

Multivariate outlier identification is often based on robust location and scatter estimates and usually performed relative to an elliptically shaped distribution. On the other hand, the idea of outlying observations is closely related to the notion of data depth, where observations with minimum depth are potential outliers. Here, we are not generally bound to the idea of an elliptical shape of the underlying distribution. Koshevoy and Mosler (1997) introduced zonoid trimmed regions which define a data depth. Recently, Paris Scholz (2002) and Becker and Paris Scholz (2004) investigated a new approach for robust estimation of convex bodies resulting from zonoids. We follow their approach and explore how the minimum volume zonoid (MZE) estimators can be used for multivariate outlier identification in the case of non-elliptically shaped null distributions.

Claudia Becker, Sebastian Paris Scholz
Robust Transformations and Outlier Detection with Autocorrelated Data

The analysis of regression data is often improved by using a transformation of the response rather than the original response itself. However, finding a suitable transformation can be strongly affected by the influence of a few individual observations. Outliers can have an enormous impact on the fitting of statistical models and can be hard to detect due to masking and swamping. These difficulties are enhanced in the case of models for dependent observations, since any anomalies are with respect to the specific autocorrelation structure of the model. In this paper we develop a forward search approach which is able to robustly estimate the Box-Cox transformation parameter under a first-order spatial autoregression model.

Andrea Cerioli, Marco Riani
Robust Multivariate Methods: The Projection Pursuit Approach

Projection pursuit was originally introduced to identify structures in multivariate data clouds (Huber, 1985). The idea of projecting data to a low-dimensional subspace can also be applied to multivariate statistical methods. The robustness of the methods can be achieved by applying robust estimators to the lower-dimensional space. Robust estimation in high dimensions can thus be avoided which usually results in a faster computation. Moreover, flat data sets where the number of variables is much higher than the number of observations can be easier analyzed in a robust way.

We will focus on the projection pursuit approach for robust continuum regression (Serneels et al., 2005). A new algorithm is introduced and compared with the reference algorithm as well as with classical continuum regression.

Peter Filzmoser, Sven Serneels, Christophe Croux, Pierre J. Van Espen
Finding Persisting States for Knowledge Discovery in Time Series

Knowledge Discovery in time series usually requires symbolic time series. Many discretization methods that convert numeric time series to symbolic time series ignore the temporal order of values. This often leads to symbols that do not correspond to states of the process generating the time series. We propose a new method for meaningful unsupervised discretization of numeric time series called “Persist”, based on the Kullback-Leibler divergence between the marginal and the self-transition probability distributions of the discretization symbols. In evaluations with artificial and real life data it clearly outperforms existing methods.

Fabian Mörchen, Alfred Ultsch

Data Mining and Explorative Multivariate Data Analysis

Restricted Co-inertia Analysis

In this paper, an extension of the Co-inertia Analysis is proposed. This extension is based on a objective function which takes into account directly the external information, as linear restrictions about one set of variables, by rewriting the Co-inertia Analysis objective function according to the principle of Restricted Eigenvalue Problem (Rao (1973)).

Pietro Amenta, Enrico Ciavolino
Hausman Principal Component Analysis

The aim of this paper is to obtain discrete-valued weights of the variables by constraining them to Hausman weights (−1, 0, 1) in principal component analysis. And this is done in two steps: First, we start with the centroid method, which produces the most restricted optimal weights −1 and 1; then extend the weights to −1,0 or 1.

Vartan Choulakian, Luigi Dambra, Biagio Simonetti
Nonlinear Time Series Modelling: Monitoring a Drilling Process

Exponential autoregressive (ExpAr) time series models are able to reveal certain types of nonlinear dynamics such as fixed points and limit cycles. In this work, these models are used to model a drilling process. This modelling approach provides an on-line monitoring strategy, using control charts, of the process in order to detect dynamic disturbances and to secure production with high quality.

Amor Messaoud, Claus Weihs, Franz Hering

Text Mining

Word Length and Frequency Distributions in Different Text Genres

In this paper we study word length frequency distributions of a systematic selection of 80 Slovenian texts (private letters, journalistic texts, poems and cooking recipes). The adequacy of four two-parametric Poisson models is analyzed according their goodness of fit properties, and the corresponding model parameter ranges are checked for their suitability to discriminate the text sorts given. As a result we obtain that the Singh-Poisson distribution seems to be the best choice for both problems: first, it is an appropriate model for three of the text sorts (private letters, journalistic texts and poems); and second, the parameter space of the model can be split into regions constituting all four text sorts.

Gordana Antić, Ernst Stadlober, Peter Grzybek, Emmerich Kelih
Bootstrapping an Unsupervised Morphemic Analysis

Unsupervised morphemic analysis may be divided into two phases: 1) Establishment of an initial morpheme set, and 2) optimization of this generally imperfect first approximization. This paper focuses on the first phase, that is the establishment of an initial morphemic analysis, whereby methodological questions regarding ‘unsupervision’ will be touched on. The basic algorithm for segmentation employed goes back to Harris (1955). Proposals for the antecedent transformation of graphemic representations into (partial) phonemic ones are discussed as well as the postprocessing step of reapplying the initially gained morphemic candidates. Instead of directly using numerical (count) measures, a proposal is put forward which exploits numerical interpretations of a universal morphological assumption on morphemic order for the evaluation of the computationally gained segmantations and their quantitative properties.

Christoph Benden
Automatic Extension of Feature-based Semantic Lexicons via Contextual Attributes

We describe how a feature-based semantic lexicon can be automatically extended using large, unstructured text corpora. Experiments are carried out using the lexicon HaGenLex and the Wortschatz corpus. The semantic classes of nouns are determined via the adjectives that modify them. It turns out to be reasonable to combine several classifiers for single attributes into one for complex semantic classes. The method is evaluated thoroughly and possible improvements are discussed.

Chris Biemann, Rainer Osswald
Learning Ontologies to Improve Text Clustering and Classification

Recent work has shown improvements in text clustering and classification tasks by integrating conceptual features extracted from ontologies. In this paper we present text mining experiments in the medical domain in which the ontological structures used are acquired automatically in an unsupervised learning process from the text corpus in question. We compare results obtained using the automatically learned ontologies with those obtained using manually engineered ones. Our results show that both types of ontologies improve results on text clustering and classification tasks, whereby the automatically acquired ontologies yield a improvement competitive with the manually engineered ones.

Stephan Bloehdorn, Philipp Cimiano, Andreas Hotho
Discovering Communities in Linked Data by Multi-view Clustering

We consider the problem of finding communities in large linked networks such as web structures or citation networks. We review similarity measures for linked objects and discuss the k-Means and EM algorithms, based on text similarity, bibliographic coupling, and co-citation strength. We study the utilization of the principle of multi-view learning to combine these similarity measures. We explore the clustering algorithms experimentally using web pages and the Cite-Seer repository of research papers and find that multi-view clustering effectively combines link-based and intrinsic similarity.

Isabel Drost, Steffen Bickel, Tobias Scheffer
Crosslinguistic Computation and a Rhythm-based Classification of Languages

This paper is in line with the principles of numerical taxonomy and with the program of holistic typology. It integrates the level of phonology with the morphological and syntactical level by correlating metric properties (such as n of phonemes per syllable and n of syllables per clause) with non-metric variables such as the number of morphological cases and adposition order. The study of crosslinguistic patterns of variation results in a division of languages into two main groups, depending on their rhythmical structure. Syllable-timed rhythm, as opposed to stress-timed rhythm, is closely associated with a lower complexity of syllables and a higher number of syllables per clause, with a rather high number of morphological cases and with a tendency to OV order and postpositions. These two fundamental types of language may be viewed as the “idealized” counterparts resulting from the very same and universal pattern of variation.

August Fenk, Gertraud Fenk-Oczlon
Using String Kernels for Classification of Slovenian Web Documents

In this paper we present an approach for classifying web pages obtained from the Slovenian Internet directory where the web sites covering different topics are organized into a topic ontology. We tested two different methods for representing text documents, both in combination with the linear SVM classification algorithm. The first representation used is a standard bag-of-words approach with TFIDF weights and cosine distance used as similarity measure. We compared this to String kernels where text documents are compared not by words but by substrings. This removes the need for stemming or lemmatisation which can be an important issue when documents are in other languages than English and tools for stemming or lemmatisation are unavailable or are expensive to make or learn. In highly inflected natural languages, such as Slovene language, the same word can have many different forms, thus String kernels have an advantage here over the bag-of-words. In this paper we show that in classification of documents written in highly inflected natural language the situation is opposite and String Kernels significantly outperform the standard bag-of-words representation. Our experiments also show that the advantage of String kernels is more evident for domains with unbalanced class distribution.

Blaž Fortuna, Dunja Mladenič
Semantic Decomposition of Character Encodings for Linguistic Knowledge Discovery

Analysis and knowledge representation of linguistic objects tends to focus on larger units (e.g. words) than print medium characters. We analyse characters as linguistic objects in their own right, with meaning, structure and form. Characters have meaning (the symbols of the International Phonetic Alphabet denote phonetic categories, the character represented by the glyph ‘∪’ denotes set union), structure (they are composed of stems and parts such as descenders or diacritics or are ligatures), and form (they have a mapping to visual glyphs). Character encoding initatives such as Unicode tend to concentrate on the structure and form of characters and ignore their meaning in the sense discussed here. We suggest that our approach of including semantic decomposition and defining font-based namespaces for semantic character domains provides a long-term perspective of interoperability and tractability with regard to data-mining over characters by integrating information about characters into a coherent semiotically-based ontology. We demonstrate these principles in a case study of the International Phonetic Alphabet.

Dafydd Gibbon, Baden Hughes, Thorsten Trippel
Applying Collaborative Filtering to Real-life Corporate Data

In this paper, we present our experience in applying collaborative filtering to real-life corporate data. The quality of collaborative filtering recommendations is highly dependent on the quality of the data used to identify users’ preferences. To understand the influence that highly sparse server-side collected data has on the accuracy of collaborative filtering, we ran a series of experiments in which we used publicly available datasets and, on the other hand, a real-life corporate dataset that does not fit the profile of ideal data for collaborative filtering.

Miha Grcar, Dunja Mladenič, Marko Grobelnik
Quantitative Text Typology: The Impact of Sentence Length

This study focuses on the contribution of sentence length for a quantitative text typology. Therefore, 333 Slovenian texts are analyzed with regard to their sentence length. By way of multivariate discriminant analyses (

M D A

) it is shown that indeed, a text typology is possible, based on sentence length, only; this typology, however, does not coincide with traditional text classifications, such as, e.g., text sorts or functional style. Rather, a new categorization into specific discourse types seems reaonable.

Emmerich Kelih, Peter Grzybek, Gordana Antić, Ernst Stadlober
A Hybrid Machine Learning Approach for Information Extraction from Free Text

We present a hybrid machine learning approach for information extraction from unstructured documents by integrating a learned classifier based on the Maximum Entropy Modeling (MEM), and a classifier based on our work on

Data-Oriented Parsing

(DOP). The hybrid behavior is achieved through a voting mechanism applied by an iterative tag-insertion algorithm. We have tested the method on a corpus of

German

newspaper articles about company turnover, and achieved 85.2% F-measure using the hybrid approach, compared to 79.3% for MEM and 51.9% for DOP when running them in isolation.

Günter Neumann
Text Classification with Active Learning

In many real world machine learning tasks, labeled training examples are expensive to obtain, while at the same time there is a lot of unlabeled examples available. One such class of learning problems is text classification. Active learning strives to reduce the required labeling effort while retaining the accuracy by intelligently selecting the examples to be labeled. However, very little comparison exists between different active learning methods. The effects of the ratio of positive to negative examples on the accuracy of such algorithms also received very little attention. This paper presents a comparison of two most promising methods and their performance on a range of categories from the Reuters Corpus Vol. 1 news article dataset.

Blaž Novak, Dunja Mladenič, Marko Grobelnik
Towards Structure-sensitive Hypertext Categorization

Hypertext categorization is the task of automatically assigning category labels to hypertext units. Comparable to text categorization it stays in the area of function learning based on the bag-of-features approach. This scenario faces the problem of a many-to-many relation between websites and their hidden logical document structure. The paper argues that this relation is a prevalent characteristic which interferes any effort of applying the classical apparatus of categorization to web genres. This is confirmed by a threefold experiment in hypertext categorization. In order to outline a solution to this problem, the paper sketches an alternative method of unsupervised learning which aims at bridging the gap between statistical and structural pattern recognition (Bunke et al. 2001) in the area of web mining.

Alexander Mehler, Rüdiger Gleim, Matthias Dehmer
Evaluating the Performance of Text Mining Systems on Real-world Press Archives

We investigate the performance of text mining systems for annotating press articles in two real-world press archives. Seven commercial systems are tested which recover the categories of a document as well named entities and catchphrases. Using cross-validation we evaluate the precision-recall characteristic. Depending on the depth of the category tree 39–79% breakeven is achieved. For one corpus 45% of the documents can be classified automatically, based on the system’s confidence estimates. In a usability experiment the formal evaluation results are confirmed. It turns out that with respect to some features human annotators exhibit a lower performance than the text mining systems. This establishes a convincing argument to use text mining systems to support indexing of large document collections.

Gerhard Paaß, Hugo de Vries
Part-of-Speech Induction by Singular Value Decomposition and Hierarchical Clustering

Part-of-speech induction involves the automatic discovery of word classes and the assignment of each word of a vocabulary to one or several of these classes. The approach proposed here is based on the analysis of word distributions in a large collection of German newspaper texts. Its main advantage over other attempts is that it combines the hierarchical clustering of context vectors with a previous step of dimensionality reduction that minimizes the effects of sampling errors.

Reinhard Rapp
Near Similarity Search and Plagiarism Analysis

Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or fingerprint. An overlap of two documents’ fingerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct fingerprints, which is bound up with two problems: (

i

) it is computationally expensive, (

ii

) a small chunk size must be chosen to identify matching passages, which additionally increases the effort for fingerprint computation, fingerprint comparison, and fingerprint storage.

This paper proposes a new class of fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints operationalize the concept of “near similarity” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our fingerprints leads to a speed-up by a factor of five and higher—without compromising the recall performance.

Benno Stein, Sven Meyer zu Eissen

Fuzzy Data Analysis

Objective Function-based Discretization

Decision tree learner inspect marginal class distributions of numerical attributes to infer a predicate that can be used as a decision node in the tree. Since such discretization techniques examine the marginal distribution only, they may fail completely to predict the class correctly even in cases for which a decision tree with a 100% classification rate exists. In this paper, an objective function-based clustering algorithm is modified to yield a discretization of numerical variables that overcomes these problems. The underlying clustering algorithm is the fuzzy c-means algorithm, which is modified to (a) take the class information into account and (b) to organize all cluster prototypes in a regular grid such that the grid rather than the individiual clusters are optimized.

Frank Höppner
Understanding and Controlling the Membership Degrees in Fuzzy Clustering

Fuzzy cluster analysis uses membership degrees to assign data objects to clusters in order to better handle ambiguous data that share properties of different clusters. However, the introduction of membership degrees requires a new parameter called fuzzifier. In this paper the good and bad effects of the fuzzifier on the clustering results are analysed and based on these considerations a more general approach to fuzzy clustering is proposed, providing better control on the membership degrees and their influence in fuzzy cluster analysis.

Frank Klawonn
Autonomous Sensor-based Landing Systems: Fusion of Vague and Incomplete Information by Application of Fuzzy Clustering Techniques

Enhanced Vision Systems (EVS) are currently developed with the goal to alleviate restrictions in airspace and airport capacity in low-visibility conditions. EVS relies on weather penetrating forward-looking sensors that augment the naturally existing visual cues in the environment and provide a real-time image of prominent topographical objects that may be identified by the pilot. In this paper an automatic analysis of millimetre wave radar images for Enhanced Vision Systems is presented. The core part of the system is a fuzzy rule based inference machine which controls the data analysis based on the uncertainty in the actual knowledge in combination with a-priori knowledge. Compared with standard TV or IR images the quality of MMW images is rather poor and data is highly corrupted with noise and clutter. Therefore, one main task of the inference machine is to handle uncertainties as well as ambiguities and inconsistencies to draw the right conclusions. The output of different sensor data analysis processes are fused and evaluated within a fuzzy/possibilistic clustering algorithm whose results serve as input to the inference machine. The only a-priori knowledge used in the presented approach is the same pilots already know from airport charts which are available of almost every airport. The performance of the approach is demonstrated with real data acquired during extensive flight tests to several airports in Northern Germany.

Bernd Korn
Outlier Preserving Clustering for Structured Data Through Kernels

In this paper, we propose a kernel-based clustering algorithm that highlights both the major trends and the atypical behaviours present in a dataset, so as to provide a complete characterisation of the data; thanks to the kernel framework, the algorithm can be applied independently of the data nature without requiring any adaptation. We apply it to xml data describing student results to several exams: we propose a kernel to handle such data and present the results obtained with a real dataset.

Marie-Jeanne Lesot

Economics and Mining in Business Processes

Classification-relevant Importance Measures for the West German Business Cycle

When analyzing business cycle data, one observes that the relevant predictor variables are often highly correlated. This paper presents a method to obtain measures of importance for the classification of data in which such multicollinearity is present. In systems with highly correlated variables it is interesting to know what changes are inflicted when a certain predictor is changed by one unit and all other predictors according to their correlation to the first instead of a ceteris paribus analysis. The approach described in this paper uses directional derivatives to obtain such importance measures. It is shown how the interesting directions can be estimated and different evaluation strategies for characteristics of classification models are presented. The method is then applied to linear discriminant analysis and multinomial logit for the classification of west German business cycle phases.

Daniel Enache, Claus Weihs, Ursula Garczarek
The Classification of Local and Branch Labour Markets in the Upper Silesia

The paper focuses on the differentiation of unemployment in Upper Silesia. All analyses have been carried out referring to both, particular professions and subregions of Silesia. It shows, how the internal diversity of the province influences the situation of the labour markets. It could be interesting to compare the present and future classifications, because the time scope of the research covers the year preceeding the European integration.

Witold Hantke
An Overview of Artificial Life Approaches for Clustering

Recently, artificial life approaches for clustering have been proposed. However, the research on artificial life is mainly the simulation of systems based on models for real life. In addition to that artificial life methods have been utilized to solve optimization problems. This paper gives a short overview of artificial life and its applications in general. From this starting point we will focus on artificial life approaches used for clustering. These approaches are characterized by the fact that solutions are emergent rather than predefined and preprogrammed. The data is seen as active rather than passive objects. New data can be added incrementally to the system. We will present existing concepts for clustering with artificial life and highlight their differences and strengths.

David Kämpf, Alfred Ultsch
Design Problems of Complex Economic Experiments

Economic experiments are a source of valuable data about economic decision making. Although a lot of experiments exist, there are few experiments done where the subjects face complex decision tasks.

Inventory management problems in supply chain management represent such complex decision tasks because of their time delay and nonlinearities.

The published experiments on inventory management in supply chains are reviewed and some of the design problems of these experiments are discussed. The paper especially focuses on incentives, presentation effects and concreteness of the experiment.

Jonas Kunze
Traffic Sensitivity of Long-term Regional Growth Forecasts

We estimate the sensitivity of the regional growth forecast in the year 2002 due to expected changes in the travel time (TT) matrix. We use a dynamic panel model with spatial effects where the spatial dimension enters the explanatory variables in different ways. The spatial dimension is based on geographical distance between 227 cells in central Europe and the travel time matrix based on average train travel times. The regressor variables are constructed by a) the average past growth rates, where the travel times are used as weights, b) the average travel times across all cells (made comparable by index construction), c) the gravity potential variables based on GDP per capita, employment, productivity and population and d) dummy variables and other socio-demographic variables. We find that for the majority of the cells the relative differences in growth for the year 2020 is rather small. But there are differences as how many regions will benefit from improved train networks: GDP, employment, and population forecasts respond differently.

Wolfgang Polasek, Helmut Berrer
Spiralling in BTA Deep-hole Drilling: Models of Varying Frequencies

One serious problem in deep-hole drilling is the formation of a dynamic disturbance called spiralling which causes holes with several lobes. Since such lobes are a severe impairment of the bore hole the formation of spiralling has to be prevented. Gessesse et al. (1994) explain spiralling by the coincidence of bending modes and multiples of the rotary frequency. This they derive from an elaborate finite elements model of the process.

In online measurements we detected slowly changing frequency patterns similar to those calculated by Gessesse et al. We therefore propose a method to estimate the parameters determining the change of frequencies over time from spectrogram data. This allows to significantly simplify the usage of the explanation of spiralling in practice because the finite elements model has to be correctly modified for each machine and tool assembly while the statistical method uses observable measurements. Estimating the variation of the frequencies as good as possible opens up the opportunity to prevent spiralling by e.g. changing the rotary frequency.

Nils Raabe, Oliver Webber, Winfried Theis, Claus Weihs
Analysis of the Economic Development of Districts in Poland as a Basis for the Framing of Regional Policies

In 2004 six major socio-economic regions according to The Nomenclature of Territorial Units for Statistics (NUTS) were established. The so called NUTS1 regions were formed as groups of voivodships (NUTS2 regions). Some of the authorities expressed their concerns that joining better regions with less developed in one might raise GDP per capita statistic in region, thus reducing the expected flows of structural funds. In the paper we try to verify these concerns. We study the level of economic development of all districts in Poland, and its spatial diversification, using the methods of synthetic taxonomic indexes of development and clustering methods.

Monika Rozkrut, Dominik Rozkrut

Banking and Finance

The Classification of Candlestick Charts: Laying the Foundation for Further Empirical Research

The academic discussion about technical analysis has a long tradition, in American literature as well as in the German scientific community. Lo et al. (2000) laid the foundation for empirical research on the “classical” technical indicators (like “head-and-shoulders” formations) with their paper “Foundations of Technical analysis”.

The candlestick technique is based on the visual recognition of patterns called “Candlesticks”, a special method of visualizing the behavior of asset prices. Candlesticks are very popular in Asia and their popularity is growing in western countries. Until now there has not been done much empirical research concerning the performance of technical analysis with candlesticks. This is probably due to the fact that no automatic and deterministic way to classify candlestick patterns has been developed thus far.

The purpose of this work is to lay the basis for future empirical investigations and to develop a systematic approach by which candlestick charts can be classified.

Stefan Etschberger, Henning Fock, Christian Klein, Bernhard Zwergel
Modeling and Estimating the Credit Cycle by a Probit-AR(1)-Process

The loss distribution of a credit portfolio is considered within the framework of a Bernoulli-mixture model where in each rating grade the stochastic Bernoulli-parameter follows an autoregressive stationary process. Changes in the loss distribution are discussed when the unconditional view is replaced by a conditional view where information from the last period is taken into account. This relates to the lively debate among practitioners whether regulatory capital should incorporate point-in-time or through-the-cycle aspects. Calculations are carried out in a model estimated with real data from a large retail portfolio.

Steffi Höse, Konstantin Vogl
Comparing and Selecting SVM-Kernels for Credit Scoring

Kernel methods for classification problems map data points into feature spaces where linear separation is performed. Detecting linear relations has been the focus of much research in statistics and machine learning, resulting in efficient algorithms that are well understood, with many applications including credit scoring problems. However, the choice of more appropriate kernel functions using nonlinear feature mapping may still improve this classification performance. We show, how different kernel functions contribute to the solution of a credit scoring problem and we also show how to select and compare such kernels.

Ralf Stecking, Klaus B. Schebesch
Value at Risk Using the Principal Components Analysis on the Polish Power Exchange

In this article we present downside risk measures such as: Value-at-Risk -

VaR

and Conditional Value-at-Risk -

CVaR

. We established these measures based on the principal components analysis. The principal components analysis is usually applied to complex systems that depend on a large number of factors where one wishes to identify the smallest number of new variables that explain as much of the variability in the system as possible. The first few principal components usually explain the most of historical variability. In our research we used the prices of electric energy from the Day Ahead Market (DAM) of the Polish Power Exchange from 30.03.03 to 27.03.04. We conclude by discussing practical applications of the results of our research in risk management on the Polish Power Exchange.

Grażyna Trzpiot, Alicja Ganczarek

Marketing

A Market Basket Analysis Conducted with a Multivariate Logit Model

The following research is guided by the hypothesis that products chosen on a shopping trip in a supermarket can indicate the preference interdependencies between different products or brands. The bundle chosen on the trip can be regarded as the result of a global utility function. More specifically: the existence of such a function implies a cross-category dependence of brand choice behavior. It is hypothesized that the global utility function related to a product bundle results from the marketing-mix of the underlying brands. Several approaches exist to describe the choice of specific categories from a set of many alternatives. The models are discussed in brief; the multivariate logit approach is used to estimate a model with a German data set.

Yasemin Boztuğ, Lutz Hildebrandt
Solving and Interpreting Binary Classification Problems in Marketing with SVMs

Marketing problems often involve binary classification of customers into “buyers” versus “non-buyers” or “prefers brand A” versus “prefers brand B”. These cases require binary classification models such as logistic regression, linear, and quadratic discriminant analysis. A promising recent technique for the binary classification problem is the Support Vector Machine (Vapnik (1995)), which has achieved outstanding results in areas ranging from Bioinformatics to Finance. In this paper, we compare the performance of the Support Vector Machine against standard binary classification techniques on a marketing data set and elaborate on the interpretation of the obtained results.

Georgi Nalbantov, Jan C. Bioch, Patrick J. F. Groenen
Modeling the Nonlinear Relationship Between Satisfaction and Loyalty with Structural Equation Models
Marcel Paulssen, Angela Sommerfeld
Job Choice Model to Measure Behavior in a Multi-stage Decision Process

The article presents a job choice model allowing to measure the importance of items of employer images in a multi-stage decision process. Based on scientific research a model for the multi-stage decision process is presented that contains details on how decisions are made on each stage. A method using logistic regression to empirically validate the model is described and compared to an alternative method. The results of applying the method are presented and discussed.

Thomas Spengler, Jan Malmendier
Semiparametric Stepwise Regression to Estimate Sales Promotion Effects

Kalyanam and Shively (1998) and van Heerde et al. (2001) have proposed semiparametric models to estimate the influence of price promotions on brand sales, and both obtained superior performance for their models compared to strictly parametric modeling. Following these researchers, we suggest another semiparametric framework which is based on penalized B-splines to analyze sales promotion effects flexibly. Unlike these researchers, we introduce a stepwise procedure with simultaneous smoothing parameter choice for variable selection. Applying this stepwise routine enables us to deal with product categories with many competitive items without imposing restrictions on the competitive market structure in advance. We illustrate the new methodology in an empirical application using weekly store-level scanner data.

Winfried J. Steiner, Christiane Belitz, Stefan Lang

Adaptivity and Personalization

Implications of Probabilistic Data Modeling for Mining Association Rules

Mining association rules is an important technique for discovering meaningful patterns in transaction databases. In the current literature, the properties of algorithms to mine association rules are discussed in great detail. We present a simple probabilistic framework for transaction data which can be used to simulate transaction data when no associations are present. We use such data and a real-world grocery database to explore the behavior of confidence and lift, two popular interest measures used for rule mining. The results show that confidence is systematically influenced by the frequency of the items in the left-hand-side of rules and that lift performs poorly to filter random noise in transaction data. The probabilistic data modeling approach presented in this paper not only is a valuable framework to analyze interest measures but also provides a starting point for further research to develop new interest measures which are based on statistical tests and geared towards the specific properties of transaction data.

Michael Hahsler, Kurt Hornik, Thomas Reutterer
Copula Functions in Model Based Clustering

Model based clustering is common approach used in cluster analysis. Here each cluster is characterized by some kind of model, for example — multivariate distribution, regression, principal component etc. One of the most well known approaches in model based clustering is the one proposed by Banfield and Raftery (1993), where each class is described by multivariate normal distribution. Due to the eigenvalue decomposition, one gets flexibility in modeling size, shape and orientation of the clusters, still assuming general elliptical shape of the set of observations. In the paper we consider the other proposal based on the general stochastic approach in two versions:

classification likelihood approach, where each observation comes from one of several populations;

mixture approach, where observations are distributed as a mixture of several distributions.

We propose the use of the copula approach, by representing the multivariate distribution as the copula function of univariate marginal distributions. We give the theoretical bases for such an approach and the algorithms for practical use.

The discussed methods are illustrated by some simulation studies and real examples using financial data.

Krzysztof Jajuga, Daniel Papla
Attribute-aware Collaborative Filtering

One of the key challenges in large information systems such as online shops and digital libraries is to discover the relevant knowledge from the enormous volume of information. Recommender systems can be viewed as a way of reducing large information spaces and to personalize information access by providing recommendations for information items based on prior usage.

Collaborative Filtering, the most commonly-used technique for this task, which applies the nearest-neighbor algorithm, does not make use of object attributes. Several so-called content-based and hybrid recommender systems have been proposed, that aim at improving the recommendation quality by incorporating attributes in a collaborative filtering model.

In this paper, we will present an adapted as well as two novel hybrid techniques for recommending items. To evaluate the performances of our approaches, we have conducted empirical evaluations using a movie dataset. These algorithms have been compared with several collaborative filtering and non-hybrid approaches that do not consider attributes. Our experimental evaluations show that our novel hybrid algorithms outperform state-of-the-art algorithms.

Karen Tso, Lars Schmidt-Thieme

User and Data Authentication in IT Security

Towards a Flexible Framework for Open Source Software for Handwritten Signature Analysis

The human signature is still the most widely used and accepted form of personal authorisation and verification applied to documents and transactions. In this paper we describe the design and implementation of a flexible and highly configurable framework for the experimental construction and performance investigation of biometric signature verification systems. We focus on a design approach which encompasses a general process model for automatic signature processing, reference instances relating to specific data parsing, feature extraction and classification algorithms and detail the provision of a framework whereby unique signature systems can be easily constructed, trialed and assessed.

Richard Guest, Mike Fairhurst, Claus Vielhauer
Multimodal Biometric Authentication System Based on Hand Features

In this paper we present a multimodal biometric authentication system based on features of the human hand. A new biometric approach to biometric authentication based on eigen-coefficients of palm, fingers between first and third phalanx, and finger tips, is described. The system was tested on a database containing 10 grey-level images of the left hand and 10 grey-level images of the right hand of 43 people. Preliminary experimental results showed high accuracy of the system in terms of the correct recognition rate (99.49 %) and the equal error rate (0.025 %).

Nikola Pavešić, Tadej Savič, Slobodan Ribarić
Labelling and Authentication for Medical Imaging Through Data Hiding

In this paper two potential applications of data hiding technology in a medical scenario are considered. In particular the first application refers to labelling and the watermarking algorithm provides the possibility of directly embed into a medical image the data of the patient; the second application regards authentication and integrity verification, and data hiding is applied for verifying whether and where the content has been modified or falsified since its distribution. Two algorithms developed for these specific purposes will be presented.

Alessia De Rosa, Roberto Caldelli, Alessandro Piva
Hand-geometry Recognition Based on Contour Landmarks

This paper demonstrates the feasibility of a new method of hand-geometry recognition based on parameters derived from the contour of the hand

1

. The contour can be modelled by parameters, or features, that can capture more details of the shape of the hand than what is possible with the standard geometrical features used in hand-geometry recognition. The set of features considered in this paper consists of the spatial coordinates of certain landmarks on the contour. The verification performance obtained with contour-based features is compared with the verification performance of other methods described in the literature.

Raymond Veldhuis, Asker Bazen, Wim Booij, Anne Hendrikse
A Cross-cultural Evaluation Framework for Behavioral Biometric User Authentication

Today biometric techniques are based either on passive (e.g. IrisScan, Face) or active methods (e.g. voice and handwriting). In our work we focus on evaluation of the latter. These methods, also described as behavioral Biometric, are characterized by a trait that is learnt and acquired over time. Several approaches for user authentication have been published, but today they have not yet been evaluated under cultural aspects such as language, script and personal background of users. Especially for handwriting such cultural aspects can lead to a significant and essential outcome, as different spoken and written languages are being used and also the script used for handwriting is different in nature.

F. Wolf, T. K. Basu, P. K. Dutta, C. Vielhauer, A. Oermann, B. Yegnanarayana

Bioinformatics and Biostatistics

On External Indices for Mixtures: Validating Mixtures of Genes

Mixture models represent results of gene expression cluster analysis in a more natural way than ‘hard’ partitions. This is also true for the representation of gene labels, such as functional annotations, where one gene is often assigned to more than one annotation term. Another important characteristic of functional annotations is their higher degree of detail in relation to groups of co-expressed genes. In other words, genes with similar function should be be grouped together, but the inverse does not holds. Both these facts, however, have been neglected by validation studies in the context of gene expression analysis presented so far. To overcome the first problem, we propose an external index extending the corrected Rand for comparison of two mixtures. To address the second and more challenging problem, we perform a clustering of terms from the functional annotation, in order to address the problem of difference in coarseness of two mixtures to be compared. We resort to simulated and biological data to show the usefulness of our proposals. The results show that we can only differentiate between distinct solutions after applying the component clustering

Ivan G. Costa, Alexander Schliep
Tests for Multiple Change Points in Binary Markov Sequences

In Krauth (2005) we derived a finite conditional conservative test for a change point in a Bernoulli sequence with first-order Markov dependence. This approach was based on the property of intercalary independence of Markov processes (Dufour and Torrès (2000)) and on the CUSUM statistic considered in Krauth (1999, 2000) for the case of independent binomial trials. Here, we derive finite conditional tests for multiple change points in binary first-order Markov sequences using in addition conditional modified maximum likelihood estimates for multiple change points (Krauth, 2004) and Exact Fisher tests.

Joachim Krauth
UnitExpressions: A Rational Normalization Scheme for DNA Microarray Data

A new normalization scheme for DNA microarray data, called UnitExpresion, is introduced. The central idea is to derive a precise model of unexpressed genes. Most of the expression rates in a typical microarray experiment belong to this category. Pareto probability density estimation (PDE) and EM are used to calculate a precise model of this distribution. UnitExpressions represent a lower bound on the probability that a gene on a microarray is expressed. With UnitExpressions experiments from different microrarrays can be compared even across different studies. UnitExpressions are compared to standardized LogRatios for distance calculation in hierarchical clustering.

Alfred Ultsch

Classification of High-dimensional Biological and Medical Data

A Ridge Classification Method for High-dimensional Observations

Currently experimental techniques such as gene expression analysis with microarrays result in the situation that the number of variables exceeds the number of observations by far. Then application of the standard classification methodology fails because of singularity of the covariance matrix. One of the possibilities to circumvent this problem is to use ridge estimates instead of the sample covariance matrix.

Raudys and Skurichina presented an analytic formula for the asymptotic error of the one-parametric ridge classification rule. Based on their approach we derived a new formula which is unlike that of Raudys and Skurichina also valid in the case of a singular covariance matrix. Under suitable conditions the formula allows to calculate the ridge parameter which minimizes the classification error. Simulation results are presented.

Martin Grüning, Siegfried Kropf
Assessing the Trustworthiness of Clustering Solutions Obtained by a Function Optimization Scheme

We present a method for finding clustering structures which are good and trustable. The method analyzes re-clustering results obtained by varying the search path in the space of partitions. From the scatter of results the joint optimum of given quality criteria is determined and the re-occurrence probability of this optimum (called optimum consensus) is estimated. Then the finest structure is determined that emerged robustly with scores typical of high partition quality. When applied to tumor gene expression benchmark data the method assigned fewer tissue samples to a wrong class compared to methods based on either consensus or quality criteria.

Ulrich Möller, Dörte Radke
Variable Selection for Discrimination of More Than Two Classes Where Data are Sparse

In classification, with an increasing number of variables, the required number of observations grows drastically. In this paper we present an approach to put into effect the maximal possible variable selection, by splitting a

K

class classification problem into pairwise problems. The principle makes use of the possibility that a variable that discriminates two classes will not necessarily do so for all such class pairs.

We further present the construction of a classification rule based on the pairwise solutions by the Pairwise Coupling algorithm according to Hastie and Tibshirani (1998). The suggested proceedure can be applied to any classification method. Finally, situations with lack of data in multidimensional spaces are investigated on different simulated data sets to illustrate the problem and the possible gain. The principle is compared to the classical approach of linear and quadratic discriminant analysis.

Gero Szepannek, Claus Weihs

Medical and Health Sciences

The Assessment of Second Primary Cancers (SPCs) in a Series of Splenic Marginal Zone Lymphoma (SMZL) Patients

The purpose of this study is to estimate the risk of second primary cancer (SPC) in 129 consecutive patients with splenic marginal zone lymphoma (SMZL) diagnosed in three Italian haematological centres. The person-years method deriving as a sum of products of age- and sex- specific rates and of the corresponding time at risk was used. The SPC Standardized Incidence Ratio (SIR) was 2.03 with a 95% confidence interval: [1.05, 3.56] (

p

< 0.05) and the corresponding Absolute Excess Risk (AER) was 145.8 (per 10000 SMZL patients per year). Our findings evidence a high frequency of additional cancers in patients with SMZL and suggest that the incidence rate of SPCs is significantly different from that expected in the general population.

Stefano De Cantis, Anna Maria Taormina
Heart Rate Classification Using Support Vector Machines

This contribution describes a classification technique that improves the heart rate estimation during hemodialysis treatments. After the heart rate is estimated from the pressure signal of the dialysis machine, a classifier decides if it is correctly identified and rejects it if necessary. As the classifier employs a support vector machine, special interest is put on the automatic selection of its user parameters. In this context, a comparison between different optimization techniques is presented, including a gradient projection method as latest development.

Michael Vogt, Ulrich Moissl, Jochen Schaab

Music Analysis

Visual Mining in Music Collections

We describe the

MusicMiner

system for organizing large collections of music with databionic mining techniques. Visualization based on perceptually motivated audio features and Emergent Self-Organizing Maps enables the unsupervised discovery of timbrally consistent clusters that may or may not correspond to musical genres and artists. We demonstrate the visualization capabilities of the U-Map. An intuitive browsing of large music collections is offered based on the paradigm of topographic maps. The user can navigate the sound space and interact with the maps to play music or show the context of a song.

Fabian Mörchen, Alfred Ultsch, Mario Nöcker, Christian Stamm
Modeling Memory for Melodies

The aim of the presented study was to find structural descriptions of melodies that influence recognition memory for melodies. 24 melodies were played twice to 42 test persons. In the second turn, some of the melodies were changed, and the subjects were asked whether they think that the melody has been exactly the same as in the first turn or not. The variables used to predict the subject judgments comprise data about the subjects’ musical experience, features of the original melody and its position in the music piece, and informations about the change between the first and the second turn. Classification and regression methods have been carried out and tested on a subsample. The prediction problem turned out to be difficult. The results seem to be influenced strongly by differences between the subjects and between the melodies that had not been recorded among the regressor variables.

Daniel Müllensiefen, Christian Hennig
Parameter Optimization in Automatic Transcription of Music

Based on former work on automatic transcription of musical time series into sheet music (Ligges et al. (2002), Weihs and Ligges (2003, 2005)) in this paper parameters of the transcription algorithm are optimized for various real singers. Moreover, the parameters of various artificial singer models derived from the models of Rossignol et al. (1999) and Davy and Godsill (2002) are estimated. In both cases, optimization is carried out by the Nelder-Mead (1965) search algorithm. In the modelling case a hierarchical Bayes extension is estimated by WinBUGS (Spiegelhalter et al. (2004)) as well. In all cases, optimal parameters are compared to heuristic estimates from our former standard method.

Claus Weihs, Uwe Ligges

Data Mining Competition

GfKl Data Mining Competition 2005: Predicting Liquidity Crises of Companies

Data preprocessing and a careful selection of the training and classification method are key steps for building a predictive or classification model with high performance. Here, we present the winner approaches submitted to the 2005 GfKl Data Mining Competition. The task to be solved for the competition was the prediction of a possible liquidity crisis of a company. The binary classification was to be based on a set of 26 variables describing attributes of the companies with unknown semantics.

Jens Strackeljan, Roland Jonscher, Sigurd Prieur, David Vogel, Thomas Deselaers, Daniel Keysers, Arne Mauser, Ilja Bezrukov, Andre Hegerath
Backmatter
Metadaten
Titel
From Data and Information Analysis to Knowledge Engineering
herausgegeben von
Professor Dr. Myra Spiliopoulou
Professor Dr. Rudolf Kruse
Dr. Christian Borgelt
Jun.-Professor Dr. Andreas Nürnberger
Professor Dr. Wolfgang Gaul
Copyright-Jahr
2006
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-31314-4
Print ISBN
978-3-540-31313-7
DOI
https://doi.org/10.1007/3-540-31314-1