Top

2014 | Book

Read chapter Read first chapter

Data Analysis, Machine Learning and Knowledge Discovery

Editors: Myra Spiliopoulou, Lars Schmidt-Thieme, Ruth Janning

Publisher: Springer International Publishing

Book Series : Studies in Classification, Data Analysis, and Knowledge Organization

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

Data analysis, machine learning and knowledge discovery are research areas at the intersection of computer science, artificial intelligence, mathematics and statistics. They cover general methods and techniques that can be applied to a vast set of applications such as web and text mining, marketing, medicine, bioinformatics and business intelligence. This volume contains the revised versions of selected papers in the field of data analysis, machine learning and knowledge discovery presented during the 36th annual conference of the German Classification Society (GfKl). The conference was held at the University of Hildesheim (Germany) in August 2012.

Frontmatter

AREA Statistics and Data Analysis: Classification, Cluster Analysis, Factor Analysis and Model Selection

Frontmatter

On Limiting Donor Usage for Imputation of Missing Data via Hot Deck Methods

Hot deck methods impute missing values within a data matrix by using available values from the same matrix. The object from which these available values are taken for imputation is called the donor. Selection of a suitable donor for the receiving object can be done within imputation classes. The risk inherent to this strategy is that any donor might be selected for multiple value recipients. In extreme cases one donor can be selected for too many or even all values. To mitigate this donor over usage risk, some hot deck procedures limit the amount of times one donor may be selected for value donation. This study answers if limiting donor usage is a superior strategy when considering imputation variance and bias in parameter estimates.

Udo Bankhofer, Dieter William Joenssen

The Most Dangerous Districts of Dortmund

In this paper the districts of Dortmund, a big German city, are ranked concerning their level of risk to be involved in an offence. In order to measure this risk the offences reported by police press reports in the year 2011 (Presseportal,

http://www.presseportal.de/polizeipresse/pm/4971/polizei-dortmund?start=0

, 2011) were analyzed and weighted by their maximum penalty corresponding to the German criminal code. The resulting danger index was used to rank the districts. Moreover, the socio-demographic influences on the different offences are studied. The most probable influences appear to be traffic density (Sierau, Dortmunderinnen und Dortmunder unterwegs—Ergebnisse einer Befragung von Dortmunder Haushalten zu Mobilität und Mobilitätsverhalten, Ergebnisbericht, Dortmund-Agentur/Graphischer Betrieb Dortmund 09/2006, 2006) and the share of older people. Also, the inner city parts appear to be much more dangerous than the outskirts of the city of Dortmund. However, can these results be trusted? Following the press office of Dortmund’s police, offences might not be uniformly reported by the districts to the office and small offences like pick-pocketing are never reported in police press reports. Therefore, this case could also be an example how an unsystematic press policy may cause an unintended bias in the public perception and media awareness.

Tim Beige, Thomas Terhorst, Claus Weihs, Holger Wormer

Benchmarking Classification Algorithms on High-Performance Computing Clusters

Comparing and benchmarking classification algorithms is an important topic in applied data analysis. Extensive and thorough studies of such a kind will produce a considerable computational burden and are therefore best delegated to high-performance computing clusters. We build upon our recently developed

packages

BatchJobs

(Map, Reduce and Filter operations from functional programming for clusters) and

BatchExperiments

(Parallelization and management of statistical experiments). Using these two packages, such experiments can now effectively and reproducibly be performed with minimal effort for the researcher. We present benchmarking results for standard classification algorithms and study the influence of pre-processing steps on their performance.

Bernd Bischl, Julia Schiffner, Claus Weihs

Visual Models for Categorical Data in Economic Research

This paper is concerned with the use of visualizing categorical data in qualitative data analysis (Friendly, Visualizing categorical data, SAS Press, 2000. ISBN 1-58025-660-0; Meyer et al., J. Stat. Softw., 2006; Meyer et al., vcd: Visualizing Categorical Data. R package version 1.0.9, 2008). Graphical methods for qualitative data and extension using a variety of R packages will be presented. This paper outlines a general framework for visual models for categorical data. These ideas are illustrated with a variety of graphical methods for categorical data for large, multi-way contingency tables. Graphical methods are available in R software in

vcd

and

vcdExtra

library including mosaic plot, association plot, sieve plot, double-decker plot or agreement plot. These R packages include methods for the exploration of categorical data, such as fitting and graphing, plots and tests for independence or visualization techniques for log-linear models. Some graphs, e.g. mosaic display plots are well-suited for detecting patterns of association in the process of model building, others are useful in model diagnosis and graphical presentation and summaries. The use of log-linear analysis, as well as visualizing categorical data in economic research, will be presented in this paper.

Justyna Brzezińska

How Many Bee Species? A Case Study in Determining the Number of Clusters

It is argued that the determination of the best number of clusters

is crucially dependent on the aim of clustering. Existing supposedly “objective” methods of estimating

ignore this.

can be determined by listing a number of requirements for a good clustering in the given application and finding a

that fulfils them all. The approach is illustrated by application to the problem of finding the number of species in a data set of Australasian tetragonula bees. Requirements here include two new statistics formalising the largest within-cluster gap and cluster separation. Due to the typical nature of expert knowledge, it is difficult to make requirements precise, and a number of subjective decisions is involved.

Christian Hennig

Two-Step Linear Discriminant Analysis for Classification of EEG Data

We introduce a multi-step machine learning approach and use it to classify electroencephalogram (EEG) data. This approach works very well for high-dimensional spatio-temporal data with separable covariance matrix. At first all features are divided into subgroups and linear discriminant analysis (LDA) is used to obtain a score for each subgroup. Then LDA is applied to these scores, producing the overall score used for classification. In this way we avoid estimation of the high-dimensional covariance matrix of all spatio-temporal features. We investigate the classification performance with special attention to the small sample size case. We also present a theoretical error bound for the normal model with separable covariance matrix, which results in a recommendation on how subgroups should be formed for the data.

Nguyen Hoang Huy, Stefan Frenzel, Christoph Bandt

Predictive Validity of Tracking Decisions: Application of a New Validation Criterion

Although tracking decisions are primarily based on students’ achievements, distributions of academic competences in secondary school strongly overlap between school tracks. However, the correctness of tracking decisions usually is based on whether or not a student has kept the track she or he was initially assigned to. To overcome the neglect of misclassified students, we propose an alternative validation criterion for tracking decisions. We applied this criterion to a sample of

= ;2, ;300 Luxembourgish 9th graders in order to identify misclassifications due to tracking decisions. In Luxembourg, students in secondary school attend either an academic track or a vocational track. Students’ scores of academic achievement tests were obtained at the beginning of 9th grade. The test-score distributions, separated by tracks, overlapped to a large degree. Based on the distributions’ intersection, we determined two competence levels. With respect to their individual scores, we assigned each student to one of these levels. It turned out that about 21

of the students attended a track that did not match their competence level. Whereas the agreement between tracking decisions and actual tracks in 9th grade was fairly high (

= 0. 93), the agreement between tracking decisions and competence levels was only moderate (

= 0. 56).

Florian Klapproth, Sabine Krolak-Schwerdt, Thomas Hörstermann, Romain Martin

DDα-Classification of Asymmetric and Fat-Tailed Data

The DD

-procedure is a fast nonparametric method for supervised classification of

-dimensional objects into

≥ 2 classes. It is based on

-dimensional depth plots and the

-procedure, which is an efficient algorithm for discrimination in the depth space [0, 1]

. Specifically, we use two depth functions that are well computable in high dimensions, the zonoid depth and the random Tukey depth, and compare their performance for different simulated data sets, in particular asymmetric elliptically and

-distributed data.

Tatjana Lange, Karl Mosler, Pavlo Mozharovskyi

The Alpha-Procedure: A Nonparametric Invariant Method for Automatic Classification of Multi-Dimensional Objects

A procedure, called

-procedure, for the efficient automatic classification of multivariate data is described. It is based on a geometric representation of two learning classes in a proper multi-dimensional rectifying feature space and the stepwise construction of a separating hyperplane in that space. The dimension of the space, i.e. the number of features that is necessary for a successful classification, is determined step by step using two-dimensional repères (linear subspaces). In each step a repère and a feature are constructed in a way that they yield maximum discriminating power. Throughout the procedure the invariant, which is the object’s affiliation with a class, is preserved.

Tatjana Lange, Pavlo Mozharovskyi

Support Vector Machines on Large Data Sets: Simple Parallel Approaches

Support Vector Machines (SVMs) are well-known for their excellent performance in the field of statistical classification. Still, the high computational cost due to the cubic runtime complexity is problematic for larger data sets. To mitigate this, Graf et al. (Adv. Neural Inf. Process. Syst. 17:521–528, 2005) proposed the Cascade SVM. It is a simple, stepwise procedure, in which the SVM is iteratively trained on subsets of the original data set and support vectors of resulting models are combined to create new training sets. The general idea is to bound the size of all considered training sets and therefore obtain a significant speedup. Another relevant advantage is that this approach can easily be parallelized because a number of independent models have to be fitted during each stage of the cascade. Initial experiments show that even moderate parallelization can reduce the computation time considerably, with only minor loss in accuracy. We compare the Cascade SVM to the standard SVM and a simple parallel bagging method w.r.t. both classification accuracy and training time. We also introduce a new stepwise bagging approach that exploits parallelization in a better way than the Cascade SVM and contains an adaptive stopping-time to select the number of stages for improved accuracy.

Oliver Meyer, Bernd Bischl, Claus Weihs

Soft Bootstrapping in Cluster Analysis and Its Comparison with Other Resampling Methods

The bootstrap approach is resampling taken with replacement from the original data. Here we consider sampling from the empirical distribution of a given data set in order to investigate the stability of results of cluster analysis. Concretely, the original bootstrap technique can be formulated by choosing the following weights of observations:

, if the corresponding object

is drawn

times, and

= 0, otherwise. We call the weights of observations masses. In this paper, we present another bootstrap method, called soft bootstrapping, that consists of random change of the “bootstrap masses” to some degree. Soft bootstrapping can be applied to any cluster analysis method that makes (directly or indirectly) use of weights of observations. This resampling scheme is especially appropriate for small sample sizes because no object is totally excluded from the soft bootstrap sample. At the end we compare different resampling techniques with respect to cluster analysis.

Hans-Joachim Mucha, Hans-Georg Bartel

Dual Scaling Classification and Its Application in Archaeometry

We consider binary classification based on the dual scaling technique. In the case of more than two classes many binary classifiers can be considered. The proposed approach goes back to Mucha (An intelligent clustering technique based on dual scaling. In: S. Nishisato, Y. Baba, H. Bozdogan, K. Kanefuji (eds.) Measurement and multivariate analysis, pp. 37–46. Springer, Tokyo, 2002) and it is based on the pioneering book of Nishisato (Analysis of categorical data: Dual scaling and its applications. The University of Toronto Press, Toronto, 1980). It is applicable to mixed data the statistician is often faced with. First, numerical variables have to be discretized into bins to become ordinal variables (data preprocessing). Second, the ordinal variables are converted into categorical ones. Then the data is ready for dual scaling of each individual variable based on the given two classes: each category is transformed into a score. Then a classifier can be derived from the scores simply in an additive manner over all variables. It will be compared with the simple Bayesian classifier (SBC). Examples and applications to archaeometry (provenance studies of Roman ceramics) are presented.

Hans-Joachim Mucha, Hans-Georg Bartel, Jens Dolata

Gamma-Hadron-Separation in the MAGIC Experiment

The MAGIC-telescopes on the canary island of La Palma are two of the largest Cherenkov telescopes in the world, operating in stereoscopic mode since 2009 (Aleksić et al., Astropart. Phys. 35:435–448, 2012). A major step in the analysis of MAGIC data is the classification of observations into a gamma-ray signal and hadronic background. In this contribution we introduce the data provided by the MAGIC telescopes, which has some distinctive features. These features include high class imbalance, unknown and unequal misclassification costs as well as the absence of reliably labeled training data. We introduce a method to deal with some of these features. The method is based on a thresholding approach (Sheng and Ling 2006) and aims at minimization of the mean square error of an estimator, which is derived from the classification. The method is designed to fit into the special requirements of the MAGIC data.

Tobias Voigt, Roland Fried, Michael Backes, Wolfgang Rhode

AREA Machine Learning and Knowledge Discovery: Clustering, Classifiers, Streams and Social Networks

Frontmatter

Implementing Inductive Concept Learning For Cooperative Query Answering

Generalization operators have long been studied in the area of Conceptual Inductive Learning (Michalski, A theory and methodolgy of inductive learning. In: Machine learning: An artificial intelligence approach (pp. 111–161). TIOGA Publishing, 1983; De Raedt, About knowledge and inference in logical and relational learning. In: Advances in machine learning II (pp. 143–153). Springer, Berlin, 2010). We present an implementation of these learning operators in a prototype system for cooperative query answering. The implementation can however also be used as a usual concept learning mechanism for concepts described in first-order predicate logic. We sketch an extension of the generalization process by a ranking mechanism on answers for the case that some answers are not related to what user asked.

Maheen Bakhtyar, Nam Dang, Katsumi Inoue, Lena Wiese

Clustering Large Datasets Using Data Stream Clustering Techniques

Unsupervised identification of groups in large data sets is important for many machine learning and knowledge discovery applications. Conventional clustering approaches (

-means, hierarchical clustering, etc.) typically do not scale well for very large data sets. In recent years, data stream clustering algorithms have been proposed which can deal efficiently with potentially unbounded streams of data. This paper is the first to investigate the use of data stream clustering algorithms as light-weight alternatives to conventional algorithms on large non-streaming data. We will discuss important issue including order dependence and report the results of an initial study using several synthetic and real-world data sets.

Matthew Bolaños, John Forrest, Michael Hahsler

Feedback Prediction for Blogs

The last decade lead to an unbelievable growth of the importance of social media. Due to the huge amounts of documents appearing in social media, there is an enormous need for the

automatic

analysis of such documents. In this work, we focus on the analysis of documents appearing in blogs. We present a proof-of-concept industrial application, developed in cooperation with Capgemini Magyarország Kft. The most interesting component of this software prototype allows to predict the number of feedbacks that a blog document is expected to receive. For the prediction, we used various predictions algorithms in our experiments. For these experiments, we crawled blog documents from the internet. As an additional contribution, we published our dataset in order to motivate research in this field of growing interest.

Krisztian Buza

Spectral Clustering: Interpretation and Gaussian Parameter

Spectral clustering consists in creating, from the spectral elements of a Gaussian affinity matrix, a low-dimensional space in which data are grouped into clusters. However, questions about the separability of clusters in the projection space and the choice of the Gaussian parameter remain open. By drawing back to some continuous formulation, we propose an interpretation of spectral clustering with Partial Differential Equations tools which provides clustering properties and defines bounds for the affinity parameter.

Sandrine Mouysset, Joseph Noailles, Daniel Ruiz, Clovis Tauber

On the Problem of Error Propagation in Classifier Chains for Multi-label Classification

So-called classifier chains have recently been proposed as an appealing method for tackling the multi-label classification task. In this paper, we analyze the influence of a potential pitfall of the learning process, namely the discrepancy between the feature spaces used in training and testing: while true class labels are used as supplementary attributes for training the binary models along the chain, the same models need to rely on estimations of these labels when making a prediction. We provide first experimental results suggesting that the attribute noise thus created can affect the overall prediction performance of a classifier chain.

Robin Senge, Juan José del Coz, Eyke Hüllermeier

Statistical Comparison of Classifiers for Multi-objective Feature Selection in Instrument Recognition

Many published articles in automatic music classification deal with the development and experimental comparison of algorithms—however the final statements are often based on figures and simple statistics in tables and only a few related studies apply proper statistical testing for a reliable discussion of results and measurements of the propositions’ significance. Therefore we provide two simple examples for a reasonable application of statistical tests for our previous study recognizing instruments in polyphonic audio. This task is solved by multi-objective feature selection starting from a large number of up-to-date audio descriptors and optimization of classification error and number of selected features at the same time by an evolutionary algorithm. The performance of several classifiers and their impact on the pareto front are analyzed by means of statistical tests.

Igor Vatolkin, Bernd Bischl, Günter Rudolph, Claus Weihs

AREA Data Analysis and Classification in Marketing

Frontmatter

The Dangers of Using Intention as a Surrogate for Retention in Brand Positioning Decision Support Systems

The purpose of this paper is to explore the dangers of using intention as a surrogate for retention in a decision support system (DSS) for brand positioning. An empirical study is conducted, using structural equations modeling and both data from the internal transactional database and a survey. The study is aimed at evaluating whether the DSS recommends different product benefits for brand positioning when intention is used as opposed to retention as a criterion variable. The results show that different product benefits are recommended contingent upon the criterion variable (intention vs. retention). The findings also indicate that the strength of the structural relationships is inflated when intention is used. This study is limited in that it investigates only one industry; the newspaper industry. This research provides guidance for brand managers in selecting the most appropriate benefit for brand positioning and advices against the use of intention as opposed to retention in DSSs. To the best of our knowledge this study is the first to challenge and refute the commonly held belief that intention is a valid surrogate for retention in a DSS for brand positioning.

Michel Ballings, Dirk Van den Poel

Multinomial SVM Item Recommender for Repeat-Buying Scenarios

Most of the common recommender systems deal with the task of generating recommendations for assortments in which a product is usually bought only once, like books or DVDs. However, there are plenty of online shops selling consumer goods like drugstore products, where the customer purchases the same product repeatedly. We call such scenarios repeat-buying scenarios (Böhm et al., Studies in classification, data analysis, and knowledge organization, 2001). For our approach we utilized the results of information geometry (Amari and Nagaoka, Methods of information geometry. Translation of mathematical monographs, vol 191, American Mathematical Society, Providence, 2000) and transformed customer data taken from a repeat-buying scenario into a multinomial space. Using the multinomial diffusion kernel from Lafferty and Lebanon (J Mach Learn Res 6:129–163, 2005) we developed the multinomial SVM (Support Vector Machine) item recommender system MN-SVM-IR to calculate personalized item recommendations for a repeat-buying scenario. We evaluated our SVM item recommender system in a tenfold-cross-validation against the state of the art recommender BPR-MF (Bayesian Personalized Ranking Matrix Factorization) developed by Rendle et al. (BPR: Bayesian personalized ranking from implicit feedback, 2009). The evaluation was performed on a real world dataset taken from a larger German online drugstore. It shows that the MN-SVM-IR outperforms the BPR-MF.

Christina Lichtenthäler, Lars Schmidt-Thieme

Predicting Changes in Market Segments Based on Customer Behavior

In modern marketing, knowing the development of different market segments is crucial. However, simply measuring the occurred changes is not sufficient when planning future marketing campaigns. Predictive models are needed to show trends and to forecast abrupt changes such as the elimination of segments, the splitting of a segment, or the like. For predicting changes, continuously collected data are needed. Behavioral data are suitable for spotting trends in customer segments as they can easily be recorded. For detecting changes in a market structure, fuzzy-clustering is used since gradual changes in cluster memberships can implicate future abrupt changes. In this paper, we introduce different measurements for the analysis of gradual changes that comprise the currentness of data and can be used in order to predict abrupt changes.

Anneke Minke, Klaus Ambrosi

Symbolic Cluster Ensemble based on Co-Association Matrix versus Noisy Variables and Outliers

Interval-valued data arise in practical situations such as recording monthly interval temperatures at meteorological stations, daily interval stock prices, etc. Ensemble approach based on aggregating information provided by different models has been proved to be a very useful tool in the context of the supervised learning. The main goal of this approach is to increase the accuracy and stability of the final classification. Recently the same techniques have been applied for cluster analysis, where by combining a set of different clusterings, a better solution can be received. Ensemble clustering techniques might be not a new problem, but their application to the symbolic data case is a quite new area. The article presents a proposal of application of the co-association based approach in cluster analysis when dealing symbolic data with noisy variables and outliers. In the empirical part simulation experiment results are compared based on artificial data (containing noisy variables and/or outliers). Besides that ensemble clustering results of real data set is shown (segmentation example). In both cases ensemble clustering results are compared with results obtained from a single clustering method.

Marcin Pełka

Image Feature Selection for Market Segmentation: A Comparison of Alternative Approaches

The selection of variables (e.g. socio-demographic or psychographic descriptors of consumers, their buying intentions, buying frequencies, preferences) plays a decisive role in market segmentation. The inclusion as well as the exclusion of variables can influence the resulting classification decisively. Whereas this problem is always of importance it becomes overwhelming when customers should be grouped on the basis of describing images (e.g. photographs showing holidays experiences, usually bought products), as the number of potentially relevant image features is huge. In this paper we apply several general-purpose approaches to this problem: the heuristic variable selection by Carmone et al. (1999) and Brusco and Cradit (2001) as well as the model-based approach by Raftery and Dean (2004). We combine them with k-means, fuzzy c-means, and latent class analysis for comparisons in a Monte Carlo setting with an image database where the optimal market segmentation is already known.

Susanne Rumstadt, Daniel Baier

The Validity of Conjoint Analysis: An Investigation of Commercial Studies Over Time

Due to more and more online questionnaires and possible distraction—e.g. by mails, social network messages, or news reading during the processing in an uncontrolled environment—one can assume that the (internal and external) validity of conjoint analyses lowers. We test this assumption by comparing the (internal and external) validity of commercial conjoint analyses over the last years. Research base are (disguised) recent commercial conjoint analyses of a leading international marketing research company in this field with about 1.000 conjoint analyses per year. The validity information is analyzed w.r.t. research objective, product type, period, incentives, and other categories, also w.r.t. other outcomes like interview length and response rates. The results show some interesting changes in the validity of these conjoint analyses. Additionally, new procedures to deal with this setting will be shown.

Sebastian Selka, Daniel Baier, Peter Kurz

Solving Product Line Design Optimization Problems Using Stochastic Programming

In this paper, we try to apply stochastic programming methods to product line design optimization problems. Because of the estimated part-worths of the product attributes in conjoint analysis, there is a need to deal with the uncertainty caused by the underlying statistical data (Kall and Mayer, 2011, Stochastic linear programming: models, theory, and computation.

International series in operations research & management science

, vol. 156. New York, London: Springer). Inspired by the work of Georg B. Dantzig (1955, Linear programming under uncertainty.

Management Science

, 197–206), we developed an approach to use the methods of stochastic programming for product line design issues. Therefore, three different approaches will be compared by using notional data of a yogurt market from Gaul and Baier (2009, Simulations- und optimierungsrechnungen auf basis der conjointanalyse. In D. Baier, & M. Brusch (Eds.),

Conjointanalyse: methoden-anwendungen-praxisbeispiele

(pp. 163–182). Berlin, Heidelberg: Springer). Stochastic programming methods like chance constrained programming are applied on Kohli and Sukumar (1990, Heuristics for product-line design using conjoint analyses.

Management Science

, 36, 1464–1478) and will be compared to its original approach and to the one of Gaul, Aust and Baier (1995, Gewinnorientierte Produktliniengestaltung unter Beruecksichtigung des Kundennutzens.

Zeitschrift fuer Betriebswirtschaftslehre

, 835–855). Besides the theoretical work, these methods will be realized by a self-written code with the help of the statistical software package R.

Sascha Voekler, Daniel Baier

AREA Data Analysis in Finance

Frontmatter

On the Discriminative Power of Credit Scoring Systems Trained on Independent Samples

The aim of this work is to assess the importance of independence assumption in behavioral scorings created using logistic regression. We develop four sampling methods that control which observations associated to each client are to be included in the training set, avoiding a functional dependence between observations of the same client. We then calibrate logistic regressions with variable selection on the samples created by each method, plus one using all the data in the training set (biased base method), and validate the models on an independent data set. We find that the regression built using all the observations shows the highest area under the ROC curve and Kolmogorv–Smirnov statistics, while the regression that uses the least amount of observations shows the lowest performance and highest variance of these indicators. Nevertheless, the fourth selection algorithm presented shows almost the same performance as the base method using just 14 % of the dataset, and 14 less variables. We conclude that violating the independence assumption does not impact strongly on results and, furthermore, trying to control it by using less data can harm the performance of calibrated models, although a better sampling method does lead to equivalent results with a far smaller dataset needed.

Miguel Biron, Cristián Bravo

A Practical Method of Determining Longevity and Premature-Death Risk Aversion in Households and Some Proposals of Its Application

This article presents a concept on how to infer some information on household preference structure from expected trajectory of cumulated net cash flow process that is indicated by the household members as the most acceptable variant. Under some assumptions, financial planning style implies cumulated surplus dynamics. The reasoning may be inverted to identify financial planning style. To illustrate the concept, there is proposed a sketch of household financial planning model taking into account longevity and premature-death risk aversion, as well as bequest motive. Then, a scheme of a procedure to identify and quantify preferences is presented. The results may be used in life-long financial planning suited to the preference structure.

Lukasz Feldman, Radoslaw Pietrzyk, Pawel Rokita

Correlation of Outliers in Multivariate Data

Conditional correlations of stock returns (also known as exceedance correlations) are commonly compared regarding downside moves and upside moves separately. The results have shown so far the increase of correlation when the market goes down and hence investors’ portfolios are less diversified. Unfortunately, while analysing empirical exceedance correlations in multi-asset portfolio each correlation may be based on different moments of time thus high exceedance correlations for downside moves do not mean lack of diversification in bear market.This paper proposes calculating correlations assuming that Mahalanobis distance is greater than the given quantile of chi-square distribution. The main advantage of proposed approach is that each correlation is calculated from the same moments of time. Furthermore, when the data come from elliptical distribution, proposed conditional correlation does not change, which is in opposition to exceedance correlation. Empirical results for selected stocks from DAX30 show increase of correlation in bear market and decrease of correlation in bull market.

Bartosz Kaszuba

Value-at-Risk Backtesting Procedures Based on Loss Functions: Simulation Analysis of the Power of Tests

The definition of Value at Risk is quite general. There are many approaches that may lead to various VaR values. Backtesting is a necessary statistical procedure to test VaR models and select the best one. There are a lot of techniques for validating VaR models. Usually risk managers are not concerned about their statistical power. The goal of this paper is to compare statistical power of specific backtest procedures but also to examine the problem of limited data sets (observed in practice). A loss function approach is usually used to rank correct VaR models, but it is also possible to evaluate VaR models by using that approach. This paper presents the idea of loss functions and compares the statistical power of backtests based on a various loss functions with the Kupiec and Berkowitz approach. Simulated data representing asset returns are used here. This paper is a continuation of earlier pieces of research done by the author.

Krzysztof Piontek

AREA Data Analysis in Biostatistics and Bioinformatics

Frontmatter

Rank Aggregation for Candidate Gene Identification

Differences of molecular processes are reflected, among others, by differences in gene expression levels of the involved cells. High-throughput methods such as microarrays and deep sequencing approaches are increasingly used to obtain these expression profiles. Often differences of gene expression across different conditions such as tumor vs inflammation are investigated. Top scoring differential genes are considered as candidates for further analysis. Measured differences may not be related to a biological process as they can also be caused by variation in measurement or by other sources of noise. A method for reducing the influence of noise is to combine the available samples. Here, we analyze different types of combination methods, early and late aggregation and compare these statistical and positional rank aggregation methods in a simulation study and by experiments on real microarray data.

Andre Burkovski, Ludwig Lausser, Johann M. Kraus, Hans A. Kestler

Unsupervised Dimension Reduction Methods for Protein Sequence Classification

Feature extraction methods are widely applied in order to reduce the dimensionality of data for subsequent classification, thus decreasing the risk of noise fitting. Principal Component Analysis (PCA) is a popular linear method for transforming high-dimensional data into a low-dimensional representation. Non-linear and non-parametric methods for dimension reduction, such as Isomap, Stochastic Neighbor Embedding (SNE) and Interpol are also used. In this study, we compare the performance of PCA, Isomap, t-SNE and Interpol as preprocessing steps for classification of protein sequences. Using random forests, we compared the classification performance on two artificial and eighteen real-world protein data sets, including HIV drug resistance, HIV-1 co-receptor usage and protein functional class prediction, preprocessed with PCA, Isomap, t-SNE and Interpol. Significant differences between these feature extraction methods were observed. The prediction performance of Interpol converges towards a stable and significantly higher value compared to PCA, Isomap and t-SNE. This is probably due to the nature of protein sequences, where amino acid are often dependent from and affect each other to achieve, for instance, conformational stability. However, visualization of data reduced with Interpol is rather unintuitive, compared to the other methods. We conclude that Interpol is superior to PCA, Isomap and t-SNE for feature extraction previous to classification, but is of limited use for visualization.

Dominik Heider, Christoph Bartenhagen, J. Nikolaj Dybowski, Sascha Hauke, Martin Pyka, Daniel Hoffmann

Three Transductive Set Covering Machines

We propose three transductive versions of the set covering machine with data dependent rays for classification in the molecular high-throughput setting. Utilizing both labeled and unlabeled samples, these transductive classifiers can learn information from both sample types, not only from labeled ones. These transductive set covering machines are based on modified selection criteria for their ensemble members. Via counting arguments we include the unlabeled information into the base classifier selection. One of the three methods we developed, uniformly increased the classification accuracy, the other two showed mixed behaviour for all data sets. Here, we could show that only by observing the order of unlabeled samples, not distances, we were able to increase classification accuracies, making these approaches useful even when very few information is available.

Florian Schmid, Ludwig Lausser, Hans A. Kestler

AREA Interdisciplinary Domains: Data Analysis in Music, Education and Psychology

Frontmatter

Tone Onset Detection Using an Auditory Model

Onset detection is an important step for music transcription and other tasks frequently encountered in music processing. Although several approaches have been developed for this task, neither of them works well under all circumstances. In Bauer et al. (Einfluss der Musikinstrumente auf die Güte der Einsatzzeiterkennung, 2012) we investigated the influence of several factors like instrumentation on the accuracy of onset detection. In this work, this investigation is extended by a computational model of the human auditory periphery. Instead of the original signal the output of the simulated auditory nerve fibers is used. The main challenge here is combining the outputs of all auditory nerve fibers to one feature for onset detection. Different approaches are presented and compared. Our investigation shows that using the auditory model output leads to essential improvements of the onset detection rate for some instruments compared to previous results.

Nadja Bauer, Klaus Friedrichs, Dominik Kirchhoff, Julia Schiffner, Claus Weihs

A Unifying Framework for GPR Image Reconstruction

Ground Penetrating Radar (GPR) is a widely used technique for detecting buried objects in subsoil. Exact localization of buried objects is required, e.g. during environmental reconstruction works to both accelerate the overall process and to reduce overall costs. Radar measurements are usually visualized as images, so-called radargrams, that contain certain geometric shapes to be identified.This paper introduces a component-based image reconstruction framework to recognize overlapping shapes spanning over a convex set of pixels. We assume some image to be generated by interaction of several base component models, e.g., hand-made components or numerical simulations, distorted by multiple different noise components, each representing different physical interaction effects.We present initial experimental results on simulated and real-world GPR data representing a first step towards a pluggable image reconstruction framework.

Andre Busche, Ruth Janning, Tomáš Horváth, Lars Schmidt-Thieme

Recognition of Musical Instruments in Intervals and Chords

Recognition of musical instruments in pieces of polyphonic music given as mp3- or wav-files is a difficult task because the onsets are unknown. Using source-filter models for sound separation is one approach. In this study, intervals and chords played by instruments of four families of musical instruments (strings, wind, piano, plucked strings) are used to build statistical models for the recognition of the musical instruments playing them by using the four high-level audio feature groups Absolute Amplitude Envelope (AAE), Mel-Frequency Cepstral Coefficients (MFCC) windowed and not-windowed as well as Linear Predictor Coding (LPC) to take also physical properties of the instruments into account (Fletcher, The physics of musical instruments, 2008). These feature groups are calculated for consecutive time blocks. Statistical supervised classification methods such as LDA, MDA, Support Vector Machines, Random Forest, and Boosting are used for classification together with variable selection (sequential forward selection).

Markus Eichhoff, Claus Weihs

ANOVA and Alternatives for Causal Inferences

Analysis of variance (ANOVA) is a procedure frequently used for analyzing experimental and quasi-experimental data in psychology. Nonetheless, there is confusion which subtype to prefer for unbalanced data. Much of this confusion can be prevented when an adequate hypothesis is formulated at first. In the present paper this is done by using a theory of causal effects. This is the starting point for the following simulation study done on unbalanced two-way designs. Simulated data sets differed in the presence of an (average) effect, the degree of interaction, total sample size, stochasticity of subsample sizes and if there was confounding between the two factors (i.e. experimental vs. quasi-experimental design). Different subtypes of ANOVA as well as other competing procedures from the research on causal effects were compared with regard to type-I-error rate and power. Results suggest that different types of ANOVA should be used with care, especially in quasi-experimental designs and when there is interaction. Procedures developed within the research on causal effects are feasible alternatives that may serve better to answer meaningful hypotheses.

Sonja Hahn

Testing Models for Medieval Settlement Location

This contribution investigates two models for the spread of Medieval settlements in the landscape known as Bergisches Land in Germany. According to the first model, the spread was closely connected with the ancient trade routes on the ridges. The alternative model assumes that the settlements primarily developed in the fertile valleys. The models are tested in a study area for which the years are known when the small hamlets and villages were first mentioned in historical sources. It does not seem appropriate to apply straight-line distances in this context because the trade routes of that time include curves. Instead an adjusted distance metric is derived from the ancient trade routes. This metric is applied to generate a digital raster map so that each raster cell value corresponds to the adjusted distance to the nearest trade route (or fertile valley respectively). Finally, for each model a Kolmogorov–Smirnov test is applied to compare the adjusted distances of the Medieval settlements with the reference distribution derived from the appropriate raster map.

Irmela Herzog

Supporting Selection of Statistical Techniques

In this paper we describe the necessity for a semi-structured approach towards the selection of techniques in quantitative research. Deciding for a set of suitable techniques to work with a given dataset is a non-trivial and time-consuming task. Thus, structured support for choosing adequate data analysis techniques is required. We present a structural framework for organizing techniques and a description template to uniformly characterize techniques. We show that the former will provide an overview on all available techniques on different levels of abstraction, while the latter offers a way to assess a single method as well as compare it to others.

Kay F. Hildebrand

Alignment Methods for Folk Tune Classification

This paper studies the performance of alignment methods for folk music classification. An edit distance approach is applied to three datasets with different associated classification tasks (tune family, geographic region, and dance type), and compared with a baseline

-gram classifier. Experimental results show that the edit distance performs well for the specific task of tune family classification, yielding similar results to an

-gram model with a pitch interval representation. However, for more general classification tasks, where tunes within the same class are heterogeneous, the

-gram model is recommended.

Ruben Hillewaere, Bernard Manderick, Darrell Conklin

Comparing Regression Approaches in Modelling Compensatory and Noncompensatory Judgment Formation

Applied research on judgment formation, e.g. in education, is interested in identifying the underlying judgment rules from empirical judgment data. Psychological theories and empirical results on human judgment formation support the assumption of compensatory strategies, e.g. (weighted) linear models, as well as noncompensatory (heuristic) strategies as underlying judgment rules. Previous research repeatedly demonstrated that linear regression models well fitted empirical judgment data, leading to the conclusion that the underlying cognitive judgment rules were also linear and compensatory. This simulation study investigated whether a good fit of a linear regression model is a valid indicator of a compensatory cognitive judgment formation process. Simulated judgment data sets with underlying compensatory and noncompensatory judgment rules were generated to reflect typical judgment data from applied educational research. Results indicated that linear regression models well fitted even judgment data with underlying noncompensatory judgment rules, thus impairing the validity of the fit of the linear model as an indicator of compensatory cognitive judgment processes.

Thomas Hörstermann, Sabine Krolak-Schwerdt

Sensitivity Analyses for the Mixed Coefficients Multinomial Logit Model

For scaling items and persons in large scale assessment studies such as Programme for International Student Assessment (PISA; OECD, PISA 2009 Technical Report. OECD Publishing, Paris, 2012) or Progress in International Reading Literacy Study (PIRLS; Martin et al., PIRLS 2006 Technical Report. TIMSS & PIRLS International Study Center, Chestnut Hill, 2007) variants of the Rasch model (Fischer and Molenaar (Eds.), Rasch models: Foundations, recent developments, and applications. Springer, New York, 1995) are used. However, goodness-of-fit statistics for the overall fit of the models under varying conditions as well as specific statistics for the various testable consequences of the models (Steyer and Eid, Messen und Testen [Measuring and Testing]. Springer, Berlin, 2001) are rarely, if at all, presented in the published reports.In this paper, we apply the mixed coefficients multinomial logit model (Adams et al., The multidimensional random coefficients multinomial logit model. Applied Psychological Measurement, 21, 1–23, 1997) to PISA data under varying conditions for dealing with missing data. On the basis of various overall and specific fit statistics, we compare how sensitive this model is, across changing conditions. The results of our study will help in quantifying how meaningful the findings from large scale assessment studies can be. In particular, we report that the proportion of missing values and the mechanism behind missingness are relevant factors for estimation accuracy, and that imputing missing values in large scale assessment settings may not lead to more precise results.

Daniel Kasper, Ali Ünlü, Bernhard Gschrey

Confidence Measures in Automatic Music Classification

Automatic music classification receives a steady attention in the research community. Music can be classified, for instance, according to music genre, style, mood, or played instruments. Automatically retrieved class labels can be used for searching and browsing within large digital music collections. However, due to the variability and complexity of music data and due to the imprecise class definitions, the classification of the real-world music remains error-prone. The reliability of automatic class decisions is essential for many applications. The goal of this work is to enhance the automatic class labels with confidence measures that provide an estimation of the probability of correct classification. We explore state-of-the-art classification techniques in application to automatic music genre classification and investigate to what extend posterior class probabilities can be used as confidence measures. The experimental results demonstrate some inadequacy of these confidence measures, which is very important for practical applications.

Hanna Lukashevich

Using Latent Class Models with Random Effects for Investigating Local Dependence

In psychometric latent variable modeling approaches such as item response theory one of the most central assumptions is local independence (LI), i.e. stochastic independence of test items given a latent ability variable (e.g., Hambleton et al., Fundamentals of item response theory, 1991). This strong assumption, however, is often violated in practice resulting, for instance, in biased parameter estimation. To visualize the local item dependencies, we derive a measure quantifying the degree of such dependence for pairs of items. This measure can be viewed as a dissimilarity function in the sense of psychophysical scaling (Dzhafarov and Colonius, Journal of Mathematical Psychology 51:290–304, 2007), which allows us to represent the local dependencies graphically in the Euclidean 2D space. To avoid problems caused by violation of the local independence assumption, in this paper, we apply a more general concept of “local independence” to psychometric items. Latent class models with random effects (LCMRE; Qu et al., Biometrics 52:797–810, 1996) are used to formulate a generalized local independence (GLI) assumption held more frequently in reality. It includes LI as a special case. We illustrate our approach by investigating the local dependence structures in item types and instances of large scale assessment data from the Programme for International Student Assessment (PISA; OECD, PISA 2009 Technical Report, 2012).

Matthias Trendtel, Ali Ünlü, Daniel Kasper, Sina Stubben

The OECD’s Programme for International Student Assessment (PISA) Study: A Review of Its Basic Psychometric Concepts

The Programme for International Student Assessment (PISA; e.g., OECD, Sample tasks from the PISA 2000 assessment, 2002a; OECD, Learning for tomorrow’s world: first results from PISA 2003, 2004; OECD, PISA 2006: Science competencies for tomorrow’s world, 2007; OECD, PISA 2009 Technical Report, 2012) is an international large scale assessment study that aims to assess the skills and knowledge of 15-year-old students, and based on the results, to compare education systems across the participating (about 70) countries (with a minimum number of approx. 4,500 tested students per country). Initiator of this Programme is the Organisation for Economic Co-operation and Development (OECD;

www.pisa.oecd.org

). We review the main methodological techniques of the PISA study. Primarily, we focus on the psychometric procedure applied for scaling items and persons. PISA proficiency scale construction and proficiency levels derived based on discretization of the continua are discussed. For a balanced reflection of the PISA methodology, questions and suggestions on the reproduction of international item parameters, as well as on scoring, classifying and reporting, are raised. We hope that along these lines the PISA analyses can be better understood and evaluated, and if necessary, possibly be improved.

Ali Ünlü, Daniel Kasper, Matthias Trendtel, Michael Schurig

Music Genre Prediction by Low-Level and High-Level Characteristics

For music genre prediction typically low-level audio signal features from time, spectral or cepstral domains are taken into account. Another way is to use community-based statistics such as Last.FM tags. Whereas the first feature group often can not be clearly interpreted by listeners, the second one lacks in erroneous or not available data for less popular songs. We propose a two-level approach combining the specific advantages of the both groups: at first we create high-level descriptors which describe instrumental and harmonic characteristics of music content, some of them derived from low-level features by supervised classification or from analysis of extended chroma and chord features. The experiments show that each categorization task requires its own feature set.

Igor Vatolkin, Günther Rötter, Claus Weihs

LIS WORKSHOP: Workshop on Classification and Subject Indexing in Library and Information Science

Frontmatter

Using Clustering Across Union Catalogues to Enrich Entries with Indexing Information

The federal system in Germany has created a segmented library landscape. Instead of a central entity responsible for cataloguing and indexing, regional library unions share the workload cooperatively among their members. One result of this approach is limited sharing of cataloguing and indexing information across union catalogues as well as heterogeneous indexing of items with almost equivalent content: different editions of the same work. In this paper, a method for clustering entries in library catalogues is proposed that can be used to reduce this heterogeneity as well as share indexing information across catalogue boundaries. In two experiments, the method is applied to several union catalogues and the results show that a surprisingly large number of previously not indexed entries can be enriched with indexing information. The quality of the indexing has been positively evaluated by human professionals and the results have already been imported into the production catalogues of two library unions.

Magnus Pfeffer

Text Mining for Ontology Construction

In the research project

NanOn: Semi-Automatic Ontology Construction—a Contribution to Knowledge Sharing in Nanotechnology

an ontology for chemical nanotechnology has been constructed. Parts of existing ontologies like CMO and ChEBI have been incorporated into the final ontology. The main focus of the project was to investigate the applicability of text mining methods for ontology construction and for automatic annotation of scientific texts. For this purpose, prototypical tools were developed, based on open source tools like GATE and OpenNLP. It could be shown that text mining methods which extract significant terms from relevant articles support conceptualisation done manually and ensure a better coverage of the domain. The quality of the annotation depends mostly on the completeness of the ontology with respect to synonymous and specific linguistic expressions.

Silke Rehme, Michael Schwantner

Data Enrichment in Discovery Systems Using Linked Data

The Linked Data Web is an abundant source for information that can be used to enrich information retrieval results. This can be helpful in many different scenarios, for example to enable extensive multilingual semantic search or to provide additional information to the users. In general, there are two different ways to enrich data: client-side and server-side. With client-side data enrichment, for instance by means of JavaScript in the browser, users can get additional information related to the results they are provided with. This additional information is not stored within the retrieval system and thus not available to improve the actual search. An example is the provision of links to external sources like Wikipedia, merely for convenience. By contrast, an enrichment on the server-side can be exploited to improve the retrieval directly, at the cost of data duplication and additional efforts to keep the data up-to-date. In this paper, we describe the basic concepts of data enrichment in discovery systems and compare advantages and disadvantages of both variants. Additionally, we introduce a JavaScript Plugin API that abstracts from the underlying system and facilitates platform independent client-side enrichments.

Dominique Ritze, Kai Eckert

Backmatter

Title: Data Analysis, Machine Learning and Knowledge Discovery
Editors: Myra Spiliopoulou
Lars Schmidt-Thieme
Ruth Janning
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-01595-8
Print ISBN: 978-3-319-01594-1
DOI: https://doi.org/10.1007/978-3-319-01595-8

Springer Professional

About this book

Table of Contents

Frontmatter

AREA Statistics and Data Analysis: Classification, Cluster Analysis, Factor Analysis and Model Selection

Frontmatter

On Limiting Donor Usage for Imputation of Missing Data via Hot Deck Methods

The Most Dangerous Districts of Dortmund

Benchmarking Classification Algorithms on High-Performance Computing Clusters

Visual Models for Categorical Data in Economic Research

How Many Bee Species? A Case Study in Determining the Number of Clusters

Two-Step Linear Discriminant Analysis for Classification of EEG Data

Predictive Validity of Tracking Decisions: Application of a New Validation Criterion

DDα-Classification of Asymmetric and Fat-Tailed Data

The Alpha-Procedure: A Nonparametric Invariant Method for Automatic Classification of Multi-Dimensional Objects

Support Vector Machines on Large Data Sets: Simple Parallel Approaches

Soft Bootstrapping in Cluster Analysis and Its Comparison with Other Resampling Methods

Dual Scaling Classification and Its Application in Archaeometry

Gamma-Hadron-Separation in the MAGIC Experiment

AREA Machine Learning and Knowledge Discovery: Clustering, Classifiers, Streams and Social Networks

Frontmatter

Implementing Inductive Concept Learning For Cooperative Query Answering

Clustering Large Datasets Using Data Stream Clustering Techniques

Feedback Prediction for Blogs

Spectral Clustering: Interpretation and Gaussian Parameter

On the Problem of Error Propagation in Classifier Chains for Multi-label Classification

Statistical Comparison of Classifiers for Multi-objective Feature Selection in Instrument Recognition

AREA Data Analysis and Classification in Marketing

Frontmatter

The Dangers of Using Intention as a Surrogate for Retention in Brand Positioning Decision Support Systems

Multinomial SVM Item Recommender for Repeat-Buying Scenarios

Predicting Changes in Market Segments Based on Customer Behavior

Symbolic Cluster Ensemble based on Co-Association Matrix versus Noisy Variables and Outliers

Image Feature Selection for Market Segmentation: A Comparison of Alternative Approaches

The Validity of Conjoint Analysis: An Investigation of Commercial Studies Over Time

Solving Product Line Design Optimization Problems Using Stochastic Programming

AREA Data Analysis in Finance

Frontmatter

On the Discriminative Power of Credit Scoring Systems Trained on Independent Samples

A Practical Method of Determining Longevity and Premature-Death Risk Aversion in Households and Some Proposals of Its Application

Correlation of Outliers in Multivariate Data

Value-at-Risk Backtesting Procedures Based on Loss Functions: Simulation Analysis of the Power of Tests

AREA Data Analysis in Biostatistics and Bioinformatics

Frontmatter

Rank Aggregation for Candidate Gene Identification

Unsupervised Dimension Reduction Methods for Protein Sequence Classification

Three Transductive Set Covering Machines

AREA Interdisciplinary Domains: Data Analysis in Music, Education and Psychology

Frontmatter

Tone Onset Detection Using an Auditory Model

A Unifying Framework for GPR Image Reconstruction

Recognition of Musical Instruments in Intervals and Chords

ANOVA and Alternatives for Causal Inferences

Testing Models for Medieval Settlement Location

Supporting Selection of Statistical Techniques

Alignment Methods for Folk Tune Classification

Comparing Regression Approaches in Modelling Compensatory and Noncompensatory Judgment Formation

Sensitivity Analyses for the Mixed Coefficients Multinomial Logit Model

Confidence Measures in Automatic Music Classification

Using Latent Class Models with Random Effects for Investigating Local Dependence

The OECD’s Programme for International Student Assessment (PISA) Study: A Review of Its Basic Psychometric Concepts

Music Genre Prediction by Low-Level and High-Level Characteristics

LIS WORKSHOP: Workshop on Classification and Subject Indexing in Library and Information Science

Frontmatter

Using Clustering Across Union Catalogues to Enrich Entries with Indexing Information

Text Mining for Ontology Construction

Data Enrichment in Discovery Systems Using Linked Data

Backmatter

Premium Partner